-
spark2 thrift server 'Connection to STS still is not created'Big Data 2020. 4. 20. 21:16
spark2 thrift server에서 'Connection to STS still is not created'의 error case를 해결한 3시간 넘게 한 삽질의 기록이다.
다양한 case가 있겠지만, 이번 경우는 yarn.nodemanager.resource.memory-mb가 spark executor의 최소 (1024+384MB) size보다 작아서 발생한 것이다. 이 원인을 찾아낸 과정의 기록이다.
--------
예전에 HDP Cluster 구축 테스트 하느라고, local pc에 VM 3개에 띄우고, Cluster 구축한 환경이 있었다. 최근에 다시 필요해져서 VM 올리고 service들 다시 올려보니, thrift server가 계속 올라오지 않았다.
Ambari 화면이다.
task log를 살펴보았다.
2020-04-20 19:31:13,890 - Generating properties file: /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf 2020-04-20 19:31:13,891 - File['/usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf'] {'owner': 'hive', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'} 2020-04-20 19:31:13,917 - Writing File['/usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf'] because contents don't match 2020-04-20 19:31:13,922 - File['/usr/hdp/current/spark2-thriftserver/conf/spark-thrift-fairscheduler.xml'] {'content': InlineTemplate(...), 'owner': 'spark', 'group': 'spark', 'mode': 0755} 2020-04-20 19:31:13,927 - Execute['/usr/hdp/current/spark2-thriftserver/sbin/start-thriftserver.sh --properties-file /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf '] {'environment': {'JAVA_HOME': u'/usr/jdk64/jdk1.8.0_112'}, 'not_if': 'ambari-sudo.sh -H -E test -f /var/run/spark2/spark-spark-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid && ambari-sudo.sh -H -E pgrep -F /var/run/spark2/spark-spark-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid', 'user': 'spark'} 2020-04-20 19:31:46,702 - Check connection to STS is created. 2020-04-20 19:31:46,704 - Execute['! /usr/hdp/current/spark2-thriftserver/bin/beeline -u 'jdbc:hive2://centos-03:10016/default;transportMode=binary' -e '' 2>&1| awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL' -e 'Error: Could not open''] {'path': [u'/usr/hdp/current/spark2-thriftserver/bin/beeline'], 'user': 'spark', 'timeout': 60.0} 2020-04-20 19:31:48,131 - Connection to STS still is not created. 2020-04-20 19:31:48,131 - Check STS process status. 2020-04-20 19:31:48,131 - Process with pid 29188 is not running. Stale pid file at /var/run/spark2/spark-spark-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid Command failed after 1 tries
error라 보여지는 부분은 "Connection to STS still is not created"이다.
늘, 항상 그렇듯이 이럴 땐 구글링이다. 뭔가 쭈욱 나온다.
하나씩 읽어봐도, 유사한 case 같으나 명확한 원인이 도출이 안된다.
구글링 결과에도, 명확히 찾지 못했을 때는, 경험상 다음 스텝은 재설치다. 무언가 환경이 꼬여 있을수도 있으니 깔끔하게 재설치다.
다른 host에 번갈아가며 thrift server 재설치하고, 서비스 올려서 확인하고, spark2를 아예 삭제하고, 설치하고 서비스 확인하고.. 지루하게 시간만 흐르다가, 다시 로그를 살펴보았다.
ambari agent가 수행하는 명령어를 terminal에서 직접 수행해보았다.
[spark@centos-03 ~]$ /usr/hdp/current/spark2-thriftserver/sbin/start-thriftserver.sh --properties-file /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /var/log/spark2/spark-spark-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-centos-03.out [spark@centos-03 ~]$ [spark@centos-03 ~]$
runtime error는 보이지 않았다. STS 시작한다는 메시지와 logfile loc만 보인다. 그 다음 명령어를 수행해보았다.
[spark@centos-03 ~]$ beeline -u 'jdbc:hive2://centos-03:10016/default;transportMode=binary' SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Connecting to jdbc:hive2://centos-03:10016/default;transportMode=binary 20/04/20 19:45:27 [main]: ERROR jdbc.HiveConnection: Error opening session org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=default}) at org.apache.thrift.TApplicationException.read(TApplicationException.java:111) ~[hive-exec-3.1.0.3.1.0.0-78.jar:3.1.0.3.1.0.0-78] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) ~[hive-exec-3.1.0.3.1.0.0-78.jar:3.1.0.3.1.0.0-78] at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_OpenSession(TCLIService.java:176) ~[hive-exec-3.1.0.3.1.0.0-78.jar:3.1.0.3.1.0.0-78] at org.apache.hive.service.rpc.thrift.TCLIService$Client.OpenSession(TCLIService.java:163) ~[hive-exec-3.1.0.3.1.0.0-78.jar:3.1.0.3.1.0.0-78] at org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:796) [hive-jdbc-3.1.0.3.1.0.0-78.jar:3.1.0.3.1.0.0-78] at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:305) [hive-jdbc-3.1.0.3.1.0.0-78.jar:3.1.0.3.1.0.0-78] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:107) [hive-jdbc-3.1.0.3.1.0.0-78.jar:3.1.0.3.1.0.0-78] at java.sql.DriverManager.getConnection(DriverManager.java:664) [?:1.8.0_112]
real error msg이다. beeline으로 접속이 안된다. Error opening session이다. start-thriftserver.sh 실행했을 때의 알려준 log를 살펴보았다.
ERROR SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Required executor memory (1024+384 MB) is above the max threshold (1024 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-alloc ation-mb' and/or 'yarn.nodemanager.resource.memory-mb'. at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:318) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:166) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) at org.apache.spark.SparkContext.<init>(SparkContext.scala:500) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
원인은.. nodemanager의 resource memory이다. VM에 cluster를 올리다보니 아주 작은 job 2,3개라도 돌릴 수 있는(돌려 보겠다고..) 환경으로 하겠다고 container memory size 조절하다가, 아무 생각없이 yarn.nodemanager.resource.memory-mb도 엄청 줄여놨었다..
memory size 재조정해주고, STS 재기동하니, 잘 올라온다.
모든건 log에 있다..