Created 10-16-2018 03:38 PM
I am facing an issue while starting Spark thrift server when NN HA is enabled. I have 2 namenodes on host1 and host2. It is starting when namenode on host1 is active and fails to start when namenode on host1 is standby. Below is the stack trace
Exception in thread "main" org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1952) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1423) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3085) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1154) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:966) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) ); at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:904) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Pasting the contents of spark-thrift-sparkconf.conf
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.dynamicAllocation.enabled true spark.dynamicAllocation.initialExecutors 0 spark.dynamicAllocation.maxExecutors 10 spark.dynamicAllocation.minExecutors 0 spark.eventLog.dir hdfs:///spark2-history/ spark.eventLog.enabled true spark.executor.extraJavaOptions -XX:+UseNUMA spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.hadoop.cacheConf false spark.history.fs.cleaner.enabled true spark.history.fs.cleaner.interval 7d spark.history.fs.cleaner.maxAge 90d spark.history.fs.logDirectory hdfs:///spark2-history/ spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.io.compression.lz4.blockSize 128kb spark.master yarn-client spark.scheduler.allocation.file /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-fairscheduler.xml spark.scheduler.mode FAIR spark.shuffle.file.buffer 1m spark.shuffle.io.backLog 8192 spark.shuffle.io.serverThreads 128 spark.shuffle.service.enabled true spark.shuffle.unsafe.file.output.buffer 5m spark.sql.autoBroadcastJoinThreshold 26214400 spark.sql.hive.convertMetastoreOrc true spark.sql.hive.metastore.jars /usr/hdp/3.0.0.0-1634/spark2/standalone-metastore/standalone-metastore-1.21.2.3.0.0.0-1634-hive3.jar spark.sql.hive.metastore.version 3.0 spark.sql.orc.filterPushdown true spark.sql.orc.impl native spark.sql.statistics.fallBackToHdfs true spark.sql.warehouse.dir /apps/spark/warehouse spark.unsafe.sorter.spill.reader.buffer.size 1m spark.yarn.executor.failuresValidityInterval 2h spark.yarn.maxAppAttempts 1 spark.yarn.queue default
I checked for core-site.xml and hdfs-site.xml in the node where spark thrift server is running. fs.defaultFS is having the proper value ( ie hdfs://namespace). I am guessing that it is picking the host1 value from some config file but not sure from which file.
Please let me know any other places to look.
.
Thanks
Created 10-18-2018 01:18 PM
Did you try to copy or symlink the /etc/hadoop/conf/core-site.xml into the /etc/spark2/conf/ ?
If not please give it a try and let us know how it goes.
HTH
Created 10-18-2018 05:09 AM
As the stascktrace is coming from Hadoop APIs directly. So it will be better to isolate the issue firts. (if the problem is from Spark Config side or from HDFS itself)
Error
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
So have you tried running the simple HDFS commands to see if those are also returning the same exception of different?
# su - hdfs -c "hdfs dfs -ls /user"
.
If you see the same message then try to restart the HDFS service once and then try again. If you notice the same issue again like "if it fails to start when namenode on host1 is standby." then the NameNode and ZKFC logs might give us more idea.
Also it will be good to check if the NameNodes have enough memory like RAM and heap setup properly?
.
Created 10-18-2018 05:35 AM
Thanks for the input. I tried running normal HDFS command and it is working fine even when host1 is on standby. I checked Namenode and ZKFC logs but there is nothing much relevant to this. I also checked the memory settings. They are fine.
Any idea from where spark picks up the NN info. I am guessing it will read from core-site.xml but is it correct?
Created 10-18-2018 01:18 PM
Did you try to copy or symlink the /etc/hadoop/conf/core-site.xml into the /etc/spark2/conf/ ?
If not please give it a try and let us know how it goes.
HTH
Created 10-18-2018 02:39 PM
Yes I tried copying both core-site.xml and also hdfs-site.xml but still facing the same issue. Attaching some logs, spark thrift server start logs in debug mode and corresponding yarn application logs.
spark-spark-orgapachesparksqlhivethriftserverhivet.zip
Also made sure that "/hadoop/yarn/local/usercache/spark/filecache/10/__spark_conf__.zip/__spark_conf__/__hadoop_conf__/core-site.xml" has correct content.
Created 10-18-2018 03:02 PM
Could you try setting nameservice id for the following properties:
spark.history.fs.logDirectory=hdfs://<name_service_id>/spark2-history/ spark.eventLog.dir=hdfs://<name_service_id>/spark2-history/
Created 10-18-2018 07:29 PM
@Aditya Sirna I reproduced the problem. Turns out the issue was caused due metastore location uri poitning to one of the nodes only. To change this you need to run:
hive --config /etc/hive/conf/conf.server --service metatool -listFSRoot
The above will list and you will be able to spot the locations you need to change. Then issue:
hive --config /etc/hive/conf/conf.server --service metatool -updateLocation <new-location> <old-location>
For example:
hive --config /etc/hive/conf/conf.server --service metatool -updateLocation hdfs://c14/apps/spark/warehouse hdfs://c14-node2.squadron-labs.com:8020/apps/spark/warehouse
HTH
Created 10-19-2018 01:12 AM
This worked like a charm. Thanks a lot for your help. Really appreciate 🙂
However in the latest version of Ambari, it should have been handled by Ambari itself. I do not see the manual step in this doc. Must be a doc bug or ambari issue in my cluster.