Support Questions

Find answers, ask questions, and share your expertise

Spark thrift server is failing to start when NN HA is enabled

avatar
Super Guru

I am facing an issue while starting Spark thrift server when NN HA is enabled. I have 2 namenodes on host1 and host2. It is starting when namenode on host1 is active and fails to start when namenode on host1 is standby. Below is the stack trace

Exception in thread "main" org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88)
        at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1952)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1423)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3085)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1154)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:966)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
);
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
        at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
        at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53)
        at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79)
        at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:904)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Pasting the contents of spark-thrift-sparkconf.conf

spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.initialExecutors 0
spark.dynamicAllocation.maxExecutors 10
spark.dynamicAllocation.minExecutors 0
spark.eventLog.dir hdfs:///spark2-history/
spark.eventLog.enabled true
spark.executor.extraJavaOptions -XX:+UseNUMA
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.hadoop.cacheConf false
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 7d
spark.history.fs.cleaner.maxAge 90d
spark.history.fs.logDirectory hdfs:///spark2-history/
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.io.compression.lz4.blockSize 128kb
spark.master yarn-client
spark.scheduler.allocation.file /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-fairscheduler.xml
spark.scheduler.mode FAIR
spark.shuffle.file.buffer 1m
spark.shuffle.io.backLog 8192
spark.shuffle.io.serverThreads 128
spark.shuffle.service.enabled true
spark.shuffle.unsafe.file.output.buffer 5m
spark.sql.autoBroadcastJoinThreshold 26214400
spark.sql.hive.convertMetastoreOrc true
spark.sql.hive.metastore.jars /usr/hdp/3.0.0.0-1634/spark2/standalone-metastore/standalone-metastore-1.21.2.3.0.0.0-1634-hive3.jar
spark.sql.hive.metastore.version 3.0
spark.sql.orc.filterPushdown true
spark.sql.orc.impl native
spark.sql.statistics.fallBackToHdfs true
spark.sql.warehouse.dir /apps/spark/warehouse
spark.unsafe.sorter.spill.reader.buffer.size 1m
spark.yarn.executor.failuresValidityInterval 2h
spark.yarn.maxAppAttempts 1
spark.yarn.queue default

I checked for core-site.xml and hdfs-site.xml in the node where spark thrift server is running. fs.defaultFS is having the proper value ( ie hdfs://namespace). I am guessing that it is picking the host1 value from some config file but not sure from which file.

Please let me know any other places to look.

.

Thanks

1 ACCEPTED SOLUTION

avatar

@Aditya Sirna

Did you try to copy or symlink the /etc/hadoop/conf/core-site.xml into the /etc/spark2/conf/ ?

If not please give it a try and let us know how it goes.

HTH

View solution in original post

7 REPLIES 7

avatar
Master Mentor

@Aditya Sirna

As the stascktrace is coming from Hadoop APIs directly. So it will be better to isolate the issue firts. (if the problem is from Spark Config side or from HDFS itself)

Error

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby

So have you tried running the simple HDFS commands to see if those are also returning the same exception of different?

# su - hdfs -c "hdfs dfs -ls /user"

.

If you see the same message then try to restart the HDFS service once and then try again. If you notice the same issue again like "if it fails to start when namenode on host1 is standby." then the NameNode and ZKFC logs might give us more idea.

Also it will be good to check if the NameNodes have enough memory like RAM and heap setup properly?

.

avatar
Super Guru

@Jay Kumar SenSharma,

Thanks for the input. I tried running normal HDFS command and it is working fine even when host1 is on standby. I checked Namenode and ZKFC logs but there is nothing much relevant to this. I also checked the memory settings. They are fine.

Any idea from where spark picks up the NN info. I am guessing it will read from core-site.xml but is it correct?

avatar

@Aditya Sirna

Did you try to copy or symlink the /etc/hadoop/conf/core-site.xml into the /etc/spark2/conf/ ?

If not please give it a try and let us know how it goes.

HTH

avatar
Super Guru

@Felix Albani,

Yes I tried copying both core-site.xml and also hdfs-site.xml but still facing the same issue. Attaching some logs, spark thrift server start logs in debug mode and corresponding yarn application logs.

yarn-app-logs.txt

spark-spark-orgapachesparksqlhivethriftserverhivet.zip

Also made sure that "/hadoop/yarn/local/usercache/spark/filecache/10/__spark_conf__.zip/__spark_conf__/__hadoop_conf__/core-site.xml" has correct content.

avatar

Could you try setting nameservice id for the following properties:

spark.history.fs.logDirectory=hdfs://<name_service_id>/spark2-history/
spark.eventLog.dir=hdfs://<name_service_id>/spark2-history/

avatar

@Aditya Sirna I reproduced the problem. Turns out the issue was caused due metastore location uri poitning to one of the nodes only. To change this you need to run:

hive --config /etc/hive/conf/conf.server --service metatool -listFSRoot

The above will list and you will be able to spot the locations you need to change. Then issue:

hive --config /etc/hive/conf/conf.server --service metatool -updateLocation <new-location> <old-location>

For example:

hive --config /etc/hive/conf/conf.server --service metatool -updateLocation hdfs://c14/apps/spark/warehouse hdfs://c14-node2.squadron-labs.com:8020/apps/spark/warehouse

HTH

avatar
Super Guru

@Felix Albani,

This worked like a charm. Thanks a lot for your help. Really appreciate 🙂

However in the latest version of Ambari, it should have been handled by Ambari itself. I do not see the manual step in this doc. Must be a doc bug or ambari issue in my cluster.

https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.1.0/managing-high-availability/content/amb_enab...