Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark - Hive tables not found when running in YARN-Cluster mode

Solved Go to solution

Spark - Hive tables not found when running in YARN-Cluster mode

Rising Star

I have a Spark (version 1.4.1) application on HDP 2.3.2. It works fine when running it in YARN-Client mode. However, when running it on YARN-Cluster mode none of my Hive tables can be found by the application.

I submit the application like so:

  ./bin/spark-submit 
  --class com.myCompany.Main 
  --master yarn-cluster 
  --num-executors 3 
  --driver-memory 4g 
  --executor-memory 10g 
  --executor-cores 1 
  --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar /home/spark/apps/YarnClusterTest.jar  
  --files /etc/hive/conf/hive-site.xml

Here's an excerpt from the logs:

5/12/02 11:05:13 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
15/12/02 11:05:14 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/12/02 11:05:14 INFO metastore.ObjectStore: ObjectStore, initialize called
15/12/02 11:05:14 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/12/02 11:05:14 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/12/02 11:05:14 INFO storage.BlockManagerMasterEndpoint: Registering block manager worker2.xxx.com:34697 with 5.2 GB RAM, BlockManagerId(1, worker2.xxx.com, 34697)
15/12/02 11:05:16 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/12/02 11:05:16 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5.  Encountered: "@" (64), after : "".
15/12/02 11:05:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/12/02 11:05:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/12/02 11:05:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/12/02 11:05:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/12/02 11:05:18 INFO metastore.ObjectStore: Initialized ObjectStore
15/12/02 11:05:19 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa
15/12/02 11:05:19 INFO metastore.HiveMetaStore: Added admin role in metastore
15/12/02 11:05:19 INFO metastore.HiveMetaStore: Added public role in metastore
15/12/02 11:05:19 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
15/12/02 11:05:19 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
15/12/02 11:05:19 INFO parse.ParseDriver: Parsing command: SELECT * FROM streamsummary
15/12/02 11:05:20 INFO parse.ParseDriver: Parse Completed
15/12/02 11:05:20 INFO hive.HiveContext: Initializing HiveMetastoreConnection version 0.13.1 using Spark classes.
15/12/02 11:05:20 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=streamsummary
15/12/02 11:05:20 INFO HiveMetaStore.audit: ugi=spark  ip=unknown-ip-addr  cmd=get_table : db=default tbl=streamsummary   
15/12/02 11:05:20 DEBUG myCompany.Main$: no such table streamsummary; line 1 pos 14

I basically run into the same 'no such table' problem for any time my application needs to read from or write to the Hive tables.

Thanks in advance!

UPDATE:

I tried submitting the spark application with the --files parameter provided before --jars as per @Guilherme Braccialli's suggestion, but doing so now gives me an exception saying that the HiveMetastoreClient could not be instantiated.

spark-submit:

  ./bin/spark-submit \
  --class com.myCompany.Main \
  --master yarn-cluster \
  --num-executors 3 \
  --driver-memory 1g \
  --executor-memory 11g \
  --executor-cores 1 \
  --files /etc/hive/conf/hive-site.xml \
  --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \<br>  /home/spark/apps/YarnClusterTest.jar

code:

// core.scala
trait Core extends java.io.Serializable{
/**
 *  This trait should be mixed in by every other class or trait that is dependent on `sc`
 * 
 */
  val sc: SparkContext
  lazy val sqlContext = new HiveContext(sc)
}

// yarncore.scala
trait YarnCore extends Core {
/** 
 * This trait initializes the SparkContext with YARN as the master
 */
  val conf = new SparkConf().setAppName("my app").setMaster("yarn-cluster")
  val sc = new SparkContext(conf)
}

main.scala
	object Test {
	  def main(args:Array[String]){
	
	  /**initialize the spark application**/
	  val app = new YarnCore  // initializes the SparkContext in YARN mode
	  with sqlHelper  // provides SQL functionality
	  with Transformer  // provides UDF's for transforming the dataframes into the marts
	
	  /**initialize the logger**/
	  val log = Logger.getLogger(getClass.getName)
	
	  val count = app.sqlContext.sql("SELECT COUNT(*) FROM streamsummary")
	
	  log.info("streamsummary has ${count} records")
	
	  /**Shut down the spark app**/
	  app.sc.stop
	  }
	}

exception:

15/12/11 09:34:55 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
15/12/11 09:34:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:117)
	at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:165)
	at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:163)
	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:170)
	at com.epldt.core.Core$class.sqlContext(core.scala:13)
	at com.epldt.Test$anon$1.sqlContext$lzycompute(main.scala:17)
	at com.epldt.Test$anon$1.sqlContext(main.scala:17)
	at com.epldt.Test$.main(main.scala:26)
	at com.epldt.Test.main(main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.yarn.ApplicationMaster$anon$2.run(ApplicationMaster.scala:486)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1412)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:62)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
	at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
	... 14 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410)
	... 19 more
Caused by: java.lang.NumberFormatException: For input string: "1800s"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1258)
	at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1211)
	at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1220)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:293)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:214)
	... 24 more

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

@Luis Antonio Torres

I did few tests and I think you just need to change location of --files, it must come before you .jar file.

Find my sample class here:

https://github.com/gbraccialli/SparkUtils/blob/master/src/main/scala/com/github/gbraccialli/spark/Hi...

Project is here:

https://github.com/gbraccialli/SparkUtils

Sample spark-submit with hive commands as parameter:

git clone https://github.com/gbraccialli/SparkUtils
cd SparkUtils/
mvn clean package
spark-submit \
  --class com.github.gbraccialli.spark.HiveCommand \
  --master yarn-cluster \
  --num-executors 1 \
  --driver-memory 1g \
  --executor-memory 1g \
  --executor-cores 1 \
  --files /usr/hdp/current/spark-client/conf/hive-site.xml \
  --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
 target/SparkUtils-1.0.0-SNAPSHOT.jar "show tables" "select * from sample_08"

View solution in original post

16 REPLIES 16
Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

Rising Star

log.txt Uploading a copy of the log excerpt in a text file because it won't format properly in the post

Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

Do you have Kerberos enabled on this cluster? Also - are you using HDP 2.3.0 or HDP 2.3.2?

Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

Could you share the code from the com.myCompany.Main class?

Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

@Luis Antonio Torres

I did few tests and I think you just need to change location of --files, it must come before you .jar file.

Find my sample class here:

https://github.com/gbraccialli/SparkUtils/blob/master/src/main/scala/com/github/gbraccialli/spark/Hi...

Project is here:

https://github.com/gbraccialli/SparkUtils

Sample spark-submit with hive commands as parameter:

git clone https://github.com/gbraccialli/SparkUtils
cd SparkUtils/
mvn clean package
spark-submit \
  --class com.github.gbraccialli.spark.HiveCommand \
  --master yarn-cluster \
  --num-executors 1 \
  --driver-memory 1g \
  --executor-memory 1g \
  --executor-cores 1 \
  --files /usr/hdp/current/spark-client/conf/hive-site.xml \
  --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
 target/SparkUtils-1.0.0-SNAPSHOT.jar "show tables" "select * from sample_08"

View solution in original post

Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

Rising Star

@Guilherme Braccialli

Thanks for your reply. I tried your suggestion of putting the --files parameter before --jars when submitting, but now I'm running into an exception saying the HiveMetastoreClient could not be instantiated. I'll update my post with the code and new stack trace.

Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

@Luis Antonio Torres

It worked for me. Can you check content of /usr/hdp/current/spark-client/conf/hive-site.xml you are using?

mine is like this:

  <configuration>
    <property>
      <name>hive.metastore.uris</name>
      <value>thrift://sandbox.hortonworks.com:9083</value>
    </property>
  </configuration>
Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

Rising Star

@Guilherme Braccialli

I just want to start by thanking you for your quick responses. I've been struggling with this problem for a while now, and actually I've also asked this on stackoverflow but no luck.

As for /usr/hdp/current/spark-client/conf/hive-site.xml, the content is pretty much the same as yours:

<configuration>
   
  <property>
  <name>hive.metastore.uris</name>
  <value>thrift://host.xxx.com:9083</value>
  </property>
   
  </configuration>
Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

@Luis Antonio Torres

check your command, your are using /etc/hive/conf/hive-site.xml instead of /usr/hdp/current/spark-client/conf/hive-site.xml

I think this is the issue.

Highlighted

Re: Spark - Hive tables not found when running in YARN-Cluster mode

Rising Star

@Guilherme Braccialli That did the trick! =) I didn't notice that at first. I wasn't the one who set-up our cluster so I had no idea that the contents of those two files were different. It's a subtle thing but I had a lot of trouble just because of that. Thank you very much!

Don't have an account?
Coming from Hortonworks? Activate your account here