Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Spark - Hive tables not found when running in YARN-Cluster mode

avatar
Expert Contributor

I have a Spark (version 1.4.1) application on HDP 2.3.2. It works fine when running it in YARN-Client mode. However, when running it on YARN-Cluster mode none of my Hive tables can be found by the application.

I submit the application like so:

  ./bin/spark-submit 
  --class com.myCompany.Main 
  --master yarn-cluster 
  --num-executors 3 
  --driver-memory 4g 
  --executor-memory 10g 
  --executor-cores 1 
  --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar /home/spark/apps/YarnClusterTest.jar  
  --files /etc/hive/conf/hive-site.xml

Here's an excerpt from the logs:

5/12/02 11:05:13 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
15/12/02 11:05:14 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/12/02 11:05:14 INFO metastore.ObjectStore: ObjectStore, initialize called
15/12/02 11:05:14 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/12/02 11:05:14 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/12/02 11:05:14 INFO storage.BlockManagerMasterEndpoint: Registering block manager worker2.xxx.com:34697 with 5.2 GB RAM, BlockManagerId(1, worker2.xxx.com, 34697)
15/12/02 11:05:16 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/12/02 11:05:16 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5.  Encountered: "@" (64), after : "".
15/12/02 11:05:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/12/02 11:05:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/12/02 11:05:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/12/02 11:05:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/12/02 11:05:18 INFO metastore.ObjectStore: Initialized ObjectStore
15/12/02 11:05:19 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa
15/12/02 11:05:19 INFO metastore.HiveMetaStore: Added admin role in metastore
15/12/02 11:05:19 INFO metastore.HiveMetaStore: Added public role in metastore
15/12/02 11:05:19 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
15/12/02 11:05:19 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
15/12/02 11:05:19 INFO parse.ParseDriver: Parsing command: SELECT * FROM streamsummary
15/12/02 11:05:20 INFO parse.ParseDriver: Parse Completed
15/12/02 11:05:20 INFO hive.HiveContext: Initializing HiveMetastoreConnection version 0.13.1 using Spark classes.
15/12/02 11:05:20 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=streamsummary
15/12/02 11:05:20 INFO HiveMetaStore.audit: ugi=spark  ip=unknown-ip-addr  cmd=get_table : db=default tbl=streamsummary   
15/12/02 11:05:20 DEBUG myCompany.Main$: no such table streamsummary; line 1 pos 14

I basically run into the same 'no such table' problem for any time my application needs to read from or write to the Hive tables.

Thanks in advance!

UPDATE:

I tried submitting the spark application with the --files parameter provided before --jars as per @Guilherme Braccialli's suggestion, but doing so now gives me an exception saying that the HiveMetastoreClient could not be instantiated.

spark-submit:

  ./bin/spark-submit \
  --class com.myCompany.Main \
  --master yarn-cluster \
  --num-executors 3 \
  --driver-memory 1g \
  --executor-memory 11g \
  --executor-cores 1 \
  --files /etc/hive/conf/hive-site.xml \
  --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \<br>  /home/spark/apps/YarnClusterTest.jar

code:

// core.scala
trait Core extends java.io.Serializable{
/**
 *  This trait should be mixed in by every other class or trait that is dependent on `sc`
 * 
 */
  val sc: SparkContext
  lazy val sqlContext = new HiveContext(sc)
}

// yarncore.scala
trait YarnCore extends Core {
/** 
 * This trait initializes the SparkContext with YARN as the master
 */
  val conf = new SparkConf().setAppName("my app").setMaster("yarn-cluster")
  val sc = new SparkContext(conf)
}

main.scala
	object Test {
	  def main(args:Array[String]){
	
	  /**initialize the spark application**/
	  val app = new YarnCore  // initializes the SparkContext in YARN mode
	  with sqlHelper  // provides SQL functionality
	  with Transformer  // provides UDF's for transforming the dataframes into the marts
	
	  /**initialize the logger**/
	  val log = Logger.getLogger(getClass.getName)
	
	  val count = app.sqlContext.sql("SELECT COUNT(*) FROM streamsummary")
	
	  log.info("streamsummary has ${count} records")
	
	  /**Shut down the spark app**/
	  app.sc.stop
	  }
	}

exception:

15/12/11 09:34:55 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
15/12/11 09:34:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:117)
	at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:165)
	at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:163)
	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:170)
	at com.epldt.core.Core$class.sqlContext(core.scala:13)
	at com.epldt.Test$anon$1.sqlContext$lzycompute(main.scala:17)
	at com.epldt.Test$anon$1.sqlContext(main.scala:17)
	at com.epldt.Test$.main(main.scala:26)
	at com.epldt.Test.main(main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.yarn.ApplicationMaster$anon$2.run(ApplicationMaster.scala:486)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1412)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:62)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
	at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
	... 14 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410)
	... 19 more
Caused by: java.lang.NumberFormatException: For input string: "1800s"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1258)
	at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1211)
	at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1220)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:293)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:214)
	... 24 more

1 ACCEPTED SOLUTION

avatar

@Luis Antonio Torres

I did few tests and I think you just need to change location of --files, it must come before you .jar file.

Find my sample class here:

https://github.com/gbraccialli/SparkUtils/blob/master/src/main/scala/com/github/gbraccialli/spark/Hi...

Project is here:

https://github.com/gbraccialli/SparkUtils

Sample spark-submit with hive commands as parameter:

git clone https://github.com/gbraccialli/SparkUtils
cd SparkUtils/
mvn clean package
spark-submit \
  --class com.github.gbraccialli.spark.HiveCommand \
  --master yarn-cluster \
  --num-executors 1 \
  --driver-memory 1g \
  --executor-memory 1g \
  --executor-cores 1 \
  --files /usr/hdp/current/spark-client/conf/hive-site.xml \
  --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
 target/SparkUtils-1.0.0-SNAPSHOT.jar "show tables" "select * from sample_08"

View solution in original post

16 REPLIES 16

avatar
Expert Contributor

log.txt Uploading a copy of the log excerpt in a text file because it won't format properly in the post

avatar

Do you have Kerberos enabled on this cluster? Also - are you using HDP 2.3.0 or HDP 2.3.2?

avatar

Could you share the code from the com.myCompany.Main class?

avatar

@Luis Antonio Torres

I did few tests and I think you just need to change location of --files, it must come before you .jar file.

Find my sample class here:

https://github.com/gbraccialli/SparkUtils/blob/master/src/main/scala/com/github/gbraccialli/spark/Hi...

Project is here:

https://github.com/gbraccialli/SparkUtils

Sample spark-submit with hive commands as parameter:

git clone https://github.com/gbraccialli/SparkUtils
cd SparkUtils/
mvn clean package
spark-submit \
  --class com.github.gbraccialli.spark.HiveCommand \
  --master yarn-cluster \
  --num-executors 1 \
  --driver-memory 1g \
  --executor-memory 1g \
  --executor-cores 1 \
  --files /usr/hdp/current/spark-client/conf/hive-site.xml \
  --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
 target/SparkUtils-1.0.0-SNAPSHOT.jar "show tables" "select * from sample_08"

avatar
Expert Contributor

@Guilherme Braccialli

Thanks for your reply. I tried your suggestion of putting the --files parameter before --jars when submitting, but now I'm running into an exception saying the HiveMetastoreClient could not be instantiated. I'll update my post with the code and new stack trace.

avatar

@Luis Antonio Torres

It worked for me. Can you check content of /usr/hdp/current/spark-client/conf/hive-site.xml you are using?

mine is like this:

  <configuration>
    <property>
      <name>hive.metastore.uris</name>
      <value>thrift://sandbox.hortonworks.com:9083</value>
    </property>
  </configuration>

avatar
Expert Contributor

@Guilherme Braccialli

I just want to start by thanking you for your quick responses. I've been struggling with this problem for a while now, and actually I've also asked this on stackoverflow but no luck.

As for /usr/hdp/current/spark-client/conf/hive-site.xml, the content is pretty much the same as yours:

<configuration>
   
  <property>
  <name>hive.metastore.uris</name>
  <value>thrift://host.xxx.com:9083</value>
  </property>
   
  </configuration>

avatar
@Luis Antonio Torres

check your command, your are using /etc/hive/conf/hive-site.xml instead of /usr/hdp/current/spark-client/conf/hive-site.xml

I think this is the issue.

avatar
Expert Contributor

@Guilherme Braccialli That did the trick! 😃 I didn't notice that at first. I wasn't the one who set-up our cluster so I had no idea that the contents of those two files were different. It's a subtle thing but I had a lot of trouble just because of that. Thank you very much!