Created 12-10-2015 08:27 AM
I have a Spark (version 1.4.1) application on HDP 2.3.2. It works fine when running it in YARN-Client mode. However, when running it on YARN-Cluster mode none of my Hive tables can be found by the application.
I submit the application like so:
./bin/spark-submit --class com.myCompany.Main --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 10g --executor-cores 1 --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar /home/spark/apps/YarnClusterTest.jar --files /etc/hive/conf/hive-site.xml
Here's an excerpt from the logs:
5/12/02 11:05:13 INFO hive.HiveContext: Initializing execution hive, version 0.13.1 15/12/02 11:05:14 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/12/02 11:05:14 INFO metastore.ObjectStore: ObjectStore, initialize called 15/12/02 11:05:14 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/12/02 11:05:14 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/12/02 11:05:14 INFO storage.BlockManagerMasterEndpoint: Registering block manager worker2.xxx.com:34697 with 5.2 GB RAM, BlockManagerId(1, worker2.xxx.com, 34697) 15/12/02 11:05:16 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 15/12/02 11:05:16 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "". 15/12/02 11:05:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/12/02 11:05:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/12/02 11:05:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/12/02 11:05:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/12/02 11:05:18 INFO metastore.ObjectStore: Initialized ObjectStore 15/12/02 11:05:19 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa 15/12/02 11:05:19 INFO metastore.HiveMetaStore: Added admin role in metastore 15/12/02 11:05:19 INFO metastore.HiveMetaStore: Added public role in metastore 15/12/02 11:05:19 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty 15/12/02 11:05:19 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/12/02 11:05:19 INFO parse.ParseDriver: Parsing command: SELECT * FROM streamsummary 15/12/02 11:05:20 INFO parse.ParseDriver: Parse Completed 15/12/02 11:05:20 INFO hive.HiveContext: Initializing HiveMetastoreConnection version 0.13.1 using Spark classes. 15/12/02 11:05:20 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=streamsummary 15/12/02 11:05:20 INFO HiveMetaStore.audit: ugi=spark ip=unknown-ip-addr cmd=get_table : db=default tbl=streamsummary 15/12/02 11:05:20 DEBUG myCompany.Main$: no such table streamsummary; line 1 pos 14
I basically run into the same 'no such table' problem for any time my application needs to read from or write to the Hive tables.
Thanks in advance!
UPDATE:
I tried submitting the spark application with the --files parameter provided before --jars as per @Guilherme Braccialli's suggestion, but doing so now gives me an exception saying that the HiveMetastoreClient could not be instantiated.
spark-submit:
./bin/spark-submit \ --class com.myCompany.Main \ --master yarn-cluster \ --num-executors 3 \ --driver-memory 1g \ --executor-memory 11g \ --executor-cores 1 \ --files /etc/hive/conf/hive-site.xml \ --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \<br> /home/spark/apps/YarnClusterTest.jar
code:
// core.scala trait Core extends java.io.Serializable{ /** * This trait should be mixed in by every other class or trait that is dependent on `sc` * */ val sc: SparkContext lazy val sqlContext = new HiveContext(sc) } // yarncore.scala trait YarnCore extends Core { /** * This trait initializes the SparkContext with YARN as the master */ val conf = new SparkConf().setAppName("my app").setMaster("yarn-cluster") val sc = new SparkContext(conf) } main.scala object Test { def main(args:Array[String]){ /**initialize the spark application**/ val app = new YarnCore // initializes the SparkContext in YARN mode with sqlHelper // provides SQL functionality with Transformer // provides UDF's for transforming the dataframes into the marts /**initialize the logger**/ val log = Logger.getLogger(getClass.getName) val count = app.sqlContext.sql("SELECT COUNT(*) FROM streamsummary") log.info("streamsummary has ${count} records") /**Shut down the spark app**/ app.sc.stop } }
exception:
15/12/11 09:34:55 INFO hive.HiveContext: Initializing execution hive, version 0.13.1 15/12/11 09:34:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:117) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:165) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:163) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:170) at com.epldt.core.Core$class.sqlContext(core.scala:13) at com.epldt.Test$anon$1.sqlContext$lzycompute(main.scala:17) at com.epldt.Test$anon$1.sqlContext(main.scala:17) at com.epldt.Test$.main(main.scala:26) at com.epldt.Test.main(main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.yarn.ApplicationMaster$anon$2.run(ApplicationMaster.scala:486) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1412) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:62) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340) ... 14 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410) ... 19 more Caused by: java.lang.NumberFormatException: For input string: "1800s" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.parseInt(Integer.java:615) at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1258) at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1211) at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1220) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:293) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:214) ... 24 more
Created 12-10-2015 12:46 PM
I did few tests and I think you just need to change location of --files, it must come before you .jar file.
Find my sample class here:
Project is here:
https://github.com/gbraccialli/SparkUtils
Sample spark-submit with hive commands as parameter:
git clone https://github.com/gbraccialli/SparkUtils cd SparkUtils/ mvn clean package spark-submit \ --class com.github.gbraccialli.spark.HiveCommand \ --master yarn-cluster \ --num-executors 1 \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 1 \ --files /usr/hdp/current/spark-client/conf/hive-site.xml \ --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \ target/SparkUtils-1.0.0-SNAPSHOT.jar "show tables" "select * from sample_08"
Created 12-10-2015 08:34 AM
log.txt Uploading a copy of the log excerpt in a text file because it won't format properly in the post
Created 12-10-2015 12:10 PM
Do you have Kerberos enabled on this cluster? Also - are you using HDP 2.3.0 or HDP 2.3.2?
Created 12-10-2015 12:21 PM
Could you share the code from the com.myCompany.Main class?
Created 12-10-2015 12:46 PM
I did few tests and I think you just need to change location of --files, it must come before you .jar file.
Find my sample class here:
Project is here:
https://github.com/gbraccialli/SparkUtils
Sample spark-submit with hive commands as parameter:
git clone https://github.com/gbraccialli/SparkUtils cd SparkUtils/ mvn clean package spark-submit \ --class com.github.gbraccialli.spark.HiveCommand \ --master yarn-cluster \ --num-executors 1 \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 1 \ --files /usr/hdp/current/spark-client/conf/hive-site.xml \ --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \ target/SparkUtils-1.0.0-SNAPSHOT.jar "show tables" "select * from sample_08"
Created 12-11-2015 01:52 AM
Thanks for your reply. I tried your suggestion of putting the --files parameter before --jars when submitting, but now I'm running into an exception saying the HiveMetastoreClient could not be instantiated. I'll update my post with the code and new stack trace.
Created 12-11-2015 01:55 AM
It worked for me. Can you check content of /usr/hdp/current/spark-client/conf/hive-site.xml you are using?
mine is like this:
<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://sandbox.hortonworks.com:9083</value> </property> </configuration>
Created 12-11-2015 02:20 AM
I just want to start by thanking you for your quick responses. I've been struggling with this problem for a while now, and actually I've also asked this on stackoverflow but no luck.
As for /usr/hdp/current/spark-client/conf/hive-site.xml, the content is pretty much the same as yours:
<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://host.xxx.com:9083</value> </property> </configuration>
Created 12-11-2015 02:24 AM
check your command, your are using /etc/hive/conf/hive-site.xml instead of /usr/hdp/current/spark-client/conf/hive-site.xml
I think this is the issue.
Created 12-11-2015 02:35 AM
@Guilherme Braccialli That did the trick! 😃 I didn't notice that at first. I wasn't the one who set-up our cluster so I had no idea that the contents of those two files were different. It's a subtle thing but I had a lot of trouble just because of that. Thank you very much!