Created on 12-22-2016 04:17 AM - edited 09-16-2022 03:52 AM
We are trying to run the Data Science team Spark code Jar on our Hortonworks cluster. They have a long running Spark application which run on mesos in their Datastax cluster. We just need to make it run on our Hortonworks cluster using Yarn as the cluster manager. We have the deploy script which starts an application and then posts the input arguments in a JSON to that long running application via http POST.
Created 12-22-2016 04:21 AM
yes you can. I have found this article here useful to understand how spark job (long running) run on yarn.
http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
Created 12-22-2016 04:21 AM
yes you can. I have found this article here useful to understand how spark job (long running) run on yarn.
http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
Created 12-22-2016 10:28 AM
Thanks Sunile,
This article is really informative. For now we are trying to run this Spark application as a single run service instead of as a Webservice and are going to give all the arguments through the command line only. We have actually found a different entry point to this service and are now trying to run it through the command line. Since I'm new to scala and spark I'm finding it difficult to start this application on my Hortonworks Sandbox. I've already setup the required Jar files and able to submit the application to Spark2 on Hortonworks 2.5 but I'm stuck on the point where I've to run this on Yarn in cluster mode. I'm getting the below error, I've kept all the required jars under the root directory from where I'm running this command. Please note that I'm able to execute this in local mode but don't know what's the issue with the Yarn cluster mode!
16/12/22 09:25:40 ERROR ApplicationMaster: User class threw exception: java.sql.SQLException: No suitable driver java.sql.SQLException: No suitable driver
The command which I'm using is given below:
su root --command "/usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --class com.kronos.research.svc.datascience.DataScienceApp --verbose --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 2g --executor-cores 1 --jars ./dst-svc-reporting-assembly-0.0.1-SNAPSHOT-deps.jar --conf spark.scheduler.mode=FIFO --conf spark.cassandra.output.concurrent.writes=5 --conf spark.cassandra.output.batch.size.bytes=4096 --conf spark.cassandra.output.consistency.level=ALL --conf spark.cassandra.input.consistency.level=LOCAL_ONE --conf spark.executor.extraClassPath=sqljdbc4.jar:ojdbc6.jar --conf spark.executor.extraJavaOptions=\"-Duser.timezone=UTC \" --driver-class-path sqljdbc4.jar:ojdbc6.jar --driver-java-options \"-Dspark.ui.port=0 -Doracle.jdbc.timezoneAsRegion=false -Dspark.cassandra.connection.host=catalyst-1.int.kronos.com,catalyst-2.int.kronos.com,catalyst-3.int.kronos.com -Duser.timezone=UTC\" /root/dst-svc-reporting-assembly-0.0.1-SNAPSHOT.jar -job.type=pipeline -pipeline=usagemon -jdbc.type=tenant -jdbc.tenant=10013 -date.start=\"2013-02-01\" -date.end=\"2016-05-01\" -labor.level=4 -output.type=console -jdbc.throttle=15 -jdbc.cache.dir=hdfs://sandbox.hortonworks.com:8020/etl"
Created 12-22-2016 09:32 PM
Zeppelin is a good example of one.
Created 12-22-2016 10:16 PM
Without the full exception stack trace its difficult to know what happened.
If you are instantiating hive then you may need to add hive-site.xml and the data-nucleus jars to the job. e.g. like
--jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar --files /usr/hdp/current/spark-client/conf/hive-site.xml