Created 07-14-2016 03:47 AM
Hi,
I We have installed HDP-2.4.0.0. As per the requirement i need to configure oozie job w.r.t spark action.
I have written the code.
Workflow.xml:
<?xml version="1.0"?> <workflow-app name="${OOZIE_WF_NAME}" xmlns="uri:oozie:workflow:0.5"> <global> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/usr/hdp/2.4.0.0-169/spark/</value> </property> </configuration> </global> <start to="spark-mongo-ETL"/> <action name="spark-mongo-ETL"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>yarn-cluster</master> <mode>cluster</mode> <name>SparkMongoLoading</name> <class>com.SparkSqlExample</class> <jar>${nameNode}${WORKFLOW_HOME}/lib/SparkParquetExample-0.0.1-SNAPSHOT.jar</jar> </spark> <ok to="End"/> <error to="killAction"/> </action> <kill name="killAction"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="End"/> </workflow-app>
Job.properties:
nameNode=hdfs://nameNode1:8020 jobTracker=yarnNM:8050 queueName=default user.name=hadoop oozie.libpath=/user/oozie/share/lib/ oozie.use.system.libpath=true WORKFLOW_HOME=/user/hadoop/SparkETL OOZIE_WF_NAME=Spark-Mongo-ETL-wf SPARK_MONGO_JAR=${nameNode}${WORKFLOW_HOME}/lib/SparkParquetExample-0.0.1-SNAPSHOT.jar oozie.wf.application.path=${nameNode}/user/hadoop/SparkETL/
Under lib folder 2 jar are placed
SparkParquetExample-0.0.1-SNAPSHOT.jar spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar
When I submit the oozie job, the action was killed.
Error :
Error: java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:217) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2624) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2634) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:342) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:270) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:432) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:164) at org.apache.hadoop.mapred.YarnChild.configureLocalDirs(YarnChild.java:256) at org.apache.hadoop.mapred.YarnChild.configureTask(YarnChild.java:314) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:146)
Also let me know how to pass the jars and files explicitly in the workflow.
Command :
spark-submit --class com.SparkSqlExample --master yarn-cluster --num-executors 2 --driver-memory 1g --executor-memory 2g --executor-cores 2 --files /usr/hdp/current/spark-client/conf/hive-site.xml --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/jackson-core-2.4.4.jar,/usr/hdp/current/spark-client/lib/mongo-hadoop-spark-1.5.2.jar,/usr/share/java/slf4j-simple-1.7.5.jar,/usr/hdp/current/spark-client/lib/spark-core_2.10-1.6.0.jar,/usr/hdp/current/spark-client/lib/spark-hive_2.10-1.6.0.jar,/usr/hdp/current/spark-client/lib/spark-sql_2.10-1.6.0.jar,/usr/hdp/current/spark-client/lib/mongo-hadoop-core-1.5.2.jar,/usr/hdp/current/spark-client/lib/spark-avro_2.10-2.0.1.jar,/usr/hdp/current/spark-client/lib/spark-csv_2.10-1.4.0.jar,/usr/hdp/current/spark-client/lib/spark-mongodb_2.10-0.11.2.jar,/usr/hdp/current/spark-client/lib/spark-streaming_2.10-1.6.0.jar,/usr/hdp/current/spark-client/lib/commons-csv-1.1.jar,/usr/hdp/current/spark-client/lib/mongodb-driver-3.2.2.jar,/usr/hdp/current/spark-client/lib/mongo-hadoop-master-1.5.2.jar,/usr/hdp/current/spark-client/lib/mongo-java-driver-3.2.2.jar,/usr/hdp/current/spark-client/lib/spark-1.6.0.2.4.0.0-169-yarn-shuffle.jar --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar --conf spark.yarn.executor.memoryOverhead=512 /home/hadoop/SparkParquetExample-0.0.1-SNAPSHOT.jar
The above command executes successfully
Can anyone suggest me the solution.
Created 07-14-2016 07:21 AM
I don't know where the TFS bit comes from, maybe some dependency problems.
For including all dependencies in the workflow I would recommend to go for a fat jar (assembly). In scala with sbt you can see the idea here Creating fat jars with sbt. Same works with maven's "maven-assembly-plugin". You should be able to call your code as
spark-submit --master yarn-cluster \ --num-executors 2 --driver-memory 1g --executor-memory 2g --executor-cores 2 \ --class com.SparkSqlExample \ /home/hadoop/SparkParquetExample-0.0.1-SNAPSHOT-with-depencencies.jar
If this works, the jar with dependencies should be the one in the oozie spark action.
Created 07-14-2016 07:21 AM
I don't know where the TFS bit comes from, maybe some dependency problems.
For including all dependencies in the workflow I would recommend to go for a fat jar (assembly). In scala with sbt you can see the idea here Creating fat jars with sbt. Same works with maven's "maven-assembly-plugin". You should be able to call your code as
spark-submit --master yarn-cluster \ --num-executors 2 --driver-memory 1g --executor-memory 2g --executor-cores 2 \ --class com.SparkSqlExample \ /home/hadoop/SparkParquetExample-0.0.1-SNAPSHOT-with-depencencies.jar
If this works, the jar with dependencies should be the one in the oozie spark action.
Created 07-14-2016 06:15 PM
Hi @Bernhard Walter,
Thanks for the reply!!!.
I have followed your idea, but still throwing different error.
Please help me.
diagnostics: Application application_1468279065782_0300 failed 2 times due to AM Container for appattempt_1468279065782_0300_000002 exited with exitCode: -1000 For more detailed output, check application tracking page:http://yarnNM:8088/cluster/app/application_1468279065782_0300Then, click on links to logs of each attempt. Diagnostics: Permission denied: user=hadoop, access=EXECUTE, inode="/user/yarn/.sparkStaging/application_1468279065782_0300/__spark_conf__1316069581048982381.zip":yarn:yarn:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:108) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3866) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1076) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:843) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Created 07-15-2016 11:44 AM
It looks like you are executing the job as user hadoop, However spark wants to execute staging data from/user/yarn (which can only be accessed by yarn). How did you start the job and with which user?
I am surprised that spark uses /user/yarn as staging dir for user hadoop. Is there any staging dir configuration in your system (SPARK_YARN_STAGING_DIR)?
Created 07-14-2016 09:09 PM
Hi @Bernhard Walter,
Inspite of Creating the Fat jar, the below error also occured
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org.apache.spark.util.Utils$.DEFAULT_DRIVER_MEM_MB()I java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.DEFAULT_DRIVER_MEM_MB()I at org.apache.spark.deploy.yarn.ClientArguments.<init>(ClientArguments.scala:49) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1120) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:104) at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:95) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47) at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:38) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:241) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Created 07-15-2016 11:46 AM