- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 09-14-2016 02:59 AM
Running Spark Jobs Through Apache Beam on HDP 2.5 Yarn Cluster Using the Spark Runner with Apache Beam
Apache Beam is still in incubator and not supported on HDP 2.5 or other platforms.
sudo yum -y install git wget http://www.gtlib.gatech.edu/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
After you get Maven downloaded, move it to /opt/demo/maven or into your path. The maven download mirror will change, so grab a new URL from http://maven.apache.org/.
Using Yum will give you an older Maven not supported and may interfere with something else. So I recommend getting a new Maven just for this build. Make sure you have Java 7 or greater, which you should have on an Apache machine. I am recommending Java 8 on your new HDP 2.5 nodes if possible.
cd /opt/demo/ git clone https://github.com/apache/incubator-beam cd incubator-beam /opt/demo/maven/bin/mvn clean install -DskipTests
If you want to run this on Spark 2.0 and not Spark 1.6.2, look here for changing environment: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/spark-choo...
For HDP 2.5, these are the parameters:
spark-submit --class org.apache.beam.runners.spark.examples.WordCount --master yarn-client target/beam-runners-spark-0.3.0-incubating-SNAPSHOT-spark-app.jar --inputFile=kinglear.txt --output=out --runner=SparkRunner --sparkMaster=yarn-client
Note, I had to change the parameters to get this to work in my environment. You may also need to do /opt/demo/maven/bin/mvn package from the /opt/demo/incubator-beam/runners/spark directory. This is running a Java 7 example from the built-in examples: https://github.com/apache/incubator-beam/tree/master/examples/java
These are the results of running our small Spark job.
16/09/14 02:35:08 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 34.0 KB, free 518.7 KB) 16/09/14 02:35:08 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.26.195.58:39575 (size: 34.0 KB, free: 511.1 MB) 16/09/14 02:35:08 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008 16/09/14 02:35:08 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[14] at mapToPair at TransformTranslator.java:568) 16/09/14 02:35:08 INFO YarnScheduler: Adding task set 1.0 with 2 tasks 16/09/14 02:35:08 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, tspanndev13.field.hortonworks.com, partition 0,NODE_LOCAL, 1994 bytes) 16/09/14 02:35:08 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, tspanndev13.field.hortonworks.com, partition 1,NODE_LOCAL, 1994 bytes) 16/09/14 02:35:08 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on tspanndev13.field.hortonworks.com:36438 (size: 34.0 KB, free: 511.1 MB) 16/09/14 02:35:08 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on tspanndev13.field.hortonworks.com:36301 (size: 34.0 KB, free: 511.1 MB) 16/09/14 02:35:08 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to tspanndev13.field.hortonworks.com:52646 16/09/14 02:35:08 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 177 bytes 16/09/14 02:35:08 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to tspanndev13.field.hortonworks.com:52640 16/09/14 02:35:09 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 681 ms on tspanndev13.field.hortonworks.com (1/2) 16/09/14 02:35:09 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1112 ms on tspanndev13.field.hortonworks.com (2/2) 16/09/14 02:35:09 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/09/14 02:35:09 INFO DAGScheduler: ResultStage 1 (saveAsNewAPIHadoopFile at TransformTranslator.java:745) finished in 1.113 s 16/09/14 02:35:09 INFO DAGScheduler: Job 0 finished: saveAsNewAPIHadoopFile at TransformTranslator.java:745, took 5.422285 s 16/09/14 02:35:09 INFO SparkRunner: Pipeline execution complete. 16/09/14 02:35:09 INFO SparkContext: Invoking stop() from shutdown hook
[root@tspanndev13 spark]# hdfs dfs -ls Found 5 items drwxr-xr-x - root hdfs 0 2016-09-14 02:35 .sparkStaging -rw-r--r-- 3 root hdfs 0 2016-09-14 02:35 _SUCCESS -rw-r--r-- 3 root hdfs 185965 2016-09-14 01:44 kinglear.txt -rw-r--r-- 3 root hdfs 27304 2016-09-14 02:35 out-00000-of-00002 -rw-r--r-- 3 root hdfs 26515 2016-09-14 02:35 out-00001-of-00002 [root@tspanndev13 spark]# hdfs dfs -cat out-00000-of-00002 oaths: 1 bed: 7 hearted: 5 warranties: 1 Refund: 1 unnaturalness: 1 sea: 7 sham'd: 1 Only: 2 sleep: 8 sister: 29 Another: 2 carbuncle: 1
As you can see as expected it produced the two part output file in HDFS with wordcounts.
Not much configuration is required to run your Apache Beam Java jobs on your HDP 2.5 YARN Spark Cluster, so if you have a development cluster, this would be a great place to try it out. Our on your own HDP 2.5 sandbox.
Resources:
http://beam.incubator.apache.org/learn/programming-guide/
https://github.com/apache/incubator-beam/tree/master/runners/spark