Member since
10-01-2015
3933
Posts
1150
Kudos Received
374
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3364 | 05-03-2017 05:13 PM | |
2796 | 05-02-2017 08:38 AM | |
3072 | 05-02-2017 08:13 AM | |
3003 | 04-10-2017 10:51 PM | |
1514 | 03-28-2017 02:27 AM |
02-17-2017
12:49 PM
1 Kudo
Have you looked at free-form query option in sqoop? https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_free_form_query_imports I would suggest to save result of a complex query in a database and sqoop that as one dataset. There is however a note on free-form queries. Note The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.
... View more
02-16-2017
09:07 PM
1 Kudo
@Erik Putrycz additionally, I added a tutorial here https://community.hortonworks.com/articles/84071/apache-ambari-workflow-manager-view-for-apache-ooz-2.html
... View more
02-16-2017
08:01 PM
5 Kudos
Part 1: https://community.hortonworks.com/articles/82964/getting-started-with-apache-ambari-workflow-design.html Part 2: https://community.hortonworks.com/articles/82967/apache-ambari-workflow-designer-view-for-apache-oo.html Part 3: https://community.hortonworks.com/articles/82988/apache-ambari-workflow-designer-view-for-apache-oo-1.html Part 4: https://community.hortonworks.com/articles/83051/apache-ambari-workflow-designer-view-for-apache-oo-2.html Part 5: https://community.hortonworks.com/articles/83361/apache-ambari-workflow-manager-view-for-apache-ooz.html Part 6: https://community.hortonworks.com/articles/83787/apache-ambari-workflow-manager-view-for-apache-ooz-1.html Part 8: https://community.hortonworks.com/articles/84394/apache-ambari-workflow-manager-view-for-apache-ooz-3.html Part 9: https://community.hortonworks.com/articles/85091/apache-ambari-workflow-manager-view-for-apache-ooz-4.html Part 10: https://community.hortonworks.com/articles/85354/apache-ambari-workflow-manager-view-for-apache-ooz-5.html Part 11: https://community.hortonworks.com/articles/85361/apache-ambari-workflow-manager-view-for-apache-ooz-6.html Part 12: https://community.hortonworks.com/articles/131389/apache-ambari-workflow-manager-view-for-apache-ooz-7.html Welcome back folks, in this tutorial, I'm going to demonstrate how to easily import existing Spark workflows and execute them in WFM as well as create your own Spark workflows. As of today, Apache Spark 2.x is not supported in Apache Oozie bundled with HDP. There is community work around making Spark2 run in Oozie but it is not released yet. I'm going to concentrate on Spark 1.6.3 today. First things first, I'm going to import a workflow into WFM from Oozie examples https://github.com/apache/oozie/tree/master/examples/src/main/apps/spark My cluster setup is: Ambari 2.5.0 HDP 2.6 HDFS HA RM HA Oozie HA Kerberos Luckily for Spark action in Kerberos environment I didn't need to add anything else (i.e. credential). First thing I need is to get dfs.nameservices property from HDFS Ambari > HDFS > Configs I'm going to use that for nameNode variable. I'm ready to import this workflow into WFM, for the details, please review one of my earlier tutorials. I'm presented with spark action node Click on the spark-node and hit the gear icon to preview the properties. let's also review any arguments for input and output as well as RM and NameNode, also notice prepare step, we can select to delete a directory if exists. We're going to leave everything as is. When we submit the workflow, we're going to supply nameNode and resourceManager address, below are my properties notice jobTracker and resourceManager both appear, ignore jobTracker, since it was in the original wf, it was inherited, we're concerned about RM going forward. Also nameNode value is the dfs.nameservices property from core-site.xml as I stated earlier. Once the job completes, you can navigate to the output directory and see that file was copied. hdfs dfs -ls /user/aervits/examples/output-data/spark/
Found 3 items
-rw-r--r-- 3 aervits hdfs 0 2017-02-16 17:16 /user/aervits/examples/output-data/spark/_SUCCESS
-rw-r--r-- 3 aervits hdfs 706 2017-02-16 17:16 /user/aervits/examples/output-data/spark/part-00000
-rw-r--r-- 3 aervits hdfs 704 2017-02-16 17:16 /user/aervits/examples/output-data/spark/part-00001
In my case sample input was a book in the examples directory hdfs dfs -cat /user/aervits/examples/output-data/spark/part-00000
To be or not to be, that is the question;
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing, end them. To die, to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to ? 'tis a consummation
Next up, I'm going to demonstrate authoring a new Spark action instead of importing one. I'm following a guide http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_spark-component-guide/content/run-sample-apps.html#run_spark_pi to demonstrate how to add this Pi job to Oozie workflow via WFM. First you need to create a workflow directory on HDFS along with lib folder. Then upload the Spark jar to that directory. hdfs dfs -mkdir -p oozie/spark/lib
cd /usr/hdp/current/spark-client/lib
hdfs dfs -put spark-examples-1.6.3.2.6.0.0-502-hadoop2.7.3.2.6.0.0-502.jar oozie/spark/lib next, let's add a spark action to WFM and edit it. Fill out the properties as below and make sure to select Yarn Cluster, Yarn Client in Oozie will be deprecated soon. Notice you can pass Spark options on its own line. I also need to add an argument to SparkPi job, in this case it's 10 If you didn't figure out already, I'm trying to recreate the following command in Oozie ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 Aside from changing yarn-client to yarn-cluster, everything else is as in the command above. I'd like to preview my XML now. I'm ready to submit the job and run it. Next I'm going to demonstrate how to run a PySpark job in Oozie via WFM. The code I'm going to run is below from pyspark import SparkContext, SparkConf
import sys
datain = sys.argv[1]
dataout = sys.argv[2]
conf = SparkConf().setAppName('counts_with_pyspark')
sc = SparkContext(conf=conf)
text_file = sc.textFile(str(datain))
counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(str(dataout))
It's taken from http://spark.apache.org/examples.html, I only added an option to pass input and output directories from command line. I'm going to run the code to make sure it works with the following command /usr/hdp/current/spark-client/bin/spark-submit counts.py hdfs://mycluster/user/aervits/examples/input-data/text/ hdfs://mycluster/user/aervits/pyspark-output This will produce the output in the pyspark-output HDFS directory with a count for each instance of a word. Expected output is below hdfs dfs –cat pyspark-output/part-0000 | less
(u'and', 7)
(u'slings', 1)
(u'fardels', 1)
(u'mind', 1)
(u'natural', 1)
(u'sea', 1)
(u'For', 2)
(u'arrows', 1)
(u'is', 2)
(u'ills', 1)
(u'resolution', 1)
(u'merit', 1)
(u'death,', 1)
(u'say', 1)
(u'pause.', 1)
(u'bare', 1)
(u'Devoutly', 1)
Next, I'm ready to add a Spark action node to WFM and edit it by populating the properties below. Notice I'm passing the Spark options as well as yarn-cluster as deployment mode. Next I need to configure input/output and prepare step. I need to delete output directory so that I can re-run my wf w/out manually deleting the output directory. Nothing new here, I'm passing the input and output as arguments to the action. I'm ready to preview the XML. Last step here is to create the lib directory in the pyspark workflow directory and upload the counts.py file there. hdfs dfs -mkdir oozie/pyspark/lib
hdfs dfs -put counts.py oozie/pyspark/lib/ Now I am ready to submit the job, luckily it succeeds. As usual, you can find my code here https://github.com/dbist/oozie/tree/master/apps/pyspark https://github.com/dbist/oozie/tree/master/apps/spark
... View more
Labels:
02-16-2017
06:00 PM
1 Kudo
@Erik Putrycz I added a pyspark workflow example https://github.com/dbist/oozie/tree/master/apps/pyspark it works in HA HDFS, RM HA, OOZIE HA, kerberos.
... View more
02-16-2017
02:16 PM
@Georg Heiler we are not keen on publishing timelines. Shouldn't be too long.
... View more
02-15-2017
11:22 PM
I had a similar issue writing Chef recipe, what I did was the following: Run the steps with Ambari, populate the ambari.properties so it is functional, then take that ambari.properties file and use as template in Chef and add variables where values change. I am not familiar with puppet and unaware of template concept is available but if it does that's one way. For reference, here's my example in Chef https://github.com/dbist/smartsense-chef/blob/master/templates/default/hst-agent.ini.erb
... View more
02-15-2017
07:54 PM
2 Kudos
Part 1: https://community.hortonworks.com/articles/82964/getting-started-with-apache-ambari-workflow-design.html Part 2: https://community.hortonworks.com/articles/82967/apache-ambari-workflow-designer-view-for-apache-oo.html Part 3: https://community.hortonworks.com/articles/82988/apache-ambari-workflow-designer-view-for-apache-oo-1.html Part 4: https://community.hortonworks.com/articles/83051/apache-ambari-workflow-designer-view-for-apache-oo-2.html Part 5: https://community.hortonworks.com/articles/83361/apache-ambari-workflow-manager-view-for-apache-ooz.html Part 7: https://community.hortonworks.com/articles/84071/apache-ambari-workflow-manager-view-for-apache-ooz-2.html Part 8: https://community.hortonworks.com/articles/84394/apache-ambari-workflow-manager-view-for-apache-ooz-3.html Part 9: https://community.hortonworks.com/articles/85091/apache-ambari-workflow-manager-view-for-apache-ooz-4.html Part 10: https://community.hortonworks.com/articles/85354/apache-ambari-workflow-manager-view-for-apache-ooz-5.html Part 11: https://community.hortonworks.com/articles/85361/apache-ambari-workflow-manager-view-for-apache-ooz-6.html Part 12: https://community.hortonworks.com/articles/131389/apache-ambari-workflow-manager-view-for-apache-ooz-7.html In this tutorial, we're going to leverage Oozie's SLA monitoring features via Workflow Manager. To read more about SLA features in Oozie, please look at the official documentation https://oozie.apache.org/docs/4.2.0/DG_SLAMonitoring.html We will begin with a simple shell action that will sleep for duration of time. Create a new file called script.sh and paste add the code below. echo “start of script execution”
sleep 60
echo “end of script execution”
We're also going to create a workflow HDFS directory and upload this script to it. hdfs dfs –mkdir oozie/shell-sla
hdfs dfs -put script.sh oozie/shell-sla/
Let's begin with adding a shell action and populating the script name and file attribute. Don't forget to check the capture output box. We want to see the output of this action. I want to submit the workflow to make sure everything works as expected before configuring SLA features. It's a good idea to preview the XML to make sure file tag and exec tags are filled correctly. Once job completes, I want to drill down to the job and view the output. Everything looks good, we're ready to enable SLA features of Oozie via Workflow Manager. Click on the shell action and then gear icon. At the bottom of the configuration page, you will see SLA section. Expand that and check the enabled box. Each field is described as below:
nominal-time: As the name suggests, this is
the time relative to which your jobs' SLAs will be calculated. Generally
since Oozie workflows are aligned with synchronous data dependencies, this
nominal time can be parameterized to be passed the value of your
coordinator nominal time. Nominal time is also required in case of
independent workflows and you can specify the time in which you expect the
workflow to be run if you don't have a synchronous dataset associated with
it. should-start: Relative to nominal-time this is the amount of time
(along with time-unit - MINUTES, HOURS, DAYS) within which your job
should start running to meet SLA. This is optional. should-end: Relative to nominal-time this is the amount of time
(along with time-unit - MINUTES, HOURS, DAYS) within which your job
should finish to meet SLA. max-duration: This is the maximum amount of
time (along with time-unit - MINUTES, HOURS, DAYS) your job is expected to
run. This is optional. alert-events: Specify the types of events for
which Email alerts should be sent. Allowable values in
this comma-separated list are start_miss, end_miss and duration_miss.
*_met events can generally be deemed low priority and hence email alerting
for these is not necessary. However, note that this setting is only for
alerts via email alerts and not via JMS messages, where
all events send out notifications, and user can filter them using desired
selectors. This is optional and only applicable when alert-contact is
configured. alert-contact: Specify a comma separated list of email addresses where you wish your
alerts to be sent. This is optional and need not be configured if you just want
to view your job SLA history in the UI and do not want to receive email alerts. I'm going to simulate each one of the SLA patterns, i.e. my job started later than scheduled, my job completed outside the SLA threshold and finally, my job took longer to complete than we were expecting. To fill out the nominal time, feel free to choose the date and clock icon below the date picker for correct time. Click x when ready. Finally, I'd like to change my script to run for 120 seconds instead of 60 to simulate long duration. My script should look like so: echo “start of script execution”
sleep 120
echo “end of script execution”
When ready re-upload the script. At this point, I want to make sure sending mail from the cluster is possible and will test that by sending a sample email. Enabling mail is beyond the scope of this tutorial, I followed the procedure below, adjust as necessary for your environment. sudo su
yum install postfix
/etc/init.d/postfix restart
exit Now we're able to send mail from our node, mail needs to work on any of the nodes Oozie will execute a wf. mail -s "test" email@email.com
hit ctrl-D, you should get an email shortly. Finally, there are some changes we need to implement on the Oozie side. I'm not going to enable JMS alerting and only concentrate on email piece. Please consult Oozie docs for JMS part. This is HDP 2.5.3 and things may look/act differently on your Oozie instance. Let's go to Ambari > Configs filter by the following property oozie.services.ext We're going to add these services to the existing list: org.apache.oozie.service.EventHandlerService,
org.apache.oozie.sla.service.SLAService Once ready, add a couple of more custom properties in Oozie, again, in my environment these properties did not exist. oozie.service.EventHandlerService.event.listeners and the value should be org.apache.oozie.sla.listener.SLAJobEventListener,
org.apache.oozie.sla.listener.SLAEmailEventListener
Oozie docs also recommend adding the following property to improve performance of event processing, we're going to add this property and set value to 15. oozie.service.SchedulerService.threads Once I saved the changes and restarted Oozie, it failed to start, looking at the logs I noticed the following in the oozie-error.log: 2017-02-15 18:14:32,757 WARN ConfigUtils:523 - SERVER[wfmanager-test-1.openstacklocal] Using a deprecated configuration property [oozie.service.AuthorizationService.security.enabled], should use [oozie.service.AuthorizationService.authorization.enabled]. Please delete the deprecated property in order for the new property to take effect.
I found the property in Ambari > Configs and set it to false, I was not able to delete it. Once done, restart all Oozie services and now you're able to see a new tab in Oozie called SLA Remember, we only configured Email service, not JMS. We're ready to test our wf. Before that, I'd like to preview the XML for good measure. At this point, I'm ready to submit the workflow and watch my inbox. I'm expecting to miss my job start, job end and duration. This is an email output of my workflow. Until next time folks!
... View more
Labels:
02-15-2017
03:26 PM
@Angelo Alexander please refer to the following doc, also you can download the MySQL driver jar from MySQL website and place it in /usr/hdp/current/sqoop-client/lib http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_data-movement-and-integration/content/apache_sqoop_connectors.html
... View more
02-14-2017
05:37 PM
My question is if I have a job that missed it's SLA, I want to terminate further progress rather than send out a JMS alert. Can I tap into SLA framework perhaps with an EL function and control decisions on the job? Can SLA framework handle that out of the box? If missed SLA, terminate job.
... View more
Labels:
- Labels:
-
Apache Oozie
02-14-2017
01:21 PM
2 Kudos
Sharing exam material goes against HCC policy. Please refrain from using this forum for such activity.
... View more