About aervits

aervits · ‎02-17-2017

Have you looked at free-form query option in sqoop? https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_free_form_query_imports I would suggest to save result of a complex query in a database and sqoop that as one dataset. There is however a note on free-form queries. Note The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.

aervits · ‎02-16-2017

@Erik Putrycz additionally, I added a tutorial here https://community.hortonworks.com/articles/84071/apache-ambari-workflow-manager-view-for-apache-ooz-2.html

aervits · ‎02-16-2017

Part 1: https://community.hortonworks.com/articles/82964/getting-started-with-apache-ambari-workflow-design.html Part 2: https://community.hortonworks.com/articles/82967/apache-ambari-workflow-designer-view-for-apache-oo.html Part 3: https://community.hortonworks.com/articles/82988/apache-ambari-workflow-designer-view-for-apache-oo-1.html Part 4: https://community.hortonworks.com/articles/83051/apache-ambari-workflow-designer-view-for-apache-oo-2.html Part 5: https://community.hortonworks.com/articles/83361/apache-ambari-workflow-manager-view-for-apache-ooz.html Part 6: https://community.hortonworks.com/articles/83787/apache-ambari-workflow-manager-view-for-apache-ooz-1.html Part 8: https://community.hortonworks.com/articles/84394/apache-ambari-workflow-manager-view-for-apache-ooz-3.html Part 9: https://community.hortonworks.com/articles/85091/apache-ambari-workflow-manager-view-for-apache-ooz-4.html Part 10: https://community.hortonworks.com/articles/85354/apache-ambari-workflow-manager-view-for-apache-ooz-5.html Part 11: https://community.hortonworks.com/articles/85361/apache-ambari-workflow-manager-view-for-apache-ooz-6.html Part 12: https://community.hortonworks.com/articles/131389/apache-ambari-workflow-manager-view-for-apache-ooz-7.html Welcome back folks, in this tutorial, I'm going to demonstrate how to easily import existing Spark workflows and execute them in WFM as well as create your own Spark workflows. As of today, Apache Spark 2.x is not supported in Apache Oozie bundled with HDP. There is community work around making Spark2 run in Oozie but it is not released yet. I'm going to concentrate on Spark 1.6.3 today. First things first, I'm going to import a workflow into WFM from Oozie examples https://github.com/apache/oozie/tree/master/examples/src/main/apps/spark My cluster setup is: Ambari 2.5.0 HDP 2.6 HDFS HA RM HA Oozie HA Kerberos Luckily for Spark action in Kerberos environment I didn't need to add anything else (i.e. credential). First thing I need is to get dfs.nameservices property from HDFS Ambari > HDFS > Configs I'm going to use that for nameNode variable. I'm ready to import this workflow into WFM, for the details, please review one of my earlier tutorials. I'm presented with spark action node Click on the spark-node and hit the gear icon to preview the properties. let's also review any arguments for input and output as well as RM and NameNode, also notice prepare step, we can select to delete a directory if exists. We're going to leave everything as is. When we submit the workflow, we're going to supply nameNode and resourceManager address, below are my properties notice jobTracker and resourceManager both appear, ignore jobTracker, since it was in the original wf, it was inherited, we're concerned about RM going forward. Also nameNode value is the dfs.nameservices property from core-site.xml as I stated earlier. Once the job completes, you can navigate to the output directory and see that file was copied. hdfs dfs -ls /user/aervits/examples/output-data/spark/ Found 3 items -rw-r--r-- 3 aervits hdfs 0 2017-02-16 17:16 /user/aervits/examples/output-data/spark/_SUCCESS -rw-r--r-- 3 aervits hdfs 706 2017-02-16 17:16 /user/aervits/examples/output-data/spark/part-00000 -rw-r--r-- 3 aervits hdfs 704 2017-02-16 17:16 /user/aervits/examples/output-data/spark/part-00001 In my case sample input was a book in the examples directory hdfs dfs -cat /user/aervits/examples/output-data/spark/part-00000 To be or not to be, that is the question; Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing, end them. To die, to sleep; No more; and by a sleep to say we end The heart-ache and the thousand natural shocks That flesh is heir to ? 'tis a consummation Next up, I'm going to demonstrate authoring a new Spark action instead of importing one. I'm following a guide http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_spark-component-guide/content/run-sample-apps.html#run_spark_pi to demonstrate how to add this Pi job to Oozie workflow via WFM. First you need to create a workflow directory on HDFS along with lib folder. Then upload the Spark jar to that directory. hdfs dfs -mkdir -p oozie/spark/lib cd /usr/hdp/current/spark-client/lib hdfs dfs -put spark-examples-1.6.3.2.6.0.0-502-hadoop2.7.3.2.6.0.0-502.jar oozie/spark/lib next, let's add a spark action to WFM and edit it. Fill out the properties as below and make sure to select Yarn Cluster, Yarn Client in Oozie will be deprecated soon. Notice you can pass Spark options on its own line. I also need to add an argument to SparkPi job, in this case it's 10 If you didn't figure out already, I'm trying to recreate the following command in Oozie ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 Aside from changing yarn-client to yarn-cluster, everything else is as in the command above. I'd like to preview my XML now. I'm ready to submit the job and run it. Next I'm going to demonstrate how to run a PySpark job in Oozie via WFM. The code I'm going to run is below from pyspark import SparkContext, SparkConf import sys datain = sys.argv[1] dataout = sys.argv[2] conf = SparkConf().setAppName('counts_with_pyspark') sc = SparkContext(conf=conf) text_file = sc.textFile(str(datain)) counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile(str(dataout)) It's taken from http://spark.apache.org/examples.html, I only added an option to pass input and output directories from command line. I'm going to run the code to make sure it works with the following command /usr/hdp/current/spark-client/bin/spark-submit counts.py hdfs://mycluster/user/aervits/examples/input-data/text/ hdfs://mycluster/user/aervits/pyspark-output This will produce the output in the pyspark-output HDFS directory with a count for each instance of a word. Expected output is below hdfs dfs –cat pyspark-output/part-0000 | less (u'and', 7) (u'slings', 1) (u'fardels', 1) (u'mind', 1) (u'natural', 1) (u'sea', 1) (u'For', 2) (u'arrows', 1) (u'is', 2) (u'ills', 1) (u'resolution', 1) (u'merit', 1) (u'death,', 1) (u'say', 1) (u'pause.', 1) (u'bare', 1) (u'Devoutly', 1) Next, I'm ready to add a Spark action node to WFM and edit it by populating the properties below. Notice I'm passing the Spark options as well as yarn-cluster as deployment mode. Next I need to configure input/output and prepare step. I need to delete output directory so that I can re-run my wf w/out manually deleting the output directory. Nothing new here, I'm passing the input and output as arguments to the action. I'm ready to preview the XML. Last step here is to create the lib directory in the pyspark workflow directory and upload the counts.py file there. hdfs dfs -mkdir oozie/pyspark/lib hdfs dfs -put counts.py oozie/pyspark/lib/ Now I am ready to submit the job, luckily it succeeds. As usual, you can find my code here https://github.com/dbist/oozie/tree/master/apps/pyspark https://github.com/dbist/oozie/tree/master/apps/spark

aervits · ‎02-16-2017

@Erik Putrycz I added a pyspark workflow example https://github.com/dbist/oozie/tree/master/apps/pyspark it works in HA HDFS, RM HA, OOZIE HA, kerberos.

aervits · ‎02-16-2017

@Georg Heiler we are not keen on publishing timelines. Shouldn't be too long.

aervits · ‎02-15-2017

I had a similar issue writing Chef recipe, what I did was the following: Run the steps with Ambari, populate the ambari.properties so it is functional, then take that ambari.properties file and use as template in Chef and add variables where values change. I am not familiar with puppet and unaware of template concept is available but if it does that's one way. For reference, here's my example in Chef https://github.com/dbist/smartsense-chef/blob/master/templates/default/hst-agent.ini.erb

aervits · ‎02-15-2017

Part 1: https://community.hortonworks.com/articles/82964/getting-started-with-apache-ambari-workflow-design.html Part 2: https://community.hortonworks.com/articles/82967/apache-ambari-workflow-designer-view-for-apache-oo.html Part 3: https://community.hortonworks.com/articles/82988/apache-ambari-workflow-designer-view-for-apache-oo-1.html Part 4: https://community.hortonworks.com/articles/83051/apache-ambari-workflow-designer-view-for-apache-oo-2.html Part 5: https://community.hortonworks.com/articles/83361/apache-ambari-workflow-manager-view-for-apache-ooz.html Part 7: https://community.hortonworks.com/articles/84071/apache-ambari-workflow-manager-view-for-apache-ooz-2.html Part 8: https://community.hortonworks.com/articles/84394/apache-ambari-workflow-manager-view-for-apache-ooz-3.html Part 9: https://community.hortonworks.com/articles/85091/apache-ambari-workflow-manager-view-for-apache-ooz-4.html Part 10: https://community.hortonworks.com/articles/85354/apache-ambari-workflow-manager-view-for-apache-ooz-5.html Part 11: https://community.hortonworks.com/articles/85361/apache-ambari-workflow-manager-view-for-apache-ooz-6.html Part 12: https://community.hortonworks.com/articles/131389/apache-ambari-workflow-manager-view-for-apache-ooz-7.html In this tutorial, we're going to leverage Oozie's SLA monitoring features via Workflow Manager. To read more about SLA features in Oozie, please look at the official documentation https://oozie.apache.org/docs/4.2.0/DG_SLAMonitoring.html We will begin with a simple shell action that will sleep for duration of time. Create a new file called script.sh and paste add the code below. echo “start of script execution” sleep 60 echo “end of script execution” We're also going to create a workflow HDFS directory and upload this script to it. hdfs dfs –mkdir oozie/shell-sla hdfs dfs -put script.sh oozie/shell-sla/ Let's begin with adding a shell action and populating the script name and file attribute. Don't forget to check the capture output box. We want to see the output of this action. I want to submit the workflow to make sure everything works as expected before configuring SLA features. It's a good idea to preview the XML to make sure file tag and exec tags are filled correctly. Once job completes, I want to drill down to the job and view the output. Everything looks good, we're ready to enable SLA features of Oozie via Workflow Manager. Click on the shell action and then gear icon. At the bottom of the configuration page, you will see SLA section. Expand that and check the enabled box. Each field is described as below: nominal-time: As the name suggests, this is the time relative to which your jobs' SLAs will be calculated. Generally since Oozie workflows are aligned with synchronous data dependencies, this nominal time can be parameterized to be passed the value of your coordinator nominal time. Nominal time is also required in case of independent workflows and you can specify the time in which you expect the workflow to be run if you don't have a synchronous dataset associated with it. should-start: Relative to nominal-time this is the amount of time (along with time-unit - MINUTES, HOURS, DAYS) within which your job should start running to meet SLA. This is optional. should-end: Relative to nominal-time this is the amount of time (along with time-unit - MINUTES, HOURS, DAYS) within which your job should finish to meet SLA. max-duration: This is the maximum amount of time (along with time-unit - MINUTES, HOURS, DAYS) your job is expected to run. This is optional. alert-events: Specify the types of events for which Email alerts should be sent. Allowable values in this comma-separated list are start_miss, end_miss and duration_miss. *_met events can generally be deemed low priority and hence email alerting for these is not necessary. However, note that this setting is only for alerts via email alerts and not via JMS messages, where all events send out notifications, and user can filter them using desired selectors. This is optional and only applicable when alert-contact is configured. alert-contact: Specify a comma separated list of email addresses where you wish your alerts to be sent. This is optional and need not be configured if you just want to view your job SLA history in the UI and do not want to receive email alerts. I'm going to simulate each one of the SLA patterns, i.e. my job started later than scheduled, my job completed outside the SLA threshold and finally, my job took longer to complete than we were expecting. To fill out the nominal time, feel free to choose the date and clock icon below the date picker for correct time. Click x when ready. Finally, I'd like to change my script to run for 120 seconds instead of 60 to simulate long duration. My script should look like so: echo “start of script execution” sleep 120 echo “end of script execution” When ready re-upload the script. At this point, I want to make sure sending mail from the cluster is possible and will test that by sending a sample email. Enabling mail is beyond the scope of this tutorial, I followed the procedure below, adjust as necessary for your environment. sudo su yum install postfix /etc/init.d/postfix restart exit Now we're able to send mail from our node, mail needs to work on any of the nodes Oozie will execute a wf. mail -s "test" email@email.com hit ctrl-D, you should get an email shortly. Finally, there are some changes we need to implement on the Oozie side. I'm not going to enable JMS alerting and only concentrate on email piece. Please consult Oozie docs for JMS part. This is HDP 2.5.3 and things may look/act differently on your Oozie instance. Let's go to Ambari > Configs filter by the following property oozie.services.ext We're going to add these services to the existing list: org.apache.oozie.service.EventHandlerService, org.apache.oozie.sla.service.SLAService Once ready, add a couple of more custom properties in Oozie, again, in my environment these properties did not exist. oozie.service.EventHandlerService.event.listeners and the value should be org.apache.oozie.sla.listener.SLAJobEventListener, org.apache.oozie.sla.listener.SLAEmailEventListener Oozie docs also recommend adding the following property to improve performance of event processing, we're going to add this property and set value to 15. oozie.service.SchedulerService.threads Once I saved the changes and restarted Oozie, it failed to start, looking at the logs I noticed the following in the oozie-error.log: 2017-02-15 18:14:32,757 WARN ConfigUtils:523 - SERVER[wfmanager-test-1.openstacklocal] Using a deprecated configuration property [oozie.service.AuthorizationService.security.enabled], should use [oozie.service.AuthorizationService.authorization.enabled]. Please delete the deprecated property in order for the new property to take effect. I found the property in Ambari > Configs and set it to false, I was not able to delete it. Once done, restart all Oozie services and now you're able to see a new tab in Oozie called SLA Remember, we only configured Email service, not JMS. We're ready to test our wf. Before that, I'd like to preview the XML for good measure. At this point, I'm ready to submit the workflow and watch my inbox. I'm expecting to miss my job start, job end and duration. This is an email output of my workflow. Until next time folks!

aervits · ‎02-15-2017

@Angelo Alexander please refer to the following doc, also you can download the MySQL driver jar from MySQL website and place it in /usr/hdp/current/sqoop-client/lib http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_data-movement-and-integration/content/apache_sqoop_connectors.html

aervits · ‎02-14-2017

My question is if I have a job that missed it's SLA, I want to terminate further progress rather than send out a JMS alert. Can I tap into SLA framework perhaps with an EL function and control decisions on the job? Can SLA framework handle that out of the box? If missed SLA, terminate job.

aervits · ‎02-14-2017

Sharing exam material goes against HCC policy. Please refrain from using this forum for such activity.

Online	Offline
Last Visited	‎08-15-2019 06:35 AM

Member Since	‎10-01-2015 11:46 AM
Last Visited	‎08-15-2019 06:35 AM
Posts	3,933
Kudos received	1074

Cloudera Community

Re: Where can I get latest resource_management.c...

Re: How to Kerberize Flume?

Re: Load Hive Table form Pig Output File.

Re: HDP 2.6 Cluster Issues with Hive Metastore

Re: which HDP release will storm 1.1.0 be packaged...

Re: I have multiple tables which i need to join an...

Re: pyspark example?

Apache Ambari Workflow Manager View for Apache Ooz...

Re: pyspark example?

Re: Spark 2 Technical preview with patches

Re: puppet setup of Ambari ssl

Apache Ambari Workflow Manager View for Apache Ooz...

Re: Load data from SQL table into Hive table

Leverage Oozie SLA Framework to control job status...

Re: Anyone here gave the Spark Data-bricks certifi...