Member since
09-24-2015
178
Posts
113
Kudos Received
28
Solutions
01-07-2016
06:40 PM
7 Kudos
In event there is an error while executing any oozie action during a workflow processing, those root cause errors are wrapped within an Oozie error and some times you have to dig a little deeper to find the root cause of the issue. Once you look in the right place (screen/log) the error gives a direct indication of what the issue is but that error is buried a couple of layers below the Oozie UI. Follow these steps to get more details on Oozie issues - 1) Open Oozie UI 2) Locate the job instance that failed and double click on it 3) Locate the step that failed and double click on it 4) From the Action pop-up, (which also has the cryptic error message in Error Message) click on the small magnifying glass on the right hand side of the Console URL. 5) From the next screen, click on the link that reads "logs" and view the details. (TBD - Add Screen snapshot)
... View more
Labels:
12-31-2015
03:20 AM
2 Kudos
In general, you have following options when running R on
Hortonworks Data Platform (HDP) - o RHadoop (rmr) - R program written in MapReduce paradigm. MapReduce
is not a vendor specific API and any program written with MapReduce is portable
across Hadoop distributions. https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr o Hadoop Streaming - R program written to make use of Hadoop
Streaming but the program structure still aligns with MapReduce. Above benefit
still applies. o RJDBC - This example does not require the R programs to be
written using MapReduce and still remains 100% native R APIs without any third
party packages. Here is a tutorial with a video, sample data and R script: http://hortonworks.com/hadoop-tutorial/using-revolution-r-enterprise-tutorial-hortonworks-sandbox/ Using RJDBC, the R program can have Hadoop parallelize
pre-processing and filtering. R submits a query to Hive or SparkSQL making use of distributed and parallel processing. Then uses existing R models,
as is & without any changes or use of any proprietary APIs. Typically speaking, any data science application involves a
ton of prepping which is usually 75% of the work. RJDBC allows pushing that
work to Hive to take advantage of distributed computing. o Spark R - Lastly, the Spark R interface which is a newer
component in Spark. SparkR is an R package that provides a light-weight
frontend to use Apache Spark from R. This component is available since Spark 1.4.1 (current version 1.5.2) Here are some details on it - https://spark.apache.org/docs/latest/sparkr.html And the available API - https://spark.apache.org/docs/latest/api/R/
... View more
Labels: