Member since
10-06-2015
273
Posts
202
Kudos Received
81
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4043 | 10-11-2017 09:33 PM | |
3565 | 10-11-2017 07:46 PM | |
2570 | 08-04-2017 01:37 PM | |
2211 | 08-03-2017 03:36 PM | |
2238 | 08-03-2017 12:52 PM |
12-16-2016
07:14 AM
@Radhouene EL HADJ EL ARBI This is by design. Atlas, and most governance tools in general, will trace lineage as far back as possible. With Atlas, not only will it go back to the root table(s), it can even go as far back as the Storm or Sqoop job that ingested the data to the original tables. The purpose of having lineage this far back is for a user to be able to effectively trace back the origins of data, whether to validate data quality, for compliance, or even just to understand how the data has mutated/evolved to it's current state.
... View more
12-16-2016
06:56 AM
1 Kudo
@vamsi valiveti Oozie can materialize coordinator actions (i.e. start tasks/jobs) based on time-based intervals or triggers. For example run Job X every day at 12pm. However, time is not always the only dependency. Sometimes we may want to start a job after all the necessary data is available. So, Oozie coordintor allows us to use both, time and data dependencies, to start a workflow. “dataset”, “input-events” and “output-events” are the pillars for configuring data dependencies in coordinator.xml. Dataset: A “dataset” is essentially an entity that represents data produced by an application and is often defined using it’s directory location. When data is ready to be consumed a file named “_SUCCESS” is added to the folder by default. Alternatively, we can specify the name of the file we want to write instead of “_SUCCESS” by setting the “done-flag”. If we look at your config example you have above, we expect a new dataset to be generated every 60 minutes and the folder will be “tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}” <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
<uri-template>
hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}
</uri-template>
</dataset>
Input-Event: An “input-event” describes the necessary instances of data to start a job. For example, when a coordinator runs, the input-event can be used to check if a “_SUCCESS” has been posted in the last hour, and process the data for that. If nothing matches this criteria, or all the data is more than an hour old, then the job is not executed. Another example is to wait until several files/data instances have been completed before running. We specify the time window to look for these data instances by using the “start-instance” and the “end-instance” So, in the example you have above, it specifies we should process files for the last 24 hours (“current(0)” being the current hour and “current(-23)” being 23 hours ago) <input-events>
<data-in name="coordInput1" dataset="input1">
<start-instance>${coord:current(-23)}</start-instance>
<end-instance>${coord:current(0)}</end-instance>
</data-in>
</input-events>
Output-Event: An “output-event” is the opposite of an “input-event”. It is the output file the job write once it completes. This file can be used as the “input-event” for another coordinator.
... View more
12-15-2016
01:24 PM
1 Kudo
@Mohana Murali Gurunathan I'm assuming you want this for Tag-Based Policies. Tag-Based Policies and Atlas-Ranger integration are not available with Atlas 0.5. They are only available with HDP 2.5+, which contains Atlas 0.7.
... View more
12-13-2016
05:24 PM
1 Kudo
@Michael Young The example in the second link you provided is good. I would also recommend taking a look at the examples in the link below for different kinds of manipulations/actions you can do to/with ORC files through java. http://www.programcreek.com/java-api-examples/index.php?api=org.apache.hadoop.hive.ql.io.orc.OrcFile
... View more
12-12-2016
11:50 PM
@anand maurya It's best to use the binaries from the repo. Install Ambari first following the steps in the below link: https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-installation/content/ch_Installing_Ambari.html Once the Ambari Server is up, use it to download and install the HDP binaries as shown in the following link: https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-installation/content/ch_Deploy_and_Configure_a_HDP_Cluster.html
... View more
11-01-2016
01:14 PM
The new HDP 2.5 Sandbox has been released. If you re-download it you shouldn't face this issue any more. http://hortonworks.com/downloads/
... View more
10-31-2016
08:51 PM
@Volodymyr Ostapiv This is due to an issue in the docker HDP 2.5 Sandbox, which will be addressed with the next Sandbox release. In the meantime, try the solution in the link below: https://community.hortonworks.com/questions/62271/unable-to-add-apache-nifi-in-ambari.html
... View more
10-31-2016
03:18 PM
"open" is not a spark api command, it is a python command. What language are you using? Replace open("file.hql").read() with the equivalent command/code-block in that language.
... View more
10-31-2016
02:38 PM
@Amit Kumar Agarwal If you are looking to do it from a program then try something like the below: http://stackoverflow.com/questions/31313361/sparksql-hql-script-in-file-to-be-loaded-on-python-code
... View more
10-28-2016
09:11 PM
@Amit Kumar Agarwal See the link below: https://hadoopist.wordpress.com/2016/03/12/how-to-execute-hive-sql-file-in-spark-engine/
... View more