About egarelnabi

egarelnabi · ‎12-16-2016

@Radhouene EL HADJ EL ARBI This is by design. Atlas, and most governance tools in general, will trace lineage as far back as possible. With Atlas, not only will it go back to the root table(s), it can even go as far back as the Storm or Sqoop job that ingested the data to the original tables. The purpose of having lineage this far back is for a user to be able to effectively trace back the origins of data, whether to validate data quality, for compliance, or even just to understand how the data has mutated/evolved to it's current state.

egarelnabi · ‎12-16-2016

@vamsi valiveti Oozie can materialize coordinator actions (i.e. start tasks/jobs) based on time-based intervals or triggers. For example run Job X every day at 12pm. However, time is not always the only dependency. Sometimes we may want to start a job after all the necessary data is available. So, Oozie coordintor allows us to use both, time and data dependencies, to start a workflow. “dataset”, “input-events” and “output-events” are the pillars for configuring data dependencies in coordinator.xml. Dataset: A “dataset” is essentially an entity that represents data produced by an application and is often defined using it’s directory location. When data is ready to be consumed a file named “_SUCCESS” is added to the folder by default. Alternatively, we can specify the name of the file we want to write instead of “_SUCCESS” by setting the “done-flag”. If we look at your config example you have above, we expect a new dataset to be generated every 60 minutes and the folder will be “tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}” <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template> hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset> Input-Event: An “input-event” describes the necessary instances of data to start a job. For example, when a coordinator runs, the input-event can be used to check if a “_SUCCESS” has been posted in the last hour, and process the data for that. If nothing matches this criteria, or all the data is more than an hour old, then the job is not executed. Another example is to wait until several files/data instances have been completed before running. We specify the time window to look for these data instances by using the “start-instance” and the “end-instance” So, in the example you have above, it specifies we should process files for the last 24 hours (“current(0)” being the current hour and “current(-23)” being 23 hours ago) <input-events> <data-in name="coordInput1" dataset="input1"> <start-instance>${coord:current(-23)}</start-instance> <end-instance>${coord:current(0)}</end-instance> </data-in> </input-events> Output-Event: An “output-event” is the opposite of an “input-event”. It is the output file the job write once it completes. This file can be used as the “input-event” for another coordinator.

egarelnabi · ‎12-15-2016

@Mohana Murali Gurunathan I'm assuming you want this for Tag-Based Policies. Tag-Based Policies and Atlas-Ranger integration are not available with Atlas 0.5. They are only available with HDP 2.5+, which contains Atlas 0.7.

egarelnabi · ‎12-13-2016

@Michael Young The example in the second link you provided is good. I would also recommend taking a look at the examples in the link below for different kinds of manipulations/actions you can do to/with ORC files through java. http://www.programcreek.com/java-api-examples/index.php?api=org.apache.hadoop.hive.ql.io.orc.OrcFile

egarelnabi · ‎12-12-2016

@anand maurya It's best to use the binaries from the repo. Install Ambari first following the steps in the below link: https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-installation/content/ch_Installing_Ambari.html Once the Ambari Server is up, use it to download and install the HDP binaries as shown in the following link: https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-installation/content/ch_Deploy_and_Configure_a_HDP_Cluster.html

egarelnabi · ‎11-01-2016

The new HDP 2.5 Sandbox has been released. If you re-download it you shouldn't face this issue any more. http://hortonworks.com/downloads/

egarelnabi · ‎10-31-2016

@Volodymyr Ostapiv This is due to an issue in the docker HDP 2.5 Sandbox, which will be addressed with the next Sandbox release. In the meantime, try the solution in the link below: https://community.hortonworks.com/questions/62271/unable-to-add-apache-nifi-in-ambari.html

egarelnabi · ‎10-31-2016

"open" is not a spark api command, it is a python command. What language are you using? Replace open("file.hql").read() with the equivalent command/code-block in that language.

egarelnabi · ‎10-31-2016

@Amit Kumar Agarwal If you are looking to do it from a program then try something like the below: http://stackoverflow.com/questions/31313361/sparksql-hql-script-in-file-to-be-loaded-on-python-code

egarelnabi · ‎10-28-2016

@Amit Kumar Agarwal See the link below: https://hadoopist.wordpress.com/2016/03/12/how-to-execute-hive-sql-file-in-spark-engine/

Online	Offline
Last Visited	‎08-14-2019 09:54 AM

Member Since	‎10-06-2015 09:21 PM
Last Visited	‎08-14-2019 09:54 AM
Posts	273
Kudos received	202

Cloudera Community

Re: Is it possible to import a complete new taxono...

Re: Is it possible in Apache Atlas to add key-valu...

Re: Do we have tag carry forward in atlas hdp2.6.1...

Re: With ATLAS, which format attribute Date is acc...

Re: Spark streaming support for stream analytics m...

Re: Data Lineage Graph with hive views

Re: oozie Input events clarification

Re: HDP 2.4.3. Atlas tag created through API, can ...

Re: Tutorial for writing directly to ORC files wit...

Re: Installation using binary or source?

Re: Unable to delete folders on Virtualbox 2.5

Re: HDP 2.5 Sandbox can't install a service

Re: How to run HQL file in Spark

Re: How to run HQL file in Spark

Re: How to run HQL file in Spark