About arsalan_siddiqi

arsalan_siddiqi · ‎08-08-2017

Thanks! that fixed it. i had to change the hostname in the file. now it works!

amestry · ‎07-25-2017

@Arsalan Siddiqi The quick_start sample aims to demonstrate use of type system, entity creation, creation of lineage and then search. Though Atlas provides out-of-box types for hive, falcon, etc. It also allows for you to create your own types. Once types are created entities of those types can be created. Think of entities as instances of types. Specifically to quick_start, the entities depend on each other, thereby showing how lineage can be used (see sales_fact). One key highlight is the use of tags. This allows grouping entities that are semantically relevant. See how the PII tag is used. See how all the load operations show up once ETL tag is selected. Hope this helps!

arsalan_siddiqi · ‎08-11-2017

@Hitesh Rajpurohit try: wget https://github.com/hortonworks/data-tutorials/blob/archive-hdp-2.5/tutorials/hdp/hdp-2.5/cross-component-lineage-with-apache-atlas-across-apache-sqoop-hive-kafka-storm/assets/crosscomponent_scripts.zip?raw=true

arsalan_siddiqi · ‎07-03-2017

Hi After a bit of search I found that I can write each dstream RDD to specified path using the saveasTextFile method within the foreachRDD action. The problem is that this would write the partitions for the RDD to the location. If you have 3 partitions for the RDD, you will have something like part-0000 part-0001 part 0002 and this would be overwritten when the next batch starts. meaning if the following batch has 1 partition, the file 0001 and 0002 will be deleted and 0000 will be overwritten with the new data. I have seen that people have written code to merge these files. As I wanted the data for each batch and did not want to loose the data, I specified the path as follows fileIDs.foreachRDD(rdd =>rdd.saveAsTextFile("/home/arsalan/SparkRDDData/"+ssc.sparkContext.applicationId+"/"+ System.currentTimeMillis() )) this way it would create a new folder for each batch. Later I can get the data for each batch and dont have to worry about finding ways to avoid overwriting of the files.

arsalan_siddiqi · ‎06-26-2017

@jfrazee thanks for the reply. I am using spark streaming which processes data in batches. I want to know how long does it take to process a batch for a given application (keeping the factors like number of nodes in the cluster constant) at a given data rate (records/batch). I eventually want to check an SLA to make sure that the end to end delay would still fall within the SLA, therefore I want to gather historic data from the application runs and make predictions for the time to process a batch. before starting a new batch you can already make a prediction whether it would voilate the SLA. I will have a look into your suggestions. Thanks

arsalan_siddiqi · ‎06-21-2017

i am running NiFi locally on my laptop... The breaking of file in steps works... thanks... will consider kafka 🙂

mgaido1 · ‎06-21-2017

There is no difference between Spark and Spark streaming in terms of stages, jobs and tasks management. In both cases, you have one job per action. In both cases, as you correctly stated, jobs are made of stages. And stages are made of tasks. You have one task per partition of your RDD. The number of stages depends on the number of wide dependency you encounter in the lineage to perform a given action. And you have a job per action. The only difference is that in Spark streaming everything is repeated for each (mini-)batch.

kkawamura · ‎06-05-2017

@Arsalan Siddiqi I assumed UpdateRecord has been there since 1.2.0, but it's not. Sorry about that. Created another template which doesn't use UpdateRecord. Instead, I used another QueryRecord to update CSV. Confirmed it works with NiFi 1.2.0. Hope you will find it useful. https://gist.githubusercontent.com/ijokarumawak/7e20af1cd222fb2adf13acb2b0f46aed/raw/e150884f52ca186dd61433428b38d172aaa7b128/Join_CSV_Files_1.2.0.xml

bkosaraju · ‎05-16-2017

Hi @Arsalan Siddiqi, Alternate to Above response, you may take the help of Livy where you don't need to worry about configuring the NiFi Environment to include spark specific configuration, as Livy take REST requests, this works with same Execute process or ExecuteStreamCommand Process, a curl command need to be issued. this is very handy when your NiFi and Spark is running in different servers. Please refer the Livy Documentation on that front

JoeWitt · ‎01-16-2017

Hello @Arsalan Siddiqi. These are some excellent questions and thoughts regarding provenance. Let me try to answer them in order. ONE: The Apache NiFi community can definitely help you with question on specific timing of releases and what will be included. I do know though that there is work underway for around Apache NiFi's provenance repository so that it can index even more event data on a per second basis than it does today. Exactly when this will end up in a release is subject to the normal community process of when the contribution is released and reviewed and merged. That said, there is a lot of interest in having higher provenance indexing rates so I'd expect it to be in an upcoming release. TWO: The current limitation we generally see is related to what I mention above in ONE. That is we see provenance indexing rate being a bottleneck on overall processing of data because we do cause backpressure to ensure that they backlog of provenance indexing doesn't just grow unbounded while more and more event data is processed. We are first going to make indexing faster. There are other techniques we could try later such as indexing less data which would make indexing far faster at the expense of slower queries. But such a tradeoff might make sense. THREE: Integration with a system such as Apache Atlas has been shown to be a very compelling combination here. The provenance that NiFi generates plays nicely with the type that Atlas ingests. If we get more and more provenance enabled systems reporting to Apache Atlas then it can be the central place to view such data and get a view of what other systems are doing and thus give that nice system of systems view that people really need. To truly prove lineage across systems there would likely need to be some cryptographically verifiable techniques employed. FOUR: The provenance data at present is prone to manipulation. In Apache NiFi we have flagged future work to adopt privacy by design features such as those which would help detect manipulated data and we're also looking at solutions to have distributed copies of the data to help with loss of availability as well. FIVE: It is designed for extension in parts. You can for example create your own implementation of a provenance repository. You can create your own reporting tasks which can harvest data from the provenance repository and send it to other systems as desired. At the moment we don't have it open for creating additional event types. We're intentionally trying to keep the vocabulary small and succinct. There are so many things left that we can do with this data in Apache NiFi and beyond to take full advantage of what it offers for the flow manager, the systems architect, the security professional, etc.. There is also some great inter and intra systems timing data that can be gleaned from this. Systems like to brag about how fast they are....provenance is the truth teller. Hope that helps a bit. Joe

Online	Offline
Last Visited	‎04-04-2018 11:57 AM

Member Since	‎10-17-2016 04:31 PM
Last Visited	‎04-04-2018 11:57 AM
Posts	93
Kudos received	10

Cloudera Community

Re: Has anyone sucessfully written an Atlas Hook? ...

Re: Can anyone make a step by step tutorial for In...

Re: How to extract the Records processed in a Spar...

Re: HDF 3.0 Shell in a box

Re: Where to get Started with Atlas?

Re: cross component lineage scripts missing

Re: How to extract the Records processed in a Spar...

Re: Spark Machine Learning for performance predict...

Re: huge csv to spark

Re: Number of Jobs in Spark Streaming

Re: Update CSV attribute/Merge CSV files

Re: Running a Spark Job with NiFi using Execute Pr...

Re: Apache Nifi Data Provenance: Limitations and R...