Member since
10-17-2016
93
Posts
10
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2645 | 09-28-2017 04:38 PM | |
4419 | 08-24-2017 06:12 PM | |
933 | 07-03-2017 12:20 PM |
12-29-2019
08:42 PM
@ask_bill_brooks , Sorry, i am not seeing the accept as solution option in my screen. Thanks.
... View more
12-04-2017
06:05 PM
@Arsalan Siddiqi Thanks the excellent question. Your observations are valid. While Atlas does help with meeting compliance requirements, it is only part of the solution. To use traffic analogy, Atlas is the map (hence the name) and does not deal the cars on the road (traffic). To complete the picture, there needs to be some monitoring on what data gets ingested in the system and if all the data conforms with the norms setup. Please take a look at this presentation from Data Summit 2017. It explains how a system can be setup which helps with governance (realm of Atlas) and then also helps with spotting errors within the data itself. To summarize, to be able to spot errors with flow of data itself, you would need some other mechanism. Atlas will not help you in that respect. About your 2nd question: Atlas consumes notifications from Kafka by spawning a single thread and processing 1 notification at a time (see NotificationHookConsumer.java & AtlasKafkaConsumer.java). In case of systems with high throughput, the notifications will be queued with Kafa and you will see a lag in consumption of notifications. Kafka guarantees durability of messages. Atlas ensures that it consumes every message produced by Kafka. If messages are dropped for some reason, you would see them in Atlas' logs. We also test Atlas in high available scenarios. Also, to address the notification message question, I would urge you to use Atlas V2 client APIs (both on master and branch-0.8). Kafka does not mandate any message format since all it understands is bytes, so that should not be a determining criteria for choosing the client API version. I know this is a lot of text, I hope it helps. Please feel free to reach out if you need clarifications.
... View more
11-13-2017
11:50 PM
@Arsalan Siddiqi Your observations are accurate. In fact there is one initiative in progress that is to address this. Only that I don't have an ETA on when it will get done.
... View more
10-23-2017
02:30 AM
@Arsalan Siddiqi I agree with @Vadim Vaks 's suggestion on structuring the types. I will attempt to try out your JSON's today and try to see if I can get a better insight on the lineage behavior with the setup.
... View more
10-10-2017
02:45 PM
@anaik @Ashutosh Mestry @Sarath Subramanian @Vadim Vaks any suggestions?
... View more
10-05-2017
05:49 PM
That is correct. The ZIP file has the output and the TXT is the input.
... View more
09-29-2017
02:27 PM
@Arsalan Siddiqi Your observation about the JSON is accurate. The JSON you see in the sample is represented in the old format. We now use the new format referred to as V2. The V2 format is easy to understand as it is a JSON representation of the Java class. This is much easy to code compared to earlier approach. I am attaching atlas-application.properties file that I use on IntelliJ development. atlas-applicationproperties.zip Hope this helps.
... View more
04-27-2018
04:45 PM
I found this problem was caused by missing all of the JSON so I had a missing { but still keen to understand the problem solving process.
... View more
08-24-2017
06:12 PM
Ok here are all the steps required to run Apache Atlas natively with Berkeley DB and Elastic: Download and install Kafka use the link : https://kafka.apache.org/downloads. Download the binary and extract to your required location. Kafka and Atlas would also require Zookeeper. By default kafka comes with an instance of zookeeper. If you do not have zookeeper running or installed, you can use this. Navigate to and run : kafkahome/bin/zookeeper-server-start.sh Once zookeeper has started you can check it using the command: netstat -ant | grep :2181. if everything is fine you should see: tcp6 0 0 :::2181 :::* LISTEN Now you can start your kafka server using the command: ./kafkaHOME/bin/kafka-server-start.sh /KafkaHome/config/server.properties To check if kafka is running run the command netstat -ant | grep :9092. You should see a similar result as mentioned above. Now you are ready to move on with ATLAS. You can either use the link provided on the website or do a branch and tag checkout directly from github. I used the command on their website: git clone https://git-wip-us.apache.org/repos/asf/atlas.git atlas navigate into the folder : cd atlas Create new folder called libext using: mkdir libext You need to download the jar file form this URL. http://download.oracle.com/otn/berkeley-db/je-5.0.73.zip You will need an oracle account. Create one to download the zip file. Copy this zip file into your libext folder that you just created. run command export MAVEN_OPTS="-Xmx1536m -XX:MaxPermSize=512m" run command mvn clean install -DskipTests (MAKE SURE TO USE SKIP TESTS ) run command: mvn clean package -DskipTests -Pdist,berkeley-elasticsearch Navigate to the following location: incubator-atlas/distro/target/apache-atlas-0.8-incubating-bin/apache-atlas-0.8-incubating/bin/atlas_start.py OR /home/arsalan/Development/atlas/distro/target/apache-atlas-0.9-SNAPSHOT-bin/apache-atlas-0.9-SNAPSHOT Depending on which repo you have used. Run the follwoing command python atlas_start.py You can now navigate to localhost:21000 to check Atlas GUI. Hope it helps!!!!!
... View more
08-08-2017
07:58 AM
Thanks! that fixed it. i had to change the hostname in the file. now it works!
... View more
05-29-2018
02:47 PM
I am wondering if this was ever resolved? I am having the exact issue as @Arsalan Siddiqi I am trying to setup AmbariReportingTask and get Connection refused (Connection refused) for metric collector url: http://localhost:6188/ws/v1/timeline/metrics Thank you.
... View more
09-25-2017
04:52 PM
@Arsalan Siddiqi Instead of having four lines with 127.0.0.1, put them all on one line like this: 127.0.0.1 localhost localhost.localdomain arsalan-Lenovo-IdeaPad-Y410P sandbox.hortonworks.com
... View more
10-25-2018
07:56 AM
Hello, I want to know if the problem has been solved. I also encountered the same problem.
How to solve it?thanks.@Arsalan Siddiqi
... View more
07-25-2017
11:08 AM
@Arsalan Siddiqi The quick_start sample aims to demonstrate use of type system, entity creation, creation of lineage and then search. Though Atlas provides out-of-box types for hive, falcon, etc. It also allows for you to create your own types. Once types are created entities of those types can be created. Think of entities as instances of types. Specifically to quick_start, the entities depend on each other, thereby showing how lineage can be used (see sales_fact). One key highlight is the use of tags. This allows grouping entities that are semantically relevant. See how the PII tag is used. See how all the load operations show up once ETL tag is selected. Hope this helps!
... View more
08-11-2017
06:13 PM
@Hitesh Rajpurohit try: wget https://github.com/hortonworks/data-tutorials/blob/archive-hdp-2.5/tutorials/hdp/hdp-2.5/cross-component-lineage-with-apache-atlas-across-apache-sqoop-hive-kafka-storm/assets/crosscomponent_scripts.zip?raw=true
... View more
07-09-2017
03:01 AM
I dont think you can submit the code in standalone mode from IDE. I also tried and failed the same.
... View more
07-03-2017
12:20 PM
Hi After a bit of search I found that I can write each dstream RDD to specified path using the saveasTextFile method within the foreachRDD action. The problem is that this would write the partitions for the RDD to the location. If you have 3 partitions for the RDD, you will have something like part-0000 part-0001 part 0002 and this would be overwritten when the next batch starts. meaning if the following batch has 1 partition, the file 0001 and 0002 will be deleted and 0000 will be overwritten with the new data. I have seen that people have written code to merge these files. As I wanted the data for each batch and did not want to loose the data, I specified the path as follows fileIDs.foreachRDD(rdd =>rdd.saveAsTextFile("/home/arsalan/SparkRDDData/"+ssc.sparkContext.applicationId+"/"+ System.currentTimeMillis() )) this way it would create a new folder for each batch. Later I can get the data for each batch and dont have to worry about finding ways to avoid overwriting of the files.
... View more
06-26-2017
02:43 PM
@jfrazee thanks for the reply. I am using spark streaming which processes data in batches. I want to know how long does it take to process a batch for a given application (keeping the factors like number of nodes in the cluster constant) at a given data rate (records/batch). I eventually want to check an SLA to make sure that the end to end delay would still fall within the SLA, therefore I want to gather historic data from the application runs and make predictions for the time to process a batch. before starting a new batch you can already make a prediction whether it would voilate the SLA. I will have a look into your suggestions. Thanks
... View more
06-29-2017
11:18 PM
5 Kudos
Yes, you can extract provenance event for a specific flow file. For this you need to search provenance by "flowfileuuid". This can be done by logging in to NiFi UI -> click on menu in upper right hand corner ->select Data Provenance -> select search button -> enter the flowfile's uuid in "FlowFile UUID" text box -> Click search. The same thing can be done via Rest api. To get to Rest api doc, in NiFi UI, click on menu in upper right hand corner and click Help option. When the help doc opens, browse to the bottom of the left pane window and select Rest Api under "Developer" section. You can also access data along with the event by using rest apis
GET /provenance-events/{id}/content/input and
GET /provenance-events/{id}/content/output.
... View more
06-21-2017
07:36 AM
i am running NiFi locally on my laptop... The breaking of file in steps works... thanks... will consider kafka 🙂
... View more
06-21-2017
01:35 PM
There is no difference between Spark and Spark streaming in terms of stages, jobs and tasks management. In both cases, you have one job per action. In both cases, as you correctly stated, jobs are made of stages. And stages are made of tasks. You have one task per partition of your RDD. The number of stages depends on the number of wide dependency you encounter in the lineage to perform a given action. And you have a job per action. The only difference is that in Spark streaming everything is repeated for each (mini-)batch.
... View more
06-20-2017
07:59 AM
hi @bkosaraju Thanks for the reply. I do have the history server configured and running capturing the events to the specified directory (also I am not using HDP, I am using spark standalone and spark from intellij). The issue is that within the history server the streaming events are not captured. The details for the batch. Although I have overridden the streaming event listener OnBatchSubmit and added code to write to a log file.
... View more
09-07-2017
11:02 AM
I am also facing the same issue. I found ambari in my sandbox running on 9080 rather than 19090. , I am also facing the same issue. For a change in my HDF Sandbox Ambari is running on http://localhost:9080
... View more
06-05-2017
11:54 PM
@Arsalan Siddiqi I assumed UpdateRecord has been there since 1.2.0, but it's not. Sorry about that. Created another template which doesn't use UpdateRecord. Instead, I used another QueryRecord to update CSV. Confirmed it works with NiFi 1.2.0. Hope you will find it useful. https://gist.githubusercontent.com/ijokarumawak/7e20af1cd222fb2adf13acb2b0f46aed/raw/e150884f52ca186dd61433428b38d172aaa7b128/Join_CSV_Files_1.2.0.xml
... View more
05-16-2017
04:44 AM
1 Kudo
Hi @Arsalan Siddiqi, Alternate to Above response, you may take the help of Livy where you don't need to worry about configuring the NiFi Environment to include spark specific configuration, as Livy take REST requests, this works with same Execute process or ExecuteStreamCommand Process, a curl command need to be issued. this is very handy when your NiFi and Spark is running in different servers. Please refer the Livy Documentation on that front
... View more
05-10-2017
05:40 PM
Attributes are in-memory and follow a flow file around (each flow file gets its own copy of the attributes), so you want to keep the number and size of attributes to a minimum for best performance. However if the solution works for you and doesn't cause performance or memory issues, then great! 🙂
... View more
04-20-2017
05:56 PM
Hi @Bala Vignesh N V , find below the properties set for the InferAvroSchema: Find below the properties set for the CsvToAvro: Find the sample csv below : header1,header2,header3,header4 value1,value2,value3,value4 In my csv there are like 30 columns and more than one lakh rows. with all the above configuration i get " Cannot find schema" Error.
... View more
03-29-2017
03:22 PM
I think you'd need a custom ExecuteSpark processor or something, that could collect some of the provenance information perhaps as metadata, to become attributes on the result flow file(s). There would be no individual provenance event for Spark per se, but you could generate a Receive event, and also the lineage would include the flow file itself, which would have the Spark provenance metadata as attribute(s).
... View more
01-16-2017
03:54 PM
6 Kudos
Hello @Arsalan Siddiqi. These are some excellent questions and thoughts regarding provenance. Let me try to answer them in order. ONE: The Apache NiFi community can definitely help you with question on specific timing of releases and what will be included. I do know though that there is work underway for around Apache NiFi's provenance repository so that it can index even more event data on a per second basis than it does today. Exactly when this will end up in a release is subject to the normal community process of when the contribution is released and reviewed and merged. That said, there is a lot of interest in having higher provenance indexing rates so I'd expect it to be in an upcoming release. TWO: The current limitation we generally see is related to what I mention above in ONE. That is we see provenance indexing rate being a bottleneck on overall processing of data because we do cause backpressure to ensure that they backlog of provenance indexing doesn't just grow unbounded while more and more event data is processed. We are first going to make indexing faster. There are other techniques we could try later such as indexing less data which would make indexing far faster at the expense of slower queries. But such a tradeoff might make sense. THREE: Integration with a system such as Apache Atlas has been shown to be a very compelling combination here. The provenance that NiFi generates plays nicely with the type that Atlas ingests. If we get more and more provenance enabled systems reporting to Apache Atlas then it can be the central place to view such data and get a view of what other systems are doing and thus give that nice system of systems view that people really need. To truly prove lineage across systems there would likely need to be some cryptographically verifiable techniques employed. FOUR: The provenance data at present is prone to manipulation. In Apache NiFi we have flagged future work to adopt privacy by design features such as those which would help detect manipulated data and we're also looking at solutions to have distributed copies of the data to help with loss of availability as well. FIVE: It is designed for extension in parts. You can for example create your own implementation of a provenance repository. You can create your own reporting tasks which can harvest data from the provenance repository and send it to other systems as desired. At the moment we don't have it open for creating additional event types. We're intentionally trying to keep the vocabulary small and succinct. There are so many things left that we can do with this data in Apache NiFi and beyond to take full advantage of what it offers for the flow manager, the systems architect, the security professional, etc.. There is also some great inter and intra systems timing data that can be gleaned from this. Systems like to brag about how fast they are....provenance is the truth teller. Hope that helps a bit. Joe
... View more