Member since
10-17-2016
93
Posts
10
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5007 | 09-28-2017 04:38 PM | |
7470 | 08-24-2017 06:12 PM | |
1939 | 07-03-2017 12:20 PM |
07-20-2017
10:48 AM
Cant find the scripts in section 3.2 of the tutorial. Any help there?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
07-03-2017
12:20 PM
Hi After a bit of search I found that I can write each dstream RDD to specified path using the saveasTextFile method within the foreachRDD action. The problem is that this would write the partitions for the RDD to the location. If you have 3 partitions for the RDD, you will have something like part-0000 part-0001 part 0002 and this would be overwritten when the next batch starts. meaning if the following batch has 1 partition, the file 0001 and 0002 will be deleted and 0000 will be overwritten with the new data. I have seen that people have written code to merge these files. As I wanted the data for each batch and did not want to loose the data, I specified the path as follows fileIDs.foreachRDD(rdd =>rdd.saveAsTextFile("/home/arsalan/SparkRDDData/"+ssc.sparkContext.applicationId+"/"+ System.currentTimeMillis() )) this way it would create a new folder for each batch. Later I can get the data for each batch and dont have to worry about finding ways to avoid overwriting of the files.
... View more
06-30-2017
07:58 AM
Hi I am using NiFi to stream csv files to Spark Streaming. Within Spark I register and override a streaming listeners to get batch (write to file) related information: Spark Streaming Listener. So for each batch I can know the start time, end time, scheduling delay, processing time and number of records etc. What I want is to know is, exactly what files were processed in a batch. so I would want to output the batch info mentioned above with an array of UUIDs for all processed files in that batch (the UUIDs can be the file attribute or if need be can be inside the content of the file aswell). I dont think I can pass the Dtream RDD to the listener. Any suggestions?
Thanks
... View more
Labels:
- Labels:
-
Apache Spark
06-26-2017
02:43 PM
@jfrazee thanks for the reply. I am using spark streaming which processes data in batches. I want to know how long does it take to process a batch for a given application (keeping the factors like number of nodes in the cluster constant) at a given data rate (records/batch). I eventually want to check an SLA to make sure that the end to end delay would still fall within the SLA, therefore I want to gather historic data from the application runs and make predictions for the time to process a batch. before starting a new batch you can already make a prediction whether it would voilate the SLA. I will have a look into your suggestions. Thanks
... View more
06-26-2017
01:42 PM
Hi I am totally new to SparkML. I capture the batch processing information for Spark Streaming and write it to file. I capture the following information per batch (FYI each batch in spark is a jobset which means it is a set of jobs.) BatchTime BatchStarted FirstJobStartTime LastJobCompletionTime FirstJobSchedulingDelay TotalJobProcessingTime (time to process all jobs in a batch) NumberOfRecords SubmissionTime TotalDelay (Total execution time for a batch from the time it is submitted, scheduled and processed.) Lets say I want to make a prediction against what will be the total delay when the number of records are X in a batch. Can anyone suggest what machine learning algorithm will be applicable in this scenario (linear regression, classification etc)? Of course the most important parameters would be scheduling delay, total delay and number of records and Job processing time. Thanks
... View more
Labels:
- Labels:
-
Apache Spark
06-21-2017
01:24 PM
@Marco Gaido Thanks for your answer. I also found the answer on slide 23 here: Deep dive with Spark Streaming. I do agree , you can get the number of blocks which represent the partitions. total tasks per job = number of stages in the job * number of partitions I was also wondering what happens when the data rate varies considerably, will we have uneven blocks? meaning that tasks will have uneven workload? I am a bit confused now. From simple spark used for batch processing Spark Architecture you have jobs where each job has stages and stages have tasks. Here whenever you have an action a new stage is created. Therefore in such a case the number of stages will depend on the number of actions. A stage contains all transformations until an action is performed (or output). In case of spark streaming, we have one job per action. How many stages will be there if you have a separate job when you perform an action? Thanks
... View more
06-21-2017
07:36 AM
i am running NiFi locally on my laptop... The breaking of file in steps works... thanks... will consider kafka 🙂
... View more
06-20-2017
02:11 PM
Hi I am currently running NiFi on my laptop and not in HDP. I have a huge CSV file (2.7 GB). The csv file contains New York Taxi events, containing basic information about a taxi trip. The header is as follows: vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount I ran the follwoing command to get a subset of the file (command line linux): head -n 50000 taxi_sorted.csv > NifiSparkSample50k.csv and the file size turns out to be 5.6 MB. The file contains approximately (2.5 GB * 1024 /5.6) * 50000= 22857142 records What I do is read the file and split it per record and then send the records to Spark Streaming via the nifi spark receiver. Unfortunately (and obviously ) the system gets stuck and NiFi does not respond. I want to have data streamed to Spark. What could be a better way to do this? Instead of splitting the file per record, should i split the file to say 50 000 records and then send these files to Spark?
Regards The following is the log entry 2017-06-20 16:00:11,363 ERROR [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] failed to process due to java.lang.NullPointerException; rolling back session: {}
java.lang.NullPointerException: null
2017-06-20 16:00:11,380 ERROR [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] failed to process session due to java.lang.NullPointerException: {}
java.lang.NullPointerException: null
2017-06-20 16:00:11,381 WARN [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] Processor Administratively Yielded for 1 sec due to processing failure
2017-06-20 16:00:11,381 WARN [Timer-Driven Process Thread-4] o.a.n.c.t.ContinuallyRunProcessorTask Administratively Yielding SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] due to uncaught Exception: java.lang.NullPointerException
2017-06-20 16:00:11,381 WARN [Timer-Driven Process Thread-4] o.a.n.c.t.ContinuallyRunProcessorTask
java.lang.NullPointerException: null
2017-06-20 16:00:41,385 ERROR [Timer-Driven Process Thread-10] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] failed to process due to java.lang.NullPointerException; rolling back session: {}
java.lang.NullPointerException: null
2017-06-20 16:00:41,385 ERROR [Timer-Driven Process Thread-10] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] failed to process session due to java.lang.NullPointerException: {}
java.lang.NullPointerException: null
... View more
Labels:
- Labels:
-
Apache NiFi
-
Apache Spark
06-20-2017
09:26 AM
Hi Is there a way to determine how many jobs will eventually be created against a batch in spark Streaming. Spark captures all the events within a window called batch interval. Apart from this we also have a block interval which divides the batch data into blocks. Example: batch interval 5 seconds Block Interval: 1 second This means you will have 5 blocks in a batch. Each block is processed by a task. Therefore you will have a total of 5 tasks. How can I find the number of Jobs that will be there in a batch? In a spark application you have: Jobs which consists of a number of Sequential Stages and each stage consists of a number of Tasks (mentioned above).
... View more
Labels:
- Labels:
-
Apache Spark
06-09-2017
09:00 AM
to run the code in intellij the above code is fine! Only need to add ssc.awaitTermination() after ssc.start().
To run in shell, I need to create a fatJar (uberjar/standalone Jar)The missing import org.apache.nifi.events._ was available in nifi-framework-api-1.2.0.jar .
I used maven to create the fat jar using the maven-assembly-plugin
... View more