About arsalan_siddiqi

arsalan_siddiqi · ‎07-20-2017

Cant find the scripts in section 3.2 of the tutorial. Any help there?

arsalan_siddiqi · ‎07-03-2017

Hi After a bit of search I found that I can write each dstream RDD to specified path using the saveasTextFile method within the foreachRDD action. The problem is that this would write the partitions for the RDD to the location. If you have 3 partitions for the RDD, you will have something like part-0000 part-0001 part 0002 and this would be overwritten when the next batch starts. meaning if the following batch has 1 partition, the file 0001 and 0002 will be deleted and 0000 will be overwritten with the new data. I have seen that people have written code to merge these files. As I wanted the data for each batch and did not want to loose the data, I specified the path as follows fileIDs.foreachRDD(rdd =>rdd.saveAsTextFile("/home/arsalan/SparkRDDData/"+ssc.sparkContext.applicationId+"/"+ System.currentTimeMillis() )) this way it would create a new folder for each batch. Later I can get the data for each batch and dont have to worry about finding ways to avoid overwriting of the files.

arsalan_siddiqi · ‎06-30-2017

Hi I am using NiFi to stream csv files to Spark Streaming. Within Spark I register and override a streaming listeners to get batch (write to file) related information: Spark Streaming Listener. So for each batch I can know the start time, end time, scheduling delay, processing time and number of records etc. What I want is to know is, exactly what files were processed in a batch. so I would want to output the batch info mentioned above with an array of UUIDs for all processed files in that batch (the UUIDs can be the file attribute or if need be can be inside the content of the file aswell). I dont think I can pass the Dtream RDD to the listener. Any suggestions? Thanks

arsalan_siddiqi · ‎06-26-2017

@jfrazee thanks for the reply. I am using spark streaming which processes data in batches. I want to know how long does it take to process a batch for a given application (keeping the factors like number of nodes in the cluster constant) at a given data rate (records/batch). I eventually want to check an SLA to make sure that the end to end delay would still fall within the SLA, therefore I want to gather historic data from the application runs and make predictions for the time to process a batch. before starting a new batch you can already make a prediction whether it would voilate the SLA. I will have a look into your suggestions. Thanks

arsalan_siddiqi · ‎06-26-2017

Hi I am totally new to SparkML. I capture the batch processing information for Spark Streaming and write it to file. I capture the following information per batch (FYI each batch in spark is a jobset which means it is a set of jobs.) BatchTime BatchStarted FirstJobStartTime LastJobCompletionTime FirstJobSchedulingDelay TotalJobProcessingTime (time to process all jobs in a batch) NumberOfRecords SubmissionTime TotalDelay (Total execution time for a batch from the time it is submitted, scheduled and processed.) Lets say I want to make a prediction against what will be the total delay when the number of records are X in a batch. Can anyone suggest what machine learning algorithm will be applicable in this scenario (linear regression, classification etc)? Of course the most important parameters would be scheduling delay, total delay and number of records and Job processing time. Thanks

arsalan_siddiqi · ‎06-21-2017

@Marco Gaido Thanks for your answer. I also found the answer on slide 23 here: Deep dive with Spark Streaming. I do agree , you can get the number of blocks which represent the partitions. total tasks per job = number of stages in the job * number of partitions I was also wondering what happens when the data rate varies considerably, will we have uneven blocks? meaning that tasks will have uneven workload? I am a bit confused now. From simple spark used for batch processing Spark Architecture you have jobs where each job has stages and stages have tasks. Here whenever you have an action a new stage is created. Therefore in such a case the number of stages will depend on the number of actions. A stage contains all transformations until an action is performed (or output). In case of spark streaming, we have one job per action. How many stages will be there if you have a separate job when you perform an action? Thanks

arsalan_siddiqi · ‎06-21-2017

i am running NiFi locally on my laptop... The breaking of file in steps works... thanks... will consider kafka 🙂

arsalan_siddiqi · ‎06-20-2017

Hi I am currently running NiFi on my laptop and not in HDP. I have a huge CSV file (2.7 GB). The csv file contains New York Taxi events, containing basic information about a taxi trip. The header is as follows: vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount I ran the follwoing command to get a subset of the file (command line linux): head -n 50000 taxi_sorted.csv > NifiSparkSample50k.csv and the file size turns out to be 5.6 MB. The file contains approximately (2.5 GB * 1024 /5.6) * 50000= 22857142 records What I do is read the file and split it per record and then send the records to Spark Streaming via the nifi spark receiver. Unfortunately (and obviously ) the system gets stuck and NiFi does not respond. I want to have data streamed to Spark. What could be a better way to do this? Instead of splitting the file per record, should i split the file to say 50 000 records and then send these files to Spark? Regards The following is the log entry 2017-06-20 16:00:11,363 ERROR [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] failed to process due to java.lang.NullPointerException; rolling back session: {} java.lang.NullPointerException: null 2017-06-20 16:00:11,380 ERROR [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] failed to process session due to java.lang.NullPointerException: {} java.lang.NullPointerException: null 2017-06-20 16:00:11,381 WARN [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] Processor Administratively Yielded for 1 sec due to processing failure 2017-06-20 16:00:11,381 WARN [Timer-Driven Process Thread-4] o.a.n.c.t.ContinuallyRunProcessorTask Administratively Yielding SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] due to uncaught Exception: java.lang.NullPointerException 2017-06-20 16:00:11,381 WARN [Timer-Driven Process Thread-4] o.a.n.c.t.ContinuallyRunProcessorTask java.lang.NullPointerException: null 2017-06-20 16:00:41,385 ERROR [Timer-Driven Process Thread-10] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] failed to process due to java.lang.NullPointerException; rolling back session: {} java.lang.NullPointerException: null 2017-06-20 16:00:41,385 ERROR [Timer-Driven Process Thread-10] o.a.nifi.processors.standard.SplitText SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] SplitText[id=d8251eb7-102e-115c-ee77-109159d0b9ea] failed to process session due to java.lang.NullPointerException: {} java.lang.NullPointerException: null

arsalan_siddiqi · ‎06-20-2017

Hi Is there a way to determine how many jobs will eventually be created against a batch in spark Streaming. Spark captures all the events within a window called batch interval. Apart from this we also have a block interval which divides the batch data into blocks. Example: batch interval 5 seconds Block Interval: 1 second This means you will have 5 blocks in a batch. Each block is processed by a task. Therefore you will have a total of 5 tasks. How can I find the number of Jobs that will be there in a batch? In a spark application you have: Jobs which consists of a number of Sequential Stages and each stage consists of a number of Tasks (mentioned above).

arsalan_siddiqi · ‎06-09-2017

to run the code in intellij the above code is fine! Only need to add ssc.awaitTermination() after ssc.start(). To run in shell, I need to create a fatJar (uberjar/standalone Jar)The missing import org.apache.nifi.events._ was available in nifi-framework-api-1.2.0.jar . I used maven to create the fat jar using the maven-assembly-plugin

Online	Offline
Last Visited	‎04-04-2018 11:57 AM

Member Since	‎10-17-2016 04:31 PM
Last Visited	‎04-04-2018 11:57 AM
Posts	93
Kudos received	10

Cloudera Community

Re: Has anyone sucessfully written an Atlas Hook? ...

Re: Can anyone make a step by step tutorial for In...

Re: How to extract the Records processed in a Spar...

cross component lineage scripts missing

Re: How to extract the Records processed in a Spar...

How to extract the Records processed in a Spark st...

Re: Spark Machine Learning for performance predict...

Spark Machine Learning for performance prediction

Re: Number of Jobs in Spark Streaming

Re: huge csv to spark

huge csv to spark

Number of Jobs in Spark Streaming

Re: NiFi + Spark : Feeding Data to Spark Streamin...