About mark_hadoop

mark_hadoop · ‎08-13-2018

Hi, I am recieving data from TCP as a json stream using pyspark. I want to save the files(append files and basically a file is a minute based ex:yyyyMMddHHmm (file), so all messages in one min should go to the corresponding file) and parallelly I want to save the json to orc hive table. I have two questions involved 1. *[path : '/folder/file'] When I receive data in Dstream I flatMap and split("\n") and then repartition(1).saveAsTextfile(path,"json") lines = ssc.socketTextStream("localhost", 9999) flat_map = lines.flatMap(lambda x: x.split("\n")) flat_map.repartition(1).saveAsTextFiles(path,"json") The above saves to the path given, but instead of giving one single file per minute and save to the folder, this makes three folders with a _SUCCESS file and a part_00000 file in every folder, which is not expected. Please help me how to solve this as expected : basically one folder per day and one file per minute under the folder? 2. If I want to save the json to orc hive table.. can I do it from a dstream? or I have to change the dstream to rdd and then perform some processing to save it to orc? as I am new to pyspark please help with the above or with some examples.

mark_hadoop · ‎08-10-2018

@Veerendra Nath Jasthi What is the frequency of files and how big are the files in the given path? Also could you please check your JVM heap memory (this is a guess, not solution)?

mark_hadoop · ‎08-10-2018

@Felix Albani Can you help me with the pyspark version of the above please.

mark_hadoop · ‎07-31-2018

@Veerendra Nath JasthiPossibly, you have a complicated computation (may be regex) running on Getfile, which is taking a lot of time to complete, also check howmany files it is getting based on your regex, it should be fixed.

mark_hadoop · ‎07-31-2018

@Felix Albani Thank your for quick response, I will go through the given info

mark_hadoop · ‎07-31-2018

@Veerendra Nath Jasthi It should not be the case, what processors you are using and getting the issue?

mark_hadoop · ‎07-31-2018

@veerendra If you are using nifi below 1.7, the best way is to restart nifi

mark_hadoop · ‎07-31-2018

Hi All, I am beginner to spark and wanted to do the below. a port 55500 is trying to send jsons as a stream (ex: {"one":"1","two":"2"}{"three":"3","four":"4"}). I have a orc table in hive with columns given below one, two,three,four,spark_streaming_startingtime,spark_streaming_endingtime,partition_value I want to load the streaming values in to hive orc table. Can you please guide me how to achieve this. Thank you for your support.

mark_hadoop · ‎07-31-2018

@Bryan Bende I checked the nifi-app.log, the JVM heap size is max, whcih is rejecting the connections and failing the processor. It got resolved as the heap size issue is solved. Thank you for you support.

mark_hadoop · ‎07-25-2018

I am using listen syslog processor (not on port 514). It was working fine, we upgraded nifi to 1.5.0.3.1.1.7-2, it worked fine after the upgrade but from last 3 days, the processor is throwing error as failed to invoke @Onscheduled method Can you please let me know how to come over this.

Online	Offline
Last Visited	‎09-20-2021 09:14 AM

Member Since	‎07-14-2017 11:10 AM
Last Visited	‎09-20-2021 09:14 AM
Posts	99
Kudos received	5

Cloudera Community

Re: update TCP stream with batchsize 10000 at once...

Re: listen syslog

Re: puthbasejson

Re: Extract text and Replace text processors regex

saving TCP stream in to hive using pyspark

Re: In a Nifi-workflow some of the processors are ...

Re: spark streaming json to hive

Re: In a Nifi-workflow some of the processors are ...

Re: spark streaming json to hive

Re: In a Nifi-workflow some of the processors are ...

Re: In a Nifi-workflow some of the processors are ...

spark streaming json to hive

Re: listen syslog

listen syslog