Member since
07-14-2017
99
Posts
5
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1462 | 09-05-2018 09:58 AM | |
1986 | 07-31-2018 12:59 PM | |
1454 | 01-15-2018 12:07 PM | |
1358 | 11-23-2017 04:19 PM |
08-13-2018
02:14 PM
Hi, I am recieving data from TCP as a json stream using pyspark. I want to save the files(append files and basically a file is a minute based ex:yyyyMMddHHmm (file), so all messages in one min should go to the corresponding file) and parallelly I want to save the json to orc hive table. I have two questions involved 1. *[path : '/folder/file'] When I receive data in Dstream I flatMap and split("\n") and then repartition(1).saveAsTextfile(path,"json") lines = ssc.socketTextStream("localhost", 9999)
flat_map = lines.flatMap(lambda x: x.split("\n"))
flat_map.repartition(1).saveAsTextFiles(path,"json")
The above saves to the path given, but instead of giving one single file per minute and save to the folder, this makes three folders with a _SUCCESS file and a part_00000 file in every folder, which is not expected. Please help me how to solve this as expected : basically one folder per day and one file per minute under the folder? 2. If I want to save the json to orc hive table.. can I do it from a dstream? or I have to change the dstream to rdd and then perform some processing to save it to orc? as I am new to pyspark please help with the above or with some examples.
... View more
Labels:
- Labels:
-
Apache Spark
08-10-2018
09:12 AM
@Veerendra Nath Jasthi What is the frequency of files and how big are the files in the given path? Also could you please check your JVM heap memory (this is a guess, not solution)?
... View more
08-10-2018
09:08 AM
@Felix Albani Can you help me with the pyspark version of the above please.
... View more
07-31-2018
02:36 PM
@Veerendra Nath JasthiPossibly, you have a complicated computation (may be regex) running on Getfile, which is taking a lot of time to complete, also check howmany files it is getting based on your regex, it should be fixed.
... View more
07-31-2018
01:51 PM
@Felix Albani Thank your for quick response, I will go through the given info
... View more
07-31-2018
01:45 PM
@Veerendra Nath Jasthi It should not be the case, what processors you are using and getting the issue?
... View more
07-31-2018
01:15 PM
@veerendra If you are using nifi below 1.7, the best way is to restart nifi
... View more
07-31-2018
01:10 PM
Hi All, I am beginner to spark and wanted to do the below. a port 55500 is trying to send jsons as a stream (ex: {"one":"1","two":"2"}{"three":"3","four":"4"}). I have a orc table in hive with columns given below one, two,three,four,spark_streaming_startingtime,spark_streaming_endingtime,partition_value I want to load the streaming values in to hive orc table. Can you please guide me how to achieve this. Thank you for your support.
... View more
Labels:
- Labels:
-
Apache Spark
07-31-2018
12:59 PM
@Bryan Bende I checked the nifi-app.log, the JVM heap size is max, whcih is rejecting the connections and failing the processor. It got resolved as the heap size issue is solved. Thank you for you support.
... View more
07-25-2018
12:31 PM
I am using listen syslog processor (not on port 514). It was working fine, we upgraded nifi to 1.5.0.3.1.1.7-2, it worked fine after the upgrade but from last 3 days, the processor is throwing error as failed to invoke @Onscheduled method Can you please let me know how to come over this.
... View more
Labels:
- Labels:
-
Apache NiFi