Created 02-22-2017 06:24 PM
Iam trying to read the gzip files in a dir parallely. I followed the steps advised by Matei in the following link
but i get the below exception. looks like other people as well got the same exception. Just wanted to know if its possible to acheive this in spark 2.1.0. iam running on local VM at the moment using a simple two line code.
splitFilesPathList = List("s3n://pathtos3file1","file2,file3)etc...
val lineRDD = sc.parallelize(splitFilesPathList, 4).map(path => sc.textFile(path)).take(10).toList.foreach(println)
The above 2 lines code doesnt work. Any help is really appreciated please.
I have checked for closures and moved the code into a new scala class which extends serializable but i still get the
I tried almost all possible ways.
I checked this as well
https://forums.databricks.com/questions/369/how-do-i-handle-a-task-not-serializable-exception.html
Created 02-23-2017 04:57 AM
Since we can't see the whole code, I have a quick question. Have you tried making "val lineRDD" transient by using "@transient"?
Created 02-23-2017 05:14 AM
@mqureshi yes i tried making it transient as well no luck and inside .map map(path => sc.textFile(path)) its not reading the contents of the file i thnk its just returning the string which is strange if i do sc.textfile outside map function it returns the data. Above is the whole code nothing else to it. Looks simple.. probably rather than calling sc.textfile inside map may be i need to fetch data using s3 api as shown in this link.
http://michaelryanbell.com/processing-whole-files-spark-s3.html
but not sure i might still get task serailization error???? any ideas to better implement this requirement
Created 03-03-2017 10:07 PM
This is recurrent error message in Spark -it means that the classes you are trying to use within the map are only available in the driver, not in the workers. The Spark context is one of these