Support Questions

Find answers, ask questions, and share your expertise

Parallelly load all the gzip files from a s3 dir to spark and i get task not serializable exception


Iam trying to read the gzip files in a dir parallely. I followed the steps advised by Matei in the following link

but i get the below exception. looks like other people as well got the same exception. Just wanted to know if its possible to acheive this in spark 2.1.0. iam running on local VM at the moment using a simple two line code.

splitFilesPathList = List("s3n://pathtos3file1","file2,file3)etc...

val lineRDD = sc.parallelize(splitFilesPathList, 4).map(path => sc.textFile(path)).take(10).toList.foreach(println)

The above 2 lines code doesnt work. Any help is really appreciated please.

I have checked for closures and moved the code into a new scala class which extends serializable but i still get the

task not serializable exception

I tried almost all possible ways.

I checked this as well


Super Guru

Since we can't see the whole code, I have a quick question. Have you tried making "val lineRDD" transient by using "@transient"?


@mqureshi yes i tried making it transient as well no luck and inside .map map(path => sc.textFile(path)) its not reading the contents of the file i thnk its just returning the string which is strange if i do sc.textfile outside map function it returns the data. Above is the whole code nothing else to it. Looks simple.. probably rather than calling sc.textfile inside map may be i need to fetch data using s3 api as shown in this link.

but not sure i might still get task serailization error???? any ideas to better implement this requirement

This is recurrent error message in Spark -it means that the classes you are trying to use within the map are only available in the driver, not in the workers. The Spark context is one of these