Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Parallelly load all the gzip files from a s3 dir to spark and i get task not serializable exception

Parallelly load all the gzip files from a s3 dir to spark and i get task not serializable exception

Contributor

Iam trying to read the gzip files in a dir parallely. I followed the steps advised by Matei in the following link

http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-td...

but i get the below exception. looks like other people as well got the same exception. Just wanted to know if its possible to acheive this in spark 2.1.0. iam running on local VM at the moment using a simple two line code.

splitFilesPathList = List("s3n://pathtos3file1","file2,file3)etc...

val lineRDD = sc.parallelize(splitFilesPathList, 4).map(path => sc.textFile(path)).take(10).toList.foreach(println)

The above 2 lines code doesnt work. Any help is really appreciated please.

I have checked for closures and moved the code into a new scala class which extends serializable but i still get the

task not serializable exception

I tried almost all possible ways.

I checked this as well

https://forums.databricks.com/questions/369/how-do-i-handle-a-task-not-serializable-exception.html

3 REPLIES 3

Re: Parallelly load all the gzip files from a s3 dir to spark and i get task not serializable exception

Super Guru
@BigDataRocks

Since we can't see the whole code, I have a quick question. Have you tried making "val lineRDD" transient by using "@transient"?

Re: Parallelly load all the gzip files from a s3 dir to spark and i get task not serializable exception

Contributor

@mqureshi yes i tried making it transient as well no luck and inside .map map(path => sc.textFile(path)) its not reading the contents of the file i thnk its just returning the string which is strange if i do sc.textfile outside map function it returns the data. Above is the whole code nothing else to it. Looks simple.. probably rather than calling sc.textfile inside map may be i need to fetch data using s3 api as shown in this link.

http://michaelryanbell.com/processing-whole-files-spark-s3.html

but not sure i might still get task serailization error???? any ideas to better implement this requirement

Re: Parallelly load all the gzip files from a s3 dir to spark and i get task not serializable exception

This is recurrent error message in Spark -it means that the classes you are trying to use within the map are only available in the driver, not in the workers. The Spark context is one of these

Don't have an account?
Coming from Hortonworks? Activate your account here