Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark Scala - Join multiple files using Spark

avatar
Rising Star
Hi, Everytime that I run my Pig Script it generates a multiple files in HDFS (I never know the number). I need to do some anlytics using Spark. How can I join that multiple files to have only one file like:
  1. val data = sc.textFile("PATH/Filejoined"); Thanks!
1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Pedro Rodgers

you can get your multiple files into a Spark RDD with:

val data = sc.textFile("/user/pedro/pig_files/*txt")

or even

val data = sc.textFile("/user/pedro/pig_files")

From this point onwards the Spark RDD 'data' will have as many partitions as there are pig files. Spark is just as happy with that, since distributing the data brings more speed and performance to anything you want to do on that RDD.

Now if you want to merge those files into one and rewrite to HDFS again, it is just:

data.repartition(1).saveAsTextFile("/user/pedro/new_file_dir") 

You can not determine the name of the output file (easily), just the HDFS folder will do

Hope this helps

View solution in original post

1 REPLY 1

avatar
Super Collaborator

@Pedro Rodgers

you can get your multiple files into a Spark RDD with:

val data = sc.textFile("/user/pedro/pig_files/*txt")

or even

val data = sc.textFile("/user/pedro/pig_files")

From this point onwards the Spark RDD 'data' will have as many partitions as there are pig files. Spark is just as happy with that, since distributing the data brings more speed and performance to anything you want to do on that RDD.

Now if you want to merge those files into one and rewrite to HDFS again, it is just:

data.repartition(1).saveAsTextFile("/user/pedro/new_file_dir") 

You can not determine the name of the output file (easily), just the HDFS folder will do

Hope this helps