- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark Scala - Join multiple files using Spark
- Labels:
-
Apache Spark
Created 09-06-2016 01:03 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- val data = sc.textFile("PATH/Filejoined"); Thanks!
Created 09-06-2016 10:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you can get your multiple files into a Spark RDD with:
val data = sc.textFile("/user/pedro/pig_files/*txt")
or even
val data = sc.textFile("/user/pedro/pig_files")
From this point onwards the Spark RDD 'data' will have as many partitions as there are pig files. Spark is just as happy with that, since distributing the data brings more speed and performance to anything you want to do on that RDD.
Now if you want to merge those files into one and rewrite to HDFS again, it is just:
data.repartition(1).saveAsTextFile("/user/pedro/new_file_dir")
You can not determine the name of the output file (easily), just the HDFS folder will do
Hope this helps
Created 09-06-2016 10:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you can get your multiple files into a Spark RDD with:
val data = sc.textFile("/user/pedro/pig_files/*txt")
or even
val data = sc.textFile("/user/pedro/pig_files")
From this point onwards the Spark RDD 'data' will have as many partitions as there are pig files. Spark is just as happy with that, since distributing the data brings more speed and performance to anything you want to do on that RDD.
Now if you want to merge those files into one and rewrite to HDFS again, it is just:
data.repartition(1).saveAsTextFile("/user/pedro/new_file_dir")
You can not determine the name of the output file (easily), just the HDFS folder will do
Hope this helps
