Support Questions

Stewart12586 · ‎09-06-2016

Hi, Everytime that I run my Pig Script it generates a multiple files in HDFS (I never know the number). I need to do some anlytics using Spark. How can I join that multiple files to have only one file like:

val data = sc.textFile("PATH/Filejoined"); Thanks!

jknulst · ‎09-06-2016

@Pedro Rodgers

you can get your multiple files into a Spark RDD with:

val data = sc.textFile("/user/pedro/pig_files/*txt")

or even

val data = sc.textFile("/user/pedro/pig_files")

From this point onwards the Spark RDD 'data' will have as many partitions as there are pig files. Spark is just as happy with that, since distributing the data brings more speed and performance to anything you want to do on that RDD.

Now if you want to merge those files into one and rewrite to HDFS again, it is just:

data.repartition(1).saveAsTextFile("/user/pedro/new_file_dir")

You can not determine the name of the output file (easily), just the HDFS folder will do

Hope this helps

View solution in original post

jknulst · ‎09-06-2016

@Pedro Rodgers

you can get your multiple files into a Spark RDD with:

val data = sc.textFile("/user/pedro/pig_files/*txt")

or even

val data = sc.textFile("/user/pedro/pig_files")

From this point onwards the Spark RDD 'data' will have as many partitions as there are pig files. Spark is just as happy with that, since distributing the data brings more speed and performance to anything you want to do on that RDD.

Now if you want to merge those files into one and rewrite to HDFS again, it is just:

data.repartition(1).saveAsTextFile("/user/pedro/new_file_dir")

You can not determine the name of the output file (easily), just the HDFS folder will do

Hope this helps

Cloudera Community

Support Questions

Spark Scala - Join multiple files using Spark

Write / Read Parquet File in Spark

Spark in CML: Recommendations for using Spark in C...

Reading multiple csv files without headers using s...

Writing parquet on HDFS using Spark Streaming

Spark History File Offline Analysis

Introduction to Apache Spark and Develop Spark App...

JSON to SQL using Spark

Anomaly Detection in Finance - Using Spark Scala a...

How to use Snowflake's Spark connector in CML

need help on sending an email using spark scala