Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Combine csv files with one header in a csv file

avatar
Contributor

Hi friends I have csv files in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please and get a final csv file , using spark

Thanks

1 ACCEPTED SOLUTION

avatar

I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:

import org.apache.spark.sql.SQLContext 

val sqlContext = new SQLContext(sc) 

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") 

df.show()

View solution in original post

4 REPLIES 4

avatar

Assumption: all files have the same columns and in each file the first line is the header

This is a solution in PySpark

I load every file via "com.databricks.spark.csv" class respecting header and inferring schema

Then I use python reduce to union them all

from functools import reduce
files = ["/tmp/test_1.csv", "/tmp/test_2.csv", "/tmp/test_3.csv"]
df = reduce(lambda x,y: x.unionAll(y), 
            [sqlContext.read.format('com.databricks.spark.csv')
                       .load(f, header="true", inferSchema="true") 
             for f in files])
df.show()

avatar
Contributor

@Bernhard Walter

Thanks a lot but can you do it in scala language please it is so kind of you thanks

avatar

I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:

import org.apache.spark.sql.SQLContext 

val sqlContext = new SQLContext(sc) 

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") 

df.show()

avatar
New Contributor

You can do that by passing a list of csv files in csv

df = sqlContext.read.load("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(["/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv"])