Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Combine csv files with one header in a csv file

avatar
New Member

Hi friends I have csv files in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please and get a final csv file , using spark

Thanks

1 ACCEPTED SOLUTION

avatar

I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:

import org.apache.spark.sql.SQLContext 

val sqlContext = new SQLContext(sc) 

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") 

df.show()

View solution in original post

4 REPLIES 4

avatar

Assumption: all files have the same columns and in each file the first line is the header

This is a solution in PySpark

I load every file via "com.databricks.spark.csv" class respecting header and inferring schema

Then I use python reduce to union them all

from functools import reduce
files = ["/tmp/test_1.csv", "/tmp/test_2.csv", "/tmp/test_3.csv"]
df = reduce(lambda x,y: x.unionAll(y), 
            [sqlContext.read.format('com.databricks.spark.csv')
                       .load(f, header="true", inferSchema="true") 
             for f in files])
df.show()

avatar
New Member

@Bernhard Walter

Thanks a lot but can you do it in scala language please it is so kind of you thanks

avatar

I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:

import org.apache.spark.sql.SQLContext 

val sqlContext = new SQLContext(sc) 

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") 

df.show()

avatar
New Member

You can do that by passing a list of csv files in csv

df = sqlContext.read.load("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(["/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv"])