question Re: Combine csv files with one header in a csv file in Archives of Support Questions (Read Only)

Combine csv files with one header in a csv file

zoro07500 — Fri, 24 Feb 2017 16:51:00 GMT

Hi friends I have csv files in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please and get a final csv file , using spark

Thanks

Re: Combine csv files with one header in a csv file

bwalter1 — Fri, 24 Feb 2017 17:30:56 GMT

Assumption: all files have the same columns and in each file the first line is the header

This is a solution in PySpark

I load every file via "com.databricks.spark.csv" class respecting header and inferring schema

Then I use python reduce to union them all

from functools import reduce
files = ["/tmp/test_1.csv", "/tmp/test_2.csv", "/tmp/test_3.csv"]
df = reduce(lambda x,y: x.unionAll(y), 
            [sqlContext.read.format('com.databricks.spark.csv')
                       .load(f, header="true", inferSchema="true") 
             for f in files])
df.show()

Re: Combine csv files with one header in a csv file

zoro07500 — Fri, 24 Feb 2017 17:35:19 GMT

@Bernhard Walter

Thanks a lot but can you do it in scala language please it is so kind of you thanks

Re: Combine csv files with one header in a csv file

dzaratsian — Sat, 25 Feb 2017 01:35:22 GMT

I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:

import org.apache.spark.sql.SQLContext 

val sqlContext = new SQLContext(sc) 

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") 

df.show()

Re: Combine csv files with one header in a csv file

prakash1 — Thu, 08 Nov 2018 17:05:03 GMT

You can do that by passing a list of csv files in csv

df = sqlContext.read.load("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(["/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv"])