Created 02-24-2017 08:51 AM
Hi friends I have csv files in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please and get a final csv file , using spark
Thanks
Created 02-24-2017 05:35 PM
I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv")
df.show()
Created 02-24-2017 09:30 AM
Assumption: all files have the same columns and in each file the first line is the header
This is a solution in PySpark
I load every file via "com.databricks.spark.csv" class respecting header and inferring schema
Then I use python reduce to union them all
from functools import reduce
files = ["/tmp/test_1.csv", "/tmp/test_2.csv", "/tmp/test_3.csv"]
df = reduce(lambda x,y: x.unionAll(y),
[sqlContext.read.format('com.databricks.spark.csv')
.load(f, header="true", inferSchema="true")
for f in files])
df.show()
Created 02-24-2017 09:35 AM
Thanks a lot but can you do it in scala language please it is so kind of you thanks
Created 02-24-2017 05:35 PM
I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv")
df.show()
Created 11-08-2018 09:05 AM
You can do that by passing a list of csv files in csv
df = sqlContext.read.load("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(["/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv"])