Created 02-24-2017 08:51 AM
Hi friends I have csv files in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please and get a final csv file , using spark
Thanks
Created 02-24-2017 05:35 PM
I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:
import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") df.show()
Created 02-24-2017 09:30 AM
Assumption: all files have the same columns and in each file the first line is the header
This is a solution in PySpark
I load every file via "com.databricks.spark.csv" class respecting header and inferring schema
Then I use python reduce to union them all
from functools import reduce files = ["/tmp/test_1.csv", "/tmp/test_2.csv", "/tmp/test_3.csv"] df = reduce(lambda x,y: x.unionAll(y), [sqlContext.read.format('com.databricks.spark.csv') .load(f, header="true", inferSchema="true") for f in files]) df.show()
Created 02-24-2017 09:35 AM
Thanks a lot but can you do it in scala language please it is so kind of you thanks
Created 02-24-2017 05:35 PM
I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:
import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") df.show()
Created 11-08-2018 09:05 AM
You can do that by passing a list of csv files in csv
df = sqlContext.read.load("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(["/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv"])