Support Questions

zoro07500 · ‎02-24-2017

Hi friends I have csv files in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please and get a final csv file , using spark

Thanks

dzaratsian · ‎02-24-2017

I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:

import org.apache.spark.sql.SQLContext 

val sqlContext = new SQLContext(sc) 

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") 

df.show()

View solution in original post

bwalter1 · ‎02-24-2017

Assumption: all files have the same columns and in each file the first line is the header

This is a solution in PySpark

I load every file via "com.databricks.spark.csv" class respecting header and inferring schema

Then I use python reduce to union them all

from functools import reduce
files = ["/tmp/test_1.csv", "/tmp/test_2.csv", "/tmp/test_3.csv"]
df = reduce(lambda x,y: x.unionAll(y), 
            [sqlContext.read.format('com.databricks.spark.csv')
                       .load(f, header="true", inferSchema="true") 
             for f in files])
df.show()

zoro07500 · ‎02-24-2017

@Bernhard Walter

Thanks a lot but can you do it in scala language please it is so kind of you thanks

dzaratsian · ‎02-24-2017

I like @Bernhard Walter's PySpark solution! Here's another way to do it using Scala:

import org.apache.spark.sql.SQLContext 

val sqlContext = new SQLContext(sc) 

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv") 

df.show()

prakash1 · ‎11-08-2018

You can do that by passing a list of csv files in csv

df = sqlContext.read.load("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(["/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv"])

Cloudera Community

Support Questions

Combine csv files with one header in a csv file

Specify Schema for CSV files with no header and pe...

Converting a Large JSON File into CSV

Converting CSV Files to Apache Hive Tables with Ap...

Create custom format from the csv file content usi...

How to get header in Impala output csv file

reject invalid csv files

Apache Hive CSV SerDe Example

Import CSV data into HBase using importtsv

How to merge many json files in one csv file in NI...

NIFI - ReplaceText crashs my csv file