Support Questions

Find answers, ask questions, and share your expertise

how to read schema of csv file and according to column values and we need to split the data into multiple file using scala

avatar

i have csv file example with schema

test.csv

name,age,state

swathi,23,us

srivani,24,UK

ram,25,London

sravan,30,UK

we need to split into different files according to state US state data should be loaded into (with schema)

output

/user/data/US.txt

name,age,state

swathi,23,us

/user/data/UK

name,age,state

srivani,24,UK

sravan,30,UK

/user/data/London

name,age,state

ram,25,London

13 REPLIES 13

avatar
@swathi thukkaraju

You can do it without using CSV package. Use the following code.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType}


val schema =new StructType().add(StructField("name",StringType,true)).add(StructField("age",IntegerType,true)).add(StructField("state",StringType,true))


val data = sc.textFile("/user/206571870/sample.csv")
val header = data.first()
val rdd = data.filter(row => row != header) 
val rowsRDD = rdd.map(x => x.split(",")).map(x => Row(x(0),x(1).toInt,x(2)))
val df = sqlContext.createDataFrame(rowsRDD,schema)

After this, do

df.show

and you will be able to see your data in a relational format.

Now you can fire whatever queries you want to fire on your "DataFrame". For example, filtering based on state and saving on HDFS etc.

PS - If you want to persist your DataFrame as a CSV file, spark 1.6 DOES NOT support it out of the box, you either need to convert it to RDD, then save or use the CSV package from DataBricks.

Let know if that helps!

avatar

I did but if my data have null values while loading my data into rdd this showing arrayoutof bound exception 88

That data has 142 fields some null values inside that file how I can hanlde

avatar

Have you seen the filter condition in my answer above?

val rdd = data.filter(row => row != header)

Now use such filter condition to filter your null records, if there are any, according to your use case.

avatar
@swathi thukkaraju

Did the answer help in the resolution of your query? Please close the thread by marking the answer as Accepted!