Support Questions

Report Inappropriate Content · ‎03-27-2018

i have csv file example with schema

test.csv

name,age,state

swathi,23,us

srivani,24,UK

ram,25,London

sravan,30,UK

we need to split into different files according to state US state data should be loaded into (with schema)

output

/user/data/US.txt

name,age,state

swathi,23,us

/user/data/UK

name,age,state

srivani,24,UK

sravan,30,UK

/user/data/London

name,age,state

ram,25,London

RahulSoni · ‎03-28-2018

@swathi thukkaraju

You can do it without using CSV package. Use the following code.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType}


val schema =new StructType().add(StructField("name",StringType,true)).add(StructField("age",IntegerType,true)).add(StructField("state",StringType,true))


val data = sc.textFile("/user/206571870/sample.csv")
val header = data.first()
val rdd = data.filter(row => row != header) 
val rowsRDD = rdd.map(x => x.split(",")).map(x => Row(x(0),x(1).toInt,x(2)))
val df = sqlContext.createDataFrame(rowsRDD,schema)

After this, do

df.show

and you will be able to see your data in a relational format.

Now you can fire whatever queries you want to fire on your "DataFrame". For example, filtering based on state and saving on HDFS etc.

PS - If you want to persist your DataFrame as a CSV file, spark 1.6 DOES NOT support it out of the box, you either need to convert it to RDD, then save or use the CSV package from DataBricks.

Let know if that helps!

Report Inappropriate Content · ‎03-29-2018

I did but if my data have null values while loading my data into rdd this showing arrayoutof bound exception 88

That data has 142 fields some null values inside that file how I can hanlde

RahulSoni · ‎03-29-2018

Have you seen the filter condition in my answer above?

val rdd = data.filter(row => row != header)

Now use such filter condition to filter your null records, if there are any, according to your use case.

RahulSoni · ‎04-01-2018

@swathi thukkaraju

Did the answer help in the resolution of your query? Please close the thread by marking the answer as Accepted!

Cloudera Community

Support Questions

how to read schema of csv file and according to column values and we need to split the data into multiple file using scala