Created 03-27-2018 08:11 AM
i have csv file example with schema
test.csv
name,age,state
swathi,23,us
srivani,24,UK
ram,25,London
sravan,30,UK
we need to split into different files according to state US state data should be loaded into (with schema)
output
/user/data/US.txt
name,age,state
swathi,23,us
/user/data/UK
name,age,state
srivani,24,UK
sravan,30,UK
/user/data/London
name,age,state
ram,25,London
Created 03-28-2018 03:26 PM
You can do it without using CSV package. Use the following code.
import org.apache.spark.sql.Row import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType} val schema =new StructType().add(StructField("name",StringType,true)).add(StructField("age",IntegerType,true)).add(StructField("state",StringType,true)) val data = sc.textFile("/user/206571870/sample.csv") val header = data.first() val rdd = data.filter(row => row != header) val rowsRDD = rdd.map(x => x.split(",")).map(x => Row(x(0),x(1).toInt,x(2))) val df = sqlContext.createDataFrame(rowsRDD,schema)
After this, do
df.show
and you will be able to see your data in a relational format.
Now you can fire whatever queries you want to fire on your "DataFrame". For example, filtering based on state and saving on HDFS etc.
PS - If you want to persist your DataFrame as a CSV file, spark 1.6 DOES NOT support it out of the box, you either need to convert it to RDD, then save or use the CSV package from DataBricks.
Let know if that helps!
Created 03-29-2018 02:28 AM
I did but if my data have null values while loading my data into rdd this showing arrayoutof bound exception 88
That data has 142 fields some null values inside that file how I can hanlde
Created 03-29-2018 05:41 AM
Have you seen the filter condition in my answer above?
val rdd = data.filter(row => row != header)
Now use such filter condition to filter your null records, if there are any, according to your use case.
Created 04-01-2018 04:09 PM
Did the answer help in the resolution of your query? Please close the thread by marking the answer as Accepted!