Reply
Highlighted
Explorer
Posts: 29
Registered: ‎01-20-2017
Accepted Solution

How to define datatype when creating dataframe using sql.types

I am trying to convert a text file to DataFrame. I found using following method instead of case class.
But where is the data type for each field is defined if we go by this method.

 

val people = sc.textFile("file:/home/edureka/dmishra/people.txt")
val schemaString = "name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val schema =
          StructType(
          schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
peopleDataFrame.registerTempTable("people")
val results = sqlContext.sql("select name,age from people")
val r = results.map(t => "Name: " + t(0) + "Age : " + t(1)).collect().foreach(println

scala> results.dtypes.foreach(println)
(name,StringType)
(age,StringType)

 

Where is the data type assigned for data frame. How to define age as integer data type in this case or if there is a date field, where to define it.

Thanks

Cloudera Employee
Posts: 30
Registered: ‎04-05-2016

Re: How to define datatype when creating dataframe using sql.types

[ Edited ]

It is the below line which is setting the data types for both the fields as StringType: 

 

val schema =
          StructType(
          schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

You can define your custom schema as follows : 

 

val customSchema = StructType(Array(
    StructField("name", StringType, true),
    StructField("age", IntegerType, true)))

You can add additional fields as well in the above schema definition.

 

And then you can use this customSchema while creating the dataframe as follows: 

 

val peopleDataFrame = sqlContext.createDataFrame(rowRDD, customSchema)

 

Also for details, please see this page

Announcements