Created on 04-14-2017 06:33 PM - edited 09-16-2022 04:27 AM
I am trying to convert a text file to DataFrame. I found using following method instead of case class.
But where is the data type for each field is defined if we go by this method.
val people = sc.textFile("file:/home/edureka/dmishra/people.txt") val schemaString = "name age" import org.apache.spark.sql.Row; import org.apache.spark.sql.types.{StructType,StructField,StringType}; val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema) peopleDataFrame.registerTempTable("people") val results = sqlContext.sql("select name,age from people") val r = results.map(t => "Name: " + t(0) + "Age : " + t(1)).collect().foreach(println
scala> results.dtypes.foreach(println)
(name,StringType)
(age,StringType)
Where is the data type assigned for data frame. How to define age as integer data type in this case or if there is a date field, where to define it.
Thanks
Created on 04-14-2017 08:31 PM - edited 04-14-2017 08:39 PM
It is the below line which is setting the data types for both the fields as StringType:
val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
You can define your custom schema as follows :
val customSchema = StructType(Array( StructField("name", StringType, true), StructField("age", IntegerType, true)))
You can add additional fields as well in the above schema definition.
And then you can use this customSchema while creating the dataframe as follows:
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, customSchema)
Also for details, please see this page.
Created on 04-14-2017 08:31 PM - edited 04-14-2017 08:39 PM
It is the below line which is setting the data types for both the fields as StringType:
val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
You can define your custom schema as follows :
val customSchema = StructType(Array( StructField("name", StringType, true), StructField("age", IntegerType, true)))
You can add additional fields as well in the above schema definition.
And then you can use this customSchema while creating the dataframe as follows:
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, customSchema)
Also for details, please see this page.