Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

TimestampType format for Spark DataFrames

avatar
Contributor

I'm loading in a DataFrame with a timestamp column and I want to extract the month and year from values in that column.

When specifying in the schema a field as TimestampType, I found that only text in the form of "yyyy-mm-dd hh:mm:ss" works without giving an error. Is there a way of specifying the format when reading in a csv file, like "mm/dd/yyyy hh:mm:ss"?

If not and we have to specify the field as StringType, is there a way of converting the format my time is in to JDBC format? Would this be inefficient compared to just substringing the timestamp as a StringType?

Thank you!

1 ACCEPTED SOLUTION

avatar
Master Guru

@jestin ma found a similar solution here.

"You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:

val df =Seq((1L,"05/26/2016 01:01:01"),(2L,"#$@#@#")).toDF("id","dts")

You can use unix_timestamp to parse strings and cast it to timestamp

import org.apache.spark.sql.functions.unix_timestamp

val ts = unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("timestamp")

df.withColumn("ts", ts).show(2,false)// +---+-------------------+---------------------+// |id |dts                |ts                   |// +---+-------------------+---------------------+// |1  |05/26/2016 01:01:01|2016-05-26 01:01:01.0|// |2  |#$@#@#             |null                 |// +---+-------------------+---------------------+

As you can see it covers both parsing and error handling.

In Spark < 1.6 you'll have to use use something like this:

unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")

or

(unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss")*1000).cast("timestamp")

due to SPARK-11724.

In Spark < 1.5 you should be able to use these with expr and HiveContext."

View solution in original post

4 REPLIES 4

avatar
Master Guru

@jestin ma found a similar solution here.

"You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:

val df =Seq((1L,"05/26/2016 01:01:01"),(2L,"#$@#@#")).toDF("id","dts")

You can use unix_timestamp to parse strings and cast it to timestamp

import org.apache.spark.sql.functions.unix_timestamp

val ts = unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("timestamp")

df.withColumn("ts", ts).show(2,false)// +---+-------------------+---------------------+// |id |dts                |ts                   |// +---+-------------------+---------------------+// |1  |05/26/2016 01:01:01|2016-05-26 01:01:01.0|// |2  |#$@#@#             |null                 |// +---+-------------------+---------------------+

As you can see it covers both parsing and error handling.

In Spark < 1.6 you'll have to use use something like this:

unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")

or

(unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss")*1000).cast("timestamp")

due to SPARK-11724.

In Spark < 1.5 you should be able to use these with expr and HiveContext."

avatar
Contributor

I didn't even see this on stackoverflow! Thank you.

avatar

Please note that there are also convenience functions provided in pyspark.sql.functions, such as dayofmonth:

pyspark.sql.functions.dayofmonth(col)

Extract the day of the month of a given date as integer.

Example:

>>> df = sqlContext.createDataFrame([('2015-04-08',)], ['a'])
>>> df.select(dayofmonth('a').alias('day')).collect()
[Row(day=8)]

avatar
Contributor

Hello people !

Could anyone help me out!

my datset contains a timestamp field and I need to extract the year, the month, the day and the hour from it.

I taped these lines !

training.createOrReplaceTempView("df")
spark.udf.register("getCurrentHour", getCurrentHour _)
val hour = spark.sql("select getCurrentHour(payload_MeterReading_IntervalBlock_IReading_endTime) as hour from df")

spark.udf.register("assignTod", assignTod _)
timestamps_df.createOrReplaceTempView("timestamps")
val tod = spark.sql("select assignTod(hour) as tod from timestamps")

the problem is am not good in scala so I couldn't figure out the best solution !

the two fonctions i used to extract hour and assign it as day part

def assignToDay(hr : Integer) : String = {
  if(hr >= 7 && hr < 12){
    return "morning"
  }else if ( hr >= 12 && hr < 14) {
    return "lunch"
  } else if ( hr >= 14 && hr < 18) {
    return "afternoon"
  } else if ( hr >= 18 && hr.<(23)) {
    return "evening"
  } else if ( hr >= 23 && hr <= 24) {
    return "night"
  } else if (  hr < 7) {
    return "night"
  } else {
    return "error"
  }
}
def getCurrentHour(dateStr: String) : Integer = {
  var currentHour = 0
  try {
    val date = new Date(dateStr.toLong)
    return int2Integer(date.getHours)
  } catch {
    case _ => return currentHour
  }
  return 1
}