Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

TimestampType format for Spark DataFrames

Explorer

I'm loading in a DataFrame with a timestamp column and I want to extract the month and year from values in that column.

When specifying in the schema a field as TimestampType, I found that only text in the form of "yyyy-mm-dd hh:mm:ss" works without giving an error. Is there a way of specifying the format when reading in a csv file, like "mm/dd/yyyy hh:mm:ss"?

If not and we have to specify the field as StringType, is there a way of converting the format my time is in to JDBC format? Would this be inefficient compared to just substringing the timestamp as a StringType?

Thank you!

1 ACCEPTED SOLUTION

Super Guru

@jestin ma found a similar solution here.

"You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:

val df =Seq((1L,"05/26/2016 01:01:01"),(2L,"#$@#@#")).toDF("id","dts")

You can use unix_timestamp to parse strings and cast it to timestamp

import org.apache.spark.sql.functions.unix_timestamp

val ts = unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("timestamp")

df.withColumn("ts", ts).show(2,false)// +---+-------------------+---------------------+// |id |dts                |ts                   |// +---+-------------------+---------------------+// |1  |05/26/2016 01:01:01|2016-05-26 01:01:01.0|// |2  |#$@#@#             |null                 |// +---+-------------------+---------------------+

As you can see it covers both parsing and error handling.

In Spark < 1.6 you'll have to use use something like this:

unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")

or

(unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss")*1000).cast("timestamp")

due to SPARK-11724.

In Spark < 1.5 you should be able to use these with expr and HiveContext."

View solution in original post

4 REPLIES 4

Super Guru

@jestin ma found a similar solution here.

"You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:

val df =Seq((1L,"05/26/2016 01:01:01"),(2L,"#$@#@#")).toDF("id","dts")

You can use unix_timestamp to parse strings and cast it to timestamp

import org.apache.spark.sql.functions.unix_timestamp

val ts = unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("timestamp")

df.withColumn("ts", ts).show(2,false)// +---+-------------------+---------------------+// |id |dts                |ts                   |// +---+-------------------+---------------------+// |1  |05/26/2016 01:01:01|2016-05-26 01:01:01.0|// |2  |#$@#@#             |null                 |// +---+-------------------+---------------------+

As you can see it covers both parsing and error handling.

In Spark < 1.6 you'll have to use use something like this:

unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")

or

(unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss")*1000).cast("timestamp")

due to SPARK-11724.

In Spark < 1.5 you should be able to use these with expr and HiveContext."

Explorer

I didn't even see this on stackoverflow! Thank you.

Please note that there are also convenience functions provided in pyspark.sql.functions, such as dayofmonth:

pyspark.sql.functions.dayofmonth(col)

Extract the day of the month of a given date as integer.

Example:

>>> df = sqlContext.createDataFrame([('2015-04-08',)], ['a'])
>>> df.select(dayofmonth('a').alias('day')).collect()
[Row(day=8)]

Hello people !

Could anyone help me out!

my datset contains a timestamp field and I need to extract the year, the month, the day and the hour from it.

I taped these lines !

training.createOrReplaceTempView("df")
spark.udf.register("getCurrentHour", getCurrentHour _)
val hour = spark.sql("select getCurrentHour(payload_MeterReading_IntervalBlock_IReading_endTime) as hour from df")

spark.udf.register("assignTod", assignTod _)
timestamps_df.createOrReplaceTempView("timestamps")
val tod = spark.sql("select assignTod(hour) as tod from timestamps")

the problem is am not good in scala so I couldn't figure out the best solution !

the two fonctions i used to extract hour and assign it as day part

def assignToDay(hr : Integer) : String = {
  if(hr >= 7 && hr < 12){
    return "morning"
  }else if ( hr >= 12 && hr < 14) {
    return "lunch"
  } else if ( hr >= 14 && hr < 18) {
    return "afternoon"
  } else if ( hr >= 18 && hr.<(23)) {
    return "evening"
  } else if ( hr >= 23 && hr <= 24) {
    return "night"
  } else if (  hr < 7) {
    return "night"
  } else {
    return "error"
  }
}
def getCurrentHour(dateStr: String) : Integer = {
  var currentHour = 0
  try {
    val date = new Date(dateStr.toLong)
    return int2Integer(date.getHours)
  } catch {
    case _ => return currentHour
  }
  return 1
}
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.