Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

TimestampType format for Spark DataFrames

Solved Go to solution
Highlighted

TimestampType format for Spark DataFrames

Explorer

I'm loading in a DataFrame with a timestamp column and I want to extract the month and year from values in that column.

When specifying in the schema a field as TimestampType, I found that only text in the form of "yyyy-mm-dd hh:mm:ss" works without giving an error. Is there a way of specifying the format when reading in a csv file, like "mm/dd/yyyy hh:mm:ss"?

If not and we have to specify the field as StringType, is there a way of converting the format my time is in to JDBC format? Would this be inefficient compared to just substringing the timestamp as a StringType?

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: TimestampType format for Spark DataFrames

Super Guru

@jestin ma found a similar solution here.

"You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:

val df =Seq((1L,"05/26/2016 01:01:01"),(2L,"#$@#@#")).toDF("id","dts")

You can use unix_timestamp to parse strings and cast it to timestamp

import org.apache.spark.sql.functions.unix_timestamp

val ts = unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("timestamp")

df.withColumn("ts", ts).show(2,false)// +---+-------------------+---------------------+// |id |dts                |ts                   |// +---+-------------------+---------------------+// |1  |05/26/2016 01:01:01|2016-05-26 01:01:01.0|// |2  |#$@#@#             |null                 |// +---+-------------------+---------------------+

As you can see it covers both parsing and error handling.

In Spark < 1.6 you'll have to use use something like this:

unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")

or

(unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss")*1000).cast("timestamp")

due to SPARK-11724.

In Spark < 1.5 you should be able to use these with expr and HiveContext."

View solution in original post

4 REPLIES 4
Highlighted

Re: TimestampType format for Spark DataFrames

Super Guru

@jestin ma found a similar solution here.

"You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:

val df =Seq((1L,"05/26/2016 01:01:01"),(2L,"#$@#@#")).toDF("id","dts")

You can use unix_timestamp to parse strings and cast it to timestamp

import org.apache.spark.sql.functions.unix_timestamp

val ts = unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("timestamp")

df.withColumn("ts", ts).show(2,false)// +---+-------------------+---------------------+// |id |dts                |ts                   |// +---+-------------------+---------------------+// |1  |05/26/2016 01:01:01|2016-05-26 01:01:01.0|// |2  |#$@#@#             |null                 |// +---+-------------------+---------------------+

As you can see it covers both parsing and error handling.

In Spark < 1.6 you'll have to use use something like this:

unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")

or

(unix_timestamp($"dts","MM/dd/yyyy HH:mm:ss")*1000).cast("timestamp")

due to SPARK-11724.

In Spark < 1.5 you should be able to use these with expr and HiveContext."

View solution in original post

Highlighted

Re: TimestampType format for Spark DataFrames

Explorer

I didn't even see this on stackoverflow! Thank you.

Highlighted

Re: TimestampType format for Spark DataFrames

Please note that there are also convenience functions provided in pyspark.sql.functions, such as dayofmonth:

pyspark.sql.functions.dayofmonth(col)

Extract the day of the month of a given date as integer.

Example:

>>> df = sqlContext.createDataFrame([('2015-04-08',)], ['a'])
>>> df.select(dayofmonth('a').alias('day')).collect()
[Row(day=8)]

Re: TimestampType format for Spark DataFrames

Hello people !

Could anyone help me out!

my datset contains a timestamp field and I need to extract the year, the month, the day and the hour from it.

I taped these lines !

training.createOrReplaceTempView("df")
spark.udf.register("getCurrentHour", getCurrentHour _)
val hour = spark.sql("select getCurrentHour(payload_MeterReading_IntervalBlock_IReading_endTime) as hour from df")

spark.udf.register("assignTod", assignTod _)
timestamps_df.createOrReplaceTempView("timestamps")
val tod = spark.sql("select assignTod(hour) as tod from timestamps")

the problem is am not good in scala so I couldn't figure out the best solution !

the two fonctions i used to extract hour and assign it as day part

def assignToDay(hr : Integer) : String = {
  if(hr >= 7 && hr < 12){
    return "morning"
  }else if ( hr >= 12 && hr < 14) {
    return "lunch"
  } else if ( hr >= 14 && hr < 18) {
    return "afternoon"
  } else if ( hr >= 18 && hr.<(23)) {
    return "evening"
  } else if ( hr >= 23 && hr <= 24) {
    return "night"
  } else if (  hr < 7) {
    return "night"
  } else {
    return "error"
  }
}
def getCurrentHour(dateStr: String) : Integer = {
  var currentHour = 0
  try {
    val date = new Date(dateStr.toLong)
    return int2Integer(date.getHours)
  } catch {
    case _ => return currentHour
  }
  return 1
}
Don't have an account?
Coming from Hortonworks? Activate your account here