Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to perform a date range query in Hive on a timestamp column

avatar

Hi,

 

My question is about performing timestamp range queries over a Hive table (defined using delimited data from files).

 

What is the proper way to do it?

 

I see one example: ts between '2017-03-01 00:00:03.00' and '2017-03-01 00:00:10.0' - is this assuming the passed in dates to be in local time?  We want to be able to feed UTC date values.

 

We're doing this via the Cloudera JDBC driver for Hive.

 

Seeing something about specifying the number of milliseconds since Epoch as timestamp, like this:

 

ts >= cast (? as timestamp) and ts < cast (? as timestamp)

 

where we're trying to convert input java.util.Date's to values that would work, as seconds since Epoch:

 

// Get # seconds since Epoch
public static BigDecimal dateToTimestamp(Date date) {
    return BigDecimal.valueOf(date.getTime()).scaleByPowerOfTen(-3);
}

What are some of the recommended approaches?

 

Thanks.

6 REPLIES 6

avatar
Super Collaborator

Hive has builtin function for these conversions:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

 

You should give a try to:

unix_timestamp(string date, string pattern)

Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400.

 

 

Where you can define the pattern of the timestamp you provide (including the timezone info)

avatar

Hi GeKas,

 

Thanks for your reply. Yes, we've been looking at unix_timestamp...

 

So basically we'd be doing something like this:

select * from mytable where
mytimestamp >= unix_timestamp("2018-03-19 11:30:00 UTC", "yyyy-MM-dd HH:mm:ss z")
and
mytimestamp < unix_timestamp("2018-03-19 12:30:40 UTC", "yyyy-MM-dd HH:mm:ss z")

Part of the complication is getting the persistence of the timestamp values into the table right, which happens to be a delimited data based table.

 

So even if the range query above is the right stuff, improper persistence may break it. I had posted this:

 

http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/How-to-properly-persist-timestamp-values-in-a...

 

Goes hand in hand with this post. Any thoughts on the persistence approach? Thanks.

avatar
Super Collaborator

I am not sure I understand your request.

What I understand:

* You have a Java application that writes data to a delimited file.

* This file is somehow loaded into Hive (e.g. load data inpath 'file.csv' into table output_table)

 

And the question is what format to use when writing to file?

If this is correct, if you have defined this column as timestamp, then the output should be one of the supported formats of timestamp for Hive

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps...

 

If you have define it as string/varchar then you can use any format you want and then convert it to seconds from unix epoch in hive by using the built-in functions.

avatar

Hi GeKas,

 

Thank you for the reply.

 

Yes, your re-statement of the issue is exactly right.  We have a Java application which generates delimited data files; we then use LOAD DATA INPATH <fname> to load into the right table.

 

And yes, the question is, what format to use when writing to file, for columns whose type is defined as timestamp in Hive.

 

The language manual states:

 

Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch. Convenience UDFs for conversion to and from timezones are provided (to_utc_timestamp, from_utc_timestamp).
All existing datetime UDFs (month, day, year, hour, etc.) work with the TIMESTAMP data type.

Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format, declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.

On the table level, alternative timestamp formats can be supported by providing the format to the SerDe property "timestamp.formats" (as of release 1.2.0 with HIVE-9298). For example, yyyy-MM-dd'T'HH:mm:ss.SSS,yyyy-MM-dd'T'HH:mm:ss.

"Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]". OK. So, this format and none other, unless "timestamp.formats" is specified, right?

 

Does "interpreted to be timezoneless" imply that the value must be a UTC value?

 

Basically, what I've got is the following, to illustrate how we populate timestamp values into delimited files, and I want to make sure that a) the format is "legitimate" and b) most importantly, that our assumption about using UTC is correct.

 

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.TimeZone;

public class DateHandling {

  public static final String TS_FORMAT = "yyyy-MM-dd HH:mm:ss.SSS";

  public static String toString(Date date) {
    // Timestamps in text files used for loading data have to use the
    // format yyyy-mm-dd hh:mm:ss[.f...] per Hive language manual.
    //
    // Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch.
    SimpleDateFormat formatter = new SimpleDateFormat(TS_FORMAT);
    formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
    return formatter.format(date);
  }

  public static void main(String[] args) {
    Date d = new Date();
    System.out.println(d);
    System.out.println(DateHandling.toString(d));
  }
}

Then here's an example of output:

 

Wed Mar 28 10:21:22 EDT 2018

2018-03-28 14:21:22.636

 

This timestamp would get written into the delimited file as "2018-03-28 14:21:22.636". Correct or incorrect?

avatar

Hi GeKas,

 

Thanks for your response. Yes, we have a Java app writing data to delimited files, then loading into Hive using LOAD DATA INPATH. The question, indeed, is about how to properly format date values when writing them into the delimited data files, for columns designated as "timestamp" in Hive.

 

The language manual states:

"Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch.

Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format, declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.

On the table level, alternative timestamp formats can be supported by providing the format to the Serde property ”timestamp.formats”."

 

Does this imply then, that the data written to the delimited file should (at least in the default interpretation) adhere to the date format yyyy-mm-dd hh:mm:ss[.f...]?  And, does the "timezoneless" statement in the manual imply that this data must be in UTC?  It is this last point that remains quite puzzling.

 

Here is a code sample which illustrates the approach we think we should be using.  I'm trying to get this validated, especially as far as the handling of the time zone is concerned:

 

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.TimeZone;
 
public class DateThing {
 
  public static final String TS_FORMAT = "yyyy-MM-dd HH:mm:ss.SSS";
 
  public static String toString(Date date) {
    // Timestamps in text files used for loading data have to use the
    // format yyyy-mm-dd hh:mm:ss[.f...] per Hive language manual.
    //
    // Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch.
 
    SimpleDateFormat formatter = new SimpleDateFormat(TS_FORMAT);
    formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
   
    return formatter.format(date);
  }
 
  public static void main(String[] args) {
    Date d = new Date();
    System.out.println(d);
    System.out.println(DateThing.toString(d));
  }
}

Example output:

Wed Mar 28 10:21:22 EDT 2018

2018-03-28 14:21:22.636

 

Let me know what you think, thanks.

 

avatar
Super Collaborator

Hi.First of all sorry for late reply, I was out for some time.

According to this "yyyy-mm-dd hh:mm:ss[.f...]", yes you have to store it in UTC.
In order to be able to store date in other timezones, the format should include the "Z" which is hours from UTC.