Member since
02-20-2018
5
Posts
0
Kudos Received
0
Solutions
03-28-2018
07:35 AM
Hi GeKas, Thanks for your response. Yes, we have a Java app writing data to delimited files, then loading into Hive using LOAD DATA INPATH. The question, indeed, is about how to properly format date values when writing them into the delimited data files, for columns designated as "timestamp" in Hive. The language manual states: "Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch. Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format, declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps. On the table level, alternative timestamp formats can be supported by providing the format to the Serde property ”timestamp.formats”." Does this imply then, that the data written to the delimited file should (at least in the default interpretation) adhere to the date format yyyy-mm-dd hh:mm:ss[.f...]? And, does the "timezoneless" statement in the manual imply that this data must be in UTC? It is this last point that remains quite puzzling. Here is a code sample which illustrates the approach we think we should be using. I'm trying to get this validated, especially as far as the handling of the time zone is concerned: import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.TimeZone;
public class DateThing {
public static final String TS_FORMAT = "yyyy-MM-dd HH:mm:ss.SSS";
public static String toString(Date date) {
// Timestamps in text files used for loading data have to use the
// format yyyy-mm-dd hh:mm:ss[.f...] per Hive language manual.
//
// Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch.
SimpleDateFormat formatter = new SimpleDateFormat(TS_FORMAT);
formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
return formatter.format(date);
}
public static void main(String[] args) {
Date d = new Date();
System.out.println(d);
System.out.println(DateThing.toString(d));
}
} Example output: Wed Mar 28 10:21:22 EDT 2018 2018-03-28 14:21:22.636 Let me know what you think, thanks.
... View more
03-28-2018
07:26 AM
Hi GeKas, Thank you for the reply. Yes, your re-statement of the issue is exactly right. We have a Java application which generates delimited data files; we then use LOAD DATA INPATH <fname> to load into the right table. And yes, the question is, what format to use when writing to file, for columns whose type is defined as timestamp in Hive. The language manual states: Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch. Convenience UDFs for conversion to and from timezones are provided (to_utc_timestamp, from_utc_timestamp).
All existing datetime UDFs (month, day, year, hour, etc.) work with the TIMESTAMP data type.
Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format, declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.
On the table level, alternative timestamp formats can be supported by providing the format to the SerDe property "timestamp.formats" (as of release 1.2.0 with HIVE-9298). For example, yyyy-MM-dd'T'HH:mm:ss.SSS,yyyy-MM-dd'T'HH:mm:ss. "Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]". OK. So, this format and none other, unless "timestamp.formats" is specified, right? Does "interpreted to be timezoneless" imply that the value must be a UTC value? Basically, what I've got is the following, to illustrate how we populate timestamp values into delimited files, and I want to make sure that a) the format is "legitimate" and b) most importantly, that our assumption about using UTC is correct. import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.TimeZone;
public class DateHandling {
public static final String TS_FORMAT = "yyyy-MM-dd HH:mm:ss.SSS";
public static String toString(Date date) {
// Timestamps in text files used for loading data have to use the
// format yyyy-mm-dd hh:mm:ss[.f...] per Hive language manual.
//
// Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch.
SimpleDateFormat formatter = new SimpleDateFormat(TS_FORMAT);
formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
return formatter.format(date);
}
public static void main(String[] args) {
Date d = new Date();
System.out.println(d);
System.out.println(DateHandling.toString(d));
}
} Then here's an example of output: Wed Mar 28 10:21:22 EDT 2018 2018-03-28 14:21:22.636 This timestamp would get written into the delimited file as "2018-03-28 14:21:22.636". Correct or incorrect?
... View more
03-19-2018
05:31 AM
Hi GeKas, Thanks for your reply. Yes, we've been looking at unix_timestamp... So basically we'd be doing something like this: select * from mytable where
mytimestamp >= unix_timestamp("2018-03-19 11:30:00 UTC", "yyyy-MM-dd HH:mm:ss z")
and
mytimestamp < unix_timestamp("2018-03-19 12:30:40 UTC", "yyyy-MM-dd HH:mm:ss z") Part of the complication is getting the persistence of the timestamp values into the table right, which happens to be a delimited data based table. So even if the range query above is the right stuff, improper persistence may break it. I had posted this: http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/How-to-properly-persist-timestamp-values-in-a-Hive-table-based/m-p/64777 Goes hand in hand with this post. Any thoughts on the persistence approach? Thanks.
... View more
02-20-2018
04:33 PM
Hi, My question is about performing timestamp range queries over a Hive table (defined using delimited data from files). What is the proper way to do it? I see one example: ts between '2017-03-01 00:00:03.00' and '2017-03-01 00:00:10.0' - is this assuming the passed in dates to be in local time? We want to be able to feed UTC date values. We're doing this via the Cloudera JDBC driver for Hive. Seeing something about specifying the number of milliseconds since Epoch as timestamp, like this: ts >= cast (? as timestamp) and ts < cast (? as timestamp) where we're trying to convert input java.util.Date's to values that would work, as seconds since Epoch: // Get # seconds since Epoch
public static BigDecimal dateToTimestamp(Date date) {
return BigDecimal.valueOf(date.getTime()).scaleByPowerOfTen(-3);
} What are some of the recommended approaches? Thanks.
... View more
Labels:
- Labels:
-
Apache Hive