Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Bug in Hive (HDP 2.1), Timestamp data overwritten during INSERT processing

avatar
Rising Star

We have experienced an issue where (re)processing data in Hive overwrites timestamp data. This occurs with HDP 2.1, but not 2.3.

We are using Hive to run an ad hoc 'reorg' or 'reprocess' on existing Hive tables to reduce the number of files stored - improving query performance and reducing pressure on the cluster (found a nice explanation from @david.streever here https://community.hortonworks.com/questions/4024/how-many-files-is-too-many-on-a-modern-hdp-cluster....).

The active Hive table is added to daily, creating at least one ORC file per day. The schema contains several timestamp columns (e.g. created_timestamp for when each record was origingally created on the source system).

We then create a reorgTable with an identical schema to activeTable, copy the data from activeTable to the reorgTable which combines many of the smaller daily files reducing the overall number.

However, this process edits/overwrites timestamp data (and does not touch other columns):

1. Contents of activeTable

ID created_timestamp

01 2000-01-01 13:08:21.110

02 1970-01-01 01:02:03.450

03 1990-10-08 03:09:02.780

2. Copy data from activeTable to reorgTable

INSERT INTO TABLE reorgTable SELECT * FROM activeTable;

3. Contents of the reorgTable

ID created_timestamp

01 1990-10-08 03:09:02.780

02 1990-10-08 03:09:02.780

03 1990-10-08 03:09:02.780

Has anyone else experienced this? Is there a solution other than upgrading? Or an alternative way to reprocess the data that might not have the same effect?

Thank you!

1 ACCEPTED SOLUTION

avatar

Just elaborating on my above comment which @Emily Sharpe has already verified as the workaround.

The issue is in the Vectorization code path, see Apache Hive JIRA HIVE-8197, the issue should be fixed in both HDP 2.2.x and HDP 2.3.x.

The workaround is to disable vectorization by setting hive.vectorized.execution.enabled = false.

View solution in original post

6 REPLIES 6

avatar

Hi Emily, what maintenance version are you using for 2.1.x? Also, what's the data type for the "created_timestamp" column? I assume the DDL for the activeTable is identical to the reorgTable?

avatar
Rising Star

Hi Scott, it's HDP 2.1.11 (Hive 0.13.1), and the data type is "timestamp". The DDLs are identical. I am trying to avoid storing the data as a different type but can do this until an upgrade if necessary

avatar

Can you try the same with hive.vectorized.execution.enabled = false?

avatar
Rising Star

Hi Deepesh, gave this a try - worked perfectly! Thank you!

avatar

Just elaborating on my above comment which @Emily Sharpe has already verified as the workaround.

The issue is in the Vectorization code path, see Apache Hive JIRA HIVE-8197, the issue should be fixed in both HDP 2.2.x and HDP 2.3.x.

The workaround is to disable vectorization by setting hive.vectorized.execution.enabled = false.

avatar
Rising Star

Thanks to @Deepesh for the workaround. Also wanted to add (for info) that these steps will not be required after HDP upgrade. We will use

 ALTER TABLE activeTable CONCATENATE;

to combine the many smaller ORC files into fewer larger ones (possible from Hive 0.14+).

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Par...