Created 12-10-2015 04:36 AM
We have experienced an issue where (re)processing data in Hive overwrites timestamp data. This occurs with HDP 2.1, but not 2.3.
We are using Hive to run an ad hoc 'reorg' or 'reprocess' on existing Hive tables to reduce the number of files stored - improving query performance and reducing pressure on the cluster (found a nice explanation from @david.streever here https://community.hortonworks.com/questions/4024/how-many-files-is-too-many-on-a-modern-hdp-cluster....).
The active Hive table is added to daily, creating at least one ORC file per day. The schema contains several timestamp columns (e.g. created_timestamp for when each record was origingally created on the source system).
We then create a reorgTable with an identical schema to activeTable, copy the data from activeTable to the reorgTable which combines many of the smaller daily files reducing the overall number.
However, this process edits/overwrites timestamp data (and does not touch other columns):
1. Contents of activeTable
ID created_timestamp
01 2000-01-01 13:08:21.110
02 1970-01-01 01:02:03.450
03 1990-10-08 03:09:02.780
2. Copy data from activeTable to reorgTable
INSERT INTO TABLE reorgTable SELECT * FROM activeTable;
3. Contents of the reorgTable
ID created_timestamp
01 1990-10-08 03:09:02.780
02 1990-10-08 03:09:02.780
03 1990-10-08 03:09:02.780
Has anyone else experienced this? Is there a solution other than upgrading? Or an alternative way to reprocess the data that might not have the same effect?
Thank you!
Created 12-11-2015 12:57 AM
Just elaborating on my above comment which @Emily Sharpe has already verified as the workaround.
The issue is in the Vectorization code path, see Apache Hive JIRA HIVE-8197, the issue should be fixed in both HDP 2.2.x and HDP 2.3.x.
The workaround is to disable vectorization by setting hive.vectorized.execution.enabled = false.
Created 12-10-2015 01:18 PM
Hi Emily, what maintenance version are you using for 2.1.x? Also, what's the data type for the "created_timestamp" column? I assume the DDL for the activeTable is identical to the reorgTable?
Created 12-10-2015 11:31 PM
Hi Scott, it's HDP 2.1.11 (Hive 0.13.1), and the data type is "timestamp". The DDLs are identical. I am trying to avoid storing the data as a different type but can do this until an upgrade if necessary
Created 12-10-2015 08:22 PM
Can you try the same with hive.vectorized.execution.enabled = false?
Created 12-11-2015 12:28 AM
Hi Deepesh, gave this a try - worked perfectly! Thank you!
Created 12-11-2015 12:57 AM
Just elaborating on my above comment which @Emily Sharpe has already verified as the workaround.
The issue is in the Vectorization code path, see Apache Hive JIRA HIVE-8197, the issue should be fixed in both HDP 2.2.x and HDP 2.3.x.
The workaround is to disable vectorization by setting hive.vectorized.execution.enabled = false.
Created 12-11-2015 05:37 AM
Thanks to @Deepesh for the workaround. Also wanted to add (for info) that these steps will not be required after HDP upgrade. We will use
ALTER TABLE activeTable CONCATENATE;
to combine the many smaller ORC files into fewer larger ones (possible from Hive 0.14+).