Support Questions

emilysharpe · ‎12-10-2015

We have experienced an issue where (re)processing data in Hive overwrites timestamp data. This occurs with HDP 2.1, but not 2.3.

We are using Hive to run an ad hoc 'reorg' or 'reprocess' on existing Hive tables to reduce the number of files stored - improving query performance and reducing pressure on the cluster (found a nice explanation from @david.streever here https://community.hortonworks.com/questions/4024/how-many-files-is-too-many-on-a-modern-hdp-cluster....).

The active Hive table is added to daily, creating at least one ORC file per day. The schema contains several timestamp columns (e.g. created_timestamp for when each record was origingally created on the source system).

We then create a reorgTable with an identical schema to activeTable, copy the data from activeTable to the reorgTable which combines many of the smaller daily files reducing the overall number.

However, this process edits/overwrites timestamp data (and does not touch other columns):

1. Contents of activeTable

ID created_timestamp

01 2000-01-01 13:08:21.110

02 1970-01-01 01:02:03.450

03 1990-10-08 03:09:02.780

2. Copy data from activeTable to reorgTable

INSERT INTO TABLE reorgTable SELECT * FROM activeTable;

3. Contents of the reorgTable

ID created_timestamp

01 1990-10-08 03:09:02.780

02 1990-10-08 03:09:02.780

03 1990-10-08 03:09:02.780

Has anyone else experienced this? Is there a solution other than upgrading? Or an alternative way to reprocess the data that might not have the same effect?

Thank you!

deepesh1 · ‎12-11-2015

Just elaborating on my above comment which @Emily Sharpe has already verified as the workaround.

The issue is in the Vectorization code path, see Apache Hive JIRA HIVE-8197, the issue should be fixed in both HDP 2.2.x and HDP 2.3.x.

The workaround is to disable vectorization by setting hive.vectorized.execution.enabled = false.

View solution in original post

SQLShaw · ‎12-10-2015

Hi Emily, what maintenance version are you using for 2.1.x? Also, what's the data type for the "created_timestamp" column? I assume the DDL for the activeTable is identical to the reorgTable?

emilysharpe · ‎12-10-2015

Hi Scott, it's HDP 2.1.11 (Hive 0.13.1), and the data type is "timestamp". The DDLs are identical. I am trying to avoid storing the data as a different type but can do this until an upgrade if necessary

deepesh1 · ‎12-10-2015

Can you try the same with hive.vectorized.execution.enabled = false?

emilysharpe · ‎12-11-2015

Hi Deepesh, gave this a try - worked perfectly! Thank you!

deepesh1 · ‎12-11-2015

Just elaborating on my above comment which @Emily Sharpe has already verified as the workaround.

The issue is in the Vectorization code path, see Apache Hive JIRA HIVE-8197, the issue should be fixed in both HDP 2.2.x and HDP 2.3.x.

The workaround is to disable vectorization by setting hive.vectorized.execution.enabled = false.

emilysharpe · ‎12-11-2015

Thanks to @Deepesh for the workaround. Also wanted to add (for info) that these steps will not be required after HDP upgrade. We will use

 ALTER TABLE activeTable CONCATENATE;

to combine the many smaller ORC files into fewer larger ones (possible from Hive 0.14+).

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Par...

Cloudera Community

Support Questions

Bug in Hive (HDP 2.1), Timestamp data overwritten during INSERT processing