Created 01-11-2018 10:49 AM
After upgrading HDP from 2.3.2.0 to 2.6.2.0, "load data into table" behavior changed.
The input data is hourly data. All file names is same name.
/user/user1/logs/yyyymmdd/00/part-r-00000.gz /user/user1/logs/yyyymmdd/01/part-r-00000.gz /user/user1/logs/yyyymmdd/02/part-r-00000.gz /user/user1/logs/yyyymmdd/03/part-r-00000.gz ・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・ /user/user1/logs/yyyymmdd/22/part-r-00000.gz /user/user1/logs/yyyymmdd/23/part-r-00000.gz
Before upgrade (HDP 2.3.2.0 )
HQL hive> load data inpath '/user/user1/logs/yyyymmdd/*/*.gz' into table sample_db.sample_tbl partition (dt='yyyymmdd'); Result /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_1.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_10.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_11.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_12.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_13.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_14.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_15.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_16.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_17.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_18.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_19.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_2.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_20.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_21.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_22.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_23.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_3.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_4.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_5.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_6.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_7.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_8.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_9.gz
All files were renamed into part-r-0000_copy_*.gz without the file part-r-0000.gz.
After upgrade(HDP 2.6.2.0 )
HQL hive> load data inpath '/user/user1/logs/yyyymmdd/*/*.gz' into table sample_db.sample_tbl partition (dt='yyyymmdd'); Result /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000.gz
There is only part-r-0000.gz.
This file was the same file as part-r-0000_copy_23.gz.
When files are loaded one by one, I can load all files like as HDP 2.3.2.0 environment.
Why is the behavior different between 2.3.2.0 and 2.6.2.0 ?
Thanks in advance
Created 01-22-2018 11:59 PM
This looks like BUG (regression). I am able to observe/reproduce the same behavior on latest hive master. Though I haven't confirmed that it was working in previous version. Feel free to open a JIRA to report this.
EDIT: Digged more into the code and found a workaround for this. Use set hive.mv.files.thread=0. This will disable parallel load of directories and LOAD should be able to load all directories by renaming them.
This is definitely a bug which needs to be fixed. Please go ahead with the JIRA report if you can. Otherwise let me know and I'll file one.
Created 01-22-2018 11:59 PM
This looks like BUG (regression). I am able to observe/reproduce the same behavior on latest hive master. Though I haven't confirmed that it was working in previous version. Feel free to open a JIRA to report this.
EDIT: Digged more into the code and found a workaround for this. Use set hive.mv.files.thread=0. This will disable parallel load of directories and LOAD should be able to load all directories by renaming them.
This is definitely a bug which needs to be fixed. Please go ahead with the JIRA report if you can. Otherwise let me know and I'll file one.
Created 01-29-2018 09:43 AM
Thanks for checking and the reply.
I have opened JIRA to report this.
https://issues.apache.org/jira/browse/HIVE-18563
I can use the workaround for this issue.
Regards,
Jun