- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
"Load data into table" behavior is different between HDP 2.6.2 and HDP 2.3.0
- Labels:
-
Apache Hive
Created ‎01-11-2018 10:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After upgrading HDP from 2.3.2.0 to 2.6.2.0, "load data into table" behavior changed.
The input data is hourly data. All file names is same name.
/user/user1/logs/yyyymmdd/00/part-r-00000.gz /user/user1/logs/yyyymmdd/01/part-r-00000.gz /user/user1/logs/yyyymmdd/02/part-r-00000.gz /user/user1/logs/yyyymmdd/03/part-r-00000.gz ・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・ /user/user1/logs/yyyymmdd/22/part-r-00000.gz /user/user1/logs/yyyymmdd/23/part-r-00000.gz
Before upgrade (HDP 2.3.2.0 )
HQL hive> load data inpath '/user/user1/logs/yyyymmdd/*/*.gz' into table sample_db.sample_tbl partition (dt='yyyymmdd'); Result /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_1.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_10.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_11.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_12.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_13.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_14.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_15.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_16.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_17.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_18.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_19.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_2.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_20.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_21.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_22.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_23.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_3.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_4.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_5.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_6.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_7.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_8.gz /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_9.gz
All files were renamed into part-r-0000_copy_*.gz without the file part-r-0000.gz.
After upgrade(HDP 2.6.2.0 )
HQL hive> load data inpath '/user/user1/logs/yyyymmdd/*/*.gz' into table sample_db.sample_tbl partition (dt='yyyymmdd'); Result /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd /hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000.gz
There is only part-r-0000.gz.
This file was the same file as part-r-0000_copy_23.gz.
When files are loaded one by one, I can load all files like as HDP 2.3.2.0 environment.
Why is the behavior different between 2.3.2.0 and 2.6.2.0 ?
Thanks in advance
- OS : CentOS6
- JDK : 1.8.0_152(Oracle)
- HDP : 2.3.2.0 and 2.6.2.0
- Hive : 1.2.1.2.3.2.0-2950 and 1.2.1000.2.6.2.0-205
Created ‎01-22-2018 11:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This looks like BUG (regression). I am able to observe/reproduce the same behavior on latest hive master. Though I haven't confirmed that it was working in previous version. Feel free to open a JIRA to report this.
EDIT: Digged more into the code and found a workaround for this. Use set hive.mv.files.thread=0. This will disable parallel load of directories and LOAD should be able to load all directories by renaming them.
This is definitely a bug which needs to be fixed. Please go ahead with the JIRA report if you can. Otherwise let me know and I'll file one.
Created ‎01-22-2018 11:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This looks like BUG (regression). I am able to observe/reproduce the same behavior on latest hive master. Though I haven't confirmed that it was working in previous version. Feel free to open a JIRA to report this.
EDIT: Digged more into the code and found a workaround for this. Use set hive.mv.files.thread=0. This will disable parallel load of directories and LOAD should be able to load all directories by renaming them.
This is definitely a bug which needs to be fixed. Please go ahead with the JIRA report if you can. Otherwise let me know and I'll file one.
Created ‎01-29-2018 09:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for checking and the reply.
I have opened JIRA to report this.
https://issues.apache.org/jira/browse/HIVE-18563
I can use the workaround for this issue.
Regards,
Jun
