Support Questions

Find answers, ask questions, and share your expertise

"Load data into table" behavior is different between HDP 2.6.2 and HDP 2.3.0

avatar
New Contributor

After upgrading HDP from 2.3.2.0 to 2.6.2.0, "load data into table" behavior changed.

The input data is hourly data. All file names is same name.

/user/user1/logs/yyyymmdd/00/part-r-00000.gz
/user/user1/logs/yyyymmdd/01/part-r-00000.gz
/user/user1/logs/yyyymmdd/02/part-r-00000.gz
/user/user1/logs/yyyymmdd/03/part-r-00000.gz
・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・
/user/user1/logs/yyyymmdd/22/part-r-00000.gz
/user/user1/logs/yyyymmdd/23/part-r-00000.gz

Before upgrade (HDP 2.3.2.0 )

HQL
hive> load data inpath '/user/user1/logs/yyyymmdd/*/*.gz' into table sample_db.sample_tbl partition (dt='yyyymmdd');


Result
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_1.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_10.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_11.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_12.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_13.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_14.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_15.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_16.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_17.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_18.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_19.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_2.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_20.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_21.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_22.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_23.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_3.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_4.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_5.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_6.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_7.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_8.gz
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000_copy_9.gz

All files were renamed into part-r-0000_copy_*.gz without the file part-r-0000.gz.

After upgrade(HDP 2.6.2.0 )

HQL
hive> load data inpath '/user/user1/logs/yyyymmdd/*/*.gz' into table sample_db.sample_tbl partition (dt='yyyymmdd');

Result
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd
/hive/warehouse/sample_db.db/sample_tbl/dt=yyyymmdd/part-r-00000.gz

There is only part-r-0000.gz.

This file was the same file as part-r-0000_copy_23.gz.

When files are loaded one by one, I can load all files like as HDP 2.3.2.0 environment.

Why is the behavior different between 2.3.2.0 and 2.6.2.0 ?

Thanks in advance

  • OS : CentOS6
  • JDK : 1.8.0_152(Oracle)
  • HDP : 2.3.2.0 and 2.6.2.0
  • Hive : 1.2.1.2.3.2.0-2950 and 1.2.1000.2.6.2.0-205
1 ACCEPTED SOLUTION

avatar
Contributor

This looks like BUG (regression). I am able to observe/reproduce the same behavior on latest hive master. Though I haven't confirmed that it was working in previous version. Feel free to open a JIRA to report this.

EDIT: Digged more into the code and found a workaround for this. Use set hive.mv.files.thread=0. This will disable parallel load of directories and LOAD should be able to load all directories by renaming them.

This is definitely a bug which needs to be fixed. Please go ahead with the JIRA report if you can. Otherwise let me know and I'll file one.

View solution in original post

2 REPLIES 2

avatar
Contributor

This looks like BUG (regression). I am able to observe/reproduce the same behavior on latest hive master. Though I haven't confirmed that it was working in previous version. Feel free to open a JIRA to report this.

EDIT: Digged more into the code and found a workaround for this. Use set hive.mv.files.thread=0. This will disable parallel load of directories and LOAD should be able to load all directories by renaming them.

This is definitely a bug which needs to be fixed. Please go ahead with the JIRA report if you can. Otherwise let me know and I'll file one.

avatar
New Contributor

@vgarg

Thanks for checking and the reply.

I have opened JIRA to report this.

https://issues.apache.org/jira/browse/HIVE-18563

I can use the workaround for this issue.

Regards,

Jun