Support Questions

Find answers, ask questions, and share your expertise

what is the default behavior of insert overwrite on external hdfs table?

avatar
New Contributor

Our Hortonworks version is 2.6.3.0-235, our Hive version is 1.2.1000

We have the following issue at the moment:

Hive insert overwrite will fail on external table if the external table's folder does not exist.

Here is the details:

We have an external table "config_another_test_output". Here is the table information:

+---------------------------------------------------------------------------------------------------------------------+--+
| createtab_stmt |
+---------------------------------------------------------------------------------------------------------------------+--+
| CREATE EXTERNAL TABLE `config_another_test_output`( |
| `stars` string, |
| `desc` string, |
| `pros` string, |
| `cons` string, |
| `advice` string, |
| `summary` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://titan/dev/10112/app/TC30/dataiku/CONFIG_ANOTHER_TEST/output' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', |
| 'DSS_DEFAUT_HIVE_TABLE_TAG'='eyJtZ3QiOnRydWUsImZtdCI6IjhpeTNOYSIsInBhcnQiOiIxQjJNMlkiLCJsb2MiOiJNNzJVeHAifQ==', |
| 'numFiles'='1', |
| 'numRows'='621', |
| 'orc.compress'='SNAPPY', |
| 'rawDataSize'='1548153', |
| 'totalSize'='233502', |
| 'transient_lastDdlTime'='1542815783') |
+---------------------------------------------------------------------------------------------------------------------+--+

In the location section, we have the HDFS path for this table.

Our current problem is:

if the above output folder ('hdfs://titan/dev/10112/app/TC30/dataiku/CONFIG_ANOTHER_TEST/output' ) does not exist, insert overwrite

will fail:

INFO: Loading data to table dev_tc30_dataiku.config_another_test_output from hdfs://titan/tmp/.hive-staging_hive_2018-11-21_10-45-41_452_43360044430205414-24417/-ext-10000

ERROR : Failed with exception Destination directory hdfs://titan/dev/10112/app/TC30/dataiku/CONFIG_ANOTHER_TEST/output has not be cleaned up.

org.apache.hadoop.hive.ql.metadata.HiveException: Destination directory hdfs://titan/dev/10112/app/TC30/dataiku/CONFIG_ANOTHER_TEST/output has not be cleaned up.

at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3380)

at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1895)

at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:373)

at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:162)

at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)

at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1756)

at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1497)

at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1294)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1161)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1156)

at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:197)

at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:76)

at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:255)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)

at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:266)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask (state=08S01,code=1)

Our question is: what is the default behavior of insert overwrite if the folder does not exist? should insert overwrite create the folder if the folder does not exist? is this an issue for hive 1.2?

Thanks for all the helps.

John

3 REPLIES 3

avatar
Expert Contributor

Hi @dahai long;

Have you tried using:

hdfs dfs -chown -R <your_user>:<your_group>  /<related path>

i.e.

[hdfs@c3253-node3 ~]$ hdfs dfs -chown -R hive:hive /data1

I tried without the -R and could not create the directory automatically, but once changed the owner and also added the -R, did not have any issue, could you please try that and let me know the results.

Regards,

Ariel Q.

avatar
New Contributor

Hi Ariel, thanks for your reply.

We are actually using a vendor application that using Hive, the vendor application first delete the folder then do the insert overwrite on the external table and it fails, and it is hard to change the process within the vendor application.

For our cluster setting, hive should able to impersonate all users:

<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>

Vendor claimed that the insert overwrite should able to create the folder if the folder does not exist:

"the behavior you’ve shown with beeline is not the expected behavior for insert/overwrite. Insert/overwrite should create the directory and write to it, or if it already exists, should clobber then remake it."

We are not too sure about if this is true, that's why we are asking about the default behavior of insert overwrite.

What are we testing is a very simple insert overwrite:

INSERT OVERWRITE TABLE `dev_tc30_dataiku`.`config_another_test_output` SELECT *

FROM `dev_tc30_dataiku`.`config_another_test_dev_demo`

Thanks,

John

avatar
New Contributor

Hello,

We have exactly the same trouble.

At first, be sure that Insert Overwite does not create the folder of your Hive table.

Your folder has been created during the "CREATE TABLE..... LOCATION 'hdfs://titan/dev/10112/app/TC30/dataiku/CONFIG_ANOTHER_TEST/output'...."

 

In fact, you are using an old version of DSS (Data Science Studio) from Dataiku. For us, we are using the 5.0.3 DSS version, knowing that at this time, 8.0.4 is the last release.

 

As you said : " the vendor application first delete the folder then do the insert overwrite on the external table". That's true.

 

But you have to know that Hive commands, like INSERT OVERWRITE or ANALYZE TABLE...COMPUTE STATISTICS, create a temporary folder.

 

You have to look at this line of your log :

Loading data to table dev_tc30_dataiku.config_another_test_output from hdfs://titan/tmp/.hive-staging_hive_2018-11-21_10-45-41_452_43360044430205414-24417/-ext-10000

 

An .hive_staging..... temporary folder has been created under /tmp directory.

 

Please go into Ambari, and look at this Hive parameter hive.exec.stagingdir

I am sure that this actual value is /tmp/.hive_staging

Please modify this value into default value .hive_staging, and test again using DSS.

 

Normally, Hive will now create a temporary folder under hdfs://titan/dev/10112/app/TC30/dataiku/CONFIG_ANOTHER_TEST/output/.hive-staging........

 

Of course, this .hive-staging... directory is temporary, but it is important to notice that the output folder has been recreated, and so the Insert Overwrite will be Ok.

 

For us, we have the same trouble because we did an upgrade of our HDP cluster from 2.6.1 into 2.6.5 release; and we have to rollback the hive.exec.stagingdir parameter from /tmp/.hive_staging into .hive_staging

 

Your ticket has been created in 11-21-2018, so at this date, the latest release of DSS was 5.0.3

Please notice that in next releases, Dataiku has modified this behaviour, and I don't think that now the Hive folder is always deleted.

Regards,

Gilles