Reply
New Contributor
Posts: 4
Registered: ‎11-30-2017

return code 3 on insertion in hive table

[ Edited ]

I have to create a partitioned table in hive. Data have 100 customers which have millions of records. So I have created a partitioned table.

create table ptable (
foo String,
bar String)
PARTITIONED BY (customer_name String, studio_name String, ack_name String)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/lucy';

 

Table successfully created; Insert the data in table

 

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;


INSERT OVERWRITE TABLE ptable PARTITION(customer_name, studio_name, ack_name)
SELECT foo,
bar,
customer_name,
studio_name,
ack_name
FROM stable;


it gives error
>ERROR : Status: Failed
>ERROR : FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
>INFO : Completed executing command(queryId=hive_20171130115757_fcbf193e-9b07-4946-b051-61005b7d0cbf); Time taken: 91.358 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=3)


I have give more memory to insert query

SET mapreduce.map.memory.mb=7000;
SET mapreduce.map.java.opts=-Xmx7200m;
SET mapreduce.reduce.memory.mb=7000;
SET mapreduce.reduce.java.opts=-Xmx7200m;

 

still getting the same error.

 

When I tried with limit in query it works well.

INSERT OVERWRITE TABLE ptable PARTITION(customer_name, studio_name, ack_name)
SELECT foo,
bar,
customer_name,
studio_name,
ack_name
FROM stable limit 100;

 

Is there any properties to give Hive?

Cloudera Employee
Posts: 1
Registered: ‎12-22-2017

Re: return code 3 on insertion in hive table

Hi,

You are using spark as execution engine, can try to tune hive on spark config based on https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_hos_tuning.html

Cloudera Employee
Posts: 30
Registered: ‎11-20-2015

Re: return code 3 on insertion in hive table

Please try setting the following configuration before the INSERT query:

 

 

SET hive.optimize.sort.dynamic.partition=true; 

 

Performing an INSERT into a dynamically partitioned Parquet table creates a classic problem.  I understand that this is a Spark application, but it's easier perhaps to visualize the issue while using the MapReduce framework.

 

For each file open file, the Parquet libraries creates a large (~128MB) buffer.  As records are written to the file, they first head to the buffer.  Once the buffer fills up, its contents are flushed to the file along with a whole bunch of metadata about its contents.  This metadata is the part of the beauty of Parquet.  One piece of metadata, for example, might be min and max values for the contents of the buffer.  If I'm querying for personal data for people older than 45 years of age, the system can quickly look at this min/max metadata to determine if a file contains any relevent records without having to step through every record in the file.  Batching the records also allows for tight compression.

 

So, for each Mapper, as it reads in data, it is going to route the data to the correct partition in HDFS.  This means that every time a Mapper reads a record for a partition it has not yet encountered, it has to open a new file, in a new partition, and it then adds the record to it.  These files stay open until the Mapper is done.


So, if we have a very randomized data set, and a Mapper is likely to come across records that go into many discrete partitions, then each partition will require a Parquet file which requires ~128MB memory buffer per file.  For this example, if there are 100 partitions and a Mapper finds records destined for each, that is 128MB x 100 = 12800MB = 12.5GB per Mapper just for buffers.

 

The optimization 'hive.optimize.sort.dynamic.partition' will force a Reducer phase to occur.  Each Reducer will receive the records, from the Mappers, for just a small number (1,2,3) of partitions.  This means that every Mapper is simply streaming the records to the local disk, to a flat file, routing the records into files per Reducer, and that each Reducer will have only open a small number of Parquet buffers.

 

Thanks.

Announcements