Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala Daemon regularly crashes on “Insert Into Table Select * From” into Parquet table

Impala Daemon regularly crashes on “Insert Into Table Select * From” into Parquet table

New Contributor

Impala Daemon regularly crashes on “Insert Into Table Select * From” into Parquet table

 

I have a Map-only job that processes SiteCatalyst data, creating SequenceFile output compressed with Snappy.  In my Driver, after the job completes, the data from each sequence file is appended to a Parquet table. Most of the time this works ok, but it regularly crashes the Impala daemon.  I can’t find a pattern to the crashes – sometimes I can make it through a day of SiteCatalyst data ok, sometimes not, but I’ve never had it succeed through two days. Sometimes it will fail in the same place, sometimes not. Sometimes I can restart my process and it continues, sometimes it continues to crash Impala.  I’ve tried various things, including setting disable_codegen=true, nothing seems to work. I’m building a prototype for a client and I really want to use the Parquet tables, the speed on small sets of columns is unreal, but I’m running into a brick wall with this.

 

Process – for each SequenceFile:

  1. Move the SequenceFile from the job output directory to a staging directory
  2. Drop staging table if it exists
  3. Create staging table: CREATE EXTERNAL TABLE db.stageTable is executed with LOCATION pointing to the staging directory.
  4. Query the staging table count(*), display the count in the logs
  5. Create the target table if it doesn’t already exist: CREATE TABLE IF NOT EXISTS db.targetTable LIKE db.stageTable STORED AS PARQUET
  6. Query the target table count(*), display the “Before” count in the logs
  7. INSERT INTO TABLE  db.targetTable SELECT * FROM db.stageTable (this is where impalad crashes)
  8. REFRESH db.targetTable
  9. Query the target table count(*), display the “After” count in the logs
  10. DROP TABLE db.stageTable
  11. Delete the SequenceFile from the staging directory

 

Other Observations

  1. Environment: Cloudera Standard 4.8.2 (#101 built by jenkins on 20140226-1855 git: 8609801079d228f2440a493a3880ab68cad0524b), Impala 1.2.4, Hardware – 3x HP 6005 16GB 1.5TB (home office dev cluster)
  2. The above SQL is executed via Hive JDBC (0.13.0).  I thought maybe there was a JDBC driver issue, so I tried writing all of the SQL to .sql files and executing via impala-shell – ran into same impalad crashes, so I don’t think it’s the JDBC driver
  3. I have some tables with a relatively small number of columns and relatively large number of rows, some just the opposite, don’t see any pattern to failure
  4. I’ve tried doing a COMPUTE STATS after the crash, sometimes the process continues ok, other times it crashes impalad, no pattern
  5. Sometimes when a failure is encountered, that sequence file continue to crash impalad – I can’t get past that point, that’s a big issue for me right now, I can't figure out a circumvention
  6. I looked in the logs in /var/log/impalad, didn’t see anything that jumped out at me (i.e. didn’t see any exceptions), but I don’t know what “good” is so I can’t judge “bad”

 

Below is a snapshot from the job log, shows the CREATE TABLE, etc.

Thanks in advance

Pete Zybrick

 

14/04/25 14:54:19 INFO hadoop.CatalystDriver: ==============================================================

14/04/25 14:54:19 INFO hadoop.CatalystDriver: Processing event-m-00000

14/04/25 14:54:19 INFO hadoop.CatalystDriver: DROP TABLE IF EXISTS demodb.event_stage

14/04/25 14:54:19 INFO hadoop.CatalystDriver: CREATE EXTERNAL TABLE demodb.event_stage  (  udc_target_date string,  udc_in_file_name string,  udc_in_file_offset bigint,  date_time timestamp,  event_type string,  event_seqn smallint,  event_id string,  event_value string,  is_secondary_lookup smallint  )  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  STORED AS SEQUENCEFILE  LOCATION '/user/ipcdev/catalyststage/event-m-00000'

14/04/25 14:54:19 INFO hadoop.CatalystDriver: REFRESH demodb.event_stage

14/04/25 14:54:21 INFO hadoop.CatalystDriver: Stage table numRows=23634507

14/04/25 14:54:21 INFO hadoop.CatalystDriver: Target table numRows Before=0

14/04/25 14:54:21 INFO hadoop.CatalystDriver: CREATE TABLE IF NOT EXISTS demodb.event LIKE demodb.event_stage STORED AS PARQUET

14/04/25 14:54:21 INFO hadoop.CatalystDriver: INSERT INTO TABLE demodb.event SELECT * FROM demodb.event_stage

14/04/25 14:54:30 ERROR hadoop.CatalystDriver: Insert Into Table Exception

java.sql.SQLException

        at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:275)

        at org.apache.hive.jdbc.HiveStatement.executeUpdate(HiveStatement.java:369)

        at com.ipcglobal.catalyst.hadoop.CatalystDriver.insertIntoParquetTable(CatalystDriver.java:402)

        at com.ipcglobal.catalyst.hadoop.CatalystDriver.processFiles(CatalystDriver.java:290)

        at com.ipcglobal.catalyst.hadoop.CatalystDriver.postProcess(CatalystDriver.java:143)

        at com.ipcglobal.catalyst.hadoop.CatalystDriver.run(CatalystDriver.java:129)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

        at com.ipcglobal.catalyst.hadoop.CatalystDriver.main(CatalystDriver.java:72)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

14/04/25 14:54:30 ERROR hadoop.CatalystDriver: java.sql.SQLException

14/04/25 14:54:30 ERROR hadoop.CatalystDriver: Cause: null

14/04/25 14:54:30 ERROR hadoop.CatalystDriver: getErrorCode: 0

14/04/25 14:54:30 ERROR hadoop.CatalystDriver: getSQLState: null

6 REPLIES 6

Re: Impala Daemon regularly crashes on “Insert Into Table Select * From” into Parquet table

New Contributor

I've got a SequenceFile that seems to be failing consistently, let me know if you want me to send it to Cloudera.

Thanks

 

(Shell build version: Impala Shell v1.2.4 (ac29ae0) built on Wed Mar 5 07:05:40 PST 2014)
[node1.ipc-global.com:21000] > create database demodb;
Query: create database demodb

Returned 0 row(s) in 0.07s
[node1.ipc-global.com:21000] > CREATE EXTERNAL TABLE demodb.event_stage ( udc_target_date string, udc_in_file_name string, udc_in_file_offset bigint, date_time timestamp, event_type string, event_seqn smallint, event_id string, event_value string, is_secondary_lookup smallint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION '/user/ipcdev/catalyststage/event-m-00000';
Query: create EXTERNAL TABLE demodb.event_stage ( udc_target_date string, udc_in_file_name string, udc_in_file_offset bigint, date_time timestamp, event_type string, event_seqn smallint, event_id string, event_value string, is_secondary_lookup smallint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION '/user/ipcdev/catalyststage/event-m-00000'

Returned 0 row(s) in 0.05s
[node1.ipc-global.com:21000] > REFRESH demodb.event_stage;
Query: refresh demodb.event_stage

Returned 0 row(s) in 0.11s
[node1.ipc-global.com:21000] > CREATE TABLE IF NOT EXISTS demodb.event LIKE demodb.event_stage STORED AS PARQUET;
Query: create TABLE IF NOT EXISTS demodb.event LIKE demodb.event_stage STORED AS PARQUET

Returned 0 row(s) in 0.07s
[node1.ipc-global.com:21000] > INSERT INTO TABLE demodb.event SELECT * FROM demodb.event_stage;
Query: insert INTO TABLE demodb.event SELECT * FROM demodb.event_stage
Query aborted.
[node1.ipc-global.com:21000] > INSERT INTO TABLE demodb.event SELECT * FROM demodb.event_stage;
Query: insert INTO TABLE demodb.event SELECT * FROM demodb.event_stage
Query aborted.
[node1.ipc-global.com:21000] >

Highlighted

Re: Impala Daemon regularly crashes on “Insert Into Table Select * From” into Parquet table

Contributor

Yes, can you please share the file? 

Re: Impala Daemon regularly crashes on ?Insert Into Table Select * From? into Parquet table

New Contributor
Thanks for the reply, the gzipped file is 168MB, what is the best way to get it to you?
Thanks
Pete

Re: Impala Daemon regularly crashes on ?Insert Into Table Select * From? into Parquet table

Contributor

Do you have box or drop box?

Re: Impala Daemon regularly crashes on ?Insert Into Table Select * From? into Parquet table

New Contributor
Dropbox, but it's customer data, ok to send to you for debugging, but I don't want to post the url on community, can I send to your email address?

Thanks

Pete



Re: Impala Daemon regularly crashes on ?Insert Into Table Select * From? into Parquet table

Contributor

nong@cloudera.com

 

Can you also include the schema?

Don't have an account?
Coming from Hortonworks? Activate your account here