Support Questions
Find answers, ask questions, and share your expertise

Parquet Consolidation to Temp table using python threads

Parquet Consolidation to Temp table using python threads

New Contributor

Hey Cloudera Community,

 

We have a production DB that tends to write multiple small parquet tables to most tables. We wrote a consolidation script that creates a temp table per production table, copies the data over to the temp table, then insert overwrites back to the main table, then drop the temp table.

 

Unfortunately one of our production tables was missed by this script and has far too many parquet files. The query details for trying to copy the whole table at once to the temp table indicate that Impala wants to allocate significantly more resources than are available. As a result, we decided to try to attempt iteratively selecting from our main table and inserting into the temp table (partitioned by a customerid and year). 

 

We have a variable number of threads, and a queue full of tuples containing the tenantids and years. Ideally each thread gets a tuple from the queue and then runs the insert statement, then repeats while the queue is not empty. When we try to run our script, we get a generic failure in Impala and the session is closed. Does anybody know I'm doing incorrectly?

 

1 REPLY 1

Re: Parquet Consolidation to Temp table using python threads

New Contributor

I discovered my issue, I was missing the join statement for the queued inserts. Very basic issue, thank you for your time!