Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

INSERT ... SELECT fails on memory limit exceeded

INSERT ... SELECT fails on memory limit exceeded

New Contributor

I'm exploring Impala by using a CDH VM.

I'm trying to convert my tsv data to parquet format, but constantly fails on memory limit.

There are no other queries in parallal.

 

 

explain insert overwrite  frecordsParquet select * from frecords;
Query: explain insert overwrite  frecordsParquet select * from frecords
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=1.05GB VCores=1                            |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| default.frecords                                                                   |
|                                                                                    |
| WRITE TO HDFS [default.frecordsparquet, OVERWRITE=true]                            |
| |  partitions=1                                                                    |
| |                                                                                  |
| 00:SCAN HDFS [default.frecords]                                                    |
|    partitions=1/1 files=56123 size=15.43GB                                         |
+------------------------------------------------------------------------------------+

 

I tried to set PARQUET_FILE_SIZE to 16m, hoping that Impala will use less memory but it didn't help.

 

Memory Limit Exceeded
Process: memory limit exceeded. Limit=2.99 GB Consumption=3.74 GB
  RequestPool=default-pool: Consumption=86.91 MB
    Query(c64edf272a4e263c:5f481efc69624fba) Limit: Consumption=86.91 MB
      Fragment c64edf272a4e263c:5f481efc69624fbb: Consumption=86.91 MB
        HDFS_SCAN_NODE (id=0): Consumption=37.90 MB
        HdfsTableSink: Consumption=47.93 MB
      Block Manager: Limit=2.39 GB Consumption=0
WARNING: The following tables are missing relevant table and/or column statistics.
default.frecords

All I want is that Impala will read some data,  write it to Parquet and then will continue to read more data.

What's the best way to acheive this?

thanks

2 REPLIES 2

Re: INSERT ... SELECT fails on memory limit exceeded

Master Collaborator

You may be running into https://issues.cloudera.org/browse/IMPALA-2940 that affects queries that scan a lot of parquet data. We've fixed the issue and the fix will appear in Impala 2.3.4 and 2.5.0.

 

A workaround is to break up the large query into multiple small queries that process a subset of the insert.

 

I would also strongly recommend giving more memory to Impala, you'll run into memory limits fairly frequently with only 3GB of memory allocated to Impala.

Highlighted

Re: INSERT ... SELECT fails on memory limit exceeded

New Contributor

 

I upgraded Impala but I still encounter the same situation. 

 

Any more suggestions?

 

I tried also to run compute stats but after a while it fails as well.