Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎06-18-2014

Impala size of files in a parquet partition

[ Edited ]

Hi everyone,

I'm new to Cloudera Impala, and I'm working on a bench to try and get the best out of it. I have read some performance tips, and I'm trying the impact of some of them like the size of parquet files (I'm not talking about the PARQUET_FILE_SIZE, but the actual size of the files created on HDFS) , and I'm stuck on a problem that bugs me.

Let me explain the situation :
    The version of Impala I'm working on is the 1.3.1 which I believe is the latest stable version. I've installed my CDH 4.6 infrastructure with Cloudera Manager 4.8.2.
    The cluster count 6 machines with 5 impalad nodes.

    I have created 3 external tables which points to HDFS directories to test the performance of Impala on external tables.
    But, I've read that to get the best out of Impala, using Parquetfile tables is a good choice and that it is best to partition tables when you get large amounts of data.
    That's the case concerning one of my external tables which represents ~15Go of files on HDFS.
    
    So what I did is something like :
        CREATE TABLE transactions(
          id BIGINT,
          customer_id BIGINT,
          book_id BIGINT,
          quantity INT,
          transaction_date TIMESTAMP
        ) PARTITIONED BY (year INT)
        STORED AS PARQUETFILE ;
    
        ALTER TABLE transactions ADD partition (year=2000) ;
        ALTER TABLE transactions partition(year=2000) set fileformat parquetfile ;
        INSERT INTO transactions PARTITION(year=2000) SELECT id, customer_id, book_id, quantity, transaction_date from test_external_part.transactions WHERE year=2000 ;

        -- The test_external_part.transactions table is the external table that is also partitioned by year and contains 15Go of datas.
        
    But, the problem that I have is that, the amount of datas that is stored inside the test_external_part.transactions partition is just about 1Go, so the execution of the INSERT statement creates (at least) 5 files inside the parquet table partition like so :
        hdfs dfs -ls /path/to/parquet/table/year=2010/
            Found 5 items
            -rw-r--r--   3 user group     85.0 M 2014-06-18 10:22 /path/to/parquet/table/year=2000/754caf76397b19d3-1be6cb02414de49b_1887686105_data.0
            -rw-r--r--   3 user group    126.5 M 2014-06-18 10:22 /path/to/parquet/table/year=2000/754caf76397b19d3-1be6cb02414de49c_300163220_data.0
            -rw-r--r--   3 user group     82.7 M 2014-06-18 10:22 /path/to/parquet/table/year=2000/754caf76397b19d3-1be6cb02414de49d_1843826208_data.0
            -rw-r--r--   3 user group     63.5 M 2014-06-18 10:22 /path/to/parquet/table/year=2000/754caf76397b19d3-1be6cb02414de49e_440384039_data.0
            -rw-r--r--   3 user group     63.5 M 2014-06-18 10:22 /path/to/parquet/table/year=2000/754caf76397b19d3-1be6cb02414de49f_1311481425_data.0

    I've already noticed that when inserting datas with an INSERT as SELECT statement, Impala decides to use each node to read a portion of the datas in the source table, and to write a portion of the data in the target table.
    
    So I thought : "Ok if I only use 1 node to do the job, I should only have this node reading the datas, and also writing it which should create only 1 file on HDFS"
    Using impala-shell I did the following :
        ALTER TABLE transactions DROP PARTITION (year=2000);
        ALTER TABLE transactions ADD partition (year=2000) ;
        ALTER TABLE transactions partition(year=2000) set fileformat parquetfile ;

        SET NUM_NODES=1; # To use only 1 node for the operation
        INSERT INTO transactions PARTITION(year=2000) SELECT id, customer_id, book_id, quantity, transaction_date from test_external_part.transactions WHERE year=2000 ;
        
    But the result is not what I had imagined :
        Found 2 items
        -rw-r--r--   3 user group    405.5 M 2014-06-18 10:37 /path/to/parquet/table/year=2000/344e97ee6c720360-cbdd0ff0b6c3149c_918566800_data.0
        -rw-r--r--   3 user group     14.6 M 2014-06-18 10:37 /path/to/parquet/table/year=2000/344e97ee6c720360-cbdd0ff0b6c3149c_918566800_data.1    
    As you can tell, there are 2 files created. I don't understand why, since I used only 1 node to read and write the datas. My PARQUET_FILE_SIZE is set to 1Go, so I imagined that Impala would have written in the same HDFS file until reaching a file size of 1Go.
    
    You can find the 2 queries profile here : 1 node insert and 5 nodes insert

I have read them but I don't get any clues from them. Maybe someone with best reading profiles skills can point something to me I haven't seen.
    
    To conclude with this, I have two questions :
        - Is there a way to avoid that behavior, so that when doing an INSERT as SELECT, Impala will write files until they reach the PARQUET_FILE_SIZE.
        - Does anyone understand what's going on ?
        
Thanks for reading and sorry for the size of the post, and for my bad english.
       

Cloudera Employee
Posts: 27
Registered: ‎09-27-2013

Re: Impala size of files in a parquet partition

Take a look at these docs, in particular the sections about the SHUFFLE/NOSHUFFLE hints. It's a more effective way to control the number of resulting files than changing num_nodes.

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/Installing-and-Usi...

 

The reason the files are smaller than you expect is due to https://issues.cloudera.org/browse/IMPALA-1072. As noted in the jira, this is unlikely to have a big impact on performance.

Announcements