Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Parquet memory requirements

Parquet memory requirements


I was wondering if anyone had more comprehensive documentation of memory requirements for parquet data files; so far my understanding is in these bullet points or better yet broaden my understanding.


  • Any INSERT statement for a Parquet table requires enough free space in the HDFS filesystem to write one block. Because Parquet data files use a block size of 1 GB by default, an INSERT might fail (even for a very small amount of data) if your HDFS is running low on space. Could anyone elaborate on this?
  •  Inserting into a Parquet table is a more memory-intensive operation because the data for each data file (with a maximum size of 1 GB) is stored in memory until encoded, compressed, and written to disk. ( is there any generalized formula for this say for example I have a 1gb file csv and I'd like to translate that to a parquet file how much memory am I using?)


Please let me know if I am being to vague


Re: Parquet memory requirements


Hi Charles!


1. The reason why this might happen is that when you open a block for writing, HDFS makes sure there is sufficent space for the entire block. This is to avoid a situation where there was enough space when you began writing, but another file being written took up your alloted space before you could finish. So long as you have more than 1GB of free space, you shouldn't have an issue.


2. I don't have numbers on how much memory a 1GB CSV would require when writing to Parquet, but generally Parquet is much more efficient so I would expect it to be in the range of 400-600MB buffered, depending on data types.



Re: Parquet memory requirements

Great answer, you had mentioned 600mb for buffering is there a place where
i can view this buffered space per file.