Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Parquet memory requirements

Highlighted

Parquet memory requirements

Explorer

I was wondering if anyone had more comprehensive documentation of memory requirements for parquet data files; so far my understanding is in these bullet points or better yet broaden my understanding.

 

  • Any INSERT statement for a Parquet table requires enough free space in the HDFS filesystem to write one block. Because Parquet data files use a block size of 1 GB by default, an INSERT might fail (even for a very small amount of data) if your HDFS is running low on space. Could anyone elaborate on this?
  •  Inserting into a Parquet table is a more memory-intensive operation because the data for each data file (with a maximum size of 1 GB) is stored in memory until encoded, compressed, and written to disk. ( is there any generalized formula for this say for example I have a 1gb file csv and I'd like to translate that to a parquet file how much memory am I using?)

 

Please let me know if I am being to vague

2 REPLIES 2

Re: Parquet memory requirements

Contributor

Hi Charles!

 

1. The reason why this might happen is that when you open a block for writing, HDFS makes sure there is sufficent space for the entire block. This is to avoid a situation where there was enough space when you began writing, but another file being written took up your alloted space before you could finish. So long as you have more than 1GB of free space, you shouldn't have an issue.

 

2. I don't have numbers on how much memory a 1GB CSV would require when writing to Parquet, but generally Parquet is much more efficient so I would expect it to be in the range of 400-600MB buffered, depending on data types.

 

-Joey

Re: Parquet memory requirements

Explorer
Great answer, you had mentioned 600mb for buffering is there a place where
i can view this buffered space per file.