I was wondering if anyone had more comprehensive documentation of memory requirements for parquet data files; so far my understanding is in these bullet points or better yet broaden my understanding.
Please let me know if I am being to vague
1. The reason why this might happen is that when you open a block for writing, HDFS makes sure there is sufficent space for the entire block. This is to avoid a situation where there was enough space when you began writing, but another file being written took up your alloted space before you could finish. So long as you have more than 1GB of free space, you shouldn't have an issue.
2. I don't have numbers on how much memory a 1GB CSV would require when writing to Parquet, but generally Parquet is much more efficient so I would expect it to be in the range of 400-600MB buffered, depending on data types.