Hi all, I developed a flow which
get CSV files from HDFS, convert them into parquet and put them back in HDFS.
For this purpose I am using the flow depicted here. I measured how much
time to run only the PutParquet processor and for a CSV file of 2.4 GB it took
10 minutes which is really slow. I already increased the heap space from 4GB to
16GB but it did not change anything.
The bandwidth is 1 Gbits/s.
I also tried to convert the CSV file to Avro and then to ORC but it is
also slow (same order as parquet conversion). Is there any way to
achieve much better performances ?
... View more