We are testing option to run query towards s3 based parquet data. Data partitioned by days. While each parquet file is about size of 60mb. We are running query using both AWS Athena and CDH 5.13.1 Impala on aws 2 x1e.2xl nodes (8 cores, 250GB RAM, 220GB nvme disk for cloudera only). Preliminary ran compute stats.
Day of data is about 4TB.
Athena outperformed impala by far for day scan query: 108 sec on Athena and 1100 sec on Impala.
During query we've seen that most of the time impala scans data on s3. However s3 reading performance on each node wasn't more than 27Mbps. No matter what we did, incl rising scan threads or optimising impala s3 parameters, we got no change. We tried also naive approach and used 20 smaller instances. Performance was better almost linear. CPU and MEM wise we had 5 times more resources and query ran almost 5 times faster. However at any point network speed wasn't higher than 27Mbps.
One more interesting thing: During compute stats s3 performance was 4 times higher than on query.
Any insights will be appreciated.
Thanks in advance!
... View more