01-01-2018 12:39 PM - last edited on 01-02-2018 05:47 AM by cjervis
We are testing option to run query towards s3 based parquet data. Data partitioned by days. While each parquet file is about size of 60mb. We are running query using both AWS Athena and CDH 5.13.1 Impala on aws 2 x1e.2xl nodes (8 cores, 250GB RAM, 220GB nvme disk for cloudera only). Preliminary ran compute stats.
Day of data is about 4TB.
Athena outperformed impala by far for day scan query: 108 sec on Athena and 1100 sec on Impala.
During query we've seen that most of the time impala scans data on s3. However s3 reading performance on each node wasn't more than 27Mbps. No matter what we did, incl rising scan threads or optimising impala s3 parameters, we got no change. We tried also naive approach and used 20 smaller instances. Performance was better almost linear. CPU and MEM wise we had 5 times more resources and query ran almost 5 times faster. However at any point network speed wasn't higher than 27Mbps.
One more interesting thing: During compute stats s3 performance was 4 times higher than on query.
Any insights will be appreciated.
Thanks in advance!
01-02-2018 01:24 PM - edited 01-02-2018 01:26 PM
There are some non-default configuration options that are needed to efficiently run on S3 that can be found here
What EC2 instance type was used for Impala and how big was the cluster?
Also can you please attach the Impala query profile?