We have an requirement to select 100 rows from a table from a perticular range of partitions. It's a wide table with 800 columns. One of the columns in the table is precise timestamp of the record ( a column of type TIMESTAMP). Also there is a partition predicate day ( yyyymmdd INT ). User often select range of days and try to find the top 100 ( order by exact timestamp ).
If I run query like
select * from table where yyyymmdd between now() and now()-3 months
order by time
This query runs extremely slow.
Reducing the number helps.
I also tried adding partition yyyymmdd to the query's order by ; thinking query planner might use it to only find results by partition and won't wait for results from other partitions if it gets 100 from first one. But I didn't see it working.
Any tricks/tips to make this query faster ? Wide table with 100s of millions of rows.
It is parquet table. But it also has lot of rows. If my query has to scan last 6 months of data ( 2.5 Billion rows) ; I'm using order by time clause ( where time is a column of type timestamp), it takes 2 minutes ( after reduced my columns to retrieve only 5 columns ). With all colums it takes like 5 + minutes.
I also tried reducing the columns to 200 - 300 but stil the performance of the query is still slow.
I was able to finally figure this ...
- Upgrading to CDH 5.12 helped. Earlier we used 5.7 which appeared to have some bug with wide table scan.
- Sorting is an issue.But it has it's own limitations ..
If sorting is removed and limit clause is removed. It takes long time to download the data for obvious reasons. Large data takes time to download. It starts streamign data instantly though.
Removing sorting but with limit obviously makes the result unpredictable and inconsistent.
- When sorting is applies and used with limit and offset clause. It appeared from profile that each node sorts the data that it has scanned and sends top ( 100 , i.e. limit 100 ) to coordinator. Coordinator than collects all the top results and sends top 100 to client.
- Event though I added partition key to sort statement. The query appearts to not use that i.e. it still tries to scan all records from all partitions before sending it to coordinator. in other words adding partition key to sorting or removing it didn't make any difference. I am not sure but Impala could have done something smarter here to make query run faster.
Glad that 5.12 is working better for you. I'm not sure exactly which issue might have caused that but we have made a lot of improvements since 5.7.
I think the optimisation you're describing is interesting. I guess the precondition is that the a prefix of the ordering key can be evaluated from the partition key. Your idea is to process the partitions on each daemon strictly in sorted order and then only sort each partition at a time? I think there are some obstacles currently to doing this but it's interesting to think about.
We don't do it currently because the partitions are processed in an arbitrary order and in parallel, so the table scan may interleave rows from different partitions.