- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Using SORT BY with externally loaded parquet files (ex IMPALA LOAD)
- Labels:
-
Apache Impala
Created on ‎02-11-2019 12:46 PM - edited ‎09-16-2022 07:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I was looking at this BLOG https://blog.cloudera.com/blog/2017/12/faster-performance-for-selective-queries/
where we see that using "SORT BY" during table creation we can improve impala query performance .
As mentioned in the blog this works only if we use "INSERT" or "CREAT table with select " . Our use case is we create parquet file externally and UPLOAD it onto HDFS and then use IMPALA " LOAD DATA" command. Is there a way we can use "SORT BY" mechanism with this model of loading parquet files.
Thanks,
Raju.
Created ‎02-12-2019 08:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The external tool that you are using would have to support ordering the data by those columns. E.g. if you're using hive, it supports SORT BY. If you're writing it from some custom code, that code would need to sort the data before writing it to parquet.
Created ‎02-12-2019 09:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using parquet-cpp to write parquet file and the upload it to HDFS using web-hdfs . At the end use "LOAD DATA" command to load iparquet file nto into impla.
Is there any option in parquet-cpp to sort it out.
Created ‎02-12-2019 10:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure that parquet-cpp has any builtin way to sort data - your client code might have to do the sorting before feeding it to parquet-cpp
Created ‎02-12-2019 03:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I am using dictionary encoding for the column, do I still need to write data in sorted order in parquet file .
Created ‎02-13-2019 11:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎02-13-2019 11:54 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much for help , I will try out sorting and validate query performance.
