- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
[HIVE] is there a way to perform a "local limit" in a Hive query ?
- Labels:
-
Apache Hive
Created ‎03-27-2018 05:47 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I was wondering if is there a way to perform a "local limit" in a Hive query.
I explain :
Considering a query that "distribute by" a partition "X".
This partition contains 30 values and I want to have exactly 100 rows per value...
Because, when we perform "limit", generally, this one will break the sink operation at the n-th row, generally only one partition is concerned in that way... And in the aim to build some samples, I think it will be very helpful that reducers (or mappers) can be locally "limited"...
I hope it is clear 🙂
Thanks for your replies.
SF
Created ‎03-30-2018 03:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Sebastien F
Maybe you could try this:
with tablePart as (SELECT ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.* FROM myTableName t) SELECT * FROM tablePart WHERE tablePart.counter <= limit;
First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.
Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.
You could add have you random data in this way:
with tablePart as (SELECT ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.* FROM myTableName t) SELECT * FROM tablePart WHERE tablePart.counter <= limit;
Created ‎03-28-2018 01:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Sebastien F, are you referring to sampling data https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling? I think I need a little more clarification in order to better help.
Created ‎03-30-2018 03:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Sebastien F
Maybe you could try this:
with tablePart as (SELECT ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.* FROM myTableName t) SELECT * FROM tablePart WHERE tablePart.counter <= limit;
First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.
Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.
You could add have you random data in this way:
with tablePart as (SELECT ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.* FROM myTableName t) SELECT * FROM tablePart WHERE tablePart.counter <= limit;
Created ‎04-04-2018 07:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Joy Ndjama,
Awesome ! Exactly what I was expecting.
Even if it is quite expensive, it is a elegant way to get a true sample.
Thanks @Scott Shaw as well, TABLESAMPLE is a very interesting functionnality too.
