Created 03-27-2018 05:47 PM
Hi,
I was wondering if is there a way to perform a "local limit" in a Hive query.
I explain :
Considering a query that "distribute by" a partition "X".
This partition contains 30 values and I want to have exactly 100 rows per value...
Because, when we perform "limit", generally, this one will break the sink operation at the n-th row, generally only one partition is concerned in that way... And in the aim to build some samples, I think it will be very helpful that reducers (or mappers) can be locally "limited"...
I hope it is clear 🙂
Thanks for your replies.
SF
Created 03-30-2018 03:42 PM
Hi @Sebastien F
Maybe you could try this:
with tablePart as (SELECT ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.* FROM myTableName t) SELECT * FROM tablePart WHERE tablePart.counter <= limit;
First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.
Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.
You could add have you random data in this way:
with tablePart as (SELECT ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.* FROM myTableName t) SELECT * FROM tablePart WHERE tablePart.counter <= limit;
Created 03-28-2018 01:49 PM
Hi @Sebastien F, are you referring to sampling data https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling? I think I need a little more clarification in order to better help.
Created 03-30-2018 03:42 PM
Hi @Sebastien F
Maybe you could try this:
with tablePart as (SELECT ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.* FROM myTableName t) SELECT * FROM tablePart WHERE tablePart.counter <= limit;
First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.
Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.
You could add have you random data in this way:
with tablePart as (SELECT ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.* FROM myTableName t) SELECT * FROM tablePart WHERE tablePart.counter <= limit;
Created 04-04-2018 07:26 AM
Hi @Joy Ndjama,
Awesome ! Exactly what I was expecting.
Even if it is quite expensive, it is a elegant way to get a true sample.
Thanks @Scott Shaw as well, TABLESAMPLE is a very interesting functionnality too.