Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

[HIVE] is there a way to perform a "local limit" in a Hive query ?

avatar

Hi,

I was wondering if is there a way to perform a "local limit" in a Hive query.

I explain :

Considering a query that "distribute by" a partition "X".

This partition contains 30 values and I want to have exactly 100 rows per value...

Because, when we perform "limit", generally, this one will break the sink operation at the n-th row, generally only one partition is concerned in that way... And in the aim to build some samples, I think it will be very helpful that reducers (or mappers) can be locally "limited"...

I hope it is clear 🙂

Thanks for your replies.

SF

1 ACCEPTED SOLUTION

avatar
New Contributor

Hi @Sebastien F

Maybe you could try this:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.

Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.

You could add have you random data in this way:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

View solution in original post

3 REPLIES 3

avatar

Hi @Sebastien F, are you referring to sampling data https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling? I think I need a little more clarification in order to better help.

avatar
New Contributor

Hi @Sebastien F

Maybe you could try this:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.

Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.

You could add have you random data in this way:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

avatar

Hi @Joy Ndjama,

Awesome ! Exactly what I was expecting.

Even if it is quite expensive, it is a elegant way to get a true sample.

Thanks @Scott Shaw as well, TABLESAMPLE is a very interesting functionnality too.