Support Questions

sebastien_frack · ‎03-27-2018

Hi,

I was wondering if is there a way to perform a "local limit" in a Hive query.

I explain :

Considering a query that "distribute by" a partition "X".

This partition contains 30 values and I want to have exactly 100 rows per value...

Because, when we perform "limit", generally, this one will break the sink operation at the n-th row, generally only one partition is concerned in that way... And in the aim to build some samples, I think it will be very helpful that reducers (or mappers) can be locally "limited"...

I hope it is clear 🙂

Thanks for your replies.

SF

joyjedid · ‎03-30-2018

Hi @Sebastien F

Maybe you could try this:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.

Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.

You could add have you random data in this way:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

View solution in original post

SQLShaw · ‎03-28-2018

Hi @Sebastien F, are you referring to sampling data https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling? I think I need a little more clarification in order to better help.

joyjedid · ‎03-30-2018

Hi @Sebastien F

Maybe you could try this:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.

Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.

You could add have you random data in this way:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

sebastien_frack · ‎04-04-2018

Hi @Joy Ndjama,

Awesome ! Exactly what I was expecting.

Even if it is quite expensive, it is a elegant way to get a true sample.

Thanks @Scott Shaw as well, TABLESAMPLE is a very interesting functionnality too.

Cloudera Community

Support Questions

[HIVE] is there a way to perform a "local limit" in a Hive query ?