Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

[HIVE] is there a way to perform a "local limit" in a Hive query ?

Solved Go to solution

[HIVE] is there a way to perform a "local limit" in a Hive query ?

New Contributor

Hi,

I was wondering if is there a way to perform a "local limit" in a Hive query.

I explain :

Considering a query that "distribute by" a partition "X".

This partition contains 30 values and I want to have exactly 100 rows per value...

Because, when we perform "limit", generally, this one will break the sink operation at the n-th row, generally only one partition is concerned in that way... And in the aim to build some samples, I think it will be very helpful that reducers (or mappers) can be locally "limited"...

I hope it is clear :)

Thanks for your replies.

SF

1 ACCEPTED SOLUTION

Accepted Solutions

Re: [HIVE] is there a way to perform a "local limit" in a Hive query ?

New Contributor

Hi @Sebastien F

Maybe you could try this:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.

Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.

You could add have you random data in this way:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

3 REPLIES 3

Re: [HIVE] is there a way to perform a "local limit" in a Hive query ?

Hi @Sebastien F, are you referring to sampling data https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling? I think I need a little more clarification in order to better help.

Re: [HIVE] is there a way to perform a "local limit" in a Hive query ?

New Contributor

Hi @Sebastien F

Maybe you could try this:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY anotherColName) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

First of all, I create groups of data via the distribute then I sort them by anotherColName, then I use row_number to assign a value to each row of each group as if it was a counter.

Then I select that counter and all the columns of the original table where the value of the local counter is less or equal to my limit.

You could add have you random data in this way:

with tablePart as (SELECT  ROW_NUMBER() OVER (DISTRIBUTE BY colName SORT BY rand()) AS counter, t.*
  FROM
   myTableName t) 
SELECT * FROM tablePart WHERE tablePart.counter <= limit;

Highlighted

Re: [HIVE] is there a way to perform a "local limit" in a Hive query ?

New Contributor

Hi @Joy Ndjama,

Awesome ! Exactly what I was expecting.

Even if it is quite expensive, it is a elegant way to get a true sample.

Thanks @Scott Shaw as well, TABLESAMPLE is a very interesting functionnality too.