Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

[HIVE] is there a way to perform a "local limit" in a Hive query ?

avatar

Hi,

I was wondering if is there a way to perform a "local limit" in a Hive query.

I explain :

Considering a query that "distribute by" a partition "X".

This partition contains 30 values and I want to have exactly 100 rows per value...

Because, when we perform "limit", generally, this one will break the sink operation at the n-th row, generally only one partition is concerned in that way... And in the aim to build some samples, I think it will be very helpful that reducers (or mappers) can be locally "limited"...

I hope it is clear 🙂

Thanks for your replies.

SF

1 ACCEPTED SOLUTION

avatar
New Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar

Hi @Sebastien F, are you referring to sampling data https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling? I think I need a little more clarification in order to better help.

avatar
New Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar

Hi @Joy Ndjama,

Awesome ! Exactly what I was expecting.

Even if it is quite expensive, it is a elegant way to get a true sample.

Thanks @Scott Shaw as well, TABLESAMPLE is a very interesting functionnality too.