Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sequence number generation in Hive

Highlighted

Sequence number generation in Hive

Explorer

I want to generate sequence number for each new record in hive.

1. First load - am loading 100 records with sequence number starting from 1 to 100.

2. Second load - i need to load new 10 records which are not present in first load . So my sequence number has to start from 101 to 110 for these new to records.

I think there is no sequence function here and Next value cannot be used to hold the max value.

Is this possible with any other function or UDFs has to be written ?

9 REPLIES 9

Re: Sequence number generation in Hive

It is almost impossible to get a steadily increasing not interrupted sequence number in a parallel data warehouse like Hive.

There is the following UDF but you would have to restrict the number of mappers to 1 to make this work.

https://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFR...

There are some tricks like Accumulators in Spark for example that can do similar things ( give unique numbers at least if not steadily increasing ) but not in Hive as far as I know.

Re: Sequence number generation in Hive

Explorer

Generating a sequential order in a distributed processing query is not possible with simple UDFs. This is because the approach will require some centralised entity to keep track of the counter, which will also result in severe inefficiency for distributed queries and is not recommended to apply.

Re: Sequence number generation in Hive

New Contributor

I have written an implementation for a distributed sequence in here:

https://github.com/manojkumarvohra/hive-hilo

Re: Sequence number generation in Hive

Dear All,

I need to generate the surrogate keys(which are sequential) in HIVE.I do not want to go and use the method (java.util.UUID) since it generates 33 bytes of lenghth and i do not want that much length(Though it was unique).So predominantly i can do that in 2 ways if i am not wrong:

1st Method:

To restrict the number of mappers to 1 and use the code for random unique generation of UDF in HIVE.The code repository as mentioned in the above post.

https://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFR...

2nd Method:

  • 1)Load data to a temp hive table
  • 2)Lets get the max value in the table
  • 3)Lets use row_number over() to generate row_number+1 and load to actual table.

**** Please Note : I think we cannot restrict the number of mappers to 1 and hence both the methods fail for the creation of unique keys. ****

Please let me know if there is a way to do this?

Thanks,

Lalit

Re: Sequence number generation in Hive

Hi Manoj-I saw your post and are you sure that your code will work?

Re: Sequence number generation in Hive

New Contributor

Yes, it does works. One of our clients made use of it in production.

Re: Sequence number generation in Hive

The number of mappers cannot be restricted to 1 since the number of mappers depend on the data which is number of input splits.

Re: Sequence number generation in Hive

Are you talking about the sequence number generation in HIVE? And you are telling it can be done with large datasets?

By the way can you please send me your email or contact number?

Thanks

Re: Sequence number generation in Hive

New Contributor

Yep, its for hive;)

Give it a try as per instructions mentioned on my github page.

If you face any problems shoot a mail on manojkumarvohra9@gmail.com

Don't have an account?
Coming from Hortonworks? Activate your account here