Support Questions

Find answers, ask questions, and share your expertise

Sequence number generation in Hive

avatar
Rising Star

I want to generate sequence number for each new record in hive.

1. First load - am loading 100 records with sequence number starting from 1 to 100.

2. Second load - i need to load new 10 records which are not present in first load . So my sequence number has to start from 101 to 110 for these new to records.

I think there is no sequence function here and Next value cannot be used to hold the max value.

Is this possible with any other function or UDFs has to be written ?

9 REPLIES 9

avatar
Master Guru

It is almost impossible to get a steadily increasing not interrupted sequence number in a parallel data warehouse like Hive.

There is the following UDF but you would have to restrict the number of mappers to 1 to make this work.

https://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFR...

There are some tricks like Accumulators in Spark for example that can do similar things ( give unique numbers at least if not steadily increasing ) but not in Hive as far as I know.

avatar
Rising Star

Generating a sequential order in a distributed processing query is not possible with simple UDFs. This is because the approach will require some centralised entity to keep track of the counter, which will also result in severe inefficiency for distributed queries and is not recommended to apply.

avatar

I have written an implementation for a distributed sequence in here:

https://github.com/manojkumarvohra/hive-hilo

avatar

Dear All,

I need to generate the surrogate keys(which are sequential) in HIVE.I do not want to go and use the method (java.util.UUID) since it generates 33 bytes of lenghth and i do not want that much length(Though it was unique).So predominantly i can do that in 2 ways if i am not wrong:

1st Method:

To restrict the number of mappers to 1 and use the code for random unique generation of UDF in HIVE.The code repository as mentioned in the above post.

https://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFR...

2nd Method:

  • 1)Load data to a temp hive table
  • 2)Lets get the max value in the table
  • 3)Lets use row_number over() to generate row_number+1 and load to actual table.

**** Please Note : I think we cannot restrict the number of mappers to 1 and hence both the methods fail for the creation of unique keys. ****

Please let me know if there is a way to do this?

Thanks,

Lalit

avatar

Hi Manoj-I saw your post and are you sure that your code will work?

avatar

Yes, it does works. One of our clients made use of it in production.

avatar

The number of mappers cannot be restricted to 1 since the number of mappers depend on the data which is number of input splits.

avatar

Are you talking about the sequence number generation in HIVE? And you are telling it can be done with large datasets?

By the way can you please send me your email or contact number?

Thanks

avatar

Yep, its for hive;)

Give it a try as per instructions mentioned on my github page.

If you face any problems shoot a mail on manojkumarvohra9@gmail.com