Support Questions

Find answers, ask questions, and share your expertise

What is the best way to assign a sequence number (surrogate key) in pyspark?

avatar
Contributor

What is the best way to assign a sequence number (surrogate key) in pyspark on a table in hive that will be inserted into all the time from various data sources after transformations..... This key will be used as a primary key.. Can I use the accumulator or is there a better way?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

View solution in original post

6 REPLIES 6

avatar
Expert Contributor

avatar
Contributor

I want to make sure all duplicate values in a certian column get the same primary key assigned to them.. the zipWithIndex doesn't gurentee that

avatar
Super Collaborator

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

avatar
Contributor

I want to make sure all duplicate values in a certain column get the same primary key assigned to them.. the zipWithIndex doesn't guarantee that

avatar
Super Collaborator

You could pull out the keys, boil them down to distinct values and then index them. Would something like this work?

rddA = your main dataset
rddAKeys = rddA.keys()
rddAUniqKeys = rddAKeys.distinct()
rddAKeyed = rddAUniqKeys.zipWithIndex()
# join rddAKeyed with rddA

avatar
New Contributor

In case if RDD is partitioned, does zipwithIndex produce the unique key??