Support Questions

Find answers, ask questions, and share your expertise

What is the best way to assign a sequence number (surrogate key) in pyspark?

Contributor

What is the best way to assign a sequence number (surrogate key) in pyspark on a table in hive that will be inserted into all the time from various data sources after transformations..... This key will be used as a primary key.. Can I use the accumulator or is there a better way?

1 ACCEPTED SOLUTION

Expert Contributor

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

View solution in original post

6 REPLIES 6

Rising Star

Contributor

I want to make sure all duplicate values in a certian column get the same primary key assigned to them.. the zipWithIndex doesn't gurentee that

Expert Contributor

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

Contributor

I want to make sure all duplicate values in a certain column get the same primary key assigned to them.. the zipWithIndex doesn't guarantee that

Expert Contributor

You could pull out the keys, boil them down to distinct values and then index them. Would something like this work?

rddA = your main dataset
rddAKeys = rddA.keys()
rddAUniqKeys = rddAKeys.distinct()
rddAKeyed = rddAUniqKeys.zipWithIndex()
# join rddAKeyed with rddA

New Contributor

In case if RDD is partitioned, does zipwithIndex produce the unique key??