Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is the best way to assign a sequence number (surrogate key) in pyspark?

Solved Go to solution

What is the best way to assign a sequence number (surrogate key) in pyspark?

Contributor

What is the best way to assign a sequence number (surrogate key) in pyspark on a table in hive that will be inserted into all the time from various data sources after transformations..... This key will be used as a primary key.. Can I use the accumulator or is there a better way?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

Expert Contributor

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

6 REPLIES 6
Highlighted

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

Rising Star

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

Contributor

I want to make sure all duplicate values in a certian column get the same primary key assigned to them.. the zipWithIndex doesn't gurentee that

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

Expert Contributor

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

Contributor

I want to make sure all duplicate values in a certain column get the same primary key assigned to them.. the zipWithIndex doesn't guarantee that

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

Expert Contributor

You could pull out the keys, boil them down to distinct values and then index them. Would something like this work?

rddA = your main dataset
rddAKeys = rddA.keys()
rddAUniqKeys = rddAKeys.distinct()
rddAKeyed = rddAUniqKeys.zipWithIndex()
# join rddAKeyed with rddA

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

New Contributor

In case if RDD is partitioned, does zipwithIndex produce the unique key??

Don't have an account?
Coming from Hortonworks? Activate your account here