question Re: What is the best way to assign a sequence number (surrogate key) in pyspark? in Archives of Support Questions (Read Only)

What is the best way to assign a sequence number (surrogate key) in pyspark?

doug_mengistu — Mon, 25 Jul 2016 21:40:13 GMT

What is the best way to assign a sequence number (surrogate key) in pyspark on a table in hive that will be inserted into all the time from various data sources after transformations..... This key will be used as a primary key.. Can I use the accumulator or is there a better way?

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

mgaido — Mon, 25 Jul 2016 21:57:26 GMT

You could use https://spark.apache.org/docs/1.6.1/api/python/pyspark.html#pyspark.RDD.zipWithUniqueId.

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

clukasik — Mon, 25 Jul 2016 21:57:59 GMT

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

doug_mengistu — Mon, 25 Jul 2016 22:18:13 GMT

I want to make sure all duplicate values in a certian column get the same primary key assigned to them.. the zipWithIndex doesn't gurentee that

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

doug_mengistu — Mon, 25 Jul 2016 22:18:19 GMT

I want to make sure all duplicate values in a certain column get the same primary key assigned to them.. the zipWithIndex doesn't guarantee that

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

clukasik — Mon, 25 Jul 2016 23:38:57 GMT

You could pull out the keys, boil them down to distinct values and then index them. Would something like this work?

rddA = your main dataset
rddAKeys = rddA.keys()
rddAUniqKeys = rddAKeys.distinct()
rddAKeyed = rddAUniqKeys.zipWithIndex()
# join rddAKeyed with rddA

Re: What is the best way to assign a sequence number (surrogate key) in pyspark?

Toughdev — Sat, 02 Feb 2019 07:24:02 GMT

In case if RDD is partitioned, does zipwithIndex produce the unique key??