Support Questions

doug_mengistu · ‎07-25-2016

What is the best way to assign a sequence number (surrogate key) in pyspark on a table in hive that will be inserted into all the time from various data sources after transformations..... This key will be used as a primary key.. Can I use the accumulator or is there a better way?

clukasik · ‎07-25-2016

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

View solution in original post

mgaido · ‎07-25-2016

You could use https://spark.apache.org/docs/1.6.1/api/python/pyspark.html#pyspark.RDD.zipWithUniqueId.

doug_mengistu · ‎07-25-2016

I want to make sure all duplicate values in a certian column get the same primary key assigned to them.. the zipWithIndex doesn't gurentee that

clukasik · ‎07-25-2016

You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.

doug_mengistu · ‎07-25-2016

I want to make sure all duplicate values in a certain column get the same primary key assigned to them.. the zipWithIndex doesn't guarantee that

clukasik · ‎07-25-2016

You could pull out the keys, boil them down to distinct values and then index them. Would something like this work?

rddA = your main dataset
rddAKeys = rddA.keys()
rddAUniqKeys = rddAKeys.distinct()
rddAKeyed = rddAUniqKeys.zipWithIndex()
# join rddAKeyed with rddA

Toughdev · ‎02-01-2019

In case if RDD is partitioned, does zipwithIndex produce the unique key??

Cloudera Community

Support Questions

What is the best way to assign a sequence number (surrogate key) in pyspark?