Created 07-25-2016 02:40 PM
What is the best way to assign a sequence number (surrogate key) in pyspark on a table in hive that will be inserted into all the time from various data sources after transformations..... This key will be used as a primary key.. Can I use the accumulator or is there a better way?
Created 07-25-2016 02:57 PM
You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.
Created 07-25-2016 02:57 PM
Created 07-25-2016 03:18 PM
I want to make sure all duplicate values in a certian column get the same primary key assigned to them.. the zipWithIndex doesn't gurentee that
Created 07-25-2016 02:57 PM
You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.
Created 07-25-2016 03:18 PM
I want to make sure all duplicate values in a certain column get the same primary key assigned to them.. the zipWithIndex doesn't guarantee that
Created 07-25-2016 04:38 PM
You could pull out the keys, boil them down to distinct values and then index them. Would something like this work?
rddA = your main dataset rddAKeys = rddA.keys() rddAUniqKeys = rddAKeys.distinct() rddAKeyed = rddAUniqKeys.zipWithIndex() # join rddAKeyed with rddA
Created 02-01-2019 11:24 PM
In case if RDD is partitioned, does zipwithIndex produce the unique key??