<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: What is the best way to assign a sequence number (surrogate key)  in pyspark? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146285#M35798</link>
    <description>&lt;P&gt;You could use &lt;A href="https://spark.apache.org/docs/1.6.1/api/python/pyspark.html#pyspark.RDD.zipWithUniqueId" target="_blank"&gt;https://spark.apache.org/docs/1.6.1/api/python/pyspark.html#pyspark.RDD.zipWithUniqueId&lt;/A&gt;. &lt;/P&gt;</description>
    <pubDate>Mon, 25 Jul 2016 21:57:26 GMT</pubDate>
    <dc:creator>mgaido</dc:creator>
    <dc:date>2016-07-25T21:57:26Z</dc:date>
    <item>
      <title>What is the best way to assign a sequence number (surrogate key)  in pyspark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146284#M35797</link>
      <description>&lt;P&gt;What is the best way to assign a sequence number (surrogate key)  in pyspark on a table in hive that will be inserted into all the time from various data sources after transformations..... This key will be used as a primary key.. Can I use the accumulator or is there a better way?&lt;/P&gt;</description>
      <pubDate>Mon, 25 Jul 2016 21:40:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146284#M35797</guid>
      <dc:creator>doug_mengistu</dc:creator>
      <dc:date>2016-07-25T21:40:13Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to assign a sequence number (surrogate key)  in pyspark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146285#M35798</link>
      <description>&lt;P&gt;You could use &lt;A href="https://spark.apache.org/docs/1.6.1/api/python/pyspark.html#pyspark.RDD.zipWithUniqueId" target="_blank"&gt;https://spark.apache.org/docs/1.6.1/api/python/pyspark.html#pyspark.RDD.zipWithUniqueId&lt;/A&gt;. &lt;/P&gt;</description>
      <pubDate>Mon, 25 Jul 2016 21:57:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146285#M35798</guid>
      <dc:creator>mgaido</dc:creator>
      <dc:date>2016-07-25T21:57:26Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to assign a sequence number (surrogate key)  in pyspark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146286#M35799</link>
      <description>&lt;P&gt;You can use the zipWithIndex method to get a sequence number. And if you need the key to be a primary key, you could snag the max value for the existing dataset in a separate RDD and then use the map method on the zipped RDD to increment the keys.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Jul 2016 21:57:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146286#M35799</guid>
      <dc:creator>clukasik</dc:creator>
      <dc:date>2016-07-25T21:57:59Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to assign a sequence number (surrogate key)  in pyspark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146287#M35800</link>
      <description>&lt;P&gt;I want to make sure all duplicate values in a certian column get the same primary key assigned to them.. the zipWithIndex doesn't gurentee that&lt;/P&gt;</description>
      <pubDate>Mon, 25 Jul 2016 22:18:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146287#M35800</guid>
      <dc:creator>doug_mengistu</dc:creator>
      <dc:date>2016-07-25T22:18:13Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to assign a sequence number (surrogate key)  in pyspark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146288#M35801</link>
      <description>&lt;P&gt;I want to make sure all duplicate values in a certain column get the same primary key assigned to them.. the zipWithIndex doesn't guarantee that&lt;/P&gt;</description>
      <pubDate>Mon, 25 Jul 2016 22:18:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146288#M35801</guid>
      <dc:creator>doug_mengistu</dc:creator>
      <dc:date>2016-07-25T22:18:19Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to assign a sequence number (surrogate key)  in pyspark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146289#M35802</link>
      <description>&lt;P&gt;You could pull out the keys, boil them down to distinct values and then index them. Would something like this work?&lt;/P&gt;&lt;PRE&gt;rddA = your main dataset
rddAKeys = rddA.keys()
rddAUniqKeys = rddAKeys.distinct()
rddAKeyed = rddAUniqKeys.zipWithIndex()
# join rddAKeyed with rddA&lt;/PRE&gt;</description>
      <pubDate>Mon, 25 Jul 2016 23:38:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146289#M35802</guid>
      <dc:creator>clukasik</dc:creator>
      <dc:date>2016-07-25T23:38:57Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to assign a sequence number (surrogate key)  in pyspark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146290#M35803</link>
      <description>&lt;P&gt;In case if RDD is partitioned, does zipwithIndex produce the unique key??&lt;/P&gt;</description>
      <pubDate>Sat, 02 Feb 2019 07:24:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-is-the-best-way-to-assign-a-sequence-number-surrogate/m-p/146290#M35803</guid>
      <dc:creator>Toughdev</dc:creator>
      <dc:date>2019-02-02T07:24:02Z</dc:date>
    </item>
  </channel>
</rss>

