<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: reading data from oracle in parallel in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/397378#M249829</link>
    <description>&lt;P&gt;For partitionColumn, can you use Date column and date range for lower and upper bound for spark 2.4.8 ?&lt;/P&gt;</description>
    <pubDate>Tue, 12 Nov 2024 13:28:00 GMT</pubDate>
    <dc:creator>djunifi</dc:creator>
    <dc:date>2024-11-12T13:28:00Z</dc:date>
    <item>
      <title>reading data from oracle in parallel</title>
      <link>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/368597#M240216</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;I feel this community is teaching me lots of things which In was not aware before.&lt;/P&gt;&lt;P&gt;i am working on a project where i need to import the data from oracle and store that data into hdfs.&lt;/P&gt;&lt;P&gt;I am using pyspark to load the data into DF and then storing the data into HDFS, however in&amp;nbsp; Oracle the table size is big and it has 420000000 record.&lt;/P&gt;&lt;P&gt;Now i want to read those tables in parallel but but there are multiple tables and i am not able to make partition on the tables.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;is there any way to read the data in parallel when you don't know the partition column ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 16 Apr 2023 15:15:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/368597#M240216</guid>
      <dc:creator>rdhau</dc:creator>
      <dc:date>2023-04-16T15:15:46Z</dc:date>
    </item>
    <item>
      <title>Re: reading data from oracle in parallel</title>
      <link>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/373907#M241804</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Spark JDBC reader is capable of reading data in parallel by splitting it into several partitions.&lt;BR /&gt;&lt;BR /&gt;There are four options.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;partitionColumn&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is the name of the column used for partitioning. An important condition is that the column must be numeric (integer or decimal), date or timestamp type. If the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;partitionColumn&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;parameter is not specified, Spark will use a single executor and create one non-empty partition. Reading data will not be distributed or parallelized.&lt;/LI&gt;&lt;LI&gt;numPartitions&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is the maximum number of partitions that can be used for simultaneous table reading and writing.&lt;/LI&gt;&lt;LI&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;lowerBound&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;upperBound&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;boundaries used to define the partition width. These boundaries determines how many rows from a given range of partition column values can be within a single partition.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;&lt;BR /&gt;For Example -&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df = spark.read \
	.format("jdbc") \
	.option("url", "jdbc:postgresql:postgres") \
	.option("dbtable", "db.table") \
	.option("user", "user")\
	.option("password", "pass") \
	.option("numPartitions", "10") \
	.option("lowerBound", "100") \
	.option("upperBound", "1100") \
	.load()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="pw-post-body-paragraph kn ko ev kp b kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk eo bj"&gt;This method will use the upper and lower bounds, and number of partitions to create where clauses. For example, if lower bound is set to 100, upper bound is 1,000, and number of partitions is 10, then the number of rows read by each task, called stride in the reference documentation, will be:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class="nr mh ev nm b gr ns nt l nu nv"&gt;(upper bound - lower bound) / number of partitions&lt;BR /&gt;(1100 - 100) / 10 = 100&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class="pw-post-body-paragraph kn ko ev kp b kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk eo bj"&gt;And the series of filters applied to each task will be:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class="nr mh ev nm b gr ns nt l nu nv"&gt;where partitionColumn &amp;lt; 100&lt;BR /&gt;where partitionColumn &amp;gt;= 100 and partitionColumn &amp;lt; 200&lt;BR /&gt;where partitionColumn &amp;gt;= 200 and partitionColumn &amp;lt; 300&lt;BR /&gt;...&lt;BR /&gt;where partitionColumn &amp;gt;= 1100&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;lowerBound&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;upperBound&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;define partitioning boundaries, but they DO NOT participate in filtering rows of the table. Therefore, Spark partitions and returns ALL the rows of the table. It is important to note that&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;all data will be read whether partitioning is used or not&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;For example suppose we have&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;partitionColumn&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;data range in [0, 10000] and we set&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;numPartitions=10,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;lowerBound=4000&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;upperBound=5000. As shown in the illustration above, the first and last partitions will contain all the data outside of the corresponding upper or lower boundary.&lt;/P&gt;&lt;P&gt;Another example, suppose we have&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;partitionColumn&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;data range in [2000, 4000] and we set&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;numPartitions=10,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;lowerBound=0&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;upperBound=10000. In this case, then only 2 of the 10 queries (one for each partition) will do all the work, not ideal. In this scenario, the best configuration would be&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;numPartitions=10,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;lowerBound=2000,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;upperBound=4000&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jul 2023 09:53:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/373907#M241804</guid>
      <dc:creator>ggangadharan</dc:creator>
      <dc:date>2023-07-14T09:53:55Z</dc:date>
    </item>
    <item>
      <title>Re: reading data from oracle in parallel</title>
      <link>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/373908#M241805</link>
      <description>&lt;P data-unlink="true"&gt;Cool, thanks. Just what I was looking for.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jul 2023 12:40:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/373908#M241805</guid>
      <dc:creator>Ganesha</dc:creator>
      <dc:date>2023-07-14T12:40:59Z</dc:date>
    </item>
    <item>
      <title>Re: reading data from oracle in parallel</title>
      <link>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/397378#M249829</link>
      <description>&lt;P&gt;For partitionColumn, can you use Date column and date range for lower and upper bound for spark 2.4.8 ?&lt;/P&gt;</description>
      <pubDate>Tue, 12 Nov 2024 13:28:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/reading-data-from-oracle-in-parallel/m-p/397378#M249829</guid>
      <dc:creator>djunifi</dc:creator>
      <dc:date>2024-11-12T13:28:00Z</dc:date>
    </item>
  </channel>
</rss>

