<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: write is slow in hdfs using pyspark in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368394#M240165</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/104568"&gt;@rdhau&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you share spark-submit command used to trigger this application?&lt;/P&gt;&lt;P&gt;Also as a quick test can you make this property as true and share its outcome?&lt;/P&gt;&lt;P class="p1"&gt;spark.dynamicAllocation.enabled&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 13 Apr 2023 07:21:51 GMT</pubDate>
    <dc:creator>AsimShaikh</dc:creator>
    <dc:date>2023-04-13T07:21:51Z</dc:date>
    <item>
      <title>write is slow in hdfs using pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368320#M240128</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;I am trying to f=import the data from oracle database and writing the data to hdfs using pyspark.&lt;/P&gt;&lt;P&gt;Oracle has 480 tables i am creating a loop over list of tables but while writing the data into hdfs spark taking too much time.&lt;/P&gt;&lt;P&gt;when i check in logs only 1 executor is running while i was passing --num-executor 4.&lt;/P&gt;&lt;P&gt;here is my code &amp;nbsp;&lt;/P&gt;&lt;P&gt;# oracle-example.py&lt;BR /&gt;from pyspark.sql import SparkSession&lt;BR /&gt;from pyspark.sql import HiveContext&lt;/P&gt;&lt;P&gt;appName = "PySpark Example - Oracle Example"&lt;BR /&gt;master = "yarn"&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;spark = SparkSession.builder.master(master).appName(appName).enableHiveSupport().getOrCreate()&lt;BR /&gt;spark.sparkContext.getConf().getAll()&lt;BR /&gt;#to get the list of tables present in schema&lt;BR /&gt;sql = "SELECT table_name FROM all_tables WHERE owner = '**'"&lt;BR /&gt;user = "**"&lt;BR /&gt;password = "**"&lt;BR /&gt;jdbc_url = "jdbc:oracle:thin:@****/**"&lt;BR /&gt;# Change this to your Oracle's details accordingly&lt;BR /&gt;server = "**"&lt;BR /&gt;port = **&lt;BR /&gt;service_name = '**'&lt;BR /&gt;jdbcDriver = "oracle.jdbc.OracleDriver"&lt;/P&gt;&lt;P&gt;# Create a data frame by reading data from Oracle via JDBC to get the list of tables prersent in schema&lt;BR /&gt;tablelist = spark.read.format("jdbc") \&lt;BR /&gt;.option("url", jdbc_url) \&lt;BR /&gt;.option("query", sql) \&lt;BR /&gt;.option("user", user) \&lt;BR /&gt;.option("password", password) \&lt;BR /&gt;.option("driver", jdbcDriver) \&lt;BR /&gt;.load().select("table_name")&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;connection_details = { "user": "**", "password": "**", "driver": "oracle.jdbc.OracleDriver", }&lt;BR /&gt;tablelist = [row.table_name for row in tablelist.collect()]&lt;BR /&gt;for i in range(len(tablelist)):&lt;BR /&gt;df = spark.read.jdbc(url=jdbc_url, table='sgms.'+tablelist[i], properties=connection_details)&lt;BR /&gt;df.write.save('hdfs:/rajsampark/sgms/'+tablelist[i], format='csv', mode='overwrite')&lt;BR /&gt;print("Write sucessfully for table "+tablelist[i])&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;And I am submitting the code using spark- submit&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;please help&lt;/P&gt;</description>
      <pubDate>Wed, 12 Apr 2023 11:41:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368320#M240128</guid>
      <dc:creator>rdhau</dc:creator>
      <dc:date>2023-04-12T11:41:56Z</dc:date>
    </item>
    <item>
      <title>Re: write is slow in hdfs using pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368331#M240132</link>
      <description>&lt;P&gt;Increase the number of partitions: By default, the number of partitions is set to the number of cores available in your cluster. If your data is small, you can try to increase the number of partitions to improve the performance. You can use the repartition method to increase the number of partitions. For example, you can try something like this:&amp;nbsp;&lt;BR /&gt;df = spark.read.jdbc(url=jdbc_url, table='sgms.'+tablelist[i], properties=connection_details).repartition(4)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This will create 4 partitions for the data and distribute it across the cluster.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Increase the executor memory: By default, each executor is allocated 1GB of memory. If your data is large, you can try to increase the memory allocation to improve the performance. You can use the --executor-memory flag to set the memory allocation. For example, you can try something like this:&lt;/P&gt;&lt;P&gt;spark-submit --executor-memory 4g oracle-example.com&lt;/P&gt;&lt;P&gt;This will allocate 4GB of memory to each executor.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Use foreachPartition instead of write: The write method writes data sequentially, which can be slow for large datasets. You can try using the foreachPartition method to write data in parallel. For example, you can try something like this:&lt;/P&gt;&lt;P&gt;df.foreachPartition(lambda x: write_to_hdfs(x))&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here, write_to_hdfs is a function that writes the data to HDFS.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Increase the number of executors: By default, only one executor is allocated for each task. You can try to increase the number of executors to improve the performance. You can use the --num-executors flag to set the number of executors. For example, you can try something like this:&amp;nbsp;&lt;/P&gt;&lt;P&gt;spark-submit --num-executors 4 oracle-example.com&lt;/P&gt;&lt;P&gt;This will allocate 4 executors for each task.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Apr 2023 12:56:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368331#M240132</guid>
      <dc:creator>HannaJr87</dc:creator>
      <dc:date>2023-04-12T12:56:31Z</dc:date>
    </item>
    <item>
      <title>Re: write is slow in hdfs using pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368385#M240159</link>
      <description>&lt;P&gt;Hi, i have applied the repartition&amp;nbsp; but still only 1 executor is running at a time.&lt;/P&gt;&lt;P&gt;could you please help me with this and also while writing can you share the syntax where i can give the path to save the data in hdfs .&lt;/P&gt;</description>
      <pubDate>Thu, 13 Apr 2023 05:13:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368385#M240159</guid>
      <dc:creator>rdhau</dc:creator>
      <dc:date>2023-04-13T05:13:48Z</dc:date>
    </item>
    <item>
      <title>Re: write is slow in hdfs using pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368394#M240165</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/104568"&gt;@rdhau&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you share spark-submit command used to trigger this application?&lt;/P&gt;&lt;P&gt;Also as a quick test can you make this property as true and share its outcome?&lt;/P&gt;&lt;P class="p1"&gt;spark.dynamicAllocation.enabled&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 13 Apr 2023 07:21:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368394#M240165</guid>
      <dc:creator>AsimShaikh</dc:creator>
      <dc:date>2023-04-13T07:21:51Z</dc:date>
    </item>
    <item>
      <title>Re: write is slow in hdfs using pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368596#M240215</link>
      <description>&lt;P&gt;thanks i have disabled the dynamic allocation and it was working now.&lt;/P&gt;</description>
      <pubDate>Sun, 16 Apr 2023 15:09:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/write-is-slow-in-hdfs-using-pyspark/m-p/368596#M240215</guid>
      <dc:creator>rdhau</dc:creator>
      <dc:date>2023-04-16T15:09:21Z</dc:date>
    </item>
  </channel>
</rss>

