<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: PySpark Queries in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412330#M253382</link>
    <description>&lt;P&gt;Can we parallelise it?&lt;/P&gt;</description>
    <pubDate>Thu, 11 Sep 2025 10:07:05 GMT</pubDate>
    <dc:creator>Jack_sparrow</dc:creator>
    <dc:date>2025-09-11T10:07:05Z</dc:date>
    <item>
      <title>PySpark Queries</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412308#M253365</link>
      <description>&lt;P&gt;In a pyspark code which reads a hdfs path in it's respective format(text, orc, parquet) and writes it in parquet format in ozone path.&lt;/P&gt;&lt;P&gt;Data is huge.&lt;/P&gt;&lt;P&gt;1) How to do resource calculations for the pyspark job&lt;/P&gt;&lt;P&gt;Like no. of cores, no. of executors, memory allocation&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2) Is there a way we can dynamically read the data from hdfs by adjusting according to it's file type.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;3) What should be the optimal solution and approach for the data movement and what input mappings the approach should use.&lt;/P&gt;</description>
      <pubDate>Tue, 21 Apr 2026 06:17:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412308#M253365</guid>
      <dc:creator>Jack_sparrow</dc:creator>
      <dc:date>2026-04-21T06:17:11Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark Queries</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412315#M253368</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/131645"&gt;@Jack_sparrow&lt;/a&gt;,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Glad to see you again in the forums.&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. The resource allocation is something kind of complicated to tell, because it depends in a lot of factors.&amp;nbsp;&lt;BR /&gt;Take in mind how big is your data, how much memory you have on the cluster, do not forget the overhead and other things.&amp;nbsp;&lt;BR /&gt;There is very useful information here:&amp;nbsp;&lt;A href="https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/tuning-spark/topics/spark-admin-tuning-resource-allocation.html" target="_blank"&gt;https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/tuning-spark/topics/spark-admin-tuning-resource-allocation.html&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;Under that parent section there is more tuning suggestions on each topic.&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. From the second option I understand that you want to read the data separately using each type.&amp;nbsp;&lt;BR /&gt;That should be possible with something like this:&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;if input_path.endswith(".parquet"):&lt;BR /&gt;df = spark.read.parquet(input_path)&lt;BR /&gt;elif input_path.endswith(".orc"):&lt;BR /&gt;df = spark.read.orc(input_path)&lt;BR /&gt;elif input_path.endswith(".txt") or input_path.endswith(".csv"):&lt;BR /&gt;df = spark.read.text(input_path) # o .csv con opciones&lt;BR /&gt;else:&lt;BR /&gt;raise Exception("Unsupported file format")&lt;/PRE&gt;&lt;P&gt;Then, you can handle each data in a separate way.&amp;nbsp;&lt;/P&gt;&lt;P&gt;3. The data movement should avoid going to the driver, to avoid issues and extra work, so&amp;nbsp;collect() or .toPandas() are not the best options.&amp;nbsp;&lt;BR /&gt;If you want to move data without transformations, distcp should be a good option.&amp;nbsp;&lt;BR /&gt;To write you can use this:&amp;nbsp;df.write.mode("overwrite").parquet("ofs://ozone/path/out")&lt;BR /&gt;And other suggestions can be tuning the partitions with "spark.sql.files.maxPartitionBytes" and change the compression to snappy using "spark.sql.parquet.compression.codec".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 09 Sep 2025 16:26:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412315#M253368</guid>
      <dc:creator>vafs</dc:creator>
      <dc:date>2025-09-09T16:26:33Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark Queries</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412330#M253382</link>
      <description>&lt;P&gt;Can we parallelise it?&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 10:07:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412330#M253382</guid>
      <dc:creator>Jack_sparrow</dc:creator>
      <dc:date>2025-09-11T10:07:05Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark Queries</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412332#M253384</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/131645"&gt;@Jack_sparrow&lt;/a&gt;,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Spark should automatically do it, you can control that with these settings:&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Input splits are controlled by&lt;BR /&gt;spark.sql.files.maxPartitionBytes (default 128MB). If smaller, more splits or parallel tasks will be executed.&lt;BR /&gt;spark.sql.files.openCostInBytes (default 4MB) influences how Spark coalesces small files.&lt;/LI&gt;&lt;LI&gt;Shuffle parallelism&lt;BR /&gt;spark.sql.shuffle.partitions (default 200). Configiure around&amp;nbsp; 2–3 times per total executor cores.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Also, make sure df.write.parquet() doesn’t set everything into few files only. For that, you can use .repartition(n) to increase the parallelism before writing.&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 18:04:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-Queries/m-p/412332#M253384</guid>
      <dc:creator>vafs</dc:creator>
      <dc:date>2025-09-11T18:04:14Z</dc:date>
    </item>
  </channel>
</rss>

