<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Possibility Split Parquet file in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378974#M243744</link>
    <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/70785"&gt;@Shmoo&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/38161"&gt;@cravani&lt;/a&gt;&amp;nbsp;Do you have any insights here? Thanks!&lt;/P&gt;</description>
    <pubDate>Mon, 13 Nov 2023 23:36:11 GMT</pubDate>
    <dc:creator>DianaTorres</dc:creator>
    <dc:date>2023-11-13T23:36:11Z</dc:date>
    <item>
      <title>Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378726#M243645</link>
      <description>&lt;P&gt;Hello everyone,&lt;/P&gt;&lt;P&gt;my team using TEZ, in particular Hive, has noticed that during an insert with a very simple select a single parquet file of 1.5 gb per partition is generated in the output table.&lt;/P&gt;&lt;P&gt;To try to remedy the problem, a number of settings were used at the session level but had no effect.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="HadoopHero_0-1699388299635.png" style="width: 400px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/38861iBDAFBEB75EEF49E9/image-size/medium?v=v2&amp;amp;px=400" role="button" title="HadoopHero_0-1699388299635.png" alt="HadoopHero_0-1699388299635.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Below are the sets used at the session level:&lt;/P&gt;&lt;P&gt;&lt;FONT size="2"&gt;SET hive.execution.engine=tez;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;SET hive.exec.dynamic.partition=true;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;SET hive.exec.dynamic.partition.mode=nonstrict;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;SET hive.optimise.sort.dynamic.partition.threshold=0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;--SET tez.grouping.max-size=268435456;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;--SET hive.exec.reducers.bytes.per.reducer=536870912;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;--SET tez.grouping.split-count=18;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;SET hive.vectorized.execution.reduce.enabled = true;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;SET hive.vectorized.execution.reduce.groupby.enabled = true;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;--SET hive.tez.auto.reducer.parallelism=false;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;--SET mapred.reduce.tasks=12;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;--SET hive.tez.partition.size=104857600;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;--SET hive.tez.partition.num=10;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;SET hive.parquet.output.block.size=104857600;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="4"&gt;I would like to ask if there is a way to always have parquet type files but broken up into smaller files as shown in the image below&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="HadoopHero_1-1699388442024.png" style="width: 400px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/38862iE3072D7B6907A2CC/image-size/medium?v=v2&amp;amp;px=400" role="button" title="HadoopHero_1-1699388442024.png" alt="HadoopHero_1-1699388442024.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="4"&gt;We cannot understand what the cause might be.&lt;BR /&gt;Files structured in this way do not guarantee sufficient parallelism for other jobs present (such as sqoop)&lt;BR /&gt;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Apr 2026 06:46:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378726#M243645</guid>
      <dc:creator>HadoopHero</dc:creator>
      <dc:date>2026-04-21T06:46:30Z</dc:date>
    </item>
    <item>
      <title>Re: Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378742#M243651</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105530"&gt;@HadoopHero&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;For Hive, if there is a single reduce task to write the output data it will not break it up the output file into smaller files, that's expected and cannot be configured to behave in a different way.&lt;/P&gt;&lt;P&gt;With DISTRIBUTE BY you should be able to achieve to have multiple reducers (if you have a column by which you can "split" your data reasonably into smaller subsets), see&lt;/P&gt;&lt;P&gt;&lt;A href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Best regards&lt;/P&gt;&lt;P&gt;&amp;nbsp;Miklos&lt;/P&gt;</description>
      <pubDate>Wed, 08 Nov 2023 09:27:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378742#M243651</guid>
      <dc:creator>mszurap</dc:creator>
      <dc:date>2023-11-08T09:27:41Z</dc:date>
    </item>
    <item>
      <title>Re: Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378796#M243679</link>
      <description>&lt;P&gt;Hello Miklos,&lt;/P&gt;&lt;P&gt;unfortunately, what you suggested had no effect. We continue to have the same problem, with creating a single parquet file.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Nov 2023 08:51:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378796#M243679</guid>
      <dc:creator>HadoopHero</dc:creator>
      <dc:date>2023-11-09T08:51:05Z</dc:date>
    </item>
    <item>
      <title>Re: Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378800#M243682</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105530"&gt;@HadoopHero&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;If the query involves dynamic partitioning, one potential issue is that 'hive.optimize.sort.dynamic.partition.threshold' may limit the number of open record writers to just one per partition value, resulting in the creation of only one file. To investigate this, could you attempt disabling 'hive.optimize.sort.dynamic.partition.threshold' entirely?&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;SET hive.optimize.sort.dynamic.partition.threshold=-1;&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note :&amp;nbsp;The problem statement contains a typo in the config name&lt;/P&gt;</description>
      <pubDate>Thu, 09 Nov 2023 10:56:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378800#M243682</guid>
      <dc:creator>ggangadharan</dc:creator>
      <dc:date>2023-11-09T10:56:47Z</dc:date>
    </item>
    <item>
      <title>Re: Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378958#M243738</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105530"&gt;@HadoopHero&lt;/a&gt;&amp;nbsp;Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.&amp;nbsp; Thanks.&lt;/P&gt;</description>
      <pubDate>Mon, 13 Nov 2023 17:32:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378958#M243738</guid>
      <dc:creator>DianaTorres</dc:creator>
      <dc:date>2023-11-13T17:32:20Z</dc:date>
    </item>
    <item>
      <title>Re: Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378969#M243743</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/93628"&gt;@DianaTorres&lt;/a&gt;&amp;nbsp;I'm sorry but unfortunately the problem still persists even after trying the suggestions in the previous posts&lt;/P&gt;</description>
      <pubDate>Mon, 13 Nov 2023 22:13:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378969#M243743</guid>
      <dc:creator>HadoopHero</dc:creator>
      <dc:date>2023-11-13T22:13:18Z</dc:date>
    </item>
    <item>
      <title>Re: Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378974#M243744</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/70785"&gt;@Shmoo&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/38161"&gt;@cravani&lt;/a&gt;&amp;nbsp;Do you have any insights here? Thanks!&lt;/P&gt;</description>
      <pubDate>Mon, 13 Nov 2023 23:36:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/378974#M243744</guid>
      <dc:creator>DianaTorres</dc:creator>
      <dc:date>2023-11-13T23:36:11Z</dc:date>
    </item>
    <item>
      <title>Re: Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/379251#M243818</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105530"&gt;@HadoopHero&lt;/a&gt;&amp;nbsp;Answer would vary based on query that you are running, assuming you have simple "Insert select */cols from Table" it is likely mapper only job and you may want to try tuning below.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;set tez.grouping.min-size=134217728; -- 128 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split&lt;/LI-CODE&gt;&lt;P&gt;Try setting min-size and max-size to same value. I would not go below 128M.&lt;/P&gt;</description>
      <pubDate>Tue, 21 Nov 2023 04:24:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/379251#M243818</guid>
      <dc:creator>cravani</dc:creator>
      <dc:date>2023-11-21T04:24:36Z</dc:date>
    </item>
    <item>
      <title>Re: Possibility Split Parquet file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/379586#M243888</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105530"&gt;@HadoopHero&lt;/a&gt;&amp;nbsp;Has the reply helped resolve your issue? Thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Nov 2023 17:31:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Possibility-Split-Parquet-file/m-p/379586#M243888</guid>
      <dc:creator>DianaTorres</dc:creator>
      <dc:date>2023-11-24T17:31:35Z</dc:date>
    </item>
  </channel>
</rss>

