<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Parquet files should not be split into multiple hdfs-blocks problem and strange record count iss in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/277729#M24878</link>
    <description>&lt;P&gt;Not able to open this link :&amp;nbsp; &amp;nbsp;&lt;A href="http://ingest.tips/2015/01/31/parquet-row-group-size/" target="_blank" rel="nofollow noopener noreferrer"&gt;http://ingest.tips/2015/01/31/parquet-row-group-size/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;can you please check and repost it please ?&lt;/P&gt;</description>
    <pubDate>Thu, 19 Sep 2019 03:53:44 GMT</pubDate>
    <dc:creator>hrishi1dypim</dc:creator>
    <dc:date>2019-09-19T03:53:44Z</dc:date>
    <item>
      <title>Parquet files should not be split into multiple hdfs-blocks problem and strange record count issue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/39572#M24875</link>
      <description>&lt;P&gt;Hello&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am trying to import parquet tables from another Cloudera Impala implementation to my Cloudera Impala&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;--&amp;gt; I am getting parquet tables via sftp&lt;/P&gt;&lt;P&gt;--&amp;gt; I am copying all parquet files into proper impala table directory like /grid1/hive/warehouse/&amp;lt;database&amp;gt;/&amp;lt;importedTable&amp;gt; &lt;FONT size="2"&gt;without any error/warning&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;--&amp;gt; I am creating required partition structure with alter table &amp;lt;importedTable&amp;gt; add partition (..) &lt;FONT size="2"&gt;without any error/warning&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;--&amp;gt; I am applying refresh &amp;lt;importedTable&amp;gt; command &lt;FONT size="2"&gt;without any error/warning&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;--&amp;gt; I could see new partitions in (show partition &amp;lt;importedTable&amp;gt; command) &lt;FONT size="2"&gt;without any error/warning&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="2"&gt;--&amp;gt;&amp;nbsp; &lt;FONT size="3"&gt;I am applying above procedure for all tables&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;--&amp;gt; When I tried to access records in the table I got following warning "WARNINGS: Parquet files should not be split into multiple hdfs-blocks"&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using gzip compression on my tables but imported tables have default settings. So I have another database with gzipped data. Therefore I am copying data from imported table to gzipped table with following command&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;set compression_codec=gzip &lt;FONT size="2"&gt;without any error/warning&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;insert into &amp;lt;gzippedTable&amp;gt; partition (&amp;lt;part1=value1, part2=value2) select field1, field3, field4 ...... from &amp;lt;importedTable&amp;gt; where &amp;lt;partitioned column1=value1, partitioned column2=value2) &lt;FONT size="2"&gt;without any error/warning&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I compare record counts for the partition both gzippedtable and imported table, there is a differences like following output&lt;/P&gt;&lt;P&gt;[host03:21000] &amp;gt; select&amp;nbsp; count (*) from &lt;STRONG&gt;importedTable&lt;/STRONG&gt; where logdate=20160401;&lt;/P&gt;&lt;P&gt;Query: select count (*) from &lt;STRONG&gt;importedTable&lt;/STRONG&gt; where logdate=20160401&lt;/P&gt;&lt;P&gt;+-----------+&lt;/P&gt;&lt;P&gt;| count(*)&amp;nbsp; |&lt;/P&gt;&lt;P&gt;+-----------+&lt;/P&gt;&lt;P&gt;| &lt;STRONG&gt;101565867&lt;/STRONG&gt; |&lt;/P&gt;&lt;P&gt;+-----------+&lt;/P&gt;&lt;P&gt;WARNINGS: Parquet files should not be split into multiple hdfs-blocks. file=hdfs://host01:8020/grid1/hive/warehouse/&amp;lt;database&amp;gt;/importedTable/partitionedColumn=value1/logdate=20160401/51464233716089fd-295e6694028850a0_1358598818_data.0.parq (1 of 94 similar)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Fetched 1 row(s) in 0.96s&lt;/P&gt;&lt;P&gt;[host03:21000] &amp;gt; select&amp;nbsp; count (*) from &lt;STRONG&gt;gzippedTable&lt;/STRONG&gt; where logdate=20160401;&lt;/P&gt;&lt;P&gt;Query: select count (*) from &lt;STRONG&gt;gzippedTable&lt;/STRONG&gt; where logdate=20160401&lt;/P&gt;&lt;P&gt;+-----------+&lt;/P&gt;&lt;P&gt;| count(*)&amp;nbsp; |&lt;/P&gt;&lt;P&gt;+-----------+&lt;/P&gt;&lt;P&gt;| &lt;STRONG&gt;123736525&lt;/STRONG&gt; |&lt;/P&gt;&lt;P&gt;+-----------+&lt;/P&gt;&lt;P&gt;Fetched 1 row(s) in 0.92s&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So how can I fix "WARNINGS: Parquet files should not be split into multiple hdfs-blocks" and why I am getting different record counts after applying above procedure.&lt;/P&gt;&lt;P&gt;Is record count differences related with multiple hdfs-blocks warning ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:13:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/39572#M24875</guid>
      <dc:creator>msuluhan</dc:creator>
      <dc:date>2022-09-16T10:13:12Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet files should not be split into multiple hdfs-blocks problem and strange record count iss</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/39854#M24876</link>
      <description>&lt;P&gt;Regarding the "&lt;SPAN&gt;WARNINGS: Parquet files should not be split into multiple hdfs-blocks" issue, what is the HDFS block size set to for the application that is inserting the parquet data into HDFS? If your application is using the default this should be found in hdfs-site.xml under the&amp;nbsp;dfs.blocksize property. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;If this is smaller than the size of the parquet files then they will be split into multiple HDFS blocks. This can&amp;nbsp;mean Impala has to read one or more blocks remotely to reassemble a&amp;nbsp;parquet row group. This carries performance impact. This is explained pretty well here:&amp;nbsp;&lt;A href="http://ingest.tips/2015/01/31/parquet-row-group-size/" target="_blank"&gt;http://ingest.tips/2015/01/31/parquet-row-group-size/&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Please note that dfs.blocksize can be set per application so you should not need to modify the global setting. Try raising this to your largest parquet file size when inserting to HDFS.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Apr 2016 09:34:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/39854#M24876</guid>
      <dc:creator>TomScott</dc:creator>
      <dc:date>2016-04-18T09:34:09Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet files should not be split into multiple hdfs-blocks problem and strange record count iss</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/40627#M24877</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So sorry for delayed update.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;invalidate metadata
invalidate metadata tablename&lt;EM&gt; and then&lt;/EM&gt;
refresh tablename &lt;/PRE&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;commands have solved my problem. Source parquet tables and gzipped target tables have same records in their partitions. I am still getting "split into multiple hdfs-block&lt;SPAN&gt;s problem" warnings but it looks like it does not any impact on my record count issue.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;BTW : The link that you provided is very good&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;Thanks for your response&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 09 May 2016 15:59:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/40627#M24877</guid>
      <dc:creator>msuluhan</dc:creator>
      <dc:date>2016-05-09T15:59:17Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet files should not be split into multiple hdfs-blocks problem and strange record count iss</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/277729#M24878</link>
      <description>&lt;P&gt;Not able to open this link :&amp;nbsp; &amp;nbsp;&lt;A href="http://ingest.tips/2015/01/31/parquet-row-group-size/" target="_blank" rel="nofollow noopener noreferrer"&gt;http://ingest.tips/2015/01/31/parquet-row-group-size/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;can you please check and repost it please ?&lt;/P&gt;</description>
      <pubDate>Thu, 19 Sep 2019 03:53:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Parquet-files-should-not-be-split-into-multiple-hdfs-blocks/m-p/277729#M24878</guid>
      <dc:creator>hrishi1dypim</dc:creator>
      <dc:date>2019-09-19T03:53:44Z</dc:date>
    </item>
  </channel>
</rss>

