<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Number of files for an ORC table(denormalized and star-schema type) in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-files-for-an-ORC-table-denormalized-and-star/m-p/112448#M50818</link>
    <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/5134/kaliyugantagonist.html" nodeid="5134"&gt;@Kaliyug Antagonist&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;Let me answer inline:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;1. For
     the fact_rtc_starschema, the no. of files are different for CTAS copy and
     the CONCAT on the original table, why so ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;The files in original table are less than HDFS default block
size (44MB - 76MB). Both CTAS and CONCAT are mapreduce (run with either M/R or
TEZ engine). So in both cases your application will try to spawn containers
(mappers) according to the tez/mr configuration you have defined. Number of
files can be different, as CTAS and CONCAT use different set of params that
will control parallelism degree.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;2. For
     fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is
     unchanged, why so ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;In this table, original number of files is already 1009.
Also all files are bigger than HDFS default block size, so both CTAS and CONCAT
won’t be different, moreover, most probably the files will have the same records
as in original files.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;3. Certainly,
     the number of columns has an impact on the size but does a larger size and
     aplenty columns cause a large number of files ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Number of columns doesn’t impact execution plan, nor number
of files. Number of files are defined by input-split-size, number of reducers
defined (if there are reducers for a job), number of files per job (if such
restriction defined), combined input split config param (and related) values.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;4.I am
     unsure if 'small no. of large files' should be taken far. Currently, there
     are 4 data nodes, soon there will 8 and so on. Does this argument hold
     then ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Please refer to the &lt;A href="https://community.hortonworks.com/content/kbentry/75501/orc-creation-best-practices.html"&gt;article&lt;/A&gt;
explaining ORC creation strategy. The article also has references to some
additional resources to consider while working with ORC.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;5. I am
     confused about which table is behaving incorrectly. Or is it both are
     correct ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Both tables behaving per your job configurations. Correct or
incorrect – I would change configuration (especially on such a cluster as 4-8
data nodes).&lt;/P&gt;&lt;P&gt;&lt;EM&gt;6. The
     problem started with a query differing drastically in response time on the
     two types of tables. The reduction of the files has improved the
     performance but added more confusions, hence, I will post those as a
     separate thread, later.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;That makes sense. Less files (till’ some number) – better performance.
See the article I mentioned above.&lt;/P&gt;&lt;P&gt;Recommendations for your use case:&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;If you append data to your
tables over the time, avoid using CONCAT. After upgrading to new Hive/ORC
version you will have issues with having ORC files of different versions for
CONCAT. Job won’t fail, but you will lose some data.&lt;/LI&gt;&lt;LI&gt;If you decide to use CONCAT, remember, that it concatenates files on a stripe level. So, if you have small files with small stripes, that won't make ORC files much more efficient, although it might reduce number of resources to be allocated by resource manager.&lt;/LI&gt;&lt;LI&gt;Create ORC tables as
mentioned in the article with following configuration:&lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;set hive.hadoop.supports.splittable.combineinputformat=true;
set tez.grouping.min-size=1073741824; -- 1 GB - depends on your container heap size
set tez.grouping.max-size=2147483648; -- 2 GB - depends on your container head size
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB - depends on your container heap size
set mapreduce.input.fileinputformat.split.maxsize=2147483648; -- 2 GB - depends on your container head size
set hive.merge.smallfiles.avgsize=1073741824;
set hive.exec.reducers.bytes.per.reducer=2147483648;&lt;/PRE&gt;</description>
    <pubDate>Fri, 06 Jan 2017 03:46:05 GMT</pubDate>
    <dc:creator>eberezitsky</dc:creator>
    <dc:date>2017-01-06T03:46:05Z</dc:date>
    <item>
      <title>Number of files for an ORC table(denormalized and star-schema type)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-files-for-an-ORC-table-denormalized-and-star/m-p/112447#M50817</link>
      <description>&lt;P&gt;HDP-2.5.0.0(2.5.0.0-1245) using Ambari  2.4.0.1&lt;/P&gt;&lt;P&gt;There are two variants of a managed, ORC table, one is denormalized approach and the other is influenced by the traditional star-schema design. At this point in time, I don't have the source scripts of the tables, hence, cannot share them for clarity. The tables are neither partitioned nor bucketed.&lt;/P&gt;&lt;P&gt;Alarmed by the no. of the ORC files the original tables viz. &lt;STRONG&gt;fact_rtc &lt;/STRONG&gt;and &lt;STRONG&gt;fact_rtc_starschema &lt;/STRONG&gt;had, I did the following :&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Created copies of those tables (CTAS)&lt;/LI&gt;&lt;LI&gt;Executed the 'ALTER TABLE ... CONCATENATE;'&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The table contains the steps and the stats, I am puzzled about the following :&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;For the fact_rtc_starschema, the no. of files are different for CTAS copy and the CONCAT on the original table, why so ?&lt;/LI&gt;&lt;LI&gt;For fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is unchanged, why so ?&lt;/LI&gt;&lt;LI&gt;Certainly, the number of columns has an impact on the size but does a larger size and aplenty columns cause a large number of files ?&lt;/LI&gt;&lt;LI&gt;I am unsure if 'small no. of large files' should be taken far. Currently, there are 4 data nodes, soon there will 8 and so on. Does this argument hold then ?&lt;/LI&gt;&lt;LI&gt;I am confused about which table is behaving incorrectly. Or is it both are correct ?&lt;/LI&gt;&lt;LI&gt;The problem started with a query differing drastically in response time on the two types of tables. The reduction of the files has improved the performance but added more confusions, hence, I will post those as a separate thread, later.&lt;/LI&gt;&lt;/OL&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRONG&gt;Table Name&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;Statistics&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;Notes&lt;/STRONG&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact_rtc&lt;/TD&gt;&lt;TD&gt;Number of columns : 249
Number of rows : 1775017302
Number of files : 1009
Average file size : 245 MB
Raw data size : 6.27 TB
Total Size : 241.05 GB&lt;/TD&gt;&lt;TD&gt;Original table stats&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact_rtc_copy&lt;/TD&gt;&lt;TD&gt;Number of columns : 249&lt;P&gt;Number of rows : 1775017302&lt;/P&gt;&lt;P&gt;Number of files : &lt;STRONG&gt;1009&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;/STRONG&gt;Average file size : 245 MB&lt;/P&gt;&lt;P&gt;Raw data size : 6.27 TB&lt;/P&gt;&lt;P&gt;Total Size : 241.05 GB&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;CTAS fact_rtc. &lt;P&gt;&lt;STRONG&gt;Note that the number of the files is unchanged&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact_RTC_copy&lt;/TD&gt;&lt;TD&gt;Number of columns : 249
Number of rows : 1775017302
Number of files : &lt;STRONG&gt;1009
&lt;/STRONG&gt;Average file size : 245 MB
Raw data size : 6.27 TB
Total Size :  241.05 GB&lt;/TD&gt;&lt;TD&gt;ALTER TABLE fact_RTC_copy CONCATENATE;
&lt;P&gt;&lt;STRONG&gt;Note that the number of the files is unchanged&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact_rtc_starschema&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;Number of columns : 132&lt;/P&gt;&lt;P&gt;Number of rows : 1775017302&lt;/P&gt;&lt;P&gt;Number of files : 3732&lt;/P&gt;&lt;P&gt;File size range : 44MB - 76MB&lt;/P&gt;&lt;P&gt;Raw data size : 2.31 TB&lt;/P&gt;&lt;P&gt;Total Size : 229.23 GB&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;Original table stats
&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact_rtc_starschema_copy&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;Number of columns : 132&lt;/P&gt;&lt;P&gt;Number of rows : 1775017302&lt;/P&gt;&lt;P&gt;Number of files : &lt;STRONG&gt;239&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;/STRONG&gt;File size range : 950MB - 1015MB&lt;/P&gt;&lt;P&gt;Raw data size : 2.31 TB&lt;/P&gt;&lt;P&gt;Total Size : 229.03 GB&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;CTAS fact_rtc_starschema. 
&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Note that the number of the files are changed&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact_rtc_starschema&lt;/TD&gt;&lt;TD&gt;Number of columns : 132
Number of rows : 1775017302
Number of files : &lt;STRONG&gt;864
&lt;/STRONG&gt;File size range : 245MB - 335MB
Raw data size : 2.31 TB
Total Size : 239.56 GB
&lt;/TD&gt;&lt;TD&gt;ALTER TABLE fact_RTC_starschema CONCATENATE;

&lt;P&gt;&lt;STRONG&gt;Note that the number of the files are changed&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Wed, 04 Jan 2017 22:50:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-files-for-an-ORC-table-denormalized-and-star/m-p/112447#M50817</guid>
      <dc:creator>kaliyugantagoni</dc:creator>
      <dc:date>2017-01-04T22:50:24Z</dc:date>
    </item>
    <item>
      <title>Re: Number of files for an ORC table(denormalized and star-schema type)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-files-for-an-ORC-table-denormalized-and-star/m-p/112448#M50818</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/5134/kaliyugantagonist.html" nodeid="5134"&gt;@Kaliyug Antagonist&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;Let me answer inline:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;1. For
     the fact_rtc_starschema, the no. of files are different for CTAS copy and
     the CONCAT on the original table, why so ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;The files in original table are less than HDFS default block
size (44MB - 76MB). Both CTAS and CONCAT are mapreduce (run with either M/R or
TEZ engine). So in both cases your application will try to spawn containers
(mappers) according to the tez/mr configuration you have defined. Number of
files can be different, as CTAS and CONCAT use different set of params that
will control parallelism degree.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;2. For
     fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is
     unchanged, why so ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;In this table, original number of files is already 1009.
Also all files are bigger than HDFS default block size, so both CTAS and CONCAT
won’t be different, moreover, most probably the files will have the same records
as in original files.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;3. Certainly,
     the number of columns has an impact on the size but does a larger size and
     aplenty columns cause a large number of files ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Number of columns doesn’t impact execution plan, nor number
of files. Number of files are defined by input-split-size, number of reducers
defined (if there are reducers for a job), number of files per job (if such
restriction defined), combined input split config param (and related) values.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;4.I am
     unsure if 'small no. of large files' should be taken far. Currently, there
     are 4 data nodes, soon there will 8 and so on. Does this argument hold
     then ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Please refer to the &lt;A href="https://community.hortonworks.com/content/kbentry/75501/orc-creation-best-practices.html"&gt;article&lt;/A&gt;
explaining ORC creation strategy. The article also has references to some
additional resources to consider while working with ORC.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;5. I am
     confused about which table is behaving incorrectly. Or is it both are
     correct ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Both tables behaving per your job configurations. Correct or
incorrect – I would change configuration (especially on such a cluster as 4-8
data nodes).&lt;/P&gt;&lt;P&gt;&lt;EM&gt;6. The
     problem started with a query differing drastically in response time on the
     two types of tables. The reduction of the files has improved the
     performance but added more confusions, hence, I will post those as a
     separate thread, later.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;That makes sense. Less files (till’ some number) – better performance.
See the article I mentioned above.&lt;/P&gt;&lt;P&gt;Recommendations for your use case:&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;If you append data to your
tables over the time, avoid using CONCAT. After upgrading to new Hive/ORC
version you will have issues with having ORC files of different versions for
CONCAT. Job won’t fail, but you will lose some data.&lt;/LI&gt;&lt;LI&gt;If you decide to use CONCAT, remember, that it concatenates files on a stripe level. So, if you have small files with small stripes, that won't make ORC files much more efficient, although it might reduce number of resources to be allocated by resource manager.&lt;/LI&gt;&lt;LI&gt;Create ORC tables as
mentioned in the article with following configuration:&lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;set hive.hadoop.supports.splittable.combineinputformat=true;
set tez.grouping.min-size=1073741824; -- 1 GB - depends on your container heap size
set tez.grouping.max-size=2147483648; -- 2 GB - depends on your container head size
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB - depends on your container heap size
set mapreduce.input.fileinputformat.split.maxsize=2147483648; -- 2 GB - depends on your container head size
set hive.merge.smallfiles.avgsize=1073741824;
set hive.exec.reducers.bytes.per.reducer=2147483648;&lt;/PRE&gt;</description>
      <pubDate>Fri, 06 Jan 2017 03:46:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-files-for-an-ORC-table-denormalized-and-star/m-p/112448#M50818</guid>
      <dc:creator>eberezitsky</dc:creator>
      <dc:date>2017-01-06T03:46:05Z</dc:date>
    </item>
    <item>
      <title>Re: Number of files for an ORC table(denormalized and star-schema type)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-files-for-an-ORC-table-denormalized-and-star/m-p/282673#M50819</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Even though we modified the stripe size to custom value -&amp;nbsp;"orc.stripe.size"="248435456"&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;there are many files which are still with 5MB , 9 MB. Any reason for this behavior?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2019 20:58:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-files-for-an-ORC-table-denormalized-and-star/m-p/282673#M50819</guid>
      <dc:creator>Bharath5</dc:creator>
      <dc:date>2019-11-11T20:58:21Z</dc:date>
    </item>
  </channel>
</rss>

