Support Questions

Find answers, ask questions, and share your expertise

Number of files for an ORC table(denormalized and star-schema type)

avatar
Super Collaborator

HDP-2.5.0.0(2.5.0.0-1245) using Ambari 2.4.0.1

There are two variants of a managed, ORC table, one is denormalized approach and the other is influenced by the traditional star-schema design. At this point in time, I don't have the source scripts of the tables, hence, cannot share them for clarity. The tables are neither partitioned nor bucketed.

Alarmed by the no. of the ORC files the original tables viz. fact_rtc and fact_rtc_starschema had, I did the following :

  • Created copies of those tables (CTAS)
  • Executed the 'ALTER TABLE ... CONCATENATE;'

The table contains the steps and the stats, I am puzzled about the following :

  1. For the fact_rtc_starschema, the no. of files are different for CTAS copy and the CONCAT on the original table, why so ?
  2. For fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is unchanged, why so ?
  3. Certainly, the number of columns has an impact on the size but does a larger size and aplenty columns cause a large number of files ?
  4. I am unsure if 'small no. of large files' should be taken far. Currently, there are 4 data nodes, soon there will 8 and so on. Does this argument hold then ?
  5. I am confused about which table is behaving incorrectly. Or is it both are correct ?
  6. The problem started with a query differing drastically in response time on the two types of tables. The reduction of the files has improved the performance but added more confusions, hence, I will post those as a separate thread, later.
Table NameStatisticsNotes
fact_rtcNumber of columns : 249 Number of rows : 1775017302 Number of files : 1009 Average file size : 245 MB Raw data size : 6.27 TB Total Size : 241.05 GBOriginal table stats
fact_rtc_copyNumber of columns : 249

Number of rows : 1775017302

Number of files : 1009

Average file size : 245 MB

Raw data size : 6.27 TB

Total Size : 241.05 GB

CTAS fact_rtc.

Note that the number of the files is unchanged

fact_RTC_copyNumber of columns : 249 Number of rows : 1775017302 Number of files : 1009 Average file size : 245 MB Raw data size : 6.27 TB Total Size : 241.05 GBALTER TABLE fact_RTC_copy CONCATENATE;

Note that the number of the files is unchanged

fact_rtc_starschema

Number of columns : 132

Number of rows : 1775017302

Number of files : 3732

File size range : 44MB - 76MB

Raw data size : 2.31 TB

Total Size : 229.23 GB

Original table stats
fact_rtc_starschema_copy

Number of columns : 132

Number of rows : 1775017302

Number of files : 239

File size range : 950MB - 1015MB

Raw data size : 2.31 TB

Total Size : 229.03 GB

CTAS fact_rtc_starschema.

Note that the number of the files are changed

fact_rtc_starschemaNumber of columns : 132 Number of rows : 1775017302 Number of files : 864 File size range : 245MB - 335MB Raw data size : 2.31 TB Total Size : 239.56 GB ALTER TABLE fact_RTC_starschema CONCATENATE;

Note that the number of the files are changed

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi @Kaliyug Antagonist,

Let me answer inline:

1. For the fact_rtc_starschema, the no. of files are different for CTAS copy and the CONCAT on the original table, why so ?

The files in original table are less than HDFS default block size (44MB - 76MB). Both CTAS and CONCAT are mapreduce (run with either M/R or TEZ engine). So in both cases your application will try to spawn containers (mappers) according to the tez/mr configuration you have defined. Number of files can be different, as CTAS and CONCAT use different set of params that will control parallelism degree.

2. For fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is unchanged, why so ?

In this table, original number of files is already 1009. Also all files are bigger than HDFS default block size, so both CTAS and CONCAT won’t be different, moreover, most probably the files will have the same records as in original files.

3. Certainly, the number of columns has an impact on the size but does a larger size and aplenty columns cause a large number of files ?

Number of columns doesn’t impact execution plan, nor number of files. Number of files are defined by input-split-size, number of reducers defined (if there are reducers for a job), number of files per job (if such restriction defined), combined input split config param (and related) values.

4.I am unsure if 'small no. of large files' should be taken far. Currently, there are 4 data nodes, soon there will 8 and so on. Does this argument hold then ?

Please refer to the article explaining ORC creation strategy. The article also has references to some additional resources to consider while working with ORC.

5. I am confused about which table is behaving incorrectly. Or is it both are correct ?

Both tables behaving per your job configurations. Correct or incorrect – I would change configuration (especially on such a cluster as 4-8 data nodes).

6. The problem started with a query differing drastically in response time on the two types of tables. The reduction of the files has improved the performance but added more confusions, hence, I will post those as a separate thread, later.

That makes sense. Less files (till’ some number) – better performance. See the article I mentioned above.

Recommendations for your use case:

  • If you append data to your tables over the time, avoid using CONCAT. After upgrading to new Hive/ORC version you will have issues with having ORC files of different versions for CONCAT. Job won’t fail, but you will lose some data.
  • If you decide to use CONCAT, remember, that it concatenates files on a stripe level. So, if you have small files with small stripes, that won't make ORC files much more efficient, although it might reduce number of resources to be allocated by resource manager.
  • Create ORC tables as mentioned in the article with following configuration:
set hive.hadoop.supports.splittable.combineinputformat=true;
set tez.grouping.min-size=1073741824; -- 1 GB - depends on your container heap size
set tez.grouping.max-size=2147483648; -- 2 GB - depends on your container head size
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB - depends on your container heap size
set mapreduce.input.fileinputformat.split.maxsize=2147483648; -- 2 GB - depends on your container head size
set hive.merge.smallfiles.avgsize=1073741824;
set hive.exec.reducers.bytes.per.reducer=2147483648;

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

Hi @Kaliyug Antagonist,

Let me answer inline:

1. For the fact_rtc_starschema, the no. of files are different for CTAS copy and the CONCAT on the original table, why so ?

The files in original table are less than HDFS default block size (44MB - 76MB). Both CTAS and CONCAT are mapreduce (run with either M/R or TEZ engine). So in both cases your application will try to spawn containers (mappers) according to the tez/mr configuration you have defined. Number of files can be different, as CTAS and CONCAT use different set of params that will control parallelism degree.

2. For fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is unchanged, why so ?

In this table, original number of files is already 1009. Also all files are bigger than HDFS default block size, so both CTAS and CONCAT won’t be different, moreover, most probably the files will have the same records as in original files.

3. Certainly, the number of columns has an impact on the size but does a larger size and aplenty columns cause a large number of files ?

Number of columns doesn’t impact execution plan, nor number of files. Number of files are defined by input-split-size, number of reducers defined (if there are reducers for a job), number of files per job (if such restriction defined), combined input split config param (and related) values.

4.I am unsure if 'small no. of large files' should be taken far. Currently, there are 4 data nodes, soon there will 8 and so on. Does this argument hold then ?

Please refer to the article explaining ORC creation strategy. The article also has references to some additional resources to consider while working with ORC.

5. I am confused about which table is behaving incorrectly. Or is it both are correct ?

Both tables behaving per your job configurations. Correct or incorrect – I would change configuration (especially on such a cluster as 4-8 data nodes).

6. The problem started with a query differing drastically in response time on the two types of tables. The reduction of the files has improved the performance but added more confusions, hence, I will post those as a separate thread, later.

That makes sense. Less files (till’ some number) – better performance. See the article I mentioned above.

Recommendations for your use case:

  • If you append data to your tables over the time, avoid using CONCAT. After upgrading to new Hive/ORC version you will have issues with having ORC files of different versions for CONCAT. Job won’t fail, but you will lose some data.
  • If you decide to use CONCAT, remember, that it concatenates files on a stripe level. So, if you have small files with small stripes, that won't make ORC files much more efficient, although it might reduce number of resources to be allocated by resource manager.
  • Create ORC tables as mentioned in the article with following configuration:
set hive.hadoop.supports.splittable.combineinputformat=true;
set tez.grouping.min-size=1073741824; -- 1 GB - depends on your container heap size
set tez.grouping.max-size=2147483648; -- 2 GB - depends on your container head size
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB - depends on your container heap size
set mapreduce.input.fileinputformat.split.maxsize=2147483648; -- 2 GB - depends on your container head size
set hive.merge.smallfiles.avgsize=1073741824;
set hive.exec.reducers.bytes.per.reducer=2147483648;

avatar
New Contributor

Hi,

 

Even though we modified the stripe size to custom value - "orc.stripe.size"="248435456"

 

there are many files which are still with 5MB , 9 MB. Any reason for this behavior?