Created 01-04-2017 02:50 PM
HDP-2.5.0.0(2.5.0.0-1245) using Ambari 2.4.0.1
There are two variants of a managed, ORC table, one is denormalized approach and the other is influenced by the traditional star-schema design. At this point in time, I don't have the source scripts of the tables, hence, cannot share them for clarity. The tables are neither partitioned nor bucketed.
Alarmed by the no. of the ORC files the original tables viz. fact_rtc and fact_rtc_starschema had, I did the following :
The table contains the steps and the stats, I am puzzled about the following :
Table Name | Statistics | Notes |
fact_rtc | Number of columns : 249 Number of rows : 1775017302 Number of files : 1009 Average file size : 245 MB Raw data size : 6.27 TB Total Size : 241.05 GB | Original table stats |
fact_rtc_copy | Number of columns : 249 Number of rows : 1775017302 Number of files : 1009 Average file size : 245 MB Raw data size : 6.27 TB Total Size : 241.05 GB | CTAS fact_rtc. Note that the number of the files is unchanged |
fact_RTC_copy | Number of columns : 249 Number of rows : 1775017302 Number of files : 1009 Average file size : 245 MB Raw data size : 6.27 TB Total Size : 241.05 GB | ALTER TABLE fact_RTC_copy CONCATENATE;
Note that the number of the files is unchanged |
fact_rtc_starschema | Number of columns : 132 Number of rows : 1775017302 Number of files : 3732 File size range : 44MB - 76MB Raw data size : 2.31 TB Total Size : 229.23 GB | Original table stats |
fact_rtc_starschema_copy | Number of columns : 132 Number of rows : 1775017302 Number of files : 239 File size range : 950MB - 1015MB Raw data size : 2.31 TB Total Size : 229.03 GB | CTAS fact_rtc_starschema. Note that the number of the files are changed |
fact_rtc_starschema | Number of columns : 132 Number of rows : 1775017302 Number of files : 864 File size range : 245MB - 335MB Raw data size : 2.31 TB Total Size : 239.56 GB | ALTER TABLE fact_RTC_starschema CONCATENATE;
Note that the number of the files are changed |
Created 01-05-2017 07:46 PM
Let me answer inline:
1. For the fact_rtc_starschema, the no. of files are different for CTAS copy and the CONCAT on the original table, why so ?
The files in original table are less than HDFS default block size (44MB - 76MB). Both CTAS and CONCAT are mapreduce (run with either M/R or TEZ engine). So in both cases your application will try to spawn containers (mappers) according to the tez/mr configuration you have defined. Number of files can be different, as CTAS and CONCAT use different set of params that will control parallelism degree.
2. For fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is unchanged, why so ?
In this table, original number of files is already 1009. Also all files are bigger than HDFS default block size, so both CTAS and CONCAT won’t be different, moreover, most probably the files will have the same records as in original files.
3. Certainly, the number of columns has an impact on the size but does a larger size and aplenty columns cause a large number of files ?
Number of columns doesn’t impact execution plan, nor number of files. Number of files are defined by input-split-size, number of reducers defined (if there are reducers for a job), number of files per job (if such restriction defined), combined input split config param (and related) values.
4.I am unsure if 'small no. of large files' should be taken far. Currently, there are 4 data nodes, soon there will 8 and so on. Does this argument hold then ?
Please refer to the article explaining ORC creation strategy. The article also has references to some additional resources to consider while working with ORC.
5. I am confused about which table is behaving incorrectly. Or is it both are correct ?
Both tables behaving per your job configurations. Correct or incorrect – I would change configuration (especially on such a cluster as 4-8 data nodes).
6. The problem started with a query differing drastically in response time on the two types of tables. The reduction of the files has improved the performance but added more confusions, hence, I will post those as a separate thread, later.
That makes sense. Less files (till’ some number) – better performance. See the article I mentioned above.
Recommendations for your use case:
set hive.hadoop.supports.splittable.combineinputformat=true; set tez.grouping.min-size=1073741824; -- 1 GB - depends on your container heap size set tez.grouping.max-size=2147483648; -- 2 GB - depends on your container head size set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB - depends on your container heap size set mapreduce.input.fileinputformat.split.maxsize=2147483648; -- 2 GB - depends on your container head size set hive.merge.smallfiles.avgsize=1073741824; set hive.exec.reducers.bytes.per.reducer=2147483648;
Created 01-05-2017 07:46 PM
Let me answer inline:
1. For the fact_rtc_starschema, the no. of files are different for CTAS copy and the CONCAT on the original table, why so ?
The files in original table are less than HDFS default block size (44MB - 76MB). Both CTAS and CONCAT are mapreduce (run with either M/R or TEZ engine). So in both cases your application will try to spawn containers (mappers) according to the tez/mr configuration you have defined. Number of files can be different, as CTAS and CONCAT use different set of params that will control parallelism degree.
2. For fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is unchanged, why so ?
In this table, original number of files is already 1009. Also all files are bigger than HDFS default block size, so both CTAS and CONCAT won’t be different, moreover, most probably the files will have the same records as in original files.
3. Certainly, the number of columns has an impact on the size but does a larger size and aplenty columns cause a large number of files ?
Number of columns doesn’t impact execution plan, nor number of files. Number of files are defined by input-split-size, number of reducers defined (if there are reducers for a job), number of files per job (if such restriction defined), combined input split config param (and related) values.
4.I am unsure if 'small no. of large files' should be taken far. Currently, there are 4 data nodes, soon there will 8 and so on. Does this argument hold then ?
Please refer to the article explaining ORC creation strategy. The article also has references to some additional resources to consider while working with ORC.
5. I am confused about which table is behaving incorrectly. Or is it both are correct ?
Both tables behaving per your job configurations. Correct or incorrect – I would change configuration (especially on such a cluster as 4-8 data nodes).
6. The problem started with a query differing drastically in response time on the two types of tables. The reduction of the files has improved the performance but added more confusions, hence, I will post those as a separate thread, later.
That makes sense. Less files (till’ some number) – better performance. See the article I mentioned above.
Recommendations for your use case:
set hive.hadoop.supports.splittable.combineinputformat=true; set tez.grouping.min-size=1073741824; -- 1 GB - depends on your container heap size set tez.grouping.max-size=2147483648; -- 2 GB - depends on your container head size set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB - depends on your container heap size set mapreduce.input.fileinputformat.split.maxsize=2147483648; -- 2 GB - depends on your container head size set hive.merge.smallfiles.avgsize=1073741824; set hive.exec.reducers.bytes.per.reducer=2147483648;
Created 11-11-2019 12:58 PM
Hi,
Even though we modified the stripe size to custom value - "orc.stripe.size"="248435456"
there are many files which are still with 5MB , 9 MB. Any reason for this behavior?