Created 10-29-2017 03:45 PM
I have loaded just around 214MB worth of data with the year, month, day and hour wise partitioned Hive ACID table through a merge query, however when I launch a simple count(*) query its taking 3645 mappers, which the same data on a non-transactional table on Hive just takes 12 mappers. Is this an expected behavior? I followed the steps from here.
Created 10-29-2017 06:53 PM
Yes, that's expected because if you are having ACID properties enabled on the table, then there will be lot of delta files(3645) in HDFS directory.
you can check files by using
bash# hadoop fs -count -v -t <table-location>
Each mapper gets will load 1 file so that is the reason why there are 3645 mappers are launched.
If there are lot of delta files in the directory you need to run Major or minor compactions, to reduce number of mappers are launched. These compactions takes a set of existing delta files and rewrites them to a single delta file
Types of Compactions in hive:-
1.Minor Compaction:-A ‘minor’ compaction will takes all the delta files and rewrites them to single delta file. This compaction wont take much resources.
hive#alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'minor';
Example:-
Here par_buk is the table name having dat is the partition column into 10 buckets and having 1 base file and 3 delta files.
bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 4 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724389 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724390_314724390 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724391_31472439
hive# alter table par_buk partition(dat='2017-10-09_12') compact 'minor'; //minor compaction gets all the delta files and rewrites them to single delta file
bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 2 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxrwxrwx - hdfs 0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391
As you can see all delta files are rewritten to single delta file in minor compaction.
2.Major Compaction:-A ‘major’ compaction will takes one or more delta files(same as minor compaction) and the base file for the bucket and rewrites them into a new base file per bucket. Major compaction is more expensive but is more effective.
This compaction can take minutes to hours and can consume a lot of disk, network, memory and CPU resources, so they should be invoked carefully.
hive# alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'major';
Example:-
bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 2 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxrwxrwx - hdfs 0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391
hive# alter table par_buk partition(dat='2017-10-09_12') compact 'major'; //major compaction gets all the delta files, base files and rewrites them to single new base file.
bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 1 items drwxrwxrwx - hdfs 0 2017-10-29 14:34 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724391
As you can see major compaction has rewritten base file and delta file to new base file per bucket.
If you want to see the status of compactions you can use
hive# show compactions;
So once you run Compactions all delta files are rewritten to single file, then there will be less number of mappers are launched. These Compactions helps you to significantly increase query performance.
Created 10-29-2017 06:53 PM
Yes, that's expected because if you are having ACID properties enabled on the table, then there will be lot of delta files(3645) in HDFS directory.
you can check files by using
bash# hadoop fs -count -v -t <table-location>
Each mapper gets will load 1 file so that is the reason why there are 3645 mappers are launched.
If there are lot of delta files in the directory you need to run Major or minor compactions, to reduce number of mappers are launched. These compactions takes a set of existing delta files and rewrites them to a single delta file
Types of Compactions in hive:-
1.Minor Compaction:-A ‘minor’ compaction will takes all the delta files and rewrites them to single delta file. This compaction wont take much resources.
hive#alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'minor';
Example:-
Here par_buk is the table name having dat is the partition column into 10 buckets and having 1 base file and 3 delta files.
bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 4 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724389 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724390_314724390 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724391_31472439
hive# alter table par_buk partition(dat='2017-10-09_12') compact 'minor'; //minor compaction gets all the delta files and rewrites them to single delta file
bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 2 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxrwxrwx - hdfs 0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391
As you can see all delta files are rewritten to single delta file in minor compaction.
2.Major Compaction:-A ‘major’ compaction will takes one or more delta files(same as minor compaction) and the base file for the bucket and rewrites them into a new base file per bucket. Major compaction is more expensive but is more effective.
This compaction can take minutes to hours and can consume a lot of disk, network, memory and CPU resources, so they should be invoked carefully.
hive# alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'major';
Example:-
bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 2 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxrwxrwx - hdfs 0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391
hive# alter table par_buk partition(dat='2017-10-09_12') compact 'major'; //major compaction gets all the delta files, base files and rewrites them to single new base file.
bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 1 items drwxrwxrwx - hdfs 0 2017-10-29 14:34 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724391
As you can see major compaction has rewritten base file and delta file to new base file per bucket.
If you want to see the status of compactions you can use
hive# show compactions;
So once you run Compactions all delta files are rewritten to single file, then there will be less number of mappers are launched. These Compactions helps you to significantly increase query performance.
Created 10-30-2017 07:02 AM
Thanks a lot for your time.
Created 10-30-2017 07:04 AM