Support Questions

sushil416 · ‎10-29-2017

I have loaded just around 214MB worth of data with the year, month, day and hour wise partitioned Hive ACID table through a merge query, however when I launch a simple count(*) query its taking 3645 mappers, which the same data on a non-transactional table on Hive just takes 12 mappers. Is this an expected behavior? I followed the steps from here.

Shu_ashu · ‎10-29-2017

@Sushil Ks

Yes, that's expected because if you are having ACID properties enabled on the table, then there will be lot of delta files(3645) in HDFS directory.

you can check files by using

bash# hadoop fs -count -v -t <table-location>

Each mapper gets will load 1 file so that is the reason why there are 3645 mappers are launched.

If there are lot of delta files in the directory you need to run Major or minor compactions, to reduce number of mappers are launched. These compactions takes a set of existing delta files and rewrites them to a single delta file

Types of Compactions in hive:-

1.Minor Compaction:-A ‘minor’ compaction will takes all the delta files and rewrites them to single delta file. This compaction wont take much resources.

hive#alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'minor';

Example:-

Here par_buk is the table name having dat is the partition column into 10 buckets and having 1 base file and 3 delta files.

bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/
Found 4 items
drwxrwxrwx   -  hdfs          0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388
drwxr-xr-x   -  hdfs          0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724389
drwxr-xr-x   -  hdfs          0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724390_314724390
drwxr-xr-x   -  hdfs          0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724391_31472439

hive# alter table par_buk partition(dat='2017-10-09_12') compact 'minor'; //minor compaction gets all the delta files and rewrites them to single delta file

bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/
Found 2 items
drwxrwxrwx   -  hdfs          0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388
drwxrwxrwx   -  hdfs          0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391

As you can see all delta files are rewritten to single delta file in minor compaction.

2.Major Compaction:-A ‘major’ compaction will takes one or more delta files(same as minor compaction) and the base file for the bucket and rewrites them into a new base file per bucket. Major compaction is more expensive but is more effective.

This compaction can take minutes to hours and can consume a lot of disk, network, memory and CPU resources, so they should be invoked carefully.

hive# alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'major';

Example:-

bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/
Found 2 items
drwxrwxrwx   -  hdfs          0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388
drwxrwxrwx   -  hdfs          0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391

hive# alter table par_buk partition(dat='2017-10-09_12') compact 'major'; //major compaction gets all the delta files, base files and rewrites them to single new base file.

bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/
Found 1 items
drwxrwxrwx   -  hdfs          0 2017-10-29 14:34 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724391

As you can see major compaction has rewritten base file and delta file to new base file per bucket.

If you want to see the status of compactions you can use

hive# show compactions;

So once you run Compactions all delta files are rewritten to single file, then there will be less number of mappers are launched. These Compactions helps you to significantly increase query performance.

View solution in original post

Shu_ashu · ‎10-29-2017