When we use the HCatalog APIs in MapReduce (HCatalog InputFormat and OutputFormat), the CPU time used by the jobs seems to shoot up drastically (nearly 80% more) than when not using HCatalog when selecting data from Hive.
The operation being done in MapReduce is an operation done by a normal Hive query - select date,count(1) from table group by date ;
Any specific reasons for this ? Please let me know, Thanks!
I would expect a higher CPU load when using the hcatalog but not that high. Can you check if you are spending a lot of time in GC's (garbage collections) you might need to give the mappers/reducers a little more memory to work with to compensate for the extra work they are doing.