Hi I have some code that runs in HIVE SQL and takes 4 hours. We only have the option of map reduce, any ideas on how to optimise would be very welcome?
Here is the code...
FROM (SELECT ca_ref
,MAX(tech_start_date) AS tech_start_date
,MAX(tech_end_date) AS tech_end_date
where line_item_type_cd = ' '
GROUP BY ca_ref
JOIN analytics_pqacs1.DOCUMENT_DET_f3 SR ON SR.ca_ref = V.ca_ref
AND SR.Contract_Account_Document_Id = V.Contract_Account_Document_Id
AND SR.repetition_Item_Num = V.repetition_Item_Num
AND SR.Contract_Account_Sub_Item_Num = V.Contract_Account_Sub_Item_Num
AND SR.Contract_Account_Item_Num = V.Contract_Account_Item_Num
AND SR.tech_start_date = V.tech_start_date
AND SR.tech_end_date = V.tech_end_date
For starters, need more details:
- Hive version
- How is the data stored? Text, SequenceFile, ORC, etc?
- Is your table partitioned, bucketed, etc?
- Do you run it over Tez?
- If you run it over Tez is vectorization enabled?
- What are container characteristics: RAM and cores?
- How much data is in these tables: rows and bytes? You do full table scans and size matters not only as rows but also as bytes
- Are the fields used for join non-string (all of them)? They seem like it.
- What does Resource Manager UI indicates when you run this query? How many containers are used? How much RAM and cores?
- When this query runs, look at Tez view to see how many mappers and reducers are executed. Share that info, please
Agree with Constantin, We need all that information to determine the exact nature or the Explain Plan for the query. But without knowing any of that, you could just remove all the group by functions such as MAX and get the entire data in one step, then perform the MAX functions on the result.