Created 07-21-2016 06:43 PM
I have a few steps in my ETL process. In some steps in the middle of the process I do a join between 1 large and another small table. The performance is GREAT (under 2 minutes) when I only use a few columns from the big table. However once Increase the number of columns (almost all columns from big table) the perform suffers BIG TIME. The tables are compressed using zlib. Partitioned and bucketed. The select statement is using fields on partitioned and bucketed fields. Yes we have done with map side join, bucket join, sorted bucket join, etc.
So clearly there is a impact to number of columns selected from ORC and performance. Few column and performance awesome. More and performance not so good. Any interesting work arounds? Should be using a different table format...avro(not a fan) or sequence....?
Created 07-22-2016 09:45 AM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 07-21-2016 07:01 PM
As I understand, you have a small table and a big table. It is a good practice, with and without ORC, to provide a hint that will stream a table in your kind of join. Have you tried that?
Check this:
https://www.linkedin.com/pulse/20141002060036-90038370-hive-join-optimization-stream-table-in-joins
Created 07-21-2016 07:48 PM
@Constantin Stanca my understanding this happens by default when using map side joins. Is that not the case?
Created 07-21-2016 07:54 PM
Ok I found the documentation. Streamtable is used by default in mapside join:
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered.
However it must be the last table in the sequence. So if it is not then your suggestion would be helpful. However in this situation the smallest table is the last table in sequence.
Created 07-21-2016 08:00 PM
Yes. You already applied map-side join which does pretty much the same and also you seem to have the small table last in the sequence. I assume that your small table fits in the memory. My take is that when you use so many columns (almost all) in the join, you practically deny the benefit of columnar. Have you gathered statistics? Without the statistics the CBO will not propose the best plan. What does the EXPLAIN PLAN says?
Created 07-22-2016 09:45 AM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 07-25-2016 03:52 AM
@Benjamin Leonhardi well now were getting somewhere. I am going to try hive.tez.exec.print.summary to find some additional analysis points.
But goes back to my main question. Does ORC make sense in this use case when I want to fetch all (or almost all columns)?
Created 07-25-2016 09:37 AM
Short answer theoretically ORC ALWAYS makes sense. Just less so than if you have a subset of columns. Then its no question
- Its stored in a protobuf format so parsing the data is much faster than deserializing strings.
- It enables vectorization, if you aggregate on a column ORC can read 10000 rows at a time and aggregate them all in one go. Much better than parsing one row at a time.
- And features like predicate pushdown if you have where conditions.
Once you read all columns there is no magic anymore it will take some time. I would focus on the query analysis and trying to identify any bottlenecks but my guess would be that ORC still are your best bet.