Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

ORC query on all columns

avatar
Master Guru

I have a few steps in my ETL process. In some steps in the middle of the process I do a join between 1 large and another small table. The performance is GREAT (under 2 minutes) when I only use a few columns from the big table. However once Increase the number of columns (almost all columns from big table) the perform suffers BIG TIME. The tables are compressed using zlib. Partitioned and bucketed. The select statement is using fields on partitioned and bucketed fields. Yes we have done with map side join, bucket join, sorted bucket join, etc.

So clearly there is a impact to number of columns selected from ORC and performance. Few column and performance awesome. More and performance not so good. Any interesting work arounds? Should be using a different table format...avro(not a fan) or sequence....?

1 ACCEPTED SOLUTION

avatar
Master Guru

First regarding ORC:

It is a column store format so it only reads the columns you need. So yes less columns good more columns bad. however its still better than a flat file which reads all columns all the time and is not stored as efficiently ( protobuf, vectorization access ... ) . But its not magic.

So the question is if you see big performance hits, is the join correct. Normally the CBO already does a decent job of figuring that out if you have statistics as Constantin says. So that is the first step. The second then is to analyze the explain plan and see if it makes sense. Worst case you could break up the query into multiple pieces with temp tables/with statements to see if a different order results in better performance.

I am also a fan of checking the execution with hive.tez.exec.print.summary to see if there is a stage that takes a loong time and doesnt have enough reducers/mappers. I.e. a bottleneck.

View solution in original post

7 REPLIES 7

avatar
Super Guru
@Sunile Manjee

As I understand, you have a small table and a big table. It is a good practice, with and without ORC, to provide a hint that will stream a table in your kind of join. Have you tried that?

Check this:

https://www.linkedin.com/pulse/20141002060036-90038370-hive-join-optimization-stream-table-in-joins

avatar
Master Guru

@Constantin Stanca my understanding this happens by default when using map side joins. Is that not the case?

avatar
Master Guru

Ok I found the documentation. Streamtable is used by default in mapside join:

In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered.

However it must be the last table in the sequence. So if it is not then your suggestion would be helpful. However in this situation the smallest table is the last table in sequence.

avatar
Super Guru

@Sunile Manjee

Yes. You already applied map-side join which does pretty much the same and also you seem to have the small table last in the sequence. I assume that your small table fits in the memory. My take is that when you use so many columns (almost all) in the join, you practically deny the benefit of columnar. Have you gathered statistics? Without the statistics the CBO will not propose the best plan. What does the EXPLAIN PLAN says?

avatar
Master Guru

First regarding ORC:

It is a column store format so it only reads the columns you need. So yes less columns good more columns bad. however its still better than a flat file which reads all columns all the time and is not stored as efficiently ( protobuf, vectorization access ... ) . But its not magic.

So the question is if you see big performance hits, is the join correct. Normally the CBO already does a decent job of figuring that out if you have statistics as Constantin says. So that is the first step. The second then is to analyze the explain plan and see if it makes sense. Worst case you could break up the query into multiple pieces with temp tables/with statements to see if a different order results in better performance.

I am also a fan of checking the execution with hive.tez.exec.print.summary to see if there is a stage that takes a loong time and doesnt have enough reducers/mappers. I.e. a bottleneck.

avatar
Master Guru

@Benjamin Leonhardi well now were getting somewhere. I am going to try hive.tez.exec.print.summary to find some additional analysis points.

But goes back to my main question. Does ORC make sense in this use case when I want to fetch all (or almost all columns)?

avatar
Master Guru

@Sunile Manjee

Short answer theoretically ORC ALWAYS makes sense. Just less so than if you have a subset of columns. Then its no question

- Its stored in a protobuf format so parsing the data is much faster than deserializing strings.

- It enables vectorization, if you aggregate on a column ORC can read 10000 rows at a time and aggregate them all in one go. Much better than parsing one row at a time.

- And features like predicate pushdown if you have where conditions.

Once you read all columns there is no magic anymore it will take some time. I would focus on the query analysis and trying to identify any bottlenecks but my guess would be that ORC still are your best bet.