Created 02-11-2018 06:58 PM
Hi,
I had a text table which was taking 7 min to do a 24 hour rollup (with min, max, stddev calculations and group by's) and writing into a text table... but when i tried the same after converting into orc zlib, the ending few mappers are taking more than 20-25 min .... how come text is more faster? should i change orc to snappy compression?
Created 02-12-2018 02:45 PM
Update: i used another table which is just ORC(no compression), then it ran a little faster(10min) but i want to know how this is explained and are their any better ways?
Created 02-13-2018 09:03 AM
Hi,
this makes sense. The data needs to be converted to ORC and compressed before being written, so it is normal that it is slower. How much slower depends on multiple factors. For sure, snappy is faster than zlib, but it takes a bit more disk space. And no compression is even faster, but again you need more disk space. The benefits of using ORC over text are multiple, though, and some of them are:
1. It requires fewer disk space;
2. You need to read less data;
3. Your queries on the resulting table will read only the columns needed (so if you have a lot of columns and you query just few of them in each query on the result table, you have a great performance gain by using ORC).
Created 02-13-2018 02:57 PM
but the source table is already ORC, why does it need to compress it before being written when it is already compressed in the source tbale... i have min, max, avg, stddev done on several columns along with grouping in the query ...
The other benefits you mentioned are understood, but i am not really sure why delay for orc table when it is usually faster than text ... everywhere it claims orc is faster.
Created 02-13-2018 03:42 PM
the format of the input table doesn't matter. When Spark reads a table (whatever format it has) it brings it in its internal memory representation. Then, when Spark writes the new table, it "converts" its internal representation of the result in the output format. Since the ORC format is way more complex than the text one, it will take longer to write, And even longer if it is compressed. It takes longer simply because it has more things to do in order to write the ORC file rather than a simple text file.
ORC is faster than text in reading, but it is slower in writing. The assumption is that you write the table once and then you read it many times, so the critical part is reading: therefore ORC is claimed to be faster, because in general it is more convenient. But if you are concerned about performance (and you don't care disk space) and you write only your data and never read it (not sure it makes sense to write something which is never read), than the text format is surely more performant.
Created 02-13-2018 07:55 PM
thanks for the explanation.... i have one last question ..
So, if my source is orc table and destination is text table, is the write slower because you are writing into text table?
Or, if my destination is orc table, will the write be faster?
Thanks so much!
Created 02-14-2018 07:55 AM
No, writing to text table is faster than writing ORC, independently from what your source is. Reading, instead, is always faster from an ORC table.