I had a text table which was taking 7 min to do a 24 hour rollup (with min, max, stddev calculations and group by's) and writing into a text table... but when i tried the same after converting into orc zlib, the ending few mappers are taking more than 20-25 min .... how come text is more faster? should i change orc to snappy compression?
Update: i used another table which is just ORC(no compression), then it ran a little faster(10min) but i want to know how this is explained and are their any better ways?
this makes sense. The data needs to be converted to ORC and compressed before being written, so it is normal that it is slower. How much slower depends on multiple factors. For sure, snappy is faster than zlib, but it takes a bit more disk space. And no compression is even faster, but again you need more disk space. The benefits of using ORC over text are multiple, though, and some of them are:
1. It requires fewer disk space;
2. You need to read less data;
3. Your queries on the resulting table will read only the columns needed (so if you have a lot of columns and you query just few of them in each query on the result table, you have a great performance gain by using ORC).
but the source table is already ORC, why does it need to compress it before being written when it is already compressed in the source tbale... i have min, max, avg, stddev done on several columns along with grouping in the query ...
The other benefits you mentioned are understood, but i am not really sure why delay for orc table when it is usually faster than text ... everywhere it claims orc is faster.
the format of the input table doesn't matter. When Spark reads a table (whatever format it has) it brings it in its internal memory representation. Then, when Spark writes the new table, it "converts" its internal representation of the result in the output format. Since the ORC format is way more complex than the text one, it will take longer to write, And even longer if it is compressed. It takes longer simply because it has more things to do in order to write the ORC file rather than a simple text file.
ORC is faster than text in reading, but it is slower in writing. The assumption is that you write the table once and then you read it many times, so the critical part is reading: therefore ORC is claimed to be faster, because in general it is more convenient. But if you are concerned about performance (and you don't care disk space) and you write only your data and never read it (not sure it makes sense to write something which is never read), than the text format is surely more performant.
thanks for the explanation.... i have one last question ..
So, if my source is orc table and destination is text table, is the write slower because you are writing into text table?
Or, if my destination is orc table, will the write be faster?
Thanks so much!
No, writing to text table is faster than writing ORC, independently from what your source is. Reading, instead, is always faster from an ORC table.