Created 02-19-2016 08:00 AM
The default value is set to 10,000 and should be > 100, as per the docs.
How should this value be changed or altered? Need some guidance.
If I have a large table of billion rows should we increase the value? Will this be affected by?
I am assuming also that the "orc.bloom.filter.columns" will be the list of columns on which the indexes will be created?
Created 02-19-2016 09:29 AM
So first:
ORC indexes come in two forms, the standard indexes which are created all the time ( min/max values for each stride for each column ) and bloom filters.
Normal indexes are good for range queries and work amazingly well if the data is sorted. This is normally automatic on any date column or increasing columns like ids.
Bloom filters are great for equality queries of things like URLs, names, etc. on data that is not sorted. ( I.e. a customer name can happen sometimes in the data ).
However boom filters take some time to compute, take some space in the indexes and do not work well for most columns in a data warehouse ( number fields like profit, sales, ... ) So they are not created by default and need to be enabled for columns:
orc.bloom.filter.columns
The stride size means the block of data that can be skipped by the ORC reader during a read operation based on these indexes. 10000 is normally a good number and increasing it doesn't help you much. You can play a bit with it but I doubt you will get big performance improvements by changing it. I would expect more impact from block size ( which impacts how many mappers are created ), compression ( zip is normally the best ).
But by far the most impact comes from good data modeling. I.e. Sorting the data during insert, Correct number of ORC files in the folder, data types used, etc.
shameless plug who explains it all a bit:
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
Created 02-19-2016 09:29 AM
So first:
ORC indexes come in two forms, the standard indexes which are created all the time ( min/max values for each stride for each column ) and bloom filters.
Normal indexes are good for range queries and work amazingly well if the data is sorted. This is normally automatic on any date column or increasing columns like ids.
Bloom filters are great for equality queries of things like URLs, names, etc. on data that is not sorted. ( I.e. a customer name can happen sometimes in the data ).
However boom filters take some time to compute, take some space in the indexes and do not work well for most columns in a data warehouse ( number fields like profit, sales, ... ) So they are not created by default and need to be enabled for columns:
orc.bloom.filter.columns
The stride size means the block of data that can be skipped by the ORC reader during a read operation based on these indexes. 10000 is normally a good number and increasing it doesn't help you much. You can play a bit with it but I doubt you will get big performance improvements by changing it. I would expect more impact from block size ( which impacts how many mappers are created ), compression ( zip is normally the best ).
But by far the most impact comes from good data modeling. I.e. Sorting the data during insert, Correct number of ORC files in the folder, data types used, etc.
shameless plug who explains it all a bit:
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data