Created 02-13-2017 07:03 PM
I have a hive managed table stored as TEXTFILE. Will there be any change in the performance between select col1,col2 from hive_tabl than select * from hive_tabl. Consider the table has 300 columns with 20 billion of data. Will there be any performance which impacts based on the select clause. If so can ORC storage overcome it or what will be the best way to store the data which will not affect based on the select clause.
Created 02-14-2017 06:21 PM
A 'SELECT *' behaves differently on Text vs ORC for certain conditions:
1. On both ORC and Text, no MR or Tez job will be launched if you SELECT all columns: "SELECT * FROM my_table". MR or Tez will just read from the file without launching any mappers.
2. On Text, a MR or Tez job will be launched if you add a WHERE condition: "SELECT * FROM my_table WHERE 'some_condition'". Mappers will be launched for Text formatted data.
3. On ORC, no MR or Tez job will be launched if you add a WHERE condition: "SELECT * FROM my_table WHERE 'some_condition'". No Mappers will be launched for ORC formatted data. Thus, ORC performs better than Text in those cases in which you submit a WHERE condition.
4. On both ORC and Text, no MR or Tez job will be launched if you SELECT a few columns: "SELECT column1, column2 FROM my_table". MR or Tez will just read from the file without launching any mappers.
5. For any aggregations on a few columns, MR or Tez job will be launched and ORC results in less I/O than Text because only those columns are being read
Created 02-13-2017 07:26 PM
Yes, there would be performance difference between select * and select column as 'select *' would bring in all the column data and with indexing not affecting much.
ORC would have better performance compared to textformat irrespective of select query being run. This is because the data is stored in splits and each split header containing the details regarding data within the split. ORC also has predicate pushdown which facilitates the better performance.
Refer to link1 and link2 for details on increasing performance.
Created 02-13-2017 08:56 PM
Is that so? In relational databases I agree that there be a significant difference. But in hadoop i always thought that it will read the entire record if it is stored as TEXTFILE. Doesnt mapper reads entire record to parse it to reducer?
Created 02-14-2017 06:21 PM
A 'SELECT *' behaves differently on Text vs ORC for certain conditions:
1. On both ORC and Text, no MR or Tez job will be launched if you SELECT all columns: "SELECT * FROM my_table". MR or Tez will just read from the file without launching any mappers.
2. On Text, a MR or Tez job will be launched if you add a WHERE condition: "SELECT * FROM my_table WHERE 'some_condition'". Mappers will be launched for Text formatted data.
3. On ORC, no MR or Tez job will be launched if you add a WHERE condition: "SELECT * FROM my_table WHERE 'some_condition'". No Mappers will be launched for ORC formatted data. Thus, ORC performs better than Text in those cases in which you submit a WHERE condition.
4. On both ORC and Text, no MR or Tez job will be launched if you SELECT a few columns: "SELECT column1, column2 FROM my_table". MR or Tez will just read from the file without launching any mappers.
5. For any aggregations on a few columns, MR or Tez job will be launched and ORC results in less I/O than Text because only those columns are being read
Created 02-14-2017 09:27 PM
Thanks @Binu Mathew. I would like to know one additional information from the point 5. I understand that ORC will read the specific column if we specify the column names in the select clause. But how does the text file work? . Like if i specify the column name in the select clause will it read only the specify columns or will it read the entire record but displays only the selected columns.
Created 02-17-2017 07:35 PM
This is such a precise and clear answer. Awesome.
Created 03-08-2018 11:34 AM
Hey, In my case lot of mappers are launched when I run a select query on ORC file.
Also, are there some particular settings of hive to be turned on so that read operations in ORC use ppd. I have tried a lot but almost all my queries read the same as size(of my ORC table), which means reader is reading the whole ORC file. I run Hive 0.13.
 
					
				
				
			
		
