Support Questions

balavignesh_nag · ‎02-13-2017

I have a hive managed table stored as TEXTFILE. Will there be any change in the performance between select col1,col2 from hive_tabl than select * from hive_tabl. Consider the table has 300 columns with 20 billion of data. Will there be any performance which impacts based on the select clause. If so can ORC storage overcome it or what will be the best way to store the data which will not affect based on the select clause.

bmathew · ‎02-14-2017

@Bala Vignesh N V

A 'SELECT *' behaves differently on Text vs ORC for certain conditions:

1. On both ORC and Text, no MR or Tez job will be launched if you SELECT all columns: "SELECT * FROM my_table". MR or Tez will just read from the file without launching any mappers.

2. On Text, a MR or Tez job will be launched if you add a WHERE condition: "SELECT * FROM my_table WHERE 'some_condition'". Mappers will be launched for Text formatted data.

3. On ORC, no MR or Tez job will be launched if you add a WHERE condition: "SELECT * FROM my_table WHERE 'some_condition'". No Mappers will be launched for ORC formatted data. Thus, ORC performs better than Text in those cases in which you submit a WHERE condition.

4. On both ORC and Text, no MR or Tez job will be launched if you SELECT a few columns: "SELECT column1, column2 FROM my_table". MR or Tez will just read from the file without launching any mappers.

5. For any aggregations on a few columns, MR or Tez job will be launched and ORC results in less I/O than Text because only those columns are being read

View solution in original post

ssubhas · ‎02-13-2017

@Bala Vignesh N V

Yes, there would be performance difference between select * and select column as 'select *' would bring in all the column data and with indexing not affecting much.

ORC would have better performance compared to textformat irrespective of select query being run. This is because the data is stored in splits and each split header containing the details regarding data within the split. ORC also has predicate pushdown which facilitates the better performance.

Refer to link1 and link2 for details on increasing performance.

balavignesh_nag · ‎02-13-2017

Is that so? In relational databases I agree that there be a significant difference. But in hadoop i always thought that it will read the entire record if it is stored as TEXTFILE. Doesnt mapper reads entire record to parse it to reducer?

bmathew · ‎02-14-2017

@Bala Vignesh N V

A 'SELECT *' behaves differently on Text vs ORC for certain conditions:

1. On both ORC and Text, no MR or Tez job will be launched if you SELECT all columns: "SELECT * FROM my_table". MR or Tez will just read from the file without launching any mappers.

2. On Text, a MR or Tez job will be launched if you add a WHERE condition: "SELECT * FROM my_table WHERE 'some_condition'". Mappers will be launched for Text formatted data.

3. On ORC, no MR or Tez job will be launched if you add a WHERE condition: "SELECT * FROM my_table WHERE 'some_condition'". No Mappers will be launched for ORC formatted data. Thus, ORC performs better than Text in those cases in which you submit a WHERE condition.

4. On both ORC and Text, no MR or Tez job will be launched if you SELECT a few columns: "SELECT column1, column2 FROM my_table". MR or Tez will just read from the file without launching any mappers.

5. For any aggregations on a few columns, MR or Tez job will be launched and ORC results in less I/O than Text because only those columns are being read

balavignesh_nag · ‎02-14-2017

Thanks @Binu Mathew. I would like to know one additional information from the point 5. I understand that ORC will read the specific column if we specify the column names in the select clause. But how does the text file work? . Like if i specify the column name in the select clause will it read only the specify columns or will it read the entire record but displays only the selected columns.

dineshc · ‎02-17-2017

This is such a precise and clear answer. Awesome.

Hadoopy · ‎03-08-2018

@Binu Mathew

Hey, In my case lot of mappers are launched when I run a select query on ORC file.

Also, are there some particular settings of hive to be turned on so that read operations in ORC use ppd. I have tried a lot but almost all my queries read the same as size(of my ORC table), which means reader is reading the whole ORC file. I run Hive 0.13.

Cloudera Community

Support Questions

Will there be any performance issues if we select particular column from hive table than selecting all(*) the columns?