Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Will there be any performance issues if we select particular column from hive table than selecting all(*) the columns?

avatar

I have a hive managed table stored as TEXTFILE. Will there be any change in the performance between select col1,col2 from hive_tabl than select * from hive_tabl. Consider the table has 300 columns with 20 billion of data. Will there be any performance which impacts based on the select clause. If so can ORC storage overcome it or what will be the best way to store the data which will not affect based on the select clause.

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
6 REPLIES 6

avatar

@Bala Vignesh N V

Yes, there would be performance difference between select * and select column as 'select *' would bring in all the column data and with indexing not affecting much.

ORC would have better performance compared to textformat irrespective of select query being run. This is because the data is stored in splits and each split header containing the details regarding data within the split. ORC also has predicate pushdown which facilitates the better performance.

Refer to link1 and link2 for details on increasing performance.

avatar

Is that so? In relational databases I agree that there be a significant difference. But in hadoop i always thought that it will read the entire record if it is stored as TEXTFILE. Doesnt mapper reads entire record to parse it to reducer?

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar

Thanks @Binu Mathew. I would like to know one additional information from the point 5. I understand that ORC will read the specific column if we specify the column names in the select clause. But how does the text file work? . Like if i specify the column name in the select clause will it read only the specify columns or will it read the entire record but displays only the selected columns.

avatar

This is such a precise and clear answer. Awesome.

avatar
Contributor

@Binu Mathew

Hey, In my case lot of mappers are launched when I run a select query on ORC file.

Also, are there some particular settings of hive to be turned on so that read operations in ORC use ppd. I have tried a lot but almost all my queries read the same as size(of my ORC table), which means reader is reading the whole ORC file. I run Hive 0.13.