Support Questions

Find answers, ask questions, and share your expertise

count(*) in hive and the wc of hadoop file is different.

avatar

Value differs when I check the count(*) using hive query and when I check the no of lines in the corresponding file. Seelect count(*) from table gives 100 as output where as wc -l of the corresponding file gives 30. My table is in ORC format. Is it how orc behaves?

1 ACCEPTED SOLUTION

avatar

@Bala Vignesh N V

Yes, the difference in count is expected. ORC converts the table data into groups of rows called stripes along with auxiliary information in a file footer, default size of stripe is 250 MB. Hence, there will be difference in wc -l on orc file compared to actual numbers of rows in the table.

View solution in original post

3 REPLIES 3

avatar

@Bala Vignesh N V

Yes, the difference in count is expected. ORC converts the table data into groups of rows called stripes along with auxiliary information in a file footer, default size of stripe is 250 MB. Hence, there will be difference in wc -l on orc file compared to actual numbers of rows in the table.

avatar

Thanks Sindhu. What happens when we store it as TEXTFILE. The count should remain the same right?

avatar

Yes, it would be same.