- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
count(*) in hive and the wc of hadoop file is different.
- Labels:
-
Apache Hadoop
-
Apache Hive
Created ‎08-11-2016 10:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Value differs when I check the count(*) using hive query and when I check the no of lines in the corresponding file. Seelect count(*) from table gives 100 as output where as wc -l of the corresponding file gives 30. My table is in ORC format. Is it how orc behaves?
Created ‎08-11-2016 10:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, the difference in count is expected. ORC converts the table data into groups of rows called stripes along with auxiliary information in a file footer, default size of stripe is 250 MB. Hence, there will be difference in wc -l on orc file compared to actual numbers of rows in the table.
Created ‎08-11-2016 10:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, the difference in count is expected. ORC converts the table data into groups of rows called stripes along with auxiliary information in a file footer, default size of stripe is 250 MB. Hence, there will be difference in wc -l on orc file compared to actual numbers of rows in the table.
Created ‎08-11-2016 10:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Sindhu. What happens when we store it as TEXTFILE. The count should remain the same right?
Created ‎08-11-2016 10:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, it would be same.
