- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Why my hive over tez query slow for 25 million records?
- Labels:
-
Apache Hive
-
Apache Tez
Created 10-11-2017 04:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My usecase is to perform interactive analytics on top of the log data (json format) stored in HDFS and in HIVE table(TEXTFILE format). We have around 30 million records and the size of the dataset is around 60 GB. Since Tez is the default query engine for my hive version, i expected the query results should be faster enough, but the response time for even count() also took around 30 seconds. What would be best practice or recommendation for performing interactive log analytics using HIVE? Do I need to use HIVE table with RC/ORC format rather than TEXT.? My customer comparing the query response time with RDBMS in this case. Appreciate your suggestion on approach/solution to satisfy my usecase. Thanks!!!
Created 10-11-2017 07:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try collecting the statistics for the table both at table level and column level. Then, try running the queries.
Use of stats should improve the performance.
Created 10-12-2017 10:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your response. Is it possible to minimize the response time by converting my table in ORC or Parquet format.?
Created 10-11-2017 09:19 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Technically you cant compare RDBMS with Hive atleast for now. Once way of doing it capture stats of the table in the hive properties through you can see improvement in performance. If you wanted to do analysis on column then you may have to run
Analyze table dbname.tblname partition compute statistics for columns column_name;
By performing column statistics you can experience a better performance. Hive works well in terms of large computing for which it is specifically designed. But comparing with RDBMS is like comparing apples with oranges. Hope it helps!
Created 10-12-2017 10:48 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply. Yes true, we cannot compare with RDBMS as both HIVE and RDBMS meant for different purposes. However, it is evident that HIVE is still handful for batch analytics but not for interactive.(atleast for now)
Created 10-12-2017 11:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please refer to below link which gives some performance comparison between ORC and text:
https://community.hortonworks.com/articles/69375/orc-columnar-format-vs-row-for-hive-tables.html
Created 10-12-2017 04:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks much
Created 10-13-2017 10:20 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @Saravanan Ramaraj In response to an earlier question I would say "yes" to converting your JSON log data to ORC if the logs are not complex data structures which can vary. unlike @Bala Vignesh N V
I have had positive experiences comparing an ORC file of > 1 billion rows with it's equivalent RDBMS (Teradata) version. ORC's reponse of 20-30 seconds was judged to be "competitive" LLAP would make this even better. I'm a big fan of ORC.