Created 10-11-2017 04:07 AM
My usecase is to perform interactive analytics on top of the log data (json format) stored in HDFS and in HIVE table(TEXTFILE format). We have around 30 million records and the size of the dataset is around 60 GB. Since Tez is the default query engine for my hive version, i expected the query results should be faster enough, but the response time for even count() also took around 30 seconds. What would be best practice or recommendation for performing interactive log analytics using HIVE? Do I need to use HIVE table with RC/ORC format rather than TEXT.? My customer comparing the query response time with RDBMS in this case. Appreciate your suggestion on approach/solution to satisfy my usecase. Thanks!!!
Created 10-11-2017 07:28 AM
Try collecting the statistics for the table both at table level and column level. Then, try running the queries.
Use of stats should improve the performance.
Created 10-12-2017 10:49 AM
Thanks for your response. Is it possible to minimize the response time by converting my table in ORC or Parquet format.?
Created 10-11-2017 09:19 AM
Technically you cant compare RDBMS with Hive atleast for now. Once way of doing it capture stats of the table in the hive properties through you can see improvement in performance. If you wanted to do analysis on column then you may have to run
Analyze table dbname.tblname partition compute statistics for columns column_name;
By performing column statistics you can experience a better performance. Hive works well in terms of large computing for which it is specifically designed. But comparing with RDBMS is like comparing apples with oranges. Hope it helps!
Created 10-12-2017 10:48 AM
Thanks for your reply. Yes true, we cannot compare with RDBMS as both HIVE and RDBMS meant for different purposes. However, it is evident that HIVE is still handful for batch analytics but not for interactive.(atleast for now)
Created 10-12-2017 11:36 AM
Please refer to below link which gives some performance comparison between ORC and text:
https://community.hortonworks.com/articles/69375/orc-columnar-format-vs-row-for-hive-tables.html
Created 10-12-2017 04:15 PM
Thanks much
Created 10-13-2017 10:20 PM
Hello @Saravanan Ramaraj In response to an earlier question I would say "yes" to converting your JSON log data to ORC if the logs are not complex data structures which can vary. unlike @Bala Vignesh N V
I have had positive experiences comparing an ORC file of > 1 billion rows with it's equivalent RDBMS (Teradata) version. ORC's reponse of 20-30 seconds was judged to be "competitive" LLAP would make this even better. I'm a big fan of ORC.