My usecase is to perform interactive analytics on top of the log data (json format) stored in HDFS and in HIVE table(TEXTFILE format). We have around 30 million records and the size of the dataset is around 60 GB. Since Tez is the default query engine for my hive version, i expected the query results should be faster enough, but the response time for even count() also took around 30 seconds. What would be best practice or recommendation for performing interactive log analytics using HIVE? Do I need to use HIVE table with RC/ORC format rather than TEXT.? My customer comparing the query response time with RDBMS in this case. Appreciate your suggestion on approach/solution to satisfy my usecase. Thanks!!!
Technically you cant compare RDBMS with Hive atleast for now. Once way of doing it capture stats of the table in the hive properties through you can see improvement in performance. If you wanted to do analysis on column then you may have to run
Analyze table dbname.tblname partition compute statistics for columns column_name;
By performing column statistics you can experience a better performance. Hive works well in terms of large computing for which it is specifically designed. But comparing with RDBMS is like comparing apples with oranges. Hope it helps!
Thanks for your reply. Yes true, we cannot compare with RDBMS as both HIVE and RDBMS meant for different purposes. However, it is evident that HIVE is still handful for batch analytics but not for interactive.(atleast for now)
Please refer to below link which gives some performance comparison between ORC and text:
Hello @Saravanan Ramaraj In response to an earlier question I would say "yes" to converting your JSON log data to ORC if the logs are not complex data structures which can vary. unlike @Bala Vignesh N V
I have had positive experiences comparing an ORC file of > 1 billion rows with it's equivalent RDBMS (Teradata) version. ORC's reponse of 20-30 seconds was judged to be "competitive" LLAP would make this even better. I'm a big fan of ORC.