Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why my hive over tez query slow for 25 million records?

Why my hive over tez query slow for 25 million records?

New Contributor

My usecase is to perform interactive analytics on top of the log data (json format) stored in HDFS and in HIVE table(TEXTFILE format). We have around 30 million records and the size of the dataset is around 60 GB. Since Tez is the default query engine for my hive version, i expected the query results should be faster enough, but the response time for even count() also took around 30 seconds. What would be best practice or recommendation for performing interactive log analytics using HIVE? Do I need to use HIVE table with RC/ORC format rather than TEXT.? My customer comparing the query response time with RDBMS in this case. Appreciate your suggestion on approach/solution to satisfy my usecase. Thanks!!!

7 REPLIES 7

Re: Why my hive over tez query slow for 25 million records?

@Saravanan Ramaraj

Try collecting the statistics for the table both at table level and column level. Then, try running the queries.
Use of stats should improve the performance.

Also, refer link1 and link2 for Hive performance tuning.

Re: Why my hive over tez query slow for 25 million records?

New Contributor

Thanks for your response. Is it possible to minimize the response time by converting my table in ORC or Parquet format.?

Re: Why my hive over tez query slow for 25 million records?

Hi @Saravanan Ramaraj

Technically you cant compare RDBMS with Hive atleast for now. Once way of doing it capture stats of the table in the hive properties through you can see improvement in performance. If you wanted to do analysis on column then you may have to run

Analyze table dbname.tblname partition compute statistics for columns column_name;

By performing column statistics you can experience a better performance. Hive works well in terms of large computing for which it is specifically designed. But comparing with RDBMS is like comparing apples with oranges. Hope it helps!

Re: Why my hive over tez query slow for 25 million records?

New Contributor

Thanks for your reply. Yes true, we cannot compare with RDBMS as both HIVE and RDBMS meant for different purposes. However, it is evident that HIVE is still handful for batch analytics but not for interactive.(atleast for now)

Re: Why my hive over tez query slow for 25 million records?

Cloudera Employee
@Saravanan Ramaraj

Please refer to below link which gives some performance comparison between ORC and text:

https://community.hortonworks.com/articles/69375/orc-columnar-format-vs-row-for-hive-tables.html

Re: Why my hive over tez query slow for 25 million records?

New Contributor

Thanks much

Re: Why my hive over tez query slow for 25 million records?

New Contributor

Hello @Saravanan Ramaraj In response to an earlier question I would say "yes" to converting your JSON log data to ORC if the logs are not complex data structures which can vary. unlike @Bala Vignesh N V

I have had positive experiences comparing an ORC file of > 1 billion rows with it's equivalent RDBMS (Teradata) version. ORC's reponse of 20-30 seconds was judged to be "competitive" LLAP would make this even better. I'm a big fan of ORC.

Don't have an account?
Coming from Hortonworks? Activate your account here