First of all I am new to Hadoop/Hive (just started looking at it recently). What I am trying to do is the following: I want to analyse how our Hadoop/Hive instances are used by its users. Ideally what I am looking for is to extract all queries run in a particular timeframe, tables used and how many bytes were read/written. It seems like the queries executed I can get reasonably easy from the hiveserver2.log. However, I read somewhere it should also list the bytes read/written, but I could not find that so the article I read on that was probably out of date.
My questions are:
1) How would you suggest I get this information out of the instance? Ideally I looking for something I can run outside the software, i.e. parsing log files or querying a db (so I can automate this)
2) Would I need to change a log level or something to obtain this kind of information?
3) Are query plans stored in any of the log files, so I can parse those?
4) I noticed when using DAS, that for some reason not all my queries appear on the queries tab. Note: I am doing just some playing around with the sandbox and the create table and loading statements from the tutorial appear in the queries tab, but not any of the select statements, so this made me think maybe I need to change a log level or something?
Looking forward to any advice, pointers or hints you would be willing to share....
Many thanks in advance,
Hi @Misha Beek. DAS is definitely the best tool for what you are looking for. Also, HDP 3.1 has a sys database and an information schema you can query to get some of the results you need. Finally, if your queries are not showing up, it may be a security issue.
@Scott Shaw Thanks for your answer. DAS seemed to me like being only an interactive interface? Ideally I just want to extract this information for a given period of time and analyse them offline. Regarding the sys database and information schema, where do I find these? How can I connect to it? Thanks again!