I am a newbie with Hive and I am having tremendous difficulty trying to figure out how to solve the following problem using Hive. I am using the hortonworks VM 2.4 for virtualbox. I have imported some csv data into Hive and having lots of trouble doing the following.
I desperately need assistance in figuring out how to query my data to obtain an aggregate heart rate for n seconds. For example, how can I get the mean heart rate for each participant every 4 seconds?
Consider the following data that represents a simple table I have
Consider the desired output above. For participant 1, there is 3 mean temperatures given. We have 96.5 for the first 4 seconds, 95.75 for the second 4 seconds, and 91.33 as the mean for the last 3 seconds. The last part is tricky because we don't have 4 seconds. Ideally, we would take the mean of what we have. Since we only have 3 seconds, we have the mean of that. It's possible we could have 2 seconds or 1 seconds. I just don't know. In addition, notice that there is a missing second for participant 1, we have 15:30:21 and then it jumps to 15:30:23.
I think the only reasonable way to solve this problem is to first assign a unique primary key to the data that labels each row a number such as 1, 2, 3, 4, 5, and somehow use that unique identifier by modulo n or something to get our mean for every 4 seconds. The problem is that I don't know if that will fall apart when we get to the end of the data for a participant and need to get the mean of the remaining data. Simply put, is Hive not worth the trouble for what I need to do?
This is a very interesting problem. My first thought was using a windowing function and use partition by, but partition by over what? What if you convert timestamp to unix timestamp. Will that make it easier to use windowing function?
If not there are two things that come to mind. One is Apache Crunch. This should enable you to do your time series analysis for your data in HDFS but will require Java code.
I think a better option is OpenTSDB. Do you already know about this? It works on top of HBase. OpenTSDB provides a simple avg function that will do the job.
Thanks for your reply. I have not touched Java in two years and I have given myself a week to solve this, so I am afraid Apache Crunch will not be useful for me. I am primarily a C# and R developer, but I believe python syntax is simple enough I can sorta use it at the moment. Unfortunately, I think this means HBase and OpenTSDB is not viable.
If someone knows a way to get Hive to python and back to Hive using the Horton VM 2.4. I think that might be the way got go for me? I might need to process this data in python and then write it back as a new table in Hive.
PS - I think this use case has demonstrated the need for me to pick up Java again, but the time constraints for this problem means I probably won't be able to use it. Unfortunately. However, I think it makes sense to stick close to Java as possible with Hadoop given its' origin story.
I have personally not used Python with Hive, but have you check this link? this would allow you to fetch data from hive using a select and then you can run your operation in python.
I have been googling, but do you know if I write data to Hive using this service to? It is important to be able to write it back to hive. If not, I can put it into csv and then back into hive, but what a pain right? Again, further evidence to use alternative tools. 🙂