Support Questions
Find answers, ask questions, and share your expertise

Assistance with extremely difficult Hive Question/Query for finding mean for every n seconds by participant?

Explorer

I am a newbie with Hive and I am having tremendous difficulty trying to figure out how to solve the following problem using Hive. I am using the hortonworks VM 2.4 for virtualbox. I have imported some csv data into Hive and having lots of trouble doing the following.

Problem

I desperately need assistance in figuring out how to query my data to obtain an aggregate heart rate for n seconds. For example, how can I get the mean heart rate for each participant every 4 seconds?

Data

Consider the following data that represents a simple table I have

timestamphrparticipant
15:30:12951
15:30:13961
15:30:14971
15:30:15981
15:30:16951
15:30:17961
15:30:18971
15:30:19951
15:30:20921
15:30:21931
15:30:23891
15:30:11802

Desired Output

timestamphrparticipantmean
15:30:1295196.5
15:30:1396196.5
15:30:1497196.5
15:30:1598196.5
15:30:1695195.75
15:30:1796195.75
15:30:1897195.75
15:30:1995195.75
15:30:2092191.33
15:30:2193191.33
15:30:2389191.33
15:30:2080296.3

Consider the desired output above. For participant 1, there is 3 mean temperatures given. We have 96.5 for the first 4 seconds, 95.75 for the second 4 seconds, and 91.33 as the mean for the last 3 seconds. The last part is tricky because we don't have 4 seconds. Ideally, we would take the mean of what we have. Since we only have 3 seconds, we have the mean of that. It's possible we could have 2 seconds or 1 seconds. I just don't know. In addition, notice that there is a missing second for participant 1, we have 15:30:21 and then it jumps to 15:30:23.

Issues

I think the only reasonable way to solve this problem is to first assign a unique primary key to the data that labels each row a number such as 1, 2, 3, 4, 5, and somehow use that unique identifier by modulo n or something to get our mean for every 4 seconds. The problem is that I don't know if that will fall apart when we get to the end of the data for a participant and need to get the mean of the remaining data. Simply put, is Hive not worth the trouble for what I need to do?

  • I have no primary id because the csv data that I was given is missing it
  • AFAIK, there is no HH:MM:SS data type in Hive. My timestamp column is a string type
Is their an easier way to do this outside of Hive and then get the data back into hive?
  • Could I try this with R? The problem there is that RJDBC can read Hive but can't write back to it. So that's out. rHive also is very unstable and I won't use it.
  • Could I use python to do this? If so, what library can I use to read and write to Hive?
  • If I can use Hive, geeze where do I even start? Please help.
Thank you so much for your patience in reading this question. I am grateful for any assistance rendered and will do my best to answer any questions or comments to better qualify my question.
4 REPLIES 4

Re: Assistance with extremely difficult Hive Question/Query for finding mean for every n seconds by participant?

Super Guru

@Heath Yates

This is a very interesting problem. My first thought was using a windowing function and use partition by, but partition by over what? What if you convert timestamp to unix timestamp. Will that make it easier to use windowing function?

If not there are two things that come to mind. One is Apache Crunch. This should enable you to do your time series analysis for your data in HDFS but will require Java code.

https://crunch.apache.org/getting-started.html

I think a better option is OpenTSDB. Do you already know about this? It works on top of HBase. OpenTSDB provides a simple avg function that will do the job.

http://opentsdb.net/docs/build/html/user_guide/query/aggregators.html

Re: Assistance with extremely difficult Hive Question/Query for finding mean for every n seconds by participant?

Explorer

Hi,

Thanks for your reply. I have not touched Java in two years and I have given myself a week to solve this, so I am afraid Apache Crunch will not be useful for me. I am primarily a C# and R developer, but I believe python syntax is simple enough I can sorta use it at the moment. Unfortunately, I think this means HBase and OpenTSDB is not viable.

If someone knows a way to get Hive to python and back to Hive using the Horton VM 2.4. I think that might be the way got go for me? I might need to process this data in python and then write it back as a new table in Hive.

Thanks, hlyates

PS - I think this use case has demonstrated the need for me to pick up Java again, but the time constraints for this problem means I probably won't be able to use it. Unfortunately. However, I think it makes sense to stick close to Java as possible with Hadoop given its' origin story.

Re: Assistance with extremely difficult Hive Question/Query for finding mean for every n seconds by participant?

Super Guru

I have personally not used Python with Hive, but have you check this link? this would allow you to fetch data from hive using a select and then you can run your operation in python.

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python

Re: Assistance with extremely difficult Hive Question/Query for finding mean for every n seconds by participant?

Explorer

I have been googling, but do you know if I write data to Hive using this service to? It is important to be able to write it back to hive. If not, I can put it into csv and then back into hive, but what a pain right? Again, further evidence to use alternative tools. 🙂