I am new to Hadoop ecosystem and self learning it through online articles. I am working on very basic project so that I can get hands-on on what I have learnt.
My use-case is extremely: Idea is I want to present location of user who login to portal to app admin.So, I have a server which is continuously generating logs, logs have user id, IP address, time-stamp. All fields are comma separated.
My idea to do this is to have a flume agent to streaming live logs data and write to HDFS. Have HIVE process in place which will read incremental data from HDFS and write to HIVE table. Use scoop to continuously copy data from HIVE to RDMBS SQL table and use that SQL table to play with. So far I have successfully configured flume agent which read logs from a given location and write to hdfs location. But after this I am confused as how should I move data from HDFS to HIVE table. One idea that's coming to my mind is to have a MapRed program that will read files in HDFS and write to HIVE tables programatically in Java. But I also want to delete files which are already processed and make sure that no duplicate records are read by MapRed. I searched online and found command that can be used to copy file data to HIVE but that's sort of a manual once activity. In my usecase I want to push data as soon as it's available in HDFS. Please guide me how to achieve this task. Links will be helpful.
I am working on Version: Cloudera Express 5.13.0
Update 1: I just created an external HIVE table pointing to HDFS location where flume is dumping logs. I noticed that as soon as table is created, I can query HIVE table and fetch data. This is awesome. But what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed? Similarly, will hive read new logs which are not processed and ignore the ones which it has already processed?
... View more
I am new to BigData and have gone through multiple online documents to learn flume. I am Cloudera Express 5.13.0.
As per my understanding flume service and flume agent are basically same thing. So, we can add flume service in cloudera manager or can run flume agent outside in command prompt as well.
When I run "which flume" , I get an answer no flume found.
When I run "which flume-ng", I get an answer as /usr/bin/flume-ng which mean cloudera quick start has flume-ng agent installed.
Now when I start flume-ng agent using command:
sudo service flume-ng-agent start
the agent start and outcome of status command is ok but when I go to cloudera manager I do not see flume agent running there. Similarly after I stop "sudo service flume-ng-agent stop" in command prompt and when I start service/agent in cloudera manager then when I execute command "sudo service flume-ng-agent status" in command prompt the output is flume-ng not running. I am confused with this. Is flume-ng from command prompt not same as starting service in cloudera manager? Also, when I start flume-ng in command prompt I do not see logs rolling in "/var/log/flume-ng". Also, I do not see flume-ng service to add in cloudera manager but I see flume to add so I have added flume with all its default configuration. Please help.
... View more