as you may know, when we use Flume to collect web log and sync into hbase, the basic steps is create table in Hbase, and set the flume.conf for sink,source,channle.
but how to define column mapping if we use org.apache.flume.sink.hbase.RegexHbaseEventSerializer? since the documentation didn't describe serializer details. where can i find it ?
|serializer||org.apache.flume.sink.hbase.SimpleHbaseEventSerializer||Default increment column = “iCol”, payload column = “pCol”.|
|serializer.*||–||Properties to be passed to the serializer.|
currently, i just can use org.apache.flume.sink.hbase.SimpleHbaseEventSerializer, but this method just put one line as one column, this is not what i want.
anyone can give some suggestion? thanks.
i have got these information from the source code.
most of our log is text file,we'd like to use exec as source to sync these log file into hbase by regex in real-time, but i have checked the documentation, it seems tail -F can't gurantee the data will not be lost,(i have tried to restart flume, i found tail -F is really will lost data) so what's your suggestion for this case ?
If you don't need the logs in realtime, then I would suggest using the spooldir source to read the log files in after they've been rotated (you'd either want to use a separate directory they've been rotated into, or use an ignorePattern in the spooldir source to exclude the active file). If you do need realtime, and are using apache, you could use a couple of methods:
1. Use apache pipe log functionality to send logs directly to flume (via syslog source, netcat source, etc)
2. Use a local rsyslog (or syslog-ng) agent to monitor the log files and then forward to a syslog source on the flume server
Using either of those methods, you could log locally, and stream the logs into flume and they would handle log file rotation, and not be susceptible to event loss (syslog daemons will track how far they are in the log file)
thaks very much buddy.
this is very helpful for me. yes, i have tested spoolDir source, it's really very good if no need real-time.
but most of our requirements need in real-time, thanks again. i will go to test these solution you provided.