Reply
Expert Contributor
Posts: 161
Registered: ‎09-29-2014

Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

Hi, everyone

 

as you may know, when we use Flume to collect web log and sync into hbase, the basic steps is create table in Hbase, and set the flume.conf for sink,source,channle.

 

but how to define column mapping if we use org.apache.flume.sink.hbase.RegexHbaseEventSerializer?  since the documentation didn't describe serializer details. where can i find it ?

 

serializerorg.apache.flume.sink.hbase.SimpleHbaseEventSerializerDefault increment column = “iCol”, payload column = “pCol”.
serializer.*Properties to be passed to the serializer.

 

currently, i just can use org.apache.flume.sink.hbase.SimpleHbaseEventSerializer, but this method just put one line as one column, this is not what i want.

 

anyone can give some suggestion? thanks.

Cloudera Employee
Posts: 249
Registered: ‎01-09-2014

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

When using the RegexHbaseEventSerializer, you need to specify the following properties:
regex
table
columnFamily
colNames

Here is an example. Note, for using the escape chars, you need to escape the backslash:

tier1.sinks.hbaseSink.channel = hbaseChannel
tier1.sinks.hbaseSink.type = org.apache.flume.sink.hbase.HBaseSink
tier1.sinks.hbaseSink.table = students
tier1.sinks.hbaseSink.columnFamily = info
tier1.sinks.hbaseSink.serializer=org.apache.flume.sink.hbase.RegexHbaseEventSerializer
tier1.sinks.hbaseSink.serializer.regex=(\\d+)\\s(\\S+)\\s(\\S+)\\s(\\d+)\\s(.+)
tier1.sinks.hbaseSink.serializer.colNames=id,first_name,last_name,age,gpa
Expert Contributor
Posts: 161
Registered: ‎09-29-2014

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

thanks.

 

i have got these information from the source code.  

 

another question;

 

most of our log is text file,we'd like to use exec as source to sync these log file into hbase by regex in real-time, but i have checked the documentation, it seems tail -F can't gurantee the data will not be lost,(i have tried to restart flume, i found tail -F is really will lost data) so what's your suggestion for this case ?  

Cloudera Employee
Posts: 249
Registered: ‎01-09-2014

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

If you don't need the logs in realtime, then I would suggest using the spooldir source to read the log files in after they've been rotated (you'd either want to use a separate directory they've been rotated into, or use an ignorePattern in the spooldir source to exclude the active file). If you do need realtime, and are using apache, you could use a couple of methods:

 

1.  Use apache pipe log functionality to send logs directly to flume (via syslog source, netcat source, etc)

http://www.oreillynet.com/pub/a/sysadmin/2006/10/12/httpd-syslog.html

 

2.  Use a local rsyslog (or syslog-ng) agent to monitor the log files and then forward to a syslog source on the flume server

http://www.rsyslog.com/doc/master/configuration/modules/imfile.html

http://www.balabit.com/sites/default/files/documents/syslog-ng-pe-4.0-guides/en/syslog-ng-pe-v4.0-gu...

 

Using either of those methods, you could log locally, and stream the logs into flume and they would handle log file rotation, and not be susceptible to event loss (syslog daemons will track how far they are in the log file)

 

HTH!

-PD

Expert Contributor
Posts: 161
Registered: ‎09-29-2014

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

thaks very much buddy.

 

this is very helpful for me.   yes, i have tested spoolDir source, it's really very good if no need real-time.

 

but most of our requirements need in real-time,  thanks again.  i will go to test these solution you provided.

Announcements
New solutions