Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

Expert Contributor

Hi, everyone

 

as you may know, when we use Flume to collect web log and sync into hbase, the basic steps is create table in Hbase, and set the flume.conf for sink,source,channle.

 

but how to define column mapping if we use org.apache.flume.sink.hbase.RegexHbaseEventSerializer?  since the documentation didn't describe serializer details. where can i find it ?

 

serializerorg.apache.flume.sink.hbase.SimpleHbaseEventSerializerDefault increment column = “iCol”, payload column = “pCol”.
serializer.*Properties to be passed to the serializer.

 

currently, i just can use org.apache.flume.sink.hbase.SimpleHbaseEventSerializer, but this method just put one line as one column, this is not what i want.

 

anyone can give some suggestion? thanks.

5 REPLIES 5

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

Super Collaborator
When using the RegexHbaseEventSerializer, you need to specify the following properties:
regex
table
columnFamily
colNames

Here is an example. Note, for using the escape chars, you need to escape the backslash:

tier1.sinks.hbaseSink.channel = hbaseChannel
tier1.sinks.hbaseSink.type = org.apache.flume.sink.hbase.HBaseSink
tier1.sinks.hbaseSink.table = students
tier1.sinks.hbaseSink.columnFamily = info
tier1.sinks.hbaseSink.serializer=org.apache.flume.sink.hbase.RegexHbaseEventSerializer
tier1.sinks.hbaseSink.serializer.regex=(\\d+)\\s(\\S+)\\s(\\S+)\\s(\\d+)\\s(.+)
tier1.sinks.hbaseSink.serializer.colNames=id,first_name,last_name,age,gpa

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

Expert Contributor

thanks.

 

i have got these information from the source code.  

 

another question;

 

most of our log is text file,we'd like to use exec as source to sync these log file into hbase by regex in real-time, but i have checked the documentation, it seems tail -F can't gurantee the data will not be lost,(i have tried to restart flume, i found tail -F is really will lost data) so what's your suggestion for this case ?  

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

Super Collaborator

If you don't need the logs in realtime, then I would suggest using the spooldir source to read the log files in after they've been rotated (you'd either want to use a separate directory they've been rotated into, or use an ignorePattern in the spooldir source to exclude the active file). If you do need realtime, and are using apache, you could use a couple of methods:

 

1.  Use apache pipe log functionality to send logs directly to flume (via syslog source, netcat source, etc)

http://www.oreillynet.com/pub/a/sysadmin/2006/10/12/httpd-syslog.html

 

2.  Use a local rsyslog (or syslog-ng) agent to monitor the log files and then forward to a syslog source on the flume server

http://www.rsyslog.com/doc/master/configuration/modules/imfile.html

http://www.balabit.com/sites/default/files/documents/syslog-ng-pe-4.0-guides/en/syslog-ng-pe-v4.0-gu...

 

Using either of those methods, you could log locally, and stream the logs into flume and they would handle log file rotation, and not be susceptible to event loss (syslog daemons will track how far they are in the log file)

 

HTH!

-PD

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

Expert Contributor

thaks very much buddy.

 

this is very helpful for me.   yes, i have tested spoolDir source, it's really very good if no need real-time.

 

but most of our requirements need in real-time,  thanks again.  i will go to test these solution you provided.

Re: Flume org.apache.flume.sink.hbase.RegexHbaseEventSerializer

New Contributor
Regex can help? To divide the columns and store in columns