Support Questions

AlinaGHERMAN · ‎09-07-2015

Hello,

Until now, we used Flume to transfer data, once a day, from a spool directory to a HDFS sink with a memory channel.

Now, we wanted to do it each 5 minutes, but the flume channel is becoming full at the second import (after 10 minutes).

---

2015-09-07 16:08:04,083 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Last read was never committed - resetting mark position.

2015-09-07 16:08:04,085 WARN org.apache.flume.source.SpoolDirectorySource: The channel is full, and cannot write data now. The source will try again after 4000 milliseconds

---

Flume input: 15-20 files each 5 minutes. Each file has 10-600 KB.

Flume configuration:

Source : spool dir
Source maxBlobLength :1610000000

Channel capacity : 100000 (we tried until 1610000000 but there was no change)
Channel transaction capacity : 1000

Sink batch size :1000
Sink idle timeout :60
Sink roll interval : 3600
Sink roll size : 63000000
Sink roll count : 0

What should we change in our configuration ?

How can we find out if the blocking part is the channel size or the sink writing speed ?

Thank you!

Alina GHERMAN

GHERMAN Alina

AlinaGHERMAN · ‎09-29-2015

The problem was solved by changing the source from spooldir to http.

I think there is a problem with the spooldir source.

GHERMAN Alina

View solution in original post

pdvorak · ‎09-07-2015

Ideally, if your sinks are delivering fast enough, your channel size should usually be near zero. If your channel size is growing, it is an indication that your sinks are not delivering fast enough, or there are issues downstream and you can either increase the batchSize or add more sinks. Can you post your flume configuration, that might give a better indication of where improvements can be made? Are you seeing any errors delivering to hdfs?

AlinaGHERMAN · ‎09-08-2015

Hello,

Thank you. There are no errors delivered to hdfs..

Note:

- The interceptor is only normalizing some inputs..

- I tried to add the thread number configuration to the sink, but with no succes (there was no difference).

# source definition
projectName.sources.spooldir-source.type = spooldir
projectName.sources.spooldir-source.spoolDir = /var/flume/in
projectName.sources.spooldir-source.basenameHeader = true
projectName.sources.spooldir-source.basenameHeaderKey = basename
projectName.sources.spooldir-source.batchSize = 10
projectName.sources.spooldir-source.deletePolicy = immediate
# Max blob size: 1.5Go
projectName.sources.spooldir-source.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
projectName.sources.spooldir-source.deserializer.maxBlobLength = 1610000000
# Attach the interceptor to the source
projectName.sources.spooldir-source.interceptors = json-interceptor
projectName.sources.spooldir-source.interceptors.json-interceptor.type = com.company.analytics.flume.interceptor.JsonInterceptor$Builder
# Define event's headers. basenameHeader must be the same than source.basenameHeaderKey (defaults is basename)
projectName.sources.spooldir-source.interceptors.json-interceptor.basenameHeader = basename
projectName.sources.spooldir-source.interceptors.json-interceptor.resourceHeader = resources
projectName.sources.spooldir-source.interceptors.json-interceptor.ssidHeader = ssid

# channel definition
projectName.channels.mem-channel-1.type = memory
projectName.channels.mem-channel-1.capacity = 100000
projectName.channels.mem-channel-1.transactionCapacity = 1000

# sink definition
projectName.sinks.hdfs-sink-1.type = hdfs
projectName.sinks.hdfs-sink-1.hdfs.path = hdfs://StandbyNameNode/path/to/in
projectName.sinks.hdfs-sink-1.hdfs.filePrefix = %{resources}_%{ssid}
projectName.sinks.hdfs-sink-1.hdfs.fileSuffix = .json
projectName.sinks.hdfs-sink-1.hdfs.fileType = DataStream
projectName.sinks.hdfs-sink-1.hdfs.writeFormat = Text
projectName.sinks.hdfs-sink-1.hdfs.rollInterval = 3600
projectName.sinks.hdfs-sink-1.hdfs.rollSize = 63000000
projectName.sinks.hdfs-sink-1.hdfs.rollCount = 0
projectName.sinks.hdfs-sink-1.hdfs.batchSize = 1000
projectName.sinks.hdfs-sink-1.hdfs.idleTimeout = 60

# connect source and sink to channel
projectName.sources.spooldir-source.channels = mem-channel-1
projectName.sinks.hdfs-sink-1.channel = mem-channel-1

Would it help to add different identic sinks on the same machine?

Thank you!

Alina GHERMAN

GHERMAN Alina

pdvorak · ‎09-08-2015

Adding sinks to your configuration will parallelize the delivery of events, (i.e. adding another sink will double your event drain rate, 3 will triple, etc).

You'll want to be sure to add a unique hdfs.filePrefix to each sink in order to ensure there are no filename collisions. If you have multiple hosts, that uniqueness would need to cover hostnames as well.

AlinaGHERMAN · ‎09-09-2015

I wanted to add one more information:

- in Cloudera Manager ==> charts ==> if we do "select channel_fill_percentage_across_flume_channels", we are at maximum 0.0001%...

Note: we have 2 channels, each with one sink and one source, both on the same machine.

This means that the error/warning that we have in the logs is not the real point that is blocking flume to work...

Thank you!

GHERMAN Alina

AlinaGHERMAN · ‎09-29-2015

The problem was solved by changing the source from spooldir to http.

I think there is a problem with the spooldir source.

GHERMAN Alina

Cloudera Community

Support Questions

Flume - Memory Channel Full