Support Questions

Find answers, ask questions, and share your expertise

Flume - Memory Channel Full

avatar
Champion Alumni

Hello,

 

Until now, we used Flume to transfer data, once a day, from a spool directory to a HDFS sink with a memory channel.

Now, we wanted to do it each 5 minutes, but the flume channel is becoming full at the second import (after 10 minutes).

---

2015-09-07 16:08:04,083 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Last read was never committed - resetting mark position.

2015-09-07 16:08:04,085 WARN org.apache.flume.source.SpoolDirectorySource: The channel is full, and cannot write data now. The source will try again after 4000 milliseconds

---

Flume input: 15-20 files each 5 minutes. Each file has 10-600 KB.

Flume configuration:

  • Source : spool dir
  • Source maxBlobLength :1610000000

 

  • Channel capacity : 100000 (we tried until 1610000000 but there was no change)
  • Channel transaction capacity : 1000

 

  • Sink batch size :1000
  • Sink idle timeout :60
  • Sink roll interval : 3600
  • Sink roll size : 63000000
  • Sink roll count : 0

What should we change in our configuration ?

How can we find out if the blocking part is the channel size or the sink writing speed ?

 

Thank you!

 

Alina GHERMAN

 

 

GHERMAN Alina
1 ACCEPTED SOLUTION

avatar
Champion Alumni

The problem was solved by changing the source from spooldir to http. 

I think there is a problem with the spooldir source.

GHERMAN Alina

View solution in original post

5 REPLIES 5

avatar

Ideally, if your sinks are delivering fast enough, your channel size should usually be near zero.  If your channel size is growing, it is an indication that your sinks are not delivering fast enough, or there are issues downstream and you can either increase the batchSize or add more sinks.  Can you post your flume configuration, that might give a  better indication of where improvements can be made? Are you seeing any errors delivering to hdfs?

avatar
Champion Alumni

Hello,

 

Thank you. There are no errors delivered to hdfs..

 

Note:

- The interceptor is only normalizing some inputs..

- I tried to add the thread number configuration to the sink, but with no succes (there was no difference).

 

 

# source definition
projectName.sources.spooldir-source.type = spooldir
projectName.sources.spooldir-source.spoolDir = /var/flume/in
projectName.sources.spooldir-source.basenameHeader = true
projectName.sources.spooldir-source.basenameHeaderKey = basename
projectName.sources.spooldir-source.batchSize = 10
projectName.sources.spooldir-source.deletePolicy = immediate
# Max blob size: 1.5Go
projectName.sources.spooldir-source.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
projectName.sources.spooldir-source.deserializer.maxBlobLength = 1610000000
# Attach the interceptor to the source
projectName.sources.spooldir-source.interceptors = json-interceptor
projectName.sources.spooldir-source.interceptors.json-interceptor.type = com.company.analytics.flume.interceptor.JsonInterceptor$Builder
# Define event's headers. basenameHeader must be the same than source.basenameHeaderKey (defaults is basename)
projectName.sources.spooldir-source.interceptors.json-interceptor.basenameHeader = basename
projectName.sources.spooldir-source.interceptors.json-interceptor.resourceHeader = resources
projectName.sources.spooldir-source.interceptors.json-interceptor.ssidHeader = ssid

# channel definition
projectName.channels.mem-channel-1.type = memory
projectName.channels.mem-channel-1.capacity = 100000
projectName.channels.mem-channel-1.transactionCapacity = 1000

# sink definition
projectName.sinks.hdfs-sink-1.type = hdfs
projectName.sinks.hdfs-sink-1.hdfs.path = hdfs://StandbyNameNode/path/to/in
projectName.sinks.hdfs-sink-1.hdfs.filePrefix = %{resources}_%{ssid}
projectName.sinks.hdfs-sink-1.hdfs.fileSuffix = .json
projectName.sinks.hdfs-sink-1.hdfs.fileType = DataStream
projectName.sinks.hdfs-sink-1.hdfs.writeFormat = Text
projectName.sinks.hdfs-sink-1.hdfs.rollInterval = 3600
projectName.sinks.hdfs-sink-1.hdfs.rollSize = 63000000
projectName.sinks.hdfs-sink-1.hdfs.rollCount = 0
projectName.sinks.hdfs-sink-1.hdfs.batchSize = 1000
projectName.sinks.hdfs-sink-1.hdfs.idleTimeout = 60

# connect source and sink to channel
projectName.sources.spooldir-source.channels = mem-channel-1
projectName.sinks.hdfs-sink-1.channel = mem-channel-1

Would it help to add different identic sinks on the same machine?

 

Thank you!

 

Alina GHERMAN

GHERMAN Alina

avatar

Adding sinks to your configuration will parallelize the delivery of events, (i.e. adding another sink will double your event drain rate, 3 will triple, etc).

 

You'll want to be sure to add a unique hdfs.filePrefix to each sink in order to ensure there are no filename collisions.  If you have multiple hosts, that uniqueness would need to cover hostnames as well.

avatar
Champion Alumni

I wanted to add one more information:

- in Cloudera Manager ==> charts ==> if we do "select channel_fill_percentage_across_flume_channels", we are at maximum 0.0001%...

 

Note: we have 2 channels, each with one sink and one source, both on the same machine.

 

This means that the error/warning that we have in the logs is not the real point that is blocking flume to work...

 

 

Thank you!

 

GHERMAN Alina

avatar
Champion Alumni

The problem was solved by changing the source from spooldir to http. 

I think there is a problem with the spooldir source.

GHERMAN Alina