- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Flume - Memory Channel Full
- Labels:
-
Apache Flume
-
HDFS
Created on ‎09-07-2015 07:53 AM - edited ‎09-16-2022 02:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Until now, we used Flume to transfer data, once a day, from a spool directory to a HDFS sink with a memory channel.
Now, we wanted to do it each 5 minutes, but the flume channel is becoming full at the second import (after 10 minutes).
---
2015-09-07 16:08:04,083 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Last read was never committed - resetting mark position.
2015-09-07 16:08:04,085 WARN org.apache.flume.source.SpoolDirectorySource: The channel is full, and cannot write data now. The source will try again after 4000 milliseconds
---
Flume input: 15-20 files each 5 minutes. Each file has 10-600 KB.
Flume configuration:
- Source : spool dir
- Source maxBlobLength :1610000000
- Channel capacity : 100000 (we tried until 1610000000 but there was no change)
- Channel transaction capacity : 1000
- Sink batch size :1000
- Sink idle timeout :60
- Sink roll interval : 3600
- Sink roll size : 63000000
- Sink roll count : 0
What should we change in our configuration ?
How can we find out if the blocking part is the channel size or the sink writing speed ?
Thank you!
Alina GHERMAN
Created on ‎09-29-2015 01:31 AM - edited ‎09-29-2015 01:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem was solved by changing the source from spooldir to http.
I think there is a problem with the spooldir source.
Created ‎09-07-2015 09:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ideally, if your sinks are delivering fast enough, your channel size should usually be near zero. If your channel size is growing, it is an indication that your sinks are not delivering fast enough, or there are issues downstream and you can either increase the batchSize or add more sinks. Can you post your flume configuration, that might give a better indication of where improvements can be made? Are you seeing any errors delivering to hdfs?
Created ‎09-08-2015 12:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Thank you. There are no errors delivered to hdfs..
Note:
- The interceptor is only normalizing some inputs..
- I tried to add the thread number configuration to the sink, but with no succes (there was no difference).
# source definition projectName.sources.spooldir-source.type = spooldir projectName.sources.spooldir-source.spoolDir = /var/flume/in projectName.sources.spooldir-source.basenameHeader = true projectName.sources.spooldir-source.basenameHeaderKey = basename projectName.sources.spooldir-source.batchSize = 10 projectName.sources.spooldir-source.deletePolicy = immediate # Max blob size: 1.5Go projectName.sources.spooldir-source.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder projectName.sources.spooldir-source.deserializer.maxBlobLength = 1610000000 # Attach the interceptor to the source projectName.sources.spooldir-source.interceptors = json-interceptor projectName.sources.spooldir-source.interceptors.json-interceptor.type = com.company.analytics.flume.interceptor.JsonInterceptor$Builder # Define event's headers. basenameHeader must be the same than source.basenameHeaderKey (defaults is basename) projectName.sources.spooldir-source.interceptors.json-interceptor.basenameHeader = basename projectName.sources.spooldir-source.interceptors.json-interceptor.resourceHeader = resources projectName.sources.spooldir-source.interceptors.json-interceptor.ssidHeader = ssid # channel definition projectName.channels.mem-channel-1.type = memory projectName.channels.mem-channel-1.capacity = 100000 projectName.channels.mem-channel-1.transactionCapacity = 1000 # sink definition projectName.sinks.hdfs-sink-1.type = hdfs projectName.sinks.hdfs-sink-1.hdfs.path = hdfs://StandbyNameNode/path/to/in projectName.sinks.hdfs-sink-1.hdfs.filePrefix = %{resources}_%{ssid} projectName.sinks.hdfs-sink-1.hdfs.fileSuffix = .json projectName.sinks.hdfs-sink-1.hdfs.fileType = DataStream projectName.sinks.hdfs-sink-1.hdfs.writeFormat = Text projectName.sinks.hdfs-sink-1.hdfs.rollInterval = 3600 projectName.sinks.hdfs-sink-1.hdfs.rollSize = 63000000 projectName.sinks.hdfs-sink-1.hdfs.rollCount = 0 projectName.sinks.hdfs-sink-1.hdfs.batchSize = 1000 projectName.sinks.hdfs-sink-1.hdfs.idleTimeout = 60 # connect source and sink to channel projectName.sources.spooldir-source.channels = mem-channel-1 projectName.sinks.hdfs-sink-1.channel = mem-channel-1
Would it help to add different identic sinks on the same machine?
Thank you!
Alina GHERMAN
Created ‎09-08-2015 09:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adding sinks to your configuration will parallelize the delivery of events, (i.e. adding another sink will double your event drain rate, 3 will triple, etc).
You'll want to be sure to add a unique hdfs.filePrefix to each sink in order to ensure there are no filename collisions. If you have multiple hosts, that uniqueness would need to cover hostnames as well.
Created on ‎09-09-2015 01:06 AM - edited ‎09-09-2015 02:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wanted to add one more information:
- in Cloudera Manager ==> charts ==> if we do "select channel_fill_percentage_across_flume_channels", we are at maximum 0.0001%...
Note: we have 2 channels, each with one sink and one source, both on the same machine.
This means that the error/warning that we have in the logs is not the real point that is blocking flume to work...
Thank you!
Created on ‎09-29-2015 01:31 AM - edited ‎09-29-2015 01:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem was solved by changing the source from spooldir to http.
I think there is a problem with the spooldir source.
