Created on 09-07-2015 07:53 AM - edited 09-16-2022 02:40 AM
Hello,
Until now, we used Flume to transfer data, once a day, from a spool directory to a HDFS sink with a memory channel.
Now, we wanted to do it each 5 minutes, but the flume channel is becoming full at the second import (after 10 minutes).
---
2015-09-07 16:08:04,083 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Last read was never committed - resetting mark position.
2015-09-07 16:08:04,085 WARN org.apache.flume.source.SpoolDirectorySource: The channel is full, and cannot write data now. The source will try again after 4000 milliseconds
---
Flume input: 15-20 files each 5 minutes. Each file has 10-600 KB.
Flume configuration:
What should we change in our configuration ?
How can we find out if the blocking part is the channel size or the sink writing speed ?
Thank you!
Alina GHERMAN
Created on 09-29-2015 01:31 AM - edited 09-29-2015 01:33 AM
The problem was solved by changing the source from spooldir to http.
I think there is a problem with the spooldir source.
Created 09-07-2015 09:43 AM
Ideally, if your sinks are delivering fast enough, your channel size should usually be near zero. If your channel size is growing, it is an indication that your sinks are not delivering fast enough, or there are issues downstream and you can either increase the batchSize or add more sinks. Can you post your flume configuration, that might give a better indication of where improvements can be made? Are you seeing any errors delivering to hdfs?
Created 09-08-2015 12:36 AM
Hello,
Thank you. There are no errors delivered to hdfs..
Note:
- The interceptor is only normalizing some inputs..
- I tried to add the thread number configuration to the sink, but with no succes (there was no difference).
# source definition projectName.sources.spooldir-source.type = spooldir projectName.sources.spooldir-source.spoolDir = /var/flume/in projectName.sources.spooldir-source.basenameHeader = true projectName.sources.spooldir-source.basenameHeaderKey = basename projectName.sources.spooldir-source.batchSize = 10 projectName.sources.spooldir-source.deletePolicy = immediate # Max blob size: 1.5Go projectName.sources.spooldir-source.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder projectName.sources.spooldir-source.deserializer.maxBlobLength = 1610000000 # Attach the interceptor to the source projectName.sources.spooldir-source.interceptors = json-interceptor projectName.sources.spooldir-source.interceptors.json-interceptor.type = com.company.analytics.flume.interceptor.JsonInterceptor$Builder # Define event's headers. basenameHeader must be the same than source.basenameHeaderKey (defaults is basename) projectName.sources.spooldir-source.interceptors.json-interceptor.basenameHeader = basename projectName.sources.spooldir-source.interceptors.json-interceptor.resourceHeader = resources projectName.sources.spooldir-source.interceptors.json-interceptor.ssidHeader = ssid # channel definition projectName.channels.mem-channel-1.type = memory projectName.channels.mem-channel-1.capacity = 100000 projectName.channels.mem-channel-1.transactionCapacity = 1000 # sink definition projectName.sinks.hdfs-sink-1.type = hdfs projectName.sinks.hdfs-sink-1.hdfs.path = hdfs://StandbyNameNode/path/to/in projectName.sinks.hdfs-sink-1.hdfs.filePrefix = %{resources}_%{ssid} projectName.sinks.hdfs-sink-1.hdfs.fileSuffix = .json projectName.sinks.hdfs-sink-1.hdfs.fileType = DataStream projectName.sinks.hdfs-sink-1.hdfs.writeFormat = Text projectName.sinks.hdfs-sink-1.hdfs.rollInterval = 3600 projectName.sinks.hdfs-sink-1.hdfs.rollSize = 63000000 projectName.sinks.hdfs-sink-1.hdfs.rollCount = 0 projectName.sinks.hdfs-sink-1.hdfs.batchSize = 1000 projectName.sinks.hdfs-sink-1.hdfs.idleTimeout = 60 # connect source and sink to channel projectName.sources.spooldir-source.channels = mem-channel-1 projectName.sinks.hdfs-sink-1.channel = mem-channel-1
Would it help to add different identic sinks on the same machine?
Thank you!
Alina GHERMAN
Created 09-08-2015 09:09 AM
Adding sinks to your configuration will parallelize the delivery of events, (i.e. adding another sink will double your event drain rate, 3 will triple, etc).
You'll want to be sure to add a unique hdfs.filePrefix to each sink in order to ensure there are no filename collisions. If you have multiple hosts, that uniqueness would need to cover hostnames as well.
Created on 09-09-2015 01:06 AM - edited 09-09-2015 02:03 AM
I wanted to add one more information:
- in Cloudera Manager ==> charts ==> if we do "select channel_fill_percentage_across_flume_channels", we are at maximum 0.0001%...
Note: we have 2 channels, each with one sink and one source, both on the same machine.
This means that the error/warning that we have in the logs is not the real point that is blocking flume to work...
Thank you!
Created on 09-29-2015 01:31 AM - edited 09-29-2015 01:33 AM
The problem was solved by changing the source from spooldir to http.
I think there is a problem with the spooldir source.