Created on 12-30-2015 08:01 PM - edited 09-16-2022 02:55 AM
Hi,
I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).
I am using CDH 5.4.2.
It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:
1. files with size Kbytes to 50-60MBytes, processed without issue.
2. files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.
3. There are no error message on flume log.
I found that it is trying to create the ".tmp" file (HDFS) several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit.
After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.
In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.
Note:
1. Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).
2. Spool directory is local filesystem on the same server running flume agent.
3. All are physical sever (not virtual).
4. In the same cluster, we have twitter datafeeding with flume running fine(although very small about of data).
5. Please find below flume.conf file I am using here:
#############start flume.conf####################
spoolDir.sources = src-1
spoolDir.channels = channel-1
spoolDir.sinks = sink_to_hdfs1
spoolDir.sources.src-1.type = spooldir
spoolDir.sources.src-1.channels = channel-1
spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/
spoolDir.sources.src-1.fileHeader = true
spoolDir.sources.src-1.basenameHeader =true
spoolDir.sources.src-1.batchSize = 100000
spoolDir.channels.channel-1.type = memory
spoolDir.channels.channel-1.transactionCapacity = 50000000
spoolDir.channels.channel-1.capacity = 60000000
spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
spoolDir.channels.channel-1.byteCapacity = 6442450944
spoolDir.sinks.sink_to_hdfs1.type = hdfs
spoolDir.sinks.sink_to_hdfs1.channel = channel-1
spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream
spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool
spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-
spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60
#############end flume.conf####################
Kindly suggest me whether there is any issue with my configuration or am I missing something.
Or is it a known issue that Flume SpoolDir cannot handle with bigger files.
Regards,
Obaid
Created 05-27-2016 05:12 AM
Hi,
Sorry guys for the reply whitch is too late.
Anyways, I tried with different combinition(memory/disk channel etc.) and found flume is either failing of too slow to load larger files (more that 1G).
So, I conclude that flume is not good for lage files.
Instead, I am now using HDFS NFS gateways to dump file directly to HDFS using scp.
Belive me, correctly configured NFS GW and NFS mount point are really cool old boys.
Thanks,
Obaid
Created 12-31-2015 12:51 AM
i guess the problem is the following configuration :
spoolDir.sources.src-1.batchSize = 100000
spoolDir.channels.channel-1.transactionCapacity = 50000000
spoolDir.channels.channel-1.capacity = 60000000
spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
spoolDir.channels.channel-1.byteCapacity = 6442450944
spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0
as that happened with me before , the problem is that the channel capacity got fully loaded
so i suggest the following edit to be done , also pay attension to the description of each attribute :
spoolDir.sources.src-1.batchSize = 100000 #Number of messages to consume in one batch
spoolDir.channels.channel-1.transactionCapacity = 60000000 ## EDIT
spoolDir.channels.channel-1.capacity = 60000000 ##EDIT
spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000 #The max number of lines to read and send to the channel at a time
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0 #Number of seconds to wait before rolling current file (0 = never roll based on time interval)
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0 #File size to trigger roll, in bytes (0: never roll based on file size)
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0 #Number of events written to file before it rolled (0 = never roll based on number of events)
hope it works fine , good luck
Created 12-31-2015 03:02 AM
Hi Tarek,
Thanks for your suggestion. Unfortunately ther is no change. It is behaving exactly same.
I think channel is not chocking. If channel choking there should be explicit error or warning message on flume log.
However, in my case there is no indication of error or warning on log. I also checked on Cloudera Manager, the channel was less than 5% utilized.
Please find below portion of flume log while I tested the scenerio:
2015-12-31 18:32:06,917 INFO org.mortbay.log: jetty-6.1.26.cloudera.4 2015-12-31 18:32:06,941 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41414 2015-12-31 18:33:09,056 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false 2015-12-31 18:33:09,238 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451557989057.tmp 2015-12-31 18:34:59,768 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Configuration provider starting 2015-12-31 18:34:59,783 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:/var/run/cloudera-scm-agent/process/29180-flume-AGENT/flume.conf 2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Added sinks: sink_to_hdfs1 Agent: spoolDir 2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1 2015-12-31 18:34:59,806 INFO org.apache.flume.conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [spoolDir] 2015-12-31 18:34:59,806 INFO org.apache.flume.node.AbstractConfigurationProvider: Creating channels 2015-12-31 18:34:59,812 INFO org.apache.flume.channel.DefaultChannelFactory: Creating instance of channel channel-1 type memory 2015-12-31 18:34:59,816 INFO org.apache.flume.node.AbstractConfigurationProvider: Created channel channel-1 2015-12-31 18:34:59,817 INFO org.apache.flume.source.DefaultSourceFactory: Creating instance of source src-1, type spooldir 2015-12-31 18:34:59,826 INFO org.apache.flume.sink.DefaultSinkFactory: Creating instance of sink: sink_to_hdfs1, type: hdfs 2015-12-31 18:34:59,835 INFO org.apache.flume.node.AbstractConfigurationProvider: Channel channel-1 connected to [src-1, sink_to_hdfs1] 2015-12-31 18:34:59,843 INFO org.apache.flume.node.Application: Starting new configuration:{ sourceRunners:{src-1=EventDrivenSourceRunner: { source:Spool Directory source src-1: { spoolDir: /stage/AIU/ETL/temp/spool } }} sinkRunners:{sink_to_hdfs1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@2195b77d counterGroup:{ name:null counters:{} } }} channels:{channel-1=org.apache.flume.channel.MemoryChannel{name: channel-1}} } 2015-12-31 18:34:59,853 INFO org.apache.flume.node.Application: Starting Channel channel-1 2015-12-31 18:34:59,903 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: channel-1: Successfully registered new MBean. 2015-12-31 18:34:59,903 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: channel-1 started 2015-12-31 18:34:59,903 INFO org.apache.flume.node.Application: Starting Sink sink_to_hdfs1 2015-12-31 18:34:59,903 INFO org.apache.flume.node.Application: Starting Source src-1 2015-12-31 18:34:59,904 INFO org.apache.flume.source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /stage/AIU/ETL/temp/spool 2015-12-31 18:34:59,905 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: sink_to_hdfs1: Successfully registered new MBean. 2015-12-31 18:34:59,905 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SINK, name: sink_to_hdfs1 started 2015-12-31 18:34:59,928 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2015-12-31 18:34:59,929 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: src-1: Successfully registered new MBean. 2015-12-31 18:34:59,930 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: src-1 started 2015-12-31 18:34:59,962 INFO org.mortbay.log: jetty-6.1.26.cloudera.4 2015-12-31 18:34:59,995 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41414 2015-12-31 18:35:00,614 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false 2015-12-31 18:35:00,875 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp 2015-12-31 18:35:06,132 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Preparing to move file /stage/AIU/ETL/temp/spool/bigfile03_2.csv to /stage/AIU/ETL/temp/spool/bigfile03_2.csv.COMPLETED 2015-12-31 18:36:42,947 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing idle bucketWriter hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp at 1451558202947 2015-12-31 18:36:42,947 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp 2015-12-31 18:36:42,975 INFO org.apache.flume.sink.hdfs.BucketWriter: Renaming hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp to hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615 2015-12-31 18:36:42,987 INFO org.apache.flume.sink.hdfs.HDFSEventSink: Writer callback called.
"2015-12-31 18:32:06,917" and "2015-12-31 18:35:00,875". After that at "2015-12-31 18:36:42,975 " it is closing the current file, although the file has not fully written.
The file with 1st attempt is 0-2MB in size and never renamed(kept .tmp).
The file with 2nd attempt was renamed (removed .tmp) and it was 50MB in size.
The source file was 100MB in size.
Thanks,
Obaid
Created 12-31-2015 03:57 AM
Created 12-31-2015 05:17 AM
Hi Tarek,
Thanks again for your time.
I tried with both, if i set to 0 it shows "batchSize must be greater than 0".
If i set to a bigger value, then writing never starts and this will never gurantee me a full file(as bigger batchSize will always combine with other files).
So, I think altering batchSize is not a viable idea or I failed to understand your point.
Please correct me if I am wrong.
-Obaid
Created 12-31-2015 05:55 AM
Created 05-27-2016 05:12 AM
Hi,
Sorry guys for the reply whitch is too late.
Anyways, I tried with different combinition(memory/disk channel etc.) and found flume is either failing of too slow to load larger files (more that 1G).
So, I conclude that flume is not good for lage files.
Instead, I am now using HDFS NFS gateways to dump file directly to HDFS using scp.
Belive me, correctly configured NFS GW and NFS mount point are really cool old boys.
Thanks,
Obaid