Support Questions

Find answers, ask questions, and share your expertise

Flume Spooling Directory Source: Cannot load files larger files

avatar
Explorer

Hi,

 

I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).

I am using CDH 5.4.2.

 

It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:

1. files with size Kbytes to 50-60MBytes, processed without issue.

2. files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.

3. There are no error message on flume log.

I found that it is trying to create the ".tmp" file (HDFS) several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit.

After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.

 

In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.

 

Note:

1. Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).

2. Spool directory is local filesystem on the same server running flume agent.

3. All are physical sever (not virtual).

4. In the same cluster, we have twitter datafeeding with flume running fine(although very small about of data).

5. Please find below flume.conf file I am using here:

 

#############start flume.conf####################

 

spoolDir.sources = src-1
spoolDir.channels = channel-1
spoolDir.sinks = sink_to_hdfs1

 

spoolDir.sources.src-1.type = spooldir
spoolDir.sources.src-1.channels = channel-1
spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/
spoolDir.sources.src-1.fileHeader = true
spoolDir.sources.src-1.basenameHeader =true
spoolDir.sources.src-1.batchSize = 100000

 

spoolDir.channels.channel-1.type = memory
spoolDir.channels.channel-1.transactionCapacity = 50000000
spoolDir.channels.channel-1.capacity = 60000000
spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
spoolDir.channels.channel-1.byteCapacity = 6442450944

 

spoolDir.sinks.sink_to_hdfs1.type = hdfs
spoolDir.sinks.sink_to_hdfs1.channel = channel-1
spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream
spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool
spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-
spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60

 

#############end flume.conf####################

 

Kindly suggest me whether there is any issue with my configuration or am I missing something.

Or is it a known issue that Flume SpoolDir cannot handle with bigger files.

 

Regards,

Obaid

 

 

 

 

 

 

1 ACCEPTED SOLUTION

avatar
Explorer

Hi,

 

Sorry guys for the reply whitch is too late.

 

Anyways, I tried with different combinition(memory/disk channel etc.) and found flume is either failing of too slow to  load larger files (more that 1G).

So, I conclude that flume is not good for lage files.

 

Instead, I am now using HDFS NFS gateways to dump file directly to HDFS using scp. 

Belive me, correctly configured NFS GW and NFS mount point are really cool old boys.

 

Thanks,

Obaid 

View solution in original post

6 REPLIES 6

avatar
Expert Contributor

i guess the problem is the following configuration :

 

spoolDir.sources.src-1.batchSize = 100000

 

spoolDir.channels.channel-1.transactionCapacity = 50000000
spoolDir.channels.channel-1.capacity = 60000000
spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
spoolDir.channels.channel-1.byteCapacity = 6442450944

 

 

spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0

 

as that happened with me before , the problem is that the channel capacity got fully loaded

 

 

so i suggest the following edit to be done , also pay attension to the description of each attribute :

 

 

spoolDir.sources.src-1.batchSize = 100000 #Number of messages to consume in one batch

 

spoolDir.channels.channel-1.transactionCapacity = 60000000 ## EDIT

spoolDir.channels.channel-1.capacity                   = 60000000 ##EDIT

 

spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000 #The max number of lines to read and send to the channel at a time 
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0 #Number of seconds to wait before rolling current file (0 = never roll based on time interval)


spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0 #File size to trigger roll, in bytes (0: never roll based on file size)
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0 #Number of events written to file before it rolled (0 = never roll based on number of events)

 

 

hope it works fine , good luck 

 

avatar
Explorer

Hi Tarek,

 

Thanks for your suggestion. Unfortunately ther is no change. It is behaving exactly same.

 

I think channel is not chocking. If channel choking there should be explicit error or warning message on flume log.

However, in my case there is no indication of error or warning on log.  I also checked on Cloudera Manager, the channel was less than 5% utilized.

Please find below portion of flume log while I tested the scenerio:

 

2015-12-31 18:32:06,917 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2015-12-31 18:32:06,941 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41414
2015-12-31 18:33:09,056 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
2015-12-31 18:33:09,238 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451557989057.tmp
2015-12-31 18:34:59,768 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Configuration provider starting
2015-12-31 18:34:59,783 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:/var/run/cloudera-scm-agent/process/29180-flume-AGENT/flume.conf
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Added sinks: sink_to_hdfs1 Agent: spoolDir
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,806 INFO org.apache.flume.conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [spoolDir]
2015-12-31 18:34:59,806 INFO org.apache.flume.node.AbstractConfigurationProvider: Creating channels
2015-12-31 18:34:59,812 INFO org.apache.flume.channel.DefaultChannelFactory: Creating instance of channel channel-1 type memory
2015-12-31 18:34:59,816 INFO org.apache.flume.node.AbstractConfigurationProvider: Created channel channel-1
2015-12-31 18:34:59,817 INFO org.apache.flume.source.DefaultSourceFactory: Creating instance of source src-1, type spooldir
2015-12-31 18:34:59,826 INFO org.apache.flume.sink.DefaultSinkFactory: Creating instance of sink: sink_to_hdfs1, type: hdfs
2015-12-31 18:34:59,835 INFO org.apache.flume.node.AbstractConfigurationProvider: Channel channel-1 connected to [src-1, sink_to_hdfs1]
2015-12-31 18:34:59,843 INFO org.apache.flume.node.Application: Starting new configuration:{ sourceRunners:{src-1=EventDrivenSourceRunner: { source:Spool Directory source src-1: { spoolDir: /stage/AIU/ETL/temp/spool } }} sinkRunners:{sink_to_hdfs1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@2195b77d counterGroup:{ name:null counters:{} } }} channels:{channel-1=org.apache.flume.channel.MemoryChannel{name: channel-1}} }
2015-12-31 18:34:59,853 INFO org.apache.flume.node.Application: Starting Channel channel-1
2015-12-31 18:34:59,903 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: channel-1: Successfully registered new MBean.
2015-12-31 18:34:59,903 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: channel-1 started
2015-12-31 18:34:59,903 INFO org.apache.flume.node.Application: Starting Sink sink_to_hdfs1
2015-12-31 18:34:59,903 INFO org.apache.flume.node.Application: Starting Source src-1
2015-12-31 18:34:59,904 INFO org.apache.flume.source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /stage/AIU/ETL/temp/spool
2015-12-31 18:34:59,905 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: sink_to_hdfs1: Successfully registered new MBean.
2015-12-31 18:34:59,905 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SINK, name: sink_to_hdfs1 started
2015-12-31 18:34:59,928 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2015-12-31 18:34:59,929 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: src-1: Successfully registered new MBean.
2015-12-31 18:34:59,930 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: src-1 started
2015-12-31 18:34:59,962 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2015-12-31 18:34:59,995 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41414
2015-12-31 18:35:00,614 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
2015-12-31 18:35:00,875 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp
2015-12-31 18:35:06,132 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Preparing to move file /stage/AIU/ETL/temp/spool/bigfile03_2.csv to /stage/AIU/ETL/temp/spool/bigfile03_2.csv.COMPLETED
2015-12-31 18:36:42,947 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing idle bucketWriter hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp at 1451558202947
2015-12-31 18:36:42,947 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp
2015-12-31 18:36:42,975 INFO org.apache.flume.sink.hdfs.BucketWriter: Renaming hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp to hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615
2015-12-31 18:36:42,987 INFO org.apache.flume.sink.hdfs.HDFSEventSink: Writer callback called.

 

"2015-12-31 18:32:06,917" and "2015-12-31 18:35:00,875". After that at "2015-12-31 18:36:42,975 " it is closing the current file, although the file has not fully written.
The file with 1st attempt is 0-2MB in size and never renamed(kept .tmp).
The file with 2nd attempt was renamed (removed .tmp) and it was 50MB in size.

The source file was 100MB in size.

 

Thanks,

Obaid

avatar
Expert Contributor
i guess the configuration :

spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = X

will write only the first X number of lines of the file to the channel and send it , so you need to decide whether you want flume to send a certain amount of lines to the channel or do you want to send the whole file as it is (trigger to send data to channel : certain number of lines , or a whole file ) i prefer whole file to be considered as a trigger , but when file size is large then channel size will be the bottleneck

avatar
Explorer

Hi Tarek,

 

Thanks again for your time.

 

I tried with both, if i set to 0 it shows "batchSize must be greater than 0".

If i set to a bigger value, then writing never starts and this will never gurantee me a full file(as bigger batchSize will always combine with other files).

 

So, I think altering batchSize is not a viable idea or I failed to understand your point.

Please correct me if I am wrong.

 

-Obaid

 

avatar
Expert Contributor
if your configuration is working with smaller files , then the problem is for sure with the configuration your are using , so i suggest checking this post , it might be helpful for large files

https://community.cloudera.com/t5/Data-Ingestion-Integration/Flume-HDFS-sink-Can-t-write-large-files...

(joey's answer)

hope it helps , and good luck

avatar
Explorer

Hi,

 

Sorry guys for the reply whitch is too late.

 

Anyways, I tried with different combinition(memory/disk channel etc.) and found flume is either failing of too slow to  load larger files (more that 1G).

So, I conclude that flume is not good for lage files.

 

Instead, I am now using HDFS NFS gateways to dump file directly to HDFS using scp. 

Belive me, correctly configured NFS GW and NFS mount point are really cool old boys.

 

Thanks,

Obaid