<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Flume Spooling Directory Source: Cannot load files larger files in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35715#M13823</link>
    <description>&lt;P&gt;Hi Tarek,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your suggestion. Unfortunately ther is no change. It is behaving exactly same.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I think channel is not chocking. If channel choking there should be explicit error or warning message on flume log.&lt;/P&gt;&lt;P&gt;However, in my case there is no indication of error or warning on log. &amp;nbsp;I also checked on Cloudera Manager, the channel was less than 5% utilized.&lt;/P&gt;&lt;P&gt;Please find below portion of flume log while I tested the scenerio:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;2015-12-31 18:32:06,917 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2015-12-31 18:32:06,941 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41414
2015-12-31 18:33:09,056 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
2015-12-31 18:33:09,238 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451557989057.tmp
2015-12-31 18:34:59,768 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Configuration provider starting
2015-12-31 18:34:59,783 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:/var/run/cloudera-scm-agent/process/29180-flume-AGENT/flume.conf
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Added sinks: sink_to_hdfs1 Agent: spoolDir
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,806 INFO org.apache.flume.conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [spoolDir]
2015-12-31 18:34:59,806 INFO org.apache.flume.node.AbstractConfigurationProvider: Creating channels
2015-12-31 18:34:59,812 INFO org.apache.flume.channel.DefaultChannelFactory: Creating instance of channel channel-1 type memory
2015-12-31 18:34:59,816 INFO org.apache.flume.node.AbstractConfigurationProvider: Created channel channel-1
2015-12-31 18:34:59,817 INFO org.apache.flume.source.DefaultSourceFactory: Creating instance of source src-1, type spooldir
2015-12-31 18:34:59,826 INFO org.apache.flume.sink.DefaultSinkFactory: Creating instance of sink: sink_to_hdfs1, type: hdfs
2015-12-31 18:34:59,835 INFO org.apache.flume.node.AbstractConfigurationProvider: Channel channel-1 connected to [src-1, sink_to_hdfs1]
2015-12-31 18:34:59,843 INFO org.apache.flume.node.Application: Starting new configuration:{ sourceRunners:{src-1=EventDrivenSourceRunner: { source:Spool Directory source src-1: { spoolDir: /stage/AIU/ETL/temp/spool } }} sinkRunners:{sink_to_hdfs1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@2195b77d counterGroup:{ name:null counters:{} } }} channels:{channel-1=org.apache.flume.channel.MemoryChannel{name: channel-1}} }
2015-12-31 18:34:59,853 INFO org.apache.flume.node.Application: Starting Channel channel-1
2015-12-31 18:34:59,903 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: channel-1: Successfully registered new MBean.
2015-12-31 18:34:59,903 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: channel-1 started
2015-12-31 18:34:59,903 INFO org.apache.flume.node.Application: Starting Sink sink_to_hdfs1
2015-12-31 18:34:59,903 INFO org.apache.flume.node.Application: Starting Source src-1
2015-12-31 18:34:59,904 INFO org.apache.flume.source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /stage/AIU/ETL/temp/spool
2015-12-31 18:34:59,905 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: sink_to_hdfs1: Successfully registered new MBean.
2015-12-31 18:34:59,905 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SINK, name: sink_to_hdfs1 started
2015-12-31 18:34:59,928 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2015-12-31 18:34:59,929 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: src-1: Successfully registered new MBean.
2015-12-31 18:34:59,930 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: src-1 started
2015-12-31 18:34:59,962 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2015-12-31 18:34:59,995 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41414
2015-12-31 18:35:00,614 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
2015-12-31 18:35:00,875 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp
2015-12-31 18:35:06,132 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Preparing to move file /stage/AIU/ETL/temp/spool/bigfile03_2.csv to /stage/AIU/ETL/temp/spool/bigfile03_2.csv.COMPLETED
2015-12-31 18:36:42,947 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing idle bucketWriter hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp at 1451558202947
2015-12-31 18:36:42,947 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp
2015-12-31 18:36:42,975 INFO org.apache.flume.sink.hdfs.BucketWriter: Renaming hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp to hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615
2015-12-31 18:36:42,987 INFO org.apache.flume.sink.hdfs.HDFSEventSink: Writer callback called.&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;"2015-12-31 18:32:06,917" and "2015-12-31 18:35:00,875". After that at "2015-12-31 18:36:42,975 " it is closing the current file, although the file has not fully written.&lt;BR /&gt;The file with 1st attempt is 0-2MB in size and never renamed(kept .tmp).&lt;BR /&gt;The file with 2nd attempt was renamed (removed .tmp) and it was 50MB in size.&lt;/P&gt;&lt;P&gt;The source file was 100MB in size.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Obaid&lt;/P&gt;</description>
    <pubDate>Thu, 31 Dec 2015 11:02:06 GMT</pubDate>
    <dc:creator>Obaidul</dc:creator>
    <dc:date>2015-12-31T11:02:06Z</dc:date>
    <item>
      <title>Flume Spooling Directory Source: Cannot load files larger files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35704#M13821</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am trying to ingest using flume spooling directory to HDFS(SpoolDir &amp;gt; Memory Channel &amp;gt; HDFS).&lt;/P&gt;&lt;P&gt;I am using CDH 5.4.2.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:&lt;/P&gt;&lt;P&gt;1. files with size Kbytes to 50-60MBytes, processed without issue.&lt;/P&gt;&lt;P&gt;2. files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.&lt;/P&gt;&lt;P&gt;3. There are no error message on flume log.&lt;/P&gt;&lt;P&gt;I found that it is trying to create the ".tmp" file (HDFS)&amp;nbsp;several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit.&lt;/P&gt;&lt;P&gt;After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note:&lt;/P&gt;&lt;P&gt;1. Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).&lt;/P&gt;&lt;P&gt;2. Spool directory is local filesystem on the same server running flume agent.&lt;/P&gt;&lt;P&gt;3. All are physical sever (not virtual).&lt;/P&gt;&lt;P&gt;4. In the same cluster, we have twitter datafeeding with flume running&amp;nbsp;fine(although very small about of data).&lt;/P&gt;&lt;P&gt;5. Please find below flume.conf file I am using here:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;#############start flume.conf####################&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;spoolDir.sources = src-1&lt;BR /&gt;spoolDir.channels = channel-1&lt;BR /&gt;spoolDir.sinks = sink_to_hdfs1&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;spoolDir.sources.src-1.type = spooldir&lt;BR /&gt;spoolDir.sources.src-1.channels = channel-1&lt;BR /&gt;spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/&lt;BR /&gt;spoolDir.sources.src-1.fileHeader = true&lt;BR /&gt;spoolDir.sources.src-1.basenameHeader =true&lt;BR /&gt;spoolDir.sources.src-1.batchSize = 100000&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;spoolDir.channels.channel-1.type = memory&lt;BR /&gt;spoolDir.channels.channel-1.transactionCapacity = 50000000&lt;BR /&gt;spoolDir.channels.channel-1.capacity = 60000000&lt;BR /&gt;spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20&lt;BR /&gt;spoolDir.channels.channel-1.byteCapacity = 6442450944&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;spoolDir.sinks.sink_to_hdfs1.type = hdfs&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.channel = channel-1&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;#############end flume.conf####################&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Kindly suggest me whether there is any issue with my configuration or am I missing something.&lt;/P&gt;&lt;P&gt;Or is it a known issue that Flume SpoolDir cannot handle&amp;nbsp;with bigger files.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Obaid&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:55:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35704#M13821</guid>
      <dc:creator>Obaidul</dc:creator>
      <dc:date>2022-09-16T09:55:19Z</dc:date>
    </item>
    <item>
      <title>Re: Flume Spooling Directory Source: Cannot load files larger files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35711#M13822</link>
      <description>&lt;P&gt;i guess the problem is the following configuration :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;spoolDir.sources.src-1.batchSize = 100000&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;spoolDir.channels.channel-1.transactionCapacity = 50000000&lt;BR /&gt;&lt;SPAN&gt;spoolDir.channels.channel-1.capacity = 60000000&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;spoolDir.channels.channel-1.byteCapacityBufferPerc&lt;/SPAN&gt;&lt;SPAN&gt;entage = 20&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;spoolDir.channels.channel-1.byteCapacity = 6442450944&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;as that happened with me before , the problem is that the channel capacity got fully loaded &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;so i suggest the following edit to be done , also pay attension to the description of each attribute :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;spoolDir.sources.src-1.batchSize = 100000 #Number of messages to consume in one batch&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;spoolDir.channels.channel-1.transactionCapacity = 60000000 ## EDIT&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;spoolDir.channels.channel-1.capacity &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; = 60000000 &lt;/SPAN&gt;&lt;/SPAN&gt;##EDIT&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000 #The max number of lines to read and send to the channel at a time&amp;nbsp;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0 #Number of seconds to wait before rolling current file (0 = never roll based on time interval)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;SPAN&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0 #File size to trigger roll, in bytes (0: never roll based on file size)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0 #Number of events written to file before it rolled (0 = never roll based on number of events)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;hope it works fine , good luck&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Dec 2015 08:51:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35711#M13822</guid>
      <dc:creator>tarekabouzeid91</dc:creator>
      <dc:date>2015-12-31T08:51:05Z</dc:date>
    </item>
    <item>
      <title>Re: Flume Spooling Directory Source: Cannot load files larger files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35715#M13823</link>
      <description>&lt;P&gt;Hi Tarek,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your suggestion. Unfortunately ther is no change. It is behaving exactly same.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I think channel is not chocking. If channel choking there should be explicit error or warning message on flume log.&lt;/P&gt;&lt;P&gt;However, in my case there is no indication of error or warning on log. &amp;nbsp;I also checked on Cloudera Manager, the channel was less than 5% utilized.&lt;/P&gt;&lt;P&gt;Please find below portion of flume log while I tested the scenerio:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;2015-12-31 18:32:06,917 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2015-12-31 18:32:06,941 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41414
2015-12-31 18:33:09,056 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
2015-12-31 18:33:09,238 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451557989057.tmp
2015-12-31 18:34:59,768 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Configuration provider starting
2015-12-31 18:34:59,783 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:/var/run/cloudera-scm-agent/process/29180-flume-AGENT/flume.conf
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Added sinks: sink_to_hdfs1 Agent: spoolDir
2015-12-31 18:34:59,788 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,789 INFO org.apache.flume.conf.FlumeConfiguration: Processing:sink_to_hdfs1
2015-12-31 18:34:59,806 INFO org.apache.flume.conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [spoolDir]
2015-12-31 18:34:59,806 INFO org.apache.flume.node.AbstractConfigurationProvider: Creating channels
2015-12-31 18:34:59,812 INFO org.apache.flume.channel.DefaultChannelFactory: Creating instance of channel channel-1 type memory
2015-12-31 18:34:59,816 INFO org.apache.flume.node.AbstractConfigurationProvider: Created channel channel-1
2015-12-31 18:34:59,817 INFO org.apache.flume.source.DefaultSourceFactory: Creating instance of source src-1, type spooldir
2015-12-31 18:34:59,826 INFO org.apache.flume.sink.DefaultSinkFactory: Creating instance of sink: sink_to_hdfs1, type: hdfs
2015-12-31 18:34:59,835 INFO org.apache.flume.node.AbstractConfigurationProvider: Channel channel-1 connected to [src-1, sink_to_hdfs1]
2015-12-31 18:34:59,843 INFO org.apache.flume.node.Application: Starting new configuration:{ sourceRunners:{src-1=EventDrivenSourceRunner: { source:Spool Directory source src-1: { spoolDir: /stage/AIU/ETL/temp/spool } }} sinkRunners:{sink_to_hdfs1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@2195b77d counterGroup:{ name:null counters:{} } }} channels:{channel-1=org.apache.flume.channel.MemoryChannel{name: channel-1}} }
2015-12-31 18:34:59,853 INFO org.apache.flume.node.Application: Starting Channel channel-1
2015-12-31 18:34:59,903 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: channel-1: Successfully registered new MBean.
2015-12-31 18:34:59,903 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: channel-1 started
2015-12-31 18:34:59,903 INFO org.apache.flume.node.Application: Starting Sink sink_to_hdfs1
2015-12-31 18:34:59,903 INFO org.apache.flume.node.Application: Starting Source src-1
2015-12-31 18:34:59,904 INFO org.apache.flume.source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /stage/AIU/ETL/temp/spool
2015-12-31 18:34:59,905 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: sink_to_hdfs1: Successfully registered new MBean.
2015-12-31 18:34:59,905 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SINK, name: sink_to_hdfs1 started
2015-12-31 18:34:59,928 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2015-12-31 18:34:59,929 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: src-1: Successfully registered new MBean.
2015-12-31 18:34:59,930 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: src-1 started
2015-12-31 18:34:59,962 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2015-12-31 18:34:59,995 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41414
2015-12-31 18:35:00,614 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
2015-12-31 18:35:00,875 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp
2015-12-31 18:35:06,132 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Preparing to move file /stage/AIU/ETL/temp/spool/bigfile03_2.csv to /stage/AIU/ETL/temp/spool/bigfile03_2.csv.COMPLETED
2015-12-31 18:36:42,947 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing idle bucketWriter hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp at 1451558202947
2015-12-31 18:36:42,947 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp
2015-12-31 18:36:42,975 INFO org.apache.flume.sink.hdfs.BucketWriter: Renaming hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615.tmp to hdfs://nameservice1/user/etl/temp/spool/bigfile03_2.csv-.1451558100615
2015-12-31 18:36:42,987 INFO org.apache.flume.sink.hdfs.HDFSEventSink: Writer callback called.&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;"2015-12-31 18:32:06,917" and "2015-12-31 18:35:00,875". After that at "2015-12-31 18:36:42,975 " it is closing the current file, although the file has not fully written.&lt;BR /&gt;The file with 1st attempt is 0-2MB in size and never renamed(kept .tmp).&lt;BR /&gt;The file with 2nd attempt was renamed (removed .tmp) and it was 50MB in size.&lt;/P&gt;&lt;P&gt;The source file was 100MB in size.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Obaid&lt;/P&gt;</description>
      <pubDate>Thu, 31 Dec 2015 11:02:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35715#M13823</guid>
      <dc:creator>Obaidul</dc:creator>
      <dc:date>2015-12-31T11:02:06Z</dc:date>
    </item>
    <item>
      <title>Re: Flume Spooling Directory Source: Cannot load files larger files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35716#M13824</link>
      <description>i guess the configuration :&lt;BR /&gt;&lt;BR /&gt;spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = X&lt;BR /&gt;&lt;BR /&gt;will write only the first X number of lines of the file to the channel and send it , so you need to decide whether you want flume to send a certain amount of lines to the channel or do you want to send the whole file as it is (trigger to send data to channel : certain number of lines , or a whole file ) i prefer whole file to be considered as a trigger , but when file size is large then channel size will be the bottleneck</description>
      <pubDate>Thu, 31 Dec 2015 11:57:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35716#M13824</guid>
      <dc:creator>tarekabouzeid91</dc:creator>
      <dc:date>2015-12-31T11:57:10Z</dc:date>
    </item>
    <item>
      <title>Re: Flume Spooling Directory Source: Cannot load files larger files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35720#M13825</link>
      <description>&lt;P&gt;Hi Tarek,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks again for your time.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried with both, if i set to 0 it shows "batchSize must be greater than 0".&lt;/P&gt;&lt;P&gt;If i set to a bigger value, then writing never starts and this will never gurantee me a full file(as bigger batchSize will always combine with other files).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, I think altering&amp;nbsp;&lt;SPAN&gt;batchSize&lt;/SPAN&gt;&amp;nbsp;is not a viable idea or I failed to understand your point.&lt;/P&gt;&lt;P&gt;Please correct me if I am wrong.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;-Obaid&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Dec 2015 13:17:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35720#M13825</guid>
      <dc:creator>Obaidul</dc:creator>
      <dc:date>2015-12-31T13:17:35Z</dc:date>
    </item>
    <item>
      <title>Re: Flume Spooling Directory Source: Cannot load files larger files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35723#M13826</link>
      <description>if your configuration is working with smaller files , then the problem is for sure with the configuration your are using , so i suggest checking this post , it might be helpful for large files&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://community.cloudera.com/t5/Data-Ingestion-Integration/Flume-HDFS-sink-Can-t-write-large-files/td-p/23456" target="_blank"&gt;https://community.cloudera.com/t5/Data-Ingestion-Integration/Flume-HDFS-sink-Can-t-write-large-files/td-p/23456&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;(joey's answer)&lt;BR /&gt;&lt;BR /&gt;hope it helps , and good luck</description>
      <pubDate>Thu, 31 Dec 2015 13:55:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/35723#M13826</guid>
      <dc:creator>tarekabouzeid91</dc:creator>
      <dc:date>2015-12-31T13:55:45Z</dc:date>
    </item>
    <item>
      <title>Re: Flume Spooling Directory Source: Cannot load files larger files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/41412#M13827</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sorry guys for the reply whitch is too late.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Anyways, I tried with different combinition(memory/disk channel etc.) and found flume is either failing of too slow to &amp;nbsp;load larger files (more that 1G).&lt;/P&gt;&lt;P&gt;So, I conclude that flume is not good for lage files.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Instead, I am now using HDFS NFS gateways to dump file directly to HDFS using scp.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Belive me, correctly configured NFS GW and NFS mount point are really cool old boys.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Obaid&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 27 May 2016 12:12:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Flume-Spooling-Directory-Source-Cannot-load-files-larger/m-p/41412#M13827</guid>
      <dc:creator>Obaidul</dc:creator>
      <dc:date>2016-05-27T12:12:25Z</dc:date>
    </item>
  </channel>
</rss>

