Support Questions
Find answers, ask questions, and share your expertise

Flume hdfs sink is creating multiple tiny files

Contributor

Flume hdfs sink is creating too many multiple files into hdfs for loading source file as small as 200KB. I have used rollCount set to 256MB even then it is not helping. Some file created on HDFS is as small as 1.5KB.

Please help me make a single file because this file has to be read by R for filtering of Keywords.

My flume.conf is

agent.sources = avro-collection-source
agent.channels = memoryChannel
agent.sinks = hdfs-sink
# For each one of the sources, the type is defined
agent.sources.avro-collection-source.type = avro
agent.sources.avro-collection-source.bind = 10.0.0.6
agent.sources.avro-collection-source.port = 60000
agent.sources.avro-collection-source.interceptors = interceptor1
agent.sources.avro-collection-source.interceptors.interceptor1.type = timestamp
# The channel can be defined as follows.
agent.sources.avro-collection-source.channels = memoryChannel
# Each sink's type must be defined
agent.sinks.hdfs-sink.type = hdfs
#agent.sinks.hdfs-sink.hdfs.path = hdfs://10.0.10.4:8020/flume/events
#agent.sinks.hdfs-sink.hdfs.path = hdfs://40.122.210.251:8020/user/hdfs/flume
agent.sinks.hdfs-sink.hdfs.path    = hdfs://40.122.210.251:8020/user/hdfs/flume/%y-%m-%d/%H%M/%S
agent.sinks.hdfs-sink.useLocalTimeStamp = true
agent.sinks.hdfs-sink.hdfs.callTimeout = 180000
#Specify the channel the sink should use
agent.sinks.hdfs-sink.channel = memoryChannel
# File size to trigger roll, in bytes (256Mb)
agent.sinks.hdfs-sink.rollSize = 268435456
# Number of seconds to wait before rolling current file (in seconds)
agent.sinks.sink.hdfs.rollInterval = 0
agent.sinks.sink.hdfs.rollCount = 0
# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

Thanks,

Shilpa

19 REPLIES 19

Super Guru
@shilpa kumar

there is a typo in your flume conf file. Notice that you define your sink name as "hdfs-sink" but with rollInterval and rollCount you use "sink.hdfs". This means the default value for rollInterval and rollCount kicks in which is 30 seconds or 10 events whichever comes first.

agent.sinks.hdfs-sink.rollSize = 268435456

# Number of seconds to wait before rolling current file (in seconds)

agent.sinks.sink.hdfs.rollInterval = 0

agent.sinks.sink.hdfs.rollCount = 0

Contributor

I corrected that typo however I still face the same issue. 😞

now, my flume.conf has following entry:

# File size to trigger roll, in bytes (256Mb)
agent.sinks.hdfs-sink.rollSize = 268435456
# Number of seconds to wait before rolling current file (in seconds)
agent.sinks.hdfs-sink.rollInterval = 0
agent.sinks.hdfs-sink.rollCount = 0

Flume.log shows multiple entry like this

17/01/19 17:13:51 INFO hdfs.BucketWriter: Creating hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/34/FlumeData.1484867616377.tmp
17/01/19 17:13:51 INFO hdfs.BucketWriter: Closing hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/34/FlumeData.1484867616377.tmp
17/01/19 17:13:51 INFO hdfs.BucketWriter: Renaming hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/34/FlumeData.1484867616377.tmp to hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/34/FlumeData.1484867616377
17/01/19 17:13:51 INFO hdfs.BucketWriter: Creating hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/34/FlumeData.1484867616378.tmp
17/01/19 17:13:51 INFO hdfs.HDFSSequenceFile: writeFormat = Writable, UseRawL

ocalFileSystem = false
17/01/19 17:13:51 INFO hdfs.BucketWriter: Creating hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/37/FlumeData.1484867631947.tmp
17/01/19 17:13:52 INFO hdfs.BucketWriter: Closing hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/37/FlumeData.1484867631947.tmp
17/01/19 17:13:52 INFO hdfs.BucketWriter: Renaming hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/37/FlumeData.1484867631947.tmp to hdfs://40.122.210.251:8020/user/hdfs/flume/17-01-19/1713/37/FlumeData.1484867631947

Super Guru

@shilpa kumar

Did you restarted your flume agents? If yes, please share complete flume.conf. You basically have to restart all your flume agents again.

Contributor

Yes, of course i restarted all agents

Super Guru
@shilpa kumar

After fixing your initial issue, one more thing comes to mind is that what you are seeing now is the result of "hdfs.batchSize". Default value is 100 and your data in memory channel is flushed to HDFS every 100 records. Then once it reaches 256 MB, it will be rolled into one big 256 MB file. It is possible that what you are seeing is the result of flush. You can confirm by reducing rollSize to something where you know for sure file would be rolled (I am assuming currently you do not have 256 MB of data).

Contributor

As of now the file which my exec source is listening to is of 64KB. I tried to set agent.sinks.hdfs-sink.rollSize to 65536 and also agent.sinks.hdfs-sink.rollInterval to 300 (so that it doesnt roll the file until it is 5mins). But it didn't work together or separately (i.e. using only one of the property). But nothing is working

My flume.conf

agent.sources = avro-collection-source
agent.channels = memoryChannel
agent.sinks = hdfs-sink
# For each one of the sources, the type is defined
agent.sources.avro-collection-source.type = avro
agent.sources.avro-collection-source.bind = 10.0.0.6
agent.sources.avro-collection-source.port = 60000
agent.sources.avro-collection-source.interceptors = interceptor1
agent.sources.avro-collection-source.interceptors.interceptor1.type = timestamp
# The channel can be defined as follows.
agent.sources.avro-collection-source.channels = memoryChannel
# Each sink's type must be defined
agent.sinks.hdfs-sink.type = hdfs
#agent.sinks.hdfs-sink.hdfs.path = hdfs://10.0.10.4:8020/flume/events
#agent.sinks.hdfs-sink.hdfs.path = hdfs://40.122.210.251:8020/user/hdfs/flume
agent.sinks.hdfs-sink.hdfs.path    = hdfs://40.122.210.251:8020/user/hdfs/flume/%y-%m-%d/%H%M/%S
agent.sinks.hdfs-sink.useLocalTimeStamp = true
agent.sinks.hdfs-sink.hdfs.callTimeout = 180000
#Specify the channel the sink should use
agent.sinks.hdfs-sink.channel = memoryChannel
# File size to trigger roll, in bytes
agent.sinks.hdfs-sink.rollSize = 65536
# Number of seconds to wait before rolling current file (in seconds)
agent.sinks.hdfs-sink.rollInterval = 300
#agent.sinks.hdfs-sink.rollCount = 0
# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 10000

Contributor

Hi @mqureshi @Deepesh

Can you help please.

Expert Contributor

with this setup:

agent.sinks.hdfs-sink.hdfs.path    = hdfs://40.122.210.251:8020/user/hdfs/flume/%y-%m-%d/%H%M/%S
agent.sinks.hdfs-sink.useLocalTimeStamp = true

flume has to roll the files each 1 second, just set the mask to smth like

%y-%m-%d/%H

Contributor

I did this, it helped a little. The only thing after deleting %M/%S is that flume is not creating some sub-directories under /user/hdfs/flume/17-01-20/ directory like earlier

 hadoop fs -ls /user/hdfs/flume/17-01-20/
Found 6 items
drwxr-xr-x   - hdfs supergroup          0 2017-01-20 12:24 /user/hdfs/flume/17-01-20/1224 <- This and 2 below directories and more sub/directories were created earlier under them
drwxr-xr-x   - hdfs supergroup          0 2017-01-20 12:25 /user/hdfs/flume/17-01-20/1225
drwxr-xr-x   - hdfs supergroup          0 2017-01-20 12:26 /user/hdfs/flume/17-01-20/1226
drwxr-xr-x   - hdfs supergroup          0 2017-01-20 13:23 /user/hdfs/flume/17-01-20/13 <- without %M%S
drwxr-xr-x   - hdfs supergroup          0 2017-01-20 13:21 /user/hdfs/flume/17-01-20/1320 <- without %S
-rw-r--r--   2 hdfs supergroup          0 2017-01-20 13:10 /user/hdfs/flume/17-01-20/jornada

However still, There are multiple tiny files getting created under <date>/<hour> directory

Like these

[root@LnxMasterNode01 RSS]# hadoop fs -ls /user/hdfs/flume/17-01-20/13
Found 96 items
-rw-r--r--   2 hdfs supergroup       1138 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151434
-rw-r--r--   2 hdfs supergroup       1069 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151435
-rw-r--r--   2 hdfs supergroup       1122 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151436
-rw-r--r--   2 hdfs supergroup        594 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151437
-rw-r--r--   2 hdfs supergroup       1131 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151438
-rw-r--r--   2 hdfs supergroup       1203 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151439
-rw-r--r--   2 hdfs supergroup       1509 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151440
-rw-r--r--   2 hdfs supergroup        963 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151441
-rw-r--r--   2 hdfs supergroup        865 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151442
-rw-r--r--   2 hdfs supergroup       1273 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151443
-rw-r--r--   2 hdfs supergroup        961 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151444
-rw-r--r--   2 hdfs supergroup        915 2017-01-20 13:22 /user/hdfs/flume/17-01-20/13/FlumeData.1484940151445
-rw-r--r--   2 hdfs