Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Flume - File Channel getting full

avatar
New Contributor

Hi,

 

We are using Flume file channel and it keeps running out of space in few days. How do we manage the data directory size specific to a file channel. Is there some property we can use to purge old data?

 

I don't want to perform a manual clean-up of the file channel directories and would prefer something that will be part of the Flume Agent configuration itself.

 

Please advise.

 

Thanks,

Mari 

7 REPLIES 7

avatar
New Contributor
Will be great if someone from Cloudera provide us clear solution for Retention policy...

avatar
The Flume file channel will remove old data log files after they have been delivered by the sink and are no longer needed (it always keeps at least two files, but all others would be eligible for deletion).

What is the size of your channel? Is it increasing? Are your sinks able to keep up?

You can't purge old data (because flume thinks it still needs it if it hasn't already deleted it), but you can control how much free space is on your disk with the minimumRequiredSpace channel property. You can also use the capacity property to restrict how many events can be in the channel.

If your files in the dataDirs are not getting deleted, it's because the sinks haven't delivered events from those files.

-pd

avatar
How far back do you have files in the dataDirs?

There have been some instances where flume holds on to a specific event, that has been delivered, and the checkpoint still references it for some reason. If that is the case, you can stop flume, delete the files in the checkpoints directory, and force a replay of the events in the channel (note this may take a long time, depending on the size of all the logs in the dataDirs). You can set use-fast-replay=true to make it replay faster, but you'll need to increase heap size as well if you choose fast replay.

-pd

avatar
New Contributor

Hi,

 

We need only 1 month old files. The rest dataFiles should be deleted on daily basis.

 

What is proper solution.

 

avatar
Flume shouldn't be holding on to old files that have all the events delivered to the sinks. If that is the case, there may be some inconsistency in the checkpoints that is causing this. You could resolve this by regenerating the checkpoints as I noted previously.

The suggestion would be to shut down flume, increase the heap size to a large amount, and then add the use-fast-replay=true property to the channel. Delete the checkpoints and then start up flume. The checkpoints will be recreated and properly record which events were delivered to the sinks, and then any old log files that are no longer needed should be removed.

As a safety measure, you may want to backup the files (data and checkpoints), but regenerating the checkpoints shouldn't negatively affect the flume channel, it will just take some time to replay.

-pd

avatar
New Contributor

Not sure you get me right...

 

Let me describe one more time:

 

1. We configured Flume agents(Cludera->Flume01->instance->configuration):

source.type=avro

source.bind=<our_hostname>

source.port=<our_port>

source.interceptor=timestamp_interceptor

source.interceptor.timestamp_interceptor=timestamp

 

channels.type=memory

channels.capacity=10000

channels..transactionCapacity=10000

 

sinks.type=hdfs

hdfs.path=/client/project/log/%Y-%m-%d

hdfs.fileType=DataStream

hdfs.rollSize=0

hdfs.rollCount=0

hdfs.rollInterval=0

hdfs.batchSize=100

 

So our application logs looks like:

>hdfs dfs -ls /client/project/log/

 

/client/project/log/2017-07-06

/client/project/log/2017-07-07

/client/project/log/2017-07-08

/client/project/log/2017-07-09

/client/project/log/2017-07-10

/client/project/log/2017-07-11

/client/project/log/2017-07-12

/client/project/log/2017-07-13

/client/project/log/2017-07-14

/client/project/log/2017-07-15

/client/project/log/2017-07-16

...

/client/project/log/2017-09-16

/client/project/log/2017-09-17

/client/project/log/2017-09-18

/client/project/log/2017-09-19

/client/project/log/2017-09-20

 

And we want to keep only folders 1 month old. 

So the question is how to configure clean up for sinks?

 

avatar
Thanks for the clarification, the original comments said your flume file channel was running out of space.

With regards to the hdfs sink, once flume delivers to the hdfs sink, it no longer controls those files. Whatever post processing you are doing that uses those files should be responsible for cleaning up those folders. There isn't functionality within the flume sink to clean up old folders or expire data that has been delivered already. You could run a simple cron job that removes directories in hdfs older than a month, or run an oozie job that does the same.

HTH

-pd