Move files from a spooling directory to HDFS with flume

Raghava9 — Fri, 16 Sep 2022 10:26:29 GMT

Hi I am using flume to copy the files from spooling directory to HDFS using file as the channel.

#Component names
a1.sources = src
a1.channels = c1
a1.sinks = k1

#Source details
a1.sources.src.type = spooldir
a1.sources.src.channels = c1
a1.sources.src.spoolDir = /home/cloudera/onetrail
a1.sources.src.fileHeader = false
a1.sources.src.basenameHeader = true
# a1.sources.src.basenameHeaderKey = basename
a1.sources.src.fileSuffix = .COMPLETED
a1.sources.src.threads = 4
a1.sources.src.interceptors = newint
a1.sources.src.interceptors.newint.type = timestamp

#Sink details
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs:///data/contentProviders/cnet/%Y%m%d/
# a1.sinks.k1.hdfs.round = false
# a1.sinks.k1.hdfs.roundValue = 1
# a1.sinks.k1.hdfs.roundUnit = second
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
#a1.sinks.k1.hdfs.file.Type = DataStream
a1.sinks.k1.hdfs.filePrefix = %{basename}
# a1.sinks.k1.hdfs.fileSuffix = .xml
a1.sinks.k1.threadsPoolSize = 4

# use a single file at a time
a1.sinks.k1.hdfs.maxOpenFiles = 1

# rollover file based on maximum size of 10 MB
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.batchSize = 12

# Channel details
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /tmp/flume/checkpoint/
a1.channels.c1.dataDirs = /tmp/flume/data/

# Bind the source and sink to the channel
a1.sources.src.channels = c1
a1.sinks.k1.channels = c1

with the above configuration it is able to copy the files to hdfs but the problem which i am facing is one file is keep staying as .tmp and not copying the complete file content.

Can some one help me what could be the problem.

Re: Move files from a spooling directory to HDFS with flume

pdvorak — Tue, 21 Jun 2016 22:49:21 GMT

You are specified that all roll values are zero:
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0

Which means the latest file will never roll (since you have hdfs.maxOpenFiles=1). I'd suggest adding the hdfs.idleTimeout if you want to make sure they roll after the file has been ingested and sent to hdfs.

-pd

Re: Move files from a spooling directory to HDFS with flume

Swechchha — Mon, 11 Jun 2018 11:23:59 GMT

Please Explain how to do the data transfer from Local file system to Hdfs using Taildir Flume Source. My use case is to deal with real time data so data in the source directory is keep updating

question Move files from a spooling directory to HDFS with flume in Archives of Support Questions (Read Only)

Move files from a spooling directory to HDFS with flume

Re: Move files from a spooling directory to HDFS with flume

Re: Move files from a spooling directory to HDFS with flume