Support Questions

Raghava9 · ‎06-21-2016

Hi I am using flume to copy the files from spooling directory to HDFS using file as the channel.

#Component names
a1.sources = src
a1.channels = c1
a1.sinks = k1

#Source details
a1.sources.src.type = spooldir
a1.sources.src.channels = c1
a1.sources.src.spoolDir = /home/cloudera/onetrail
a1.sources.src.fileHeader = false
a1.sources.src.basenameHeader = true
# a1.sources.src.basenameHeaderKey = basename
a1.sources.src.fileSuffix = .COMPLETED
a1.sources.src.threads = 4
a1.sources.src.interceptors = newint
a1.sources.src.interceptors.newint.type = timestamp

#Sink details
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs:///data/contentProviders/cnet/%Y%m%d/
# a1.sinks.k1.hdfs.round = false
# a1.sinks.k1.hdfs.roundValue = 1
# a1.sinks.k1.hdfs.roundUnit = second
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
#a1.sinks.k1.hdfs.file.Type = DataStream
a1.sinks.k1.hdfs.filePrefix = %{basename}
# a1.sinks.k1.hdfs.fileSuffix = .xml
a1.sinks.k1.threadsPoolSize = 4

# use a single file at a time
a1.sinks.k1.hdfs.maxOpenFiles = 1

# rollover file based on maximum size of 10 MB
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.batchSize = 12

# Channel details
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /tmp/flume/checkpoint/
a1.channels.c1.dataDirs = /tmp/flume/data/

# Bind the source and sink to the channel
a1.sources.src.channels = c1
a1.sinks.k1.channels = c1

with the above configuration it is able to copy the files to hdfs but the problem which i am facing is one file is keep staying as .tmp and not copying the complete file content.

Can some one help me what could be the problem.

pdvorak · ‎06-21-2016

You are specified that all roll values are zero:
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0

Which means the latest file will never roll (since you have hdfs.maxOpenFiles=1). I'd suggest adding the hdfs.idleTimeout if you want to make sure they roll after the file has been ingested and sent to hdfs.

-pd

View solution in original post

pdvorak · ‎06-21-2016

You are specified that all roll values are zero:
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0

Which means the latest file will never roll (since you have hdfs.maxOpenFiles=1). I'd suggest adding the hdfs.idleTimeout if you want to make sure they roll after the file has been ingested and sent to hdfs.

-pd

Swechchha · ‎06-11-2018

Please Explain how to do the data transfer from Local file system to Hdfs using Taildir Flume Source. My use case is to deal with real time data so data in the source directory is keep updating

Cloudera Community

Support Questions

Move files from a spooling directory to HDFS with flume