Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Only 10 records populating from local to hdfs while running flume but i have 500 records in my file

avatar

Here my config file

 

-----Local Config

agent.sources = localsource
agent.channels = memoryChannel
agent.sinks = avro_Sink

 

agent.sources.localsource.type = exec
agent.sources.localsource.shell = /bin/bash -c
agent.sources.localsource.command = tail -F /home/dwh/teja/Flumedata/testfile.csv

# The channel can be defined as follows.
agent.sources.localsource.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.avro_Sink.type = avro
agent.sinks.avro_Sink.hostname=192.168.44.4
agent.sinks.avro_Sink.port= 8021
agent.sinks.avro_Sink.avro.batchSize = 10000
agent.sinks.avro_Sink.avro.rollCount = 5000
agent.sinks.avro_Sink.avro.rollSize = 500
agent.sinks.avro_Sink.avro.rollInterval = 30
agent.sinks.avro_Sink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000
agent.channels.memoryChannel.transactionCapacity = 10000

 

------Remote config

# Please paste flume.conf here. Example:

# Sources, channels, and sinks are defined per
# agent name, in this case 'tier1'.
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1

# For each source, channel, and sink, set
tier1.sources.source1.type = avro
tier1.sources.source1.bind = 192.168.44.4
tier1.sources.source1.port=8021
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = memory
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.hdfs.path = hdfs://192.168.44.4:8020/user/hadoop/flumelogs/
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.hdfs.writeFormat= Text
tier1.sinks.sink1.hdfs.batchSize = 10000
tier1.sinks.sink1.hdfs.rollCount = 5000
tier1.sinks.sink1.hdfs.rollSize = 500
tier1.sinks.sink1.hdfs.rollInterval = 30


# specify the capacity of the memory channel.
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactioncapacity=10000

 

 

 

Please help i want to populate full file from local to hdfs

 

2 ACCEPTED SOLUTIONS

avatar
You are using a 'tail -f' command on your (I assume) idempotent csv file, which is tailing the last 10 lines (by default) and would continue to tail if you are writing more data to that CSV file. If this file is in fact no longer being modified, and you want to index the whole file, then I would recommend using the spooldir source instead: http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#spooling-directory-source

-PD

View solution in original post

avatar
The exec source is generally not recommended for production environments, as it doesn't handle things well if the process that is getting spawned gets killed unexpectedly. With regards to your log files that you are transferring, are you trying to stream them, or just transport them into hdfs? You may want to consider just using an hdfs put command with a cron job, or mounting the hdfs filesystem via nfs, especially if you want to preserve the files in hdfs as-is. Flume is designed for streaming data, not as a file transport mechanism.

If you do want to stream them then, the spooldir source would be used if the files are not being appended to. If they are being appended to while flume is reading them, then you would want to use the new taildir source (as of CDH5.5) [1], as it provides a more reliable handling of streaming log files. The spool dir source requires that files are not modified once they are in the spool directory, and they are removed or marked with .COMPLETED when ingestion is finished.

-PD

[1] http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#taildir-source

View solution in original post

6 REPLIES 6

avatar
You are using a 'tail -f' command on your (I assume) idempotent csv file, which is tailing the last 10 lines (by default) and would continue to tail if you are writing more data to that CSV file. If this file is in fact no longer being modified, and you want to index the whole file, then I would recommend using the spooldir source instead: http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#spooling-directory-source

-PD

avatar
Hi Thanks a lot for your reply. All my logs in csv format so if i want to transfer full log to hdfs i want to use spooldir instead of exec source is that u saying right. if so can you explain clearly in which scenario i can use exec source. Thanks in advance..

avatar
The exec source is generally not recommended for production environments, as it doesn't handle things well if the process that is getting spawned gets killed unexpectedly. With regards to your log files that you are transferring, are you trying to stream them, or just transport them into hdfs? You may want to consider just using an hdfs put command with a cron job, or mounting the hdfs filesystem via nfs, especially if you want to preserve the files in hdfs as-is. Flume is designed for streaming data, not as a file transport mechanism.

If you do want to stream them then, the spooldir source would be used if the files are not being appended to. If they are being appended to while flume is reading them, then you would want to use the new taildir source (as of CDH5.5) [1], as it provides a more reliable handling of streaming log files. The spool dir source requires that files are not modified once they are in the spool directory, and they are removed or marked with .COMPLETED when ingestion is finished.

-PD

[1] http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#taildir-source

avatar
Thanks for your quick reply. As you asked I want to transfer my log files everyday morning from sever x to hdfs. As you said i can use put command for that but i have different log files in server X, so i thought put won't be a good way to transfer files that's why i choose flume and that too another reason is i want to filter data from each log file but in put command we can't do that right

avatar

Hi if i gave spooldir is this config fine

# For each one of the sources, the type is defined
agent.sources.localsource.type = spooldir
#agent.sources.localsource.shell = /bin/bash -c
agent.sources.localsource.command = /home/dwh/teja/Flumedata/
agent.sources.localsource.fileHeader = true

 

or else i want to add file name as well in path

 

avatar
Hi, As you said i'm using spooldir source it's working fine. But one problem is flume generating more files with less records but i want like one or two files. As i said before, i have 500 records log file i want to populate as one file this is just test case but in real scenario i have lakhs of records in one log file please help .
my config file is same as above which i shared with spooldir source