question Only 10 records populating from local to hdfs while running flume but i have 500 records in my file in Archives of Support Questions (Read Only)

Only 10 records populating from local to hdfs while running flume but i have 500 records in my file

Tejaponnaluru — Fri, 16 Sep 2022 10:11:53 GMT

Here my config file

-----Local Config

agent.sources = localsource
agent.channels = memoryChannel
agent.sinks = avro_Sink

agent.sources.localsource.type = exec
agent.sources.localsource.shell = /bin/bash -c
agent.sources.localsource.command = tail -F /home/dwh/teja/Flumedata/testfile.csv

# The channel can be defined as follows.
agent.sources.localsource.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.avro_Sink.type = avro
agent.sinks.avro_Sink.hostname=192.168.44.4
agent.sinks.avro_Sink.port= 8021
agent.sinks.avro_Sink.avro.batchSize = 10000
agent.sinks.avro_Sink.avro.rollCount = 5000
agent.sinks.avro_Sink.avro.rollSize = 500
agent.sinks.avro_Sink.avro.rollInterval = 30
agent.sinks.avro_Sink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000
agent.channels.memoryChannel.transactionCapacity = 10000

------Remote config

# Please paste flume.conf here. Example:

# Sources, channels, and sinks are defined per
# agent name, in this case 'tier1'.
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1

# For each source, channel, and sink, set
tier1.sources.source1.type = avro
tier1.sources.source1.bind = 192.168.44.4
tier1.sources.source1.port=8021
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = memory
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.hdfs.path = hdfs://192.168.44.4:8020/user/hadoop/flumelogs/
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.hdfs.writeFormat= Text
tier1.sinks.sink1.hdfs.batchSize = 10000
tier1.sinks.sink1.hdfs.rollCount = 5000
tier1.sinks.sink1.hdfs.rollSize = 500
tier1.sinks.sink1.hdfs.rollInterval = 30

# specify the capacity of the memory channel.
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactioncapacity=10000

Please help i want to populate full file from local to hdfs

Re: Only 10 records populating from local to hdfs while running flume but i have 500 records in my

pdvorak — Thu, 31 Mar 2016 19:46:43 GMT

You are using a 'tail -f' command on your (I assume) idempotent csv file, which is tailing the last 10 lines (by default) and would continue to tail if you are writing more data to that CSV file. If this file is in fact no longer being modified, and you want to index the whole file, then I would recommend using the spooldir source instead: http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#spooling-directory-source

-PD

Re: Only 10 records populating from local to hdfs while running flume but i have 500 records in my

Tejaponnaluru — Fri, 01 Apr 2016 04:36:35 GMT

Hi Thanks a lot for your reply. All my logs in csv format so if i want to transfer full log to hdfs i want to use spooldir instead of exec source is that u saying right. if so can you explain clearly in which scenario i can use exec source. Thanks in advance..

Re: Only 10 records populating from local to hdfs while running flume but i have 500 records in my

pdvorak — Fri, 01 Apr 2016 04:58:56 GMT

The exec source is generally not recommended for production environments, as it doesn't handle things well if the process that is getting spawned gets killed unexpectedly. With regards to your log files that you are transferring, are you trying to stream them, or just transport them into hdfs? You may want to consider just using an hdfs put command with a cron job, or mounting the hdfs filesystem via nfs, especially if you want to preserve the files in hdfs as-is. Flume is designed for streaming data, not as a file transport mechanism.

If you do want to stream them then, the spooldir source would be used if the files are not being appended to. If they are being appended to while flume is reading them, then you would want to use the new taildir source (as of CDH5.5) [1], as it provides a more reliable handling of streaming log files. The spool dir source requires that files are not modified once they are in the spool directory, and they are removed or marked with .COMPLETED when ingestion is finished.

-PD

[1] http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#taildir-source

Re: Only 10 records populating from local to hdfs while running flume but i have 500 records in my

Tejaponnaluru — Fri, 01 Apr 2016 05:33:23 GMT

Thanks for your quick reply. As you asked I want to transfer my log files everyday morning from sever x to hdfs. As you said i can use put command for that but i have different log files in server X, so i thought put won't be a good way to transfer files that's why i choose flume and that too another reason is i want to filter data from each log file but in put command we can't do that right

Re: Only 10 records populating from local to hdfs while running flume but i have 500 records in my

Tejaponnaluru — Fri, 01 Apr 2016 09:48:43 GMT

Hi if i gave spooldir is this config fine

# For each one of the sources, the type is defined
agent.sources.localsource.type = spooldir
#agent.sources.localsource.shell = /bin/bash -c
agent.sources.localsource.command = /home/dwh/teja/Flumedata/
agent.sources.localsource.fileHeader = true

or else i want to add file name as well in path

Re: Only 10 records populating from local to hdfs while running flume but i have 500 records in my

Tejaponnaluru — Mon, 04 Apr 2016 05:33:45 GMT

Hi, As you said i'm using spooldir source it's working fine. But one problem is flume generating more files with less records but i want like one or two files. As i said before, i have 500 records log file i want to populate as one file this is just test case but in real scenario i have lakhs of records in one log file please help .
my config file is same as above which i shared with spooldir source