New Contributor
Posts: 4
Registered: ‎11-27-2014

Multiple tweets with same id in twitter streaming.


I collect tweets with help of this pipeline. I tried to use some own scripts to analyse collected scripts.

I found that i get multiple tweets with same id.

I looked in hdfs://user/flume/tweets and saw that this multiple tweets are in stored files.

So it isn't hive or oozie problem.
May it be flume problem: I made some configuration edits in flume parameters.


TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000 //in github 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100000 //in github 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100000 //in github 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 10000 //in github 100

Or twitter gives this tweets? And it isn't hadoop problem?