Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎02-14-2018

Spark doesn't read the file properly

[ Edited ]

I run flume to ingest twitter data into HDFS in json format and run spark to read that file. But somehow, it doesn't return the correct result...it seems the content of the file is not updated.

Here's my flume config

 

TwitterAgent01.sources = Twitter
TwitterAgent01.channels = MemoryChannel01
TwitterAgent01.sinks = HDFS

TwitterAgent01.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent01.sources.Twitter.channels = MemoryChannel01
TwitterAgent01.sources.Twitter.consumerKey = xxx
TwitterAgent01.sources.Twitter.consumerSecret = xxx
TwitterAgent01.sources.Twitter.accessToken = xxx
TwitterAgent01.sources.Twitter.accessTokenSecret = xxx
TwitterAgent01.sources.Twitter.keywords = some_keywords

TwitterAgent01.sinks.HDFS.channel = MemoryChannel01
TwitterAgent01.sinks.HDFS.type = hdfs
TwitterAgent01.sinks.HDFS.hdfs.path = hdfs://hadoop01:8020/warehouse/raw/twitter/provider/m=%Y%m/
TwitterAgent01.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent01.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent01.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent01.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent01.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent01.sinks.HDFS.hdfs.rollInterval = 86400

TwitterAgent01.channels.MemoryChannel01.type = memory
TwitterAgent01.channels.MemoryChannel01.capacity = 10000
TwitterAgent01.channels.MemoryChannel01.transactionCapacity = 10000

 

After that I check the output using regular 'hdfs dfs -cat' command, it returns more than 1000 rows, means that the data was successfully inserted.

But in spark, that's not the case

spark.read.json("/warehouse/raw/twitter/provider").filter("m=201802").show()

 only has 6 rows, it's not updated...

 

did I miss something here ?

 

Announcements