Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Data ingestion using kafka from crawlers


Data ingestion using kafka from crawlers

Rising Star



I am trying to work with kafka for data ingestion but being new to this, i kind of pretty much confused.


I have multiple  crawlers, who extract data for me from web platforms, now the issue is i want to ingest that extracted data to hadoop using kafka without any middle scripts/service file . Main commplication is that, platforms are disparate in nature and one web platform is providing real-time data other batch based. Can integrate my crawlers some how with kafka producers ? and they keep running all by themselves. is it possible ? I think it is but i am not getting in right direction. any help would be appreciated.





Re: Data ingestion using kafka from crawlers

Cloudera Employee



Loading data directly to Kafka without any Service seems unlikely.


However, you can use execute a simple kafka console producer to send all your data to the kafka service. But if your requirement is to save data to HDFS you need to include a few more services along with Kafka. 


For example, Crawlers >> kafka console producer  (or) Spark Streaming >> Flume >> HDFS


As your requirement is to store the data in HDFS and not stream the data. I suggest you execute a Spark job, it will store your data to HDFS. Refer mentioned commands to execute a spark job to move data to HDFS.


Initiate a spark-shell


Execute the mentioned command in the Spark shell in the same order.


val moveFile = sc.textFile("file:///path/to/Sample.log")