Support Questions

Find answers, ask questions, and share your expertise

Data ingestion using kafka from crawlers

avatar
Expert Contributor

Hello,

 

I am trying to work with kafka for data ingestion but being new to this, i kind of pretty much confused.

 

I have multiple  crawlers, who extract data for me from web platforms, now the issue is i want to ingest that extracted data to hadoop using kafka without any middle scripts/service file . Main commplication is that, platforms are disparate in nature and one web platform is providing real-time data other batch based. Can integrate my crawlers some how with kafka producers ? and they keep running all by themselves. is it possible ? I think it is but i am not getting in right direction. any help would be appreciated.

 

 

Thanks

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hello,

 

Loading data directly to Kafka without any Service seems unlikely.

 

However, you can use execute a simple kafka console producer to send all your data to the kafka service. But if your requirement is to save data to HDFS you need to include a few more services along with Kafka. 

 

For example, Crawlers >> kafka console producer  (or) Spark Streaming >> Flume >> HDFS

 

As your requirement is to store the data in HDFS and not stream the data. I suggest you execute a Spark job, it will store your data to HDFS. Refer mentioned commands to execute a spark job to move data to HDFS.

 

Initiate a spark-shell

 

Execute the mentioned command in the Spark shell in the same order.

 

val moveFile = sc.textFile("file:///path/to/Sample.log")

moveFile.saveAsTextFile("hdfs:///tmp/Sample.log")

 

 

View solution in original post

1 REPLY 1

avatar
Expert Contributor

Hello,

 

Loading data directly to Kafka without any Service seems unlikely.

 

However, you can use execute a simple kafka console producer to send all your data to the kafka service. But if your requirement is to save data to HDFS you need to include a few more services along with Kafka. 

 

For example, Crawlers >> kafka console producer  (or) Spark Streaming >> Flume >> HDFS

 

As your requirement is to store the data in HDFS and not stream the data. I suggest you execute a Spark job, it will store your data to HDFS. Refer mentioned commands to execute a spark job to move data to HDFS.

 

Initiate a spark-shell

 

Execute the mentioned command in the Spark shell in the same order.

 

val moveFile = sc.textFile("file:///path/to/Sample.log")

moveFile.saveAsTextFile("hdfs:///tmp/Sample.log")