Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Data ingestion using kafka from crawlers

avatar
Expert Contributor

Hello,

 

I am trying to work with kafka for data ingestion but being new to this, i kind of pretty much confused.

 

I have multiple  crawlers, who extract data for me from web platforms, now the issue is i want to ingest that extracted data to hadoop using kafka without any middle scripts/service file . Main commplication is that, platforms are disparate in nature and one web platform is providing real-time data other batch based. Can integrate my crawlers some how with kafka producers ? and they keep running all by themselves. is it possible ? I think it is but i am not getting in right direction. any help would be appreciated.

 

 

Thanks

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hello,

 

Loading data directly to Kafka without any Service seems unlikely.

 

However, you can use execute a simple kafka console producer to send all your data to the kafka service. But if your requirement is to save data to HDFS you need to include a few more services along with Kafka. 

 

For example, Crawlers >> kafka console producer  (or) Spark Streaming >> Flume >> HDFS

 

As your requirement is to store the data in HDFS and not stream the data. I suggest you execute a Spark job, it will store your data to HDFS. Refer mentioned commands to execute a spark job to move data to HDFS.

 

Initiate a spark-shell

 

Execute the mentioned command in the Spark shell in the same order.

 

val moveFile = sc.textFile("file:///path/to/Sample.log")

moveFile.saveAsTextFile("hdfs:///tmp/Sample.log")

 

 

View solution in original post

1 REPLY 1

avatar
Expert Contributor

Hello,

 

Loading data directly to Kafka without any Service seems unlikely.

 

However, you can use execute a simple kafka console producer to send all your data to the kafka service. But if your requirement is to save data to HDFS you need to include a few more services along with Kafka. 

 

For example, Crawlers >> kafka console producer  (or) Spark Streaming >> Flume >> HDFS

 

As your requirement is to store the data in HDFS and not stream the data. I suggest you execute a Spark job, it will store your data to HDFS. Refer mentioned commands to execute a spark job to move data to HDFS.

 

Initiate a spark-shell

 

Execute the mentioned command in the Spark shell in the same order.

 

val moveFile = sc.textFile("file:///path/to/Sample.log")

moveFile.saveAsTextFile("hdfs:///tmp/Sample.log")