<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Data ingestion using kafka from crawlers in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-ingestion-using-kafka-from-crawlers/m-p/85813#M82241</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Loading data directly to Kafka without any Service seems unlikely.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, you can use execute a simple kafka console producer&amp;nbsp;to send all your data to the kafka service. But if your requirement is to save data to HDFS you need to include a few more services along with Kafka.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example, C&lt;SPAN&gt;rawlers &amp;gt;&amp;gt;&amp;nbsp;kafka console producer&amp;nbsp; (or) Spark Streaming &amp;gt;&amp;gt; Flume &amp;gt;&amp;gt; HDFS&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As your requirement is to store the data in HDFS and not stream the data. I suggest you&amp;nbsp;execute a Spark job, it will store your data to HDFS. Refer mentioned commands to&amp;nbsp;execute a spark job to move data to HDFS.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Initiate a spark-shell&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Execute the mentioned command in the Spark shell in the same order.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;val moveFile = sc.textFile("file:///path/to/Sample.log")&lt;/P&gt;&lt;P&gt;moveFile.saveAsTextFile("hdfs:///tmp/Sample.log")&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 03 Feb 2019 13:09:06 GMT</pubDate>
    <dc:creator>TonyStank</dc:creator>
    <dc:date>2019-02-03T13:09:06Z</dc:date>
    <item>
      <title>Data ingestion using kafka from crawlers</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-ingestion-using-kafka-from-crawlers/m-p/78607#M82240</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am trying to work with kafka for data ingestion but being new to this, i kind of pretty much confused.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have multiple&amp;nbsp; crawlers, who extract data for me from web platforms, now the issue is i want to ingest that extracted data to hadoop using kafka without any middle scripts/service file . Main commplication is that, platforms are disparate in nature and one web platform is providing real-time data other batch based. Can integrate my crawlers some how with kafka producers ? and they keep running all by themselves. is it possible ? I think it is but i am not getting in right direction. any help would be appreciated.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 13:35:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-ingestion-using-kafka-from-crawlers/m-p/78607#M82240</guid>
      <dc:creator>hadoopNoob</dc:creator>
      <dc:date>2022-09-16T13:35:48Z</dc:date>
    </item>
    <item>
      <title>Re: Data ingestion using kafka from crawlers</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-ingestion-using-kafka-from-crawlers/m-p/85813#M82241</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Loading data directly to Kafka without any Service seems unlikely.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, you can use execute a simple kafka console producer&amp;nbsp;to send all your data to the kafka service. But if your requirement is to save data to HDFS you need to include a few more services along with Kafka.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example, C&lt;SPAN&gt;rawlers &amp;gt;&amp;gt;&amp;nbsp;kafka console producer&amp;nbsp; (or) Spark Streaming &amp;gt;&amp;gt; Flume &amp;gt;&amp;gt; HDFS&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As your requirement is to store the data in HDFS and not stream the data. I suggest you&amp;nbsp;execute a Spark job, it will store your data to HDFS. Refer mentioned commands to&amp;nbsp;execute a spark job to move data to HDFS.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Initiate a spark-shell&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Execute the mentioned command in the Spark shell in the same order.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;val moveFile = sc.textFile("file:///path/to/Sample.log")&lt;/P&gt;&lt;P&gt;moveFile.saveAsTextFile("hdfs:///tmp/Sample.log")&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 03 Feb 2019 13:09:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-ingestion-using-kafka-from-crawlers/m-p/85813#M82241</guid>
      <dc:creator>TonyStank</dc:creator>
      <dc:date>2019-02-03T13:09:06Z</dc:date>
    </item>
  </channel>
</rss>

