<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Move files from a spooling directory to HDFS with flume in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Move-files-from-a-spooling-directory-to-HDFS-with-flume/m-p/23135#M4327</link>
    <description>&lt;P&gt;I'm implementing a small hadoop cluster for a POC in my company. I'm trying to import files into HDFS with Flume. Each files contains JSON objects like this (1 "long" line per file):&lt;/P&gt;&lt;PRE&gt;{ "objectType" : [ { JSON Object } , { JSON Object }, ... ] }&lt;/PRE&gt;&lt;P&gt;"objectType" is the type the objects in the array (ex: events, users, ...).&lt;/P&gt;&lt;P&gt;These files will be processed later by several tasks depending on the "objectType".&lt;/P&gt;&lt;P&gt;I'm using the spoolDir source and the HDFS sink.&lt;/P&gt;&lt;P&gt;My questions are:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Is it possible to keep the source filename when flume write into HDFS (filenames are unique as they contains a timestamp and a UUID in their name)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Is there a way to set "deserializer.maxLineLength" to an unlimited value (instead of setting a high value)?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;I really dn't want to loose data. Which channel is the best, JDBC or File? (I do not have a flow with high throughput)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;My constraint is that I have to use flume out-of-the-box (no custom elements) as much as possible.&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 09:16:55 GMT</pubDate>
    <dc:creator>AlinaGHERMAN</dc:creator>
    <dc:date>2022-09-16T09:16:55Z</dc:date>
    <item>
      <title>Move files from a spooling directory to HDFS with flume</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Move-files-from-a-spooling-directory-to-HDFS-with-flume/m-p/23135#M4327</link>
      <description>&lt;P&gt;I'm implementing a small hadoop cluster for a POC in my company. I'm trying to import files into HDFS with Flume. Each files contains JSON objects like this (1 "long" line per file):&lt;/P&gt;&lt;PRE&gt;{ "objectType" : [ { JSON Object } , { JSON Object }, ... ] }&lt;/PRE&gt;&lt;P&gt;"objectType" is the type the objects in the array (ex: events, users, ...).&lt;/P&gt;&lt;P&gt;These files will be processed later by several tasks depending on the "objectType".&lt;/P&gt;&lt;P&gt;I'm using the spoolDir source and the HDFS sink.&lt;/P&gt;&lt;P&gt;My questions are:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Is it possible to keep the source filename when flume write into HDFS (filenames are unique as they contains a timestamp and a UUID in their name)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Is there a way to set "deserializer.maxLineLength" to an unlimited value (instead of setting a high value)?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;I really dn't want to loose data. Which channel is the best, JDBC or File? (I do not have a flow with high throughput)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;My constraint is that I have to use flume out-of-the-box (no custom elements) as much as possible.&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:16:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Move-files-from-a-spooling-directory-to-HDFS-with-flume/m-p/23135#M4327</guid>
      <dc:creator>AlinaGHERMAN</dc:creator>
      <dc:date>2022-09-16T09:16:55Z</dc:date>
    </item>
    <item>
      <title>Re: Move files from a spooling directory to HDFS with flume</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Move-files-from-a-spooling-directory-to-HDFS-with-flume/m-p/23143#M4328</link>
      <description>&lt;P&gt;If you want each file to end up remaining whole, you can use the BlobDeserialzier[1] for&amp;nbsp;the &lt;FONT face="courier new,courier"&gt;deserializer&lt;/FONT&gt; parameter of the SpoolingDirectorySource[2].:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class="na"&gt;a1.channels&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;c1&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sources&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;src-1&lt;/SPAN&gt;

&lt;SPAN class="na"&gt;a1.sources.src-1.type&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;spooldir&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sources.src-1.channels&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;ch-1&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sources.src-1.spoolDir&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;/var/log/apache/flumeSpool&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sources.src-1.fileHeader&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;true&lt;BR /&gt;&lt;/SPAN&gt;a1.sources.src-1.deserializer = &lt;SPAN&gt;org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;If you need to, set &lt;FONT face="courier new,courier"&gt;deserialzier.maxBlobLength&lt;/FONT&gt; to the maximum file size you'll be picking up. The default is 100 million bytes. This won't work for very large files as the entire file contents will get buffered into RAM.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The File channel is the best option for reliable data flow.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you want the output file to have the same name is the input file, you can set the&amp;nbsp;&lt;SPAN&gt;&lt;FONT face="courier new,courier"&gt;basenameHeader&lt;/FONT&gt; parameter to true. This will set a header in the flume event called&amp;nbsp;&lt;FONT face="courier new,courier"&gt;basename&lt;/FONT&gt;. You can customize the name of the header by setting&amp;nbsp;&lt;FONT face="courier new,courier"&gt;basenameHeaderKey&lt;/FONT&gt;. Then in your sink configuration, you can refer to the header value in the filePrefix with something like this:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class="na"&gt;a1.channels&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;c1&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sinks&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;k1&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sinks.k1.type&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;hdfs&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sinks.k1.channel&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;c1&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sinks.k1.hdfs.path&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="s"&gt;/flume/events/&lt;/SPAN&gt;
&lt;SPAN class="na"&gt;a1.sinks.k1.hdfs.filePrefix&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; %{basename}-&lt;BR /&gt;a1.sinks.k1.hdfs.fileType = DataStream&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;HTH,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;-Joey&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;[1]&amp;nbsp;&lt;A target="_blank" href="http://flume.apache.org/FlumeUserGuide.html#blobdeserializer"&gt;http://flume.apache.org/FlumeUserGuide.html#blobdeserializer&lt;/A&gt;&lt;/P&gt;&lt;P&gt;[2]&amp;nbsp;&lt;A target="_blank" href="http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source"&gt;http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 29 Dec 2014 17:57:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Move-files-from-a-spooling-directory-to-HDFS-with-flume/m-p/23143#M4328</guid>
      <dc:creator>joey</dc:creator>
      <dc:date>2014-12-29T17:57:11Z</dc:date>
    </item>
  </channel>
</rss>

