<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Storm HDFS Bolt question (Trident api) in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128573#M51471</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2564/aiyer.html" nodeid="2564"&gt;@Arun Mahadevan&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/755/aarondossett.html" nodeid="755"&gt;@Aaron Dossett&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/172/schintalapani.html" nodeid="172"&gt;@Sriharsha Chintalapani&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I am kind of confused right now. So let me rephrase what I got so far in my own words:&lt;/P&gt;&lt;P&gt;Whereas Trident can have strong exactly-once semantics for persisting stream aggregates and tuples making it to any HDFS file, the action of rotating the file itself is not protected by these same strong guarantees?&lt;/P&gt;&lt;P&gt;Or is the rotation protected by exactly-once but not the .addRotationAction attached to it? &lt;/P&gt;&lt;P&gt;It is just not clear in the documentation: &lt;A href="https://github.com/apache/storm/tree/master/external/storm-hdfs#hdfs-bolt-support-for-trident-api" target="_blank"&gt;https://github.com/apache/storm/tree/master/external/storm-hdfs#hdfs-bolt-support-for-trident-api&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Suppose, the file rotation is exactly-once then it could work to have the syncpolicy set to the exact same size limit as the size-based rotation policy. That way the files will only be visible to HDFS clients (synced) when that size limit is met.&lt;/P&gt;</description>
    <pubDate>Tue, 17 Jan 2017 20:03:54 GMT</pubDate>
    <dc:creator>jknulst</dc:creator>
    <dc:date>2017-01-17T20:03:54Z</dc:date>
    <item>
      <title>Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128565#M51463</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I need to know if the current file in HDFS that Storm writes to is recognizable as 'in flight' file? For instance Flume marks the in flight files like &amp;lt;filename&amp;gt;.tmp (or something like that). How does Storm do this?&lt;/P&gt;&lt;P&gt;Maybe somebody knows just like that, I hope so I don't have to build a test setup myself now.&lt;/P&gt;&lt;P&gt;Edit: final goal is to have a batch oriented process take on only completed/closed files. &lt;/P&gt;</description>
      <pubDate>Fri, 13 Jan 2017 00:06:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128565#M51463</guid>
      <dc:creator>jknulst</dc:creator>
      <dc:date>2017-01-13T00:06:53Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128566#M51464</link>
      <description>&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;HDFS spout assumes that files visible in the monitored directory are not actively being updated. Only after a file is completely written should it be made visible to the spout. Following are two approaches for ensuring this:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Write the file to another directory. When the write operation is finished, move the file to the monitored directory.&lt;/LI&gt;&lt;LI&gt;Create the file in the monitored directory with an '.ignore' suffix; HDFS spout ignores files with an '.ignore' suffix. When the write operation is finished, rename the file to omit the suffix.&lt;/LI&gt;&lt;/UL&gt;
&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;When the spout is actively consuming a file, it renames the file with an &lt;CODE&gt;.inprogress&lt;/CODE&gt; suffix. After consuming all contents in the file, the file is moved to a configurable done directory and the &lt;CODE&gt;.inprogress&lt;/CODE&gt; suffix is dropped. &lt;A rel="user" href="https://community.cloudera.com/users/12513/jknulst.html" nodeid="12513"&gt;@Jasper&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-ingest-hdfs.html" target="_blank"&gt;http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-ingest-hdfs.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jan 2017 00:18:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128566#M51464</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2017-01-13T00:18:07Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128567#M51465</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/393/aervits.html" nodeid="393"&gt;@Artem Ervits&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Thanks Artem, but I actually meant the exact opposite; Storm writing to HDFS, not reading&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jan 2017 00:34:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128567#M51465</guid>
      <dc:creator>jknulst</dc:creator>
      <dc:date>2017-01-13T00:34:32Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128568#M51466</link>
      <description>&lt;P&gt;I can't find exact reference to it but seems to handle the finished file you can implement an easy solution by doing the following&lt;/P&gt;&lt;P&gt;Both the HDFS bolt and Trident State implementation allow you to register any number of &lt;CODE&gt;RotationAction&lt;/CODE&gt;s. What &lt;CODE&gt;RotationAction&lt;/CODE&gt;s do is provide a hook to allow you to perform some action right after a file is rotated. For example, moving a file to a different location or renaming it.&lt;/P&gt;&lt;PRE&gt;public class MoveFileAction implements RotationAction {
    private static final Logger LOG = LoggerFactory.getLogger(MoveFileAction.class);

    private String destination;

    public MoveFileAction withDestination(String destDir){
        destination = destDir;
        return this;
    }
&lt;/PRE&gt;&lt;P&gt;If you are using Trident and sequence files you can do something like this:&lt;/P&gt;&lt;PRE&gt;        HdfsState.Options seqOpts = new HdfsState.SequenceFileOptions()
                .withFileNameFormat(fileNameFormat)
                .withSequenceFormat(new DefaultSequenceFormat("key", "data"))
                .withRotationPolicy(rotationPolicy)
                .withFsUrl("hdfs://localhost:54310")
                .addRotationAction(new MoveFileAction().withDestination("/dest2/"));

&lt;/PRE&gt;</description>
      <pubDate>Fri, 13 Jan 2017 01:39:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128568#M51466</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2017-01-13T01:39:24Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128569#M51467</link>
      <description>&lt;P&gt;Hi Jasper -- No, you can't tell just by the filename.  When I had to solve that same problem I ended up using the HDFS API to test whether or not the file was open.  This wasn't too painful since the application I wrote was already using the Java API.&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jan 2017 06:14:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128569#M51467</guid>
      <dc:creator>Aaron_Dossett</dc:creator>
      <dc:date>2017-01-13T06:14:54Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128570#M51468</link>
      <description>&lt;P&gt;This is right, but I am VERY wary about using `RotationAction`s because if a worker crashes and restarts the action will never be performed.  For that reason I would never use a `RotationAction` in production.&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jan 2017 06:16:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128570#M51468</guid>
      <dc:creator>Aaron_Dossett</dc:creator>
      <dc:date>2017-01-13T06:16:01Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128571#M51469</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/755/aarondossett.html" nodeid="755"&gt;@Aaron Dossett&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Hey Aaron, were you using the storm-core HDFSBolt or the trident api? The Trident one should guarantee the action in the face of failures and crashes. Can you elaborate a bit?&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 06:05:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128571#M51469</guid>
      <dc:creator>jknulst</dc:creator>
      <dc:date>2017-01-17T06:05:07Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128572#M51470</link>
      <description>&lt;P&gt;@Jasper right now HDFS bolt does not mark the current file as "in-progress". A reasonable solution is to use RotationAction to move the rotated files to a different directory, but its possible that if the worker crashes in the middle of a rotation, the file may not move to the destination (core and trident).&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 14:31:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128572#M51470</guid>
      <dc:creator>aiyer1</dc:creator>
      <dc:date>2017-01-17T14:31:16Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128573#M51471</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2564/aiyer.html" nodeid="2564"&gt;@Arun Mahadevan&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/755/aarondossett.html" nodeid="755"&gt;@Aaron Dossett&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/172/schintalapani.html" nodeid="172"&gt;@Sriharsha Chintalapani&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I am kind of confused right now. So let me rephrase what I got so far in my own words:&lt;/P&gt;&lt;P&gt;Whereas Trident can have strong exactly-once semantics for persisting stream aggregates and tuples making it to any HDFS file, the action of rotating the file itself is not protected by these same strong guarantees?&lt;/P&gt;&lt;P&gt;Or is the rotation protected by exactly-once but not the .addRotationAction attached to it? &lt;/P&gt;&lt;P&gt;It is just not clear in the documentation: &lt;A href="https://github.com/apache/storm/tree/master/external/storm-hdfs#hdfs-bolt-support-for-trident-api" target="_blank"&gt;https://github.com/apache/storm/tree/master/external/storm-hdfs#hdfs-bolt-support-for-trident-api&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Suppose, the file rotation is exactly-once then it could work to have the syncpolicy set to the exact same size limit as the size-based rotation policy. That way the files will only be visible to HDFS clients (synced) when that size limit is met.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 20:03:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128573#M51471</guid>
      <dc:creator>jknulst</dc:creator>
      <dc:date>2017-01-17T20:03:54Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128574#M51472</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/12513/jknulst.html" nodeid="12513"&gt;@Jasper&lt;/A&gt; I think it's correct to say that actions added by .addRotationAction are AT MOST ONCE, even for Trident.&lt;/P&gt;&lt;P&gt;I'm not sure using a sync policy would work either, I believe those guarantee a minimum sync frequency but additional syncs are still possible  (I could be wrong, I haven't looked at that code in a while).&lt;/P&gt;&lt;P&gt;Said another way, you want to treat the unit of work as an HDFS file, but in Storm itself the unit of work is a tuple.  Thus it will be hard to meet your goal in a clean way.&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jan 2017 04:44:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128574#M51472</guid>
      <dc:creator>Aaron_Dossett</dc:creator>
      <dc:date>2017-01-18T04:44:12Z</dc:date>
    </item>
    <item>
      <title>Re: Storm HDFS Bolt question (Trident api)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128575#M51473</link>
      <description>&lt;P&gt;With the help of the remarks by &lt;A rel="user" href="https://community.cloudera.com/users/755/aarondossett.html" nodeid="755"&gt;@Aaron Dossett&lt;/A&gt; I found a solution to this.&lt;/P&gt;&lt;P&gt;Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution.&lt;/P&gt;&lt;P&gt;HDFS can report the files on a path that are open for write:&lt;/P&gt;&lt;PRE&gt;hdfs fsck &amp;lt;storm_hdfs_state_output_path&amp;gt; -files -openforwrite&lt;/PRE&gt;&lt;P&gt;or alternatively you can just list only NON open files on a path:&lt;/P&gt;&lt;PRE&gt;hdfs fsck &amp;lt;storm_hdfs_state_output_path&amp;gt; -files&lt;/PRE&gt;&lt;P&gt;The output is quite verbose but you can use sed or awk to get closed/completed files from there.&lt;/P&gt;&lt;P&gt;(Java HDFS api has similar hooks, this is just for CLI level solution)&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jan 2017 20:02:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Storm-HDFS-Bolt-question-Trident-api/m-p/128575#M51473</guid>
      <dc:creator>jknulst</dc:creator>
      <dc:date>2017-01-25T20:02:24Z</dc:date>
    </item>
  </channel>
</rss>

