<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Efficient ways to store many images files in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90663#M55340</link>
    <description>&lt;P&gt;We have ten millions image and video files, are looking for efficient ways to store them in Hadoop (HDFS ...), and analyze them with tools available in the Hadoop ecosystem. I understand HDFS prefer big files. These image files are small, they are under ten megabytes. Please advise. Thanks very much!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 14:23:51 GMT</pubDate>
    <dc:creator>vincentD</dc:creator>
    <dc:date>2022-09-16T14:23:51Z</dc:date>
    <item>
      <title>Efficient ways to store many images files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90663#M55340</link>
      <description>&lt;P&gt;We have ten millions image and video files, are looking for efficient ways to store them in Hadoop (HDFS ...), and analyze them with tools available in the Hadoop ecosystem. I understand HDFS prefer big files. These image files are small, they are under ten megabytes. Please advise. Thanks very much!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 14:23:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90663#M55340</guid>
      <dc:creator>vincentD</dc:creator>
      <dc:date>2022-09-16T14:23:51Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient ways to store many images files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90687#M55341</link>
      <description>You can do this via two methods: Container files, or HBase MOBs. Which is&lt;BR /&gt;the right path depends on your eventual, dominant read pattern for this&lt;BR /&gt;data.&lt;BR /&gt;&lt;BR /&gt;If your analysis will require loading up only a small range of images out&lt;BR /&gt;of the total dataset, or individual images, then HBase is a better fit with&lt;BR /&gt;its key based access model, columnar storage and caches.&lt;BR /&gt;&lt;BR /&gt;If instead you will require processing these images in bulk, then large&lt;BR /&gt;container files (such as Sequence Files (with BytesWritable or equivalent),&lt;BR /&gt;Parquet Files (with BINARY/BYTE_ARRAY types), etc. that can store multiple&lt;BR /&gt;images into a single file, and allow for fast, sequential reads of all&lt;BR /&gt;images in bulk.&lt;BR /&gt;</description>
      <pubDate>Mon, 20 May 2019 01:17:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90687#M55341</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2019-05-20T01:17:15Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient ways to store many images files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90862#M55342</link>
      <description>&lt;P&gt;Thanks a lot for your reply Harsh. These sound great. Can you give some pointers to some learning materials on both methods, i.e. examples, blogs, URLs or books etc? &amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 24 May 2019 01:48:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90862#M55342</guid>
      <dc:creator>vincentD</dc:creator>
      <dc:date>2019-05-24T01:48:30Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient ways to store many images files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90863#M55343</link>
      <description>For HBase MOBs, this can serve as a good starting point as most of the changes are administrative and the writer API remains the same as regular cells: &lt;A href="https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hbase_mob.html" target="_blank"&gt;https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hbase_mob.html&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;For SequenceFiles, a good short snippet can be found here: &lt;A href="https://github.com/sakserv/sequencefile-examples/blob/master/test/main/java/com/github/sakserv/sequencefile/SequenceFileTest.java#L65-L70" target="_blank"&gt;https://github.com/sakserv/sequencefile-examples/blob/master/test/main/java/com/github/sakserv/sequencefile/SequenceFileTest.java#L65-L70&lt;/A&gt; and for Parquet: &lt;A href="https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/example/ExampleParquetWriter.java" target="_blank"&gt;https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/example/ExampleParquetWriter.java&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;More general reading for the file formats: &lt;A href="https://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/" target="_blank"&gt;https://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/&lt;/A&gt; and &lt;A href="https://parquet.apache.org/documentation/latest/" target="_blank"&gt;https://parquet.apache.org/documentation/latest/&lt;/A&gt;</description>
      <pubDate>Fri, 24 May 2019 03:46:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Efficient-ways-to-store-many-images-files/m-p/90863#M55343</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2019-05-24T03:46:39Z</dc:date>
    </item>
  </channel>
</rss>

