Support Questions

vincentD · ‎05-18-2019

We have ten millions image and video files, are looking for efficient ways to store them in Hadoop (HDFS ...), and analyze them with tools available in the Hadoop ecosystem. I understand HDFS prefer big files. These image files are small, they are under ten megabytes. Please advise. Thanks very much!

Harsh J · ‎05-23-2019

For HBase MOBs, this can serve as a good starting point as most of the changes are administrative and the writer API remains the same as regular cells: https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hbase_mob.html

For SequenceFiles, a good short snippet can be found here: https://github.com/sakserv/sequencefile-examples/blob/master/test/main/java/com/github/sakserv/seque... and for Parquet: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/had...

More general reading for the file formats: https://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/ and https://parquet.apache.org/documentation/latest/

View solution in original post

Harsh J · ‎05-19-2019

You can do this via two methods: Container files, or HBase MOBs. Which is
the right path depends on your eventual, dominant read pattern for this
data.

If your analysis will require loading up only a small range of images out
of the total dataset, or individual images, then HBase is a better fit with
its key based access model, columnar storage and caches.

If instead you will require processing these images in bulk, then large
container files (such as Sequence Files (with BytesWritable or equivalent),
Parquet Files (with BINARY/BYTE_ARRAY types), etc. that can store multiple
images into a single file, and allow for fast, sequential reads of all
images in bulk.

vincentD · ‎05-23-2019

Thanks a lot for your reply Harsh. These sound great. Can you give some pointers to some learning materials on both methods, i.e. examples, blogs, URLs or books etc?

Harsh J · ‎05-23-2019

For HBase MOBs, this can serve as a good starting point as most of the changes are administrative and the writer API remains the same as regular cells: https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hbase_mob.html

For SequenceFiles, a good short snippet can be found here: https://github.com/sakserv/sequencefile-examples/blob/master/test/main/java/com/github/sakserv/seque... and for Parquet: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/had...

More general reading for the file formats: https://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/ and https://parquet.apache.org/documentation/latest/

Cloudera Community

Support Questions

Efficient ways to store many images files