Support Questions

Find answers, ask questions, and share your expertise

Efficient ways to store many images files


We have ten millions image and video files, are looking for efficient ways to store them in Hadoop (HDFS ...), and analyze them with tools available in the Hadoop ecosystem. I understand HDFS prefer big files. These image files are small, they are under ten megabytes. Please advise. Thanks very much!




Master Guru

Master Guru
You can do this via two methods: Container files, or HBase MOBs. Which is
the right path depends on your eventual, dominant read pattern for this

If your analysis will require loading up only a small range of images out
of the total dataset, or individual images, then HBase is a better fit with
its key based access model, columnar storage and caches.

If instead you will require processing these images in bulk, then large
container files (such as Sequence Files (with BytesWritable or equivalent),
Parquet Files (with BINARY/BYTE_ARRAY types), etc. that can store multiple
images into a single file, and allow for fast, sequential reads of all
images in bulk.


Thanks a lot for your reply Harsh. These sound great. Can you give some pointers to some learning materials on both methods, i.e. examples, blogs, URLs or books etc?  



Master Guru