Support Questions

Find answers, ask questions, and share your expertise

effective way to store image files, pdf files in hdfs as sequence format using nifi

avatar
Expert Contributor

Currently working on a POC to effectively store image files or pdf files in hdfs as sequence format may be. In hdfs as there is a block size of 64mb lets say if i want to store couple of images whose size is 2mb each then i ll be wasting 60mb block size. So iam trying to come up with a way to effectively store small image files or pdf files in hdfs without wasting block size. Also please let me know whether we can ingest these files into hdfs using apache nifi and if so which processors would be best to use. thanks

13 REPLIES 13

avatar
Master Mentor

Try to pass property to set block size to smaller size when writing the files? Maybe when you use nifi you can merge content? Compress a few images into one large zip before writing tp hdfs? Interesting qurstion. @surender nath reddy kudumula

avatar
Expert Contributor

@Artem Ervits sounds good... how about also using sequence file format before merging.. or i beleive storing in zip or bzip format would be the effective storage i guess. so we can store any file formats not just jpeg or png's effectively without wasting block size space or disk space in zip or bzip format.. Am i correct. I heard MAPR has some other file system which is best used for storing small files compared to hortonworks. How about zip files which are larger than 64mb are these splitted in hdfs or in Nifi we write a processor so that the zip files wont exceed 64mb???

avatar
Master Mentor

@surender nath reddy kudumula can't comment on MapR but if you achieve what you're planning, it will be a great candidate for article on this site. Here are some sample nifi templates.

avatar
Expert Contributor

thanks for your reply @Artem Ervits will implement the poc thanks

avatar
Expert Contributor
@Artem Ervits

just wondering once the zip files are in hdfs and if they are streaming into hdfs using nifi as zip format i believe we need a way to automate the unzip process and analyse the files stored in zip folder.. Any ideas how we can acheive this please?? thank you

avatar
Master Mentor

@surender nath reddy kudumula you can execute shell commands in nifi processor to achieve that.

avatar

@surender nath reddy kudumula

This may not answer the image size question but may give you some ideas on your POC. The images are stored in HBASE and processed.

See A Non Standard Use Case of Hadoop High Scale Image Processing and Analysis by TrueCar

The slides are at Hadoop Image Processing Pipeline

avatar
Expert Contributor

thanks @Ancil McBarnett. will have a look..:)

avatar
Contributor

Hi, I want to process images under hadoop but I do not know in what format it will be easy !!, is it better to store them in sequence file ? or in Hbase ?, ....? knowing that I will process These images by c ++ programs that call opencv and ffmpeg . @surender nath reddy kudumula , @Ancil McBarnett