Support Questions
Find answers, ask questions, and share your expertise

effective way to store image files, pdf files in hdfs as sequence format using nifi

Contributor

Currently working on a POC to effectively store image files or pdf files in hdfs as sequence format may be. In hdfs as there is a block size of 64mb lets say if i want to store couple of images whose size is 2mb each then i ll be wasting 60mb block size. So iam trying to come up with a way to effectively store small image files or pdf files in hdfs without wasting block size. Also please let me know whether we can ingest these files into hdfs using apache nifi and if so which processors would be best to use. thanks

13 REPLIES 13

Mentor

Try to pass property to set block size to smaller size when writing the files? Maybe when you use nifi you can merge content? Compress a few images into one large zip before writing tp hdfs? Interesting qurstion. @surender nath reddy kudumula

Contributor

@Artem Ervits sounds good... how about also using sequence file format before merging.. or i beleive storing in zip or bzip format would be the effective storage i guess. so we can store any file formats not just jpeg or png's effectively without wasting block size space or disk space in zip or bzip format.. Am i correct. I heard MAPR has some other file system which is best used for storing small files compared to hortonworks. How about zip files which are larger than 64mb are these splitted in hdfs or in Nifi we write a processor so that the zip files wont exceed 64mb???

Mentor

@surender nath reddy kudumula can't comment on MapR but if you achieve what you're planning, it will be a great candidate for article on this site. Here are some sample nifi templates.

Contributor

thanks for your reply @Artem Ervits will implement the poc thanks

Contributor
@Artem Ervits

just wondering once the zip files are in hdfs and if they are streaming into hdfs using nifi as zip format i believe we need a way to automate the unzip process and analyse the files stored in zip folder.. Any ideas how we can acheive this please?? thank you

Mentor

@surender nath reddy kudumula you can execute shell commands in nifi processor to achieve that.

@surender nath reddy kudumula

This may not answer the image size question but may give you some ideas on your POC. The images are stored in HBASE and processed.

See A Non Standard Use Case of Hadoop High Scale Image Processing and Analysis by TrueCar

The slides are at Hadoop Image Processing Pipeline

Contributor

thanks @Ancil McBarnett. will have a look..:)

Contributor

Hi, I want to process images under hadoop but I do not know in what format it will be easy !!, is it better to store them in sequence file ? or in Hbase ?, ....? knowing that I will process These images by c ++ programs that call opencv and ffmpeg . @surender nath reddy kudumula , @Ancil McBarnett

Mentor

@surender nath reddy kudumula has this been resolved? Can you post your solution or accept best answer?

Explorer

The parameter you want to pass in is -Ddfs.block.size=<value in Bytes> This will set the block size to the desired amount for the transfer.

Hi guys,

I want to avoid a confusion on the block size and storage usage in this post.

@surender nath reddy kudumula when you say "In hdfs as there is a block size of 64mb lets say if i want to store couple of images whose size is 2mb each then i ll be wasting 60mb block size" in your question, I understand that you loose storage capacity. This is not the case in HDFS: a file smaller than a single block does not occupy a full block’s worth of storage so there's no storage wasting.

The problem with small files is the impact on processing performance. This is why you should use Sequence Files, HAR, HBase or merging solutions. You can read more on this aspect here: https://community.hortonworks.com/questions/4024/how-many-files-is-too-many-on-a-modern-hdp-cluster....

New Contributor

Have you figured out the solution yet? Would you mind to share with us. I got the same problem with POC project. Thanks

,

Have you figured out the solution? Would you mind to share with us about your solution. Thanks

; ;