Created on 01-20-2016 10:26 AM - edited 09-16-2022 02:58 AM
Currently working on a POC to effectively store image files or pdf files in hdfs as sequence format may be. In hdfs as there is a block size of 64mb lets say if i want to store couple of images whose size is 2mb each then i ll be wasting 60mb block size. So iam trying to come up with a way to effectively store small image files or pdf files in hdfs without wasting block size. Also please let me know whether we can ingest these files into hdfs using apache nifi and if so which processors would be best to use. thanks
Created 01-20-2016 06:48 PM
Try to pass property to set block size to smaller size when writing the files? Maybe when you use nifi you can merge content? Compress a few images into one large zip before writing tp hdfs? Interesting qurstion. @surender nath reddy kudumula
Created 01-20-2016 07:16 PM
@Artem Ervits sounds good... how about also using sequence file format before merging.. or i beleive storing in zip or bzip format would be the effective storage i guess. so we can store any file formats not just jpeg or png's effectively without wasting block size space or disk space in zip or bzip format.. Am i correct. I heard MAPR has some other file system which is best used for storing small files compared to hortonworks. How about zip files which are larger than 64mb are these splitted in hdfs or in Nifi we write a processor so that the zip files wont exceed 64mb???
Created 01-20-2016 08:20 PM
@surender nath reddy kudumula can't comment on MapR but if you achieve what you're planning, it will be a great candidate for article on this site. Here are some sample nifi templates.
Created 01-20-2016 10:54 PM
thanks for your reply @Artem Ervits will implement the poc thanks
Created 01-20-2016 07:26 PM
just wondering once the zip files are in hdfs and if they are streaming into hdfs using nifi as zip format i believe we need a way to automate the unzip process and analyse the files stored in zip folder.. Any ideas how we can acheive this please?? thank you
Created 01-20-2016 08:18 PM
@surender nath reddy kudumula you can execute shell commands in nifi processor to achieve that.
Created 01-20-2016 07:17 PM
This may not answer the image size question but may give you some ideas on your POC. The images are stored in HBASE and processed.
See A Non Standard Use Case of Hadoop High Scale Image Processing and Analysis by TrueCar
The slides are at Hadoop Image Processing Pipeline
Created 01-20-2016 07:29 PM
thanks @Ancil McBarnett. will have a look..:)
Created 11-16-2016 08:33 PM
Hi, I want to process images under hadoop but I do not know in what format it will be easy !!, is it better to store them in sequence file ? or in Hbase ?, ....? knowing that I will process These images by c ++ programs that call opencv and ffmpeg . @surender nath reddy kudumula , @Ancil McBarnett