When to chose which storage layer. HBase vs HBase.
With respect to size
i) Impact of storing small files in HBase vs HDFS
ii) Impact of storing large files in HBase vs HDFS
Note : I consider small sized files to be few KBs to few MBs (upto 5MB).
With respect to File types
iii) Storing text files, images in HBase vs HDFS
With respect to "Disk Fragmentation" vs "Compaction" at operating system level
I got this doubt after reading this post from Facebook's "Finding needle in haystack"
iv) At operating system level, which one is a wise storage choice? Small files or large files?
From what I've googled I could find that storing small files in HDFS would increase the fsimage's size which might be hefty in Namenode's main memory - so that it could not hold after certain size - which were the only limitation.
With respect to modifying fsimage for storing multiple small files
v) Will HDFS be a better choice for storing small files when I tweak the HDFS's code for storing fsimage - something like pointing a path to a directory which can accommodate multiple small files?
Which one will be a wise choice for storing multiple small files (where the number of files grows per day)? I would write these files once - read the same file few times (say twice or thrice) within next 10 minutes.
I see healthy conversations with respect to hadoop community here. So, I've posted my queries. I'm really sorry if it isn't the right place to ask these.
In general HBase seems like a natural place to store small files, for example JSON files of a few KB, or images of a few MB.
In your specific use case (writing files, and reading them shortly after) the choice seems even more clear as you should be able to leverage the in-memory capability of HBase for new entries.
In general I would also not recommend tweaking source code unless you are really blocked. If you really want to use hdfs for tons of files of a few KB, consider zipping them together before loading them in.
"zipping them together" - Yeah lets consider I've tons of small files. zipping them together
i) would create unnecessary unzipping while reading it as individual files
ii) and the rate at which these files enter into the system may vary on time (I do not have a common criteria to zip files).
Note that the way to get these small files into HDFS would be considered a workaround, hence it may or may not be suitable for the situation. Based on the additional information in the comments it may not be suitable to you, but in this kind of usecase it could work: Make a zip per minute, whenever there is a multiple of x minutes, read it all.