Created 11-16-2015 04:24 PM
Earlier Hadoop versions had problems with many small files because of the demands it placed on the NameNode. Modern machines and newer versions of NameNode seem to have mitigated this somewhat, but how much? Is there a rule of thumb for what number of files is too many? Small files also have proportionately more overhead per MB of data. Is there a rule of thumb for what is too small?
Created 11-17-2015 10:15 AM
I've seen several systems with 400+ million objects represented in the Namenode without issues. In my opinion, that's not the "right" question though. Certainly, the classic answer to small files has been the pressure it put's on the Namenode but that's only a part of the equation. And with hardware / cpu and increase memory thresholds, that number has certainly climbed over the years since the small file problem was documented.
The better question is: How do small files "impact" cluster performance? Everything is a trade-off when dealing with data at scale. The impact of small files, beyond the Namenode pressures, is more specifically related to "job" performance.
Under classic MR, the number of small files controls the number of mappers required to perform a job. Of course, there are tricks to "combine" inputs and reduce this, but that leads to a lot of data back planing and increased cluster I/O chatter. A mapper in the classic sense, is a costly resource to allocate. If the actual task done by the mapper is rather mundane, most of the time spent accomplishing your job can be "administrative" in nature with the construction and management of all those resources.
Consider the impact to a cluster when this happens. For example, I had a client once that was trying to get more from their cluster but there was a job that was processing 80,000 files. Which lead to the creation of 80,000 mappers. Which lead to consuming ALL the cluster resources, several times over.
Follow that path a bit further and you'll find that the impact on the Namenode is exacerbated with all of the intermediate files generated by the mapper for the shuffle/sort phases.
That's the real impact on a cluster. A little work in the beginning can have a dramatic affect on the downstream performance of your jobs. Take the time to "refine" your data and consolidate your files.
Here's another way to approach it, which is even more evident when dealing with ORC files. Processing a 1Mb file has an overhead to it. So processing 128 1Mb files will cost you 128 times more "administrative" overhead, versus processing 1 128Mb file. In plain text, that 1Mb file may contain 1000 records. The 128 Mb file might contain 128000 records. And I've typically seen 85-92% compression ratio with ORC files, so you could safely say that a 128 Mb ORC file contains over 1 Million records. Sidebar: Which may of been why the default strip size in ORC's was changed to 64Mb, instead of 128Mb a few version back.
The impact is multi-fold. With data locality, you move less data, process larger chunks of data at a time, generate fewer intermediate files, reduce impact to the Namenode and increase throughput overall, EVERYWHERE.
The system moves away from being I/O bound to being CPU bound. Now you have the opportunity to tune container sizes to match "what" you're doing, because the container is actually "doing" a lot of work processing your data and not "manage" the job.
Sometimes small files can't be avoided, but deal with them early, to limit the repetitive impact to your cluster. Here's a lists of general patterns to reduce the number of small files:
Nifi - Use a combine processor to consolidate flows and aggregate data before if even gets to your cluster.
Flume - Use a tiered Flume architecture to combine events from multiple inputs, producing "right" sized HDFS files for further refinement.
Hive - Process the Small files regularly and often to produce larger files for "repetitive" processing. And in a classic pattern that incrementally "appends" to a dataset, creating a LOT of files over time, don't be afraid to go back and "reprocess" the file set again to streamline the impact on downstream tasks.
Sqoop - Manager the number of mappers to generate appropriately size files.
Oh, and if you NEED to keep those small files as "sources"... Archive them using hadoop archive resources 'har' and save your Namenode from the cost of managing those resource objects.
Created 11-17-2015 10:15 AM
I've seen several systems with 400+ million objects represented in the Namenode without issues. In my opinion, that's not the "right" question though. Certainly, the classic answer to small files has been the pressure it put's on the Namenode but that's only a part of the equation. And with hardware / cpu and increase memory thresholds, that number has certainly climbed over the years since the small file problem was documented.
The better question is: How do small files "impact" cluster performance? Everything is a trade-off when dealing with data at scale. The impact of small files, beyond the Namenode pressures, is more specifically related to "job" performance.
Under classic MR, the number of small files controls the number of mappers required to perform a job. Of course, there are tricks to "combine" inputs and reduce this, but that leads to a lot of data back planing and increased cluster I/O chatter. A mapper in the classic sense, is a costly resource to allocate. If the actual task done by the mapper is rather mundane, most of the time spent accomplishing your job can be "administrative" in nature with the construction and management of all those resources.
Consider the impact to a cluster when this happens. For example, I had a client once that was trying to get more from their cluster but there was a job that was processing 80,000 files. Which lead to the creation of 80,000 mappers. Which lead to consuming ALL the cluster resources, several times over.
Follow that path a bit further and you'll find that the impact on the Namenode is exacerbated with all of the intermediate files generated by the mapper for the shuffle/sort phases.
That's the real impact on a cluster. A little work in the beginning can have a dramatic affect on the downstream performance of your jobs. Take the time to "refine" your data and consolidate your files.
Here's another way to approach it, which is even more evident when dealing with ORC files. Processing a 1Mb file has an overhead to it. So processing 128 1Mb files will cost you 128 times more "administrative" overhead, versus processing 1 128Mb file. In plain text, that 1Mb file may contain 1000 records. The 128 Mb file might contain 128000 records. And I've typically seen 85-92% compression ratio with ORC files, so you could safely say that a 128 Mb ORC file contains over 1 Million records. Sidebar: Which may of been why the default strip size in ORC's was changed to 64Mb, instead of 128Mb a few version back.
The impact is multi-fold. With data locality, you move less data, process larger chunks of data at a time, generate fewer intermediate files, reduce impact to the Namenode and increase throughput overall, EVERYWHERE.
The system moves away from being I/O bound to being CPU bound. Now you have the opportunity to tune container sizes to match "what" you're doing, because the container is actually "doing" a lot of work processing your data and not "manage" the job.
Sometimes small files can't be avoided, but deal with them early, to limit the repetitive impact to your cluster. Here's a lists of general patterns to reduce the number of small files:
Nifi - Use a combine processor to consolidate flows and aggregate data before if even gets to your cluster.
Flume - Use a tiered Flume architecture to combine events from multiple inputs, producing "right" sized HDFS files for further refinement.
Hive - Process the Small files regularly and often to produce larger files for "repetitive" processing. And in a classic pattern that incrementally "appends" to a dataset, creating a LOT of files over time, don't be afraid to go back and "reprocess" the file set again to streamline the impact on downstream tasks.
Sqoop - Manager the number of mappers to generate appropriately size files.
Oh, and if you NEED to keep those small files as "sources"... Archive them using hadoop archive resources 'har' and save your Namenode from the cost of managing those resource objects.
Created 11-17-2015 10:20 AM
Almost forgot... Another side affect of small files on a cluster shows up while running the Balancer. You'll move LESS data, and increase the impact on the Namenode even further.
So YES, small files are bad. But only as bad as you're willing to "pay" for in most cases. If you don't mind doubling the size of the cluster to address some files, then do it. But I wouldn't. I'd do a bit of planning and refinement and bank the extra nodes for more interesting projects. 🙂
Created 11-17-2015 10:38 AM
@David Streever nicely explained!!!! Thanks!
Created 11-17-2015 10:39 AM
Great write-up.Thanks!
Created 11-17-2015 03:03 PM
Fantastic answer. Exactly what I was looking for.
Created 11-17-2015 05:05 PM
Real nice answer 🙂
Created 11-23-2015 05:36 PM
Definitely the response by @David Streever is spot-on, but I had a cluster where the OS-level inode settings came back to bite us and I ended up blogging about it at small files and hadoop's hdfs (bonus: an inode formula) where I ultimately offered up a simple "inode calculator" in the form of a simple algebraic formula.
NbrOfFilesInodeSettingCanSupport = ( ( InodeSettingPerDisk * NbrDisksPerDN * NbrDNs ) / ( ReplFactor * 2.1 ) ) / AvgNbrBlocksPerFile
Always interested if anyone else has run across a similar problem, or for anyone to (in)validate my thinking.
As for how we got into thinking along these lines... well... something (as usual) broke!
This scenario surfaced under an extreme corner case. It was for a cluster with 8 worker nodes each having 4 disks. This group's boxes were configured with 953856 inodes per disk (plus 1K block sizes) which actually yields over 190 million files according to my formula with the fact that their initial ingested files where never spanning past 1 block each.
Yes, they had (very) SMALL files and despite recommendations of either cat'ing them together prior to loading into HDFS and/or running a Pig script to read a ton of them and collapse them into big bulk files, they tried to load up a massive amount of 10K files and then were surprised when their disks where at about 1% utilization, but they got "-ksh: sample.txt: cannot create [No space left on device]" messages from Linux file operations.
Fortunately, it was early enough in their project that they could rebuild the hosts (you can't change inode values w/o rebuilding the filesystems) and to do a bit of re-architecting to adapt to prior recommendations. Hey, it's a "journey" for almost all of us -- sometimes filled with peril. 😉
Created 11-23-2015 06:10 PM
Your inode article is a great addition to David's answer. I'm puzzled though that any machine would run out of inodes before running out of disk space---it would require a strange configuration of the file system, wouldn't it? Was someone trying to save on inode allocation by assuming the average file would be larger? I can't think of any other reason to stray from the defaults. Any idea why?
Created 11-24-2015 06:20 AM
Well, it seems 600 characters wasn't enough for to supply enough of the back story @Peter Coates, so I added some high-level details in my original answer. I'd be glad to share more fine-grained info via email if you'd like.