Support Questions

pcoates · ‎11-16-2015

Earlier Hadoop versions had problems with many small files because of the demands it placed on the NameNode. Modern machines and newer versions of NameNode seem to have mitigated this somewhat, but how much? Is there a rule of thumb for what number of files is too many? Small files also have proportionately more overhead per MB of data. Is there a rule of thumb for what is too small?

dstreev · ‎11-17-2015

I've seen several systems with 400+ million objects represented in the Namenode without issues. In my opinion, that's not the "right" question though. Certainly, the classic answer to small files has been the pressure it put's on the Namenode but that's only a part of the equation. And with hardware / cpu and increase memory thresholds, that number has certainly climbed over the years since the small file problem was documented.

The better question is: How do small files "impact" cluster performance? Everything is a trade-off when dealing with data at scale. The impact of small files, beyond the Namenode pressures, is more specifically related to "job" performance.

Under classic MR, the number of small files controls the number of mappers required to perform a job. Of course, there are tricks to "combine" inputs and reduce this, but that leads to a lot of data back planing and increased cluster I/O chatter. A mapper in the classic sense, is a costly resource to allocate. If the actual task done by the mapper is rather mundane, most of the time spent accomplishing your job can be "administrative" in nature with the construction and management of all those resources.

Consider the impact to a cluster when this happens. For example, I had a client once that was trying to get more from their cluster but there was a job that was processing 80,000 files. Which lead to the creation of 80,000 mappers. Which lead to consuming ALL the cluster resources, several times over.

Follow that path a bit further and you'll find that the impact on the Namenode is exacerbated with all of the intermediate files generated by the mapper for the shuffle/sort phases.

That's the real impact on a cluster. A little work in the beginning can have a dramatic affect on the downstream performance of your jobs. Take the time to "refine" your data and consolidate your files.

Here's another way to approach it, which is even more evident when dealing with ORC files. Processing a 1Mb file has an overhead to it. So processing 128 1Mb files will cost you 128 times more "administrative" overhead, versus processing 1 128Mb file. In plain text, that 1Mb file may contain 1000 records. The 128 Mb file might contain 128000 records. And I've typically seen 85-92% compression ratio with ORC files, so you could safely say that a 128 Mb ORC file contains over 1 Million records. Sidebar: Which may of been why the default strip size in ORC's was changed to 64Mb, instead of 128Mb a few version back.

The impact is multi-fold. With data locality, you move less data, process larger chunks of data at a time, generate fewer intermediate files, reduce impact to the Namenode and increase throughput overall, EVERYWHERE.

The system moves away from being I/O bound to being CPU bound. Now you have the opportunity to tune container sizes to match "what" you're doing, because the container is actually "doing" a lot of work processing your data and not "manage" the job.

Sometimes small files can't be avoided, but deal with them early, to limit the repetitive impact to your cluster. Here's a lists of general patterns to reduce the number of small files:

Nifi - Use a combine processor to consolidate flows and aggregate data before if even gets to your cluster.

Flume - Use a tiered Flume architecture to combine events from multiple inputs, producing "right" sized HDFS files for further refinement.

Hive - Process the Small files regularly and often to produce larger files for "repetitive" processing. And in a classic pattern that incrementally "appends" to a dataset, creating a LOT of files over time, don't be afraid to go back and "reprocess" the file set again to streamline the impact on downstream tasks.

Sqoop - Manager the number of mappers to generate appropriately size files.

Oh, and if you NEED to keep those small files as "sources"... Archive them using hadoop archive resources 'har' and save your Namenode from the cost of managing those resource objects.

View solution in original post

dstreev · ‎11-17-2015