Created 12-13-2015 05:51 PM
I am building an ingest pipeline (Java program) that pulls from lots of relatively small CSV stats files. Every run cycle, I might find say 1000's of new files where further, say 100 of them are the same type of stats that all should land in the same Hive table. Because I am also partitioning them into yyyy/mm/dd Hive partitions, it is likely that all 100 belong in the same day partition. So, my logic will open a file stream to the appropriate file path, like "/basepath/hivetabledir/year=2015/month=12/day=10/filename_nnnnnnn.avro" and start writing. This is nice because I end up with one HDFS file containing all the data from 100 smaller source files which is part of my small file reduction strategy. However, I also have some very large files that contain stragger records from back in time - way back in time (like possibly seeing straggler data records for every day of the year or more, meaning potentially 100's of partition directories and files). And, due to policy, I want to retain them even though one might ask, why? If I was to have to delay ingest (like for maintain or something) then start up again to catch up, I could end up in a situation where I process lots of different types of stats files X many different days of timestamps = 100's or even 1000 or more simultaneously open HDFS file streams. So, my questions are:
(1) What resource limits should I be considering/monitoring in the design such that I limit the number of simultaneous open HDFS file streams to some practical value to avoid trouble? The balance is, the more I keep open, the more small files I can pack into one HDFS file. I could certainly modify that logic to close files at some threshold and start new ones to keep the open file count in check.
(2) A related question is regarding buffering. I know that HDFS shows a zero size file for the duration of the time each file is open and being written to then, when I close the stream, a see a small delay and the file size then updates to reflect the bytes written. But, I'm writing 100's of MB to GB's of data to some of these files. That is buffering somewhere and I have 100's in this state at once. Maybe this is just no problem for HDFS - not sure? So, what should I be concerned with regarding that "buffered data" resource load as opposed to an "open file stream" resource load?
Thanks, Mark
Created 12-14-2015 07:52 PM
In this response, I will assume your ingest application is using the standard HDFS Client library.
If your client is running on a server that is not part of the Hadoop cluster, then there is almost no practical limit to the number of open HDFS files. Each Datanode can have several thousand simultaneous open connections, and the open files will be distributed randomly among all Datanodes. Of course you should also consider if there are other loads on your cluster that might also be enthusiastic readers or writers. BTW, if your client is actually running on a Datanode, which is not unusual in small operations or in laboratory setups, you should be aware that the first copy of all blocks of the files being written will be directed to the local Datanode as an optimization. In this case, you might want to limit the number of open HDFS files to 1000 or so, and/or distribute the ingest among several client instances on multiple Datanodes.
You should probably be more concerned about resources on your client, which will be subject to the local OS limit on the number of simultaneously open files being read for ingestion, and the number of simultaneously open connections to HDFS streams being written.
You mention "the more [streams] I keep open, the more small files I can pack into one HDFS file." This implies you are holding HDFS files open while waiting for ingest data to become available. This isn't really necessary, as you can use the append operation to extend previously closed files, in HDP-2.1 or newer. If you will have multiple client instances potentially trying to simultaneously append to the same HDFS file, however, I recommend you use HDP-2.3, as its version of Apache Hadoop is more efficient and has several bug fixes in append compared to previous versions.
Of course, in all cases (create/write and append), only one client at a time has write access to any given file; any number of simultaneous readers are allowed, including during write.
Regarding buffering, for each output stream, the HDFS client buffers 64kb chunks of data locally in RAM before flushing to the Datanode, unless you use explicit hflush calls. When the buffer fills, hflush is automatically called. After an hflush call returns successfully, the data is guaranteed to be in the datanodes, safe from client failures, and available for reading by other clients.
Created 12-13-2015 05:59 PM
Created 12-14-2015 07:52 PM
In this response, I will assume your ingest application is using the standard HDFS Client library.
If your client is running on a server that is not part of the Hadoop cluster, then there is almost no practical limit to the number of open HDFS files. Each Datanode can have several thousand simultaneous open connections, and the open files will be distributed randomly among all Datanodes. Of course you should also consider if there are other loads on your cluster that might also be enthusiastic readers or writers. BTW, if your client is actually running on a Datanode, which is not unusual in small operations or in laboratory setups, you should be aware that the first copy of all blocks of the files being written will be directed to the local Datanode as an optimization. In this case, you might want to limit the number of open HDFS files to 1000 or so, and/or distribute the ingest among several client instances on multiple Datanodes.
You should probably be more concerned about resources on your client, which will be subject to the local OS limit on the number of simultaneously open files being read for ingestion, and the number of simultaneously open connections to HDFS streams being written.
You mention "the more [streams] I keep open, the more small files I can pack into one HDFS file." This implies you are holding HDFS files open while waiting for ingest data to become available. This isn't really necessary, as you can use the append operation to extend previously closed files, in HDP-2.1 or newer. If you will have multiple client instances potentially trying to simultaneously append to the same HDFS file, however, I recommend you use HDP-2.3, as its version of Apache Hadoop is more efficient and has several bug fixes in append compared to previous versions.
Of course, in all cases (create/write and append), only one client at a time has write access to any given file; any number of simultaneous readers are allowed, including during write.
Regarding buffering, for each output stream, the HDFS client buffers 64kb chunks of data locally in RAM before flushing to the Datanode, unless you use explicit hflush calls. When the buffer fills, hflush is automatically called. After an hflush call returns successfully, the data is guaranteed to be in the datanodes, safe from client failures, and available for reading by other clients.
Created 12-15-2015 06:28 PM
Thanks, Matt. As this is a periodic batch job, I am only holding HDFS files open until I process the batch of files - maybe a couple minutes per batch run. There will be far fewer input files open than output files in my workload. As for buffering, I'm not sure I see it like you state. If I look at HDFS files that I am writing to (via HDP file browser UI), while writing, they always have zero bytes until I close them. And, that's after I have written well beyond MBs of data to a file. So, readers during write? Don't see that working in this case. Can you please expound?
Created 12-15-2015 10:42 PM
Hi Mark, good catch. We're both right. It's complicated 🙂
The file browser UI is served by the Namenode. As long as it takes to fill up a block, the Datanodes handle all the communication with the client, and replication. It is too expensive to update the Namenode every time the datanode receives a write operation. Only when the block fills up and a new block allocation is needed, or when the file is closed and the block is terminated, then the Namenode will be updated. And blocks are usually 128MB or more.
Created 12-15-2015 10:43 PM
In the meantime, the Namenode is aware of all complete blocks and the block in progress for each open file, and will deliver that information to clients that ask for it. The clients can successfully read as much data as has been hflush'ed so far, even if it goes beyond the length the Namenode knows about so far.
Created 12-16-2015 02:58 AM
Hey Matt, I take it back. I've evolved my program along a bit more today and while watching an ingest cycle that pulled in just over 1GB of data from 160 local files into three HDFS files, I do see the file browser UI reflecting the file size changes. I kept refreshing the webpage and the file size went from 0.1Kb to 128MB to 256MB, etc. My block size is 128MB and it shows exactly like you said. It's cool getting to know how things are working under the hood! 🙂 I really appreciate your inputs. Thank you! Things are getting clearer now by the day.
Created 11-21-2016 09:49 AM
FYI YARN by default limits the number of concurrent applications to 10000. this parameter can be changed through the YARN config in Ambari ( Yarn.scheduler.capacity.SPECIFIC_QUEUE.maximum-applications )