Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is the best architecture with significant file dependancy issue?


What is the best architecture with significant file dependancy issue?

New Contributor

The Hadoop cluster is up and running, with 1 namenode, 1 secondary namenode and 5 datanodes sitting Ubuntu 16.04. Replication level is set to 2 and so afar, appears to be working normally with an even distribution of files across all available nodes. The issue is that the the files are mostly tdms and tdms_index files (a National Instruments format) and to use them efficiently, you are required to open the index file to get the locations within the tdms file to then open and extract the actual data. This means that both files should be present on the same node to operate on them as a pair. In the event that the index file is not present it will be automatically regenerated by the file read utility. It would be best to avoid regenerating the index file and equally like to avoid having the overhead to move an index file to the node the tdms is on or move the tdms to the node that the index is on during the processing operation. I am assuming here that the hdfs would appear to be seamless to the application so that when an attempt is made to open the file, it will succeed and the actual move or copy operation occurs in the background (albeit with a penalty).

To illustrate, assume we have nodes A, B and C. Node A has a copy of both the tdms and the index file, node B only has a copy of the tdms file and node C only has a copy of the index file.

In this situation if the processing job is allocated to node A, it will be fastest as both tdms and index are on the same node. If the processing job is allocated to node B, then there will be a penalty to move the index from either A or C. This is the smaller penalty as the index file is typically very small compared to the tdms file. If the processing job is allocated to node C then the penalty is largest as the tdms file would need to be copied across from A or B. Interestingly, this means that increasing the node count on the cluster may be counter-productive since it will increase the chances of the tdms and index files being on different nodes and therefore increase the probability of incurring a penalty.

Is there a way to directly control file allocation to blocks or block allocation to nodes in an automatic (ie scripted way)? How does processing allocation work in this context as we now have a preferred node for the process to run on? How would this interact with any constraints set up in yarn? In short, I'd like to start a discussion around the best architecture for this scenario when a known file dependency exists.