Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How many machines to access the dataset from

How many machines to access the dataset from

Explorer

When you fire a mapreduce or yarn or query against hdfs, what determines how many machines to access the required dataset from?

 

so let us suppose replication factor is set to 5. And the required dataset spans minimum three hosts (each of whichs replicated data set is not available in the other two hosts) - so there are 15 machines from which the data set can be accessed.

 

So what determines the number of machines from which the required dataset is to be accessed.

 

Appreciate the insights.

4 REPLIES 4

Re: How many machines to access the dataset from

Master Guru
Am not sure I follow the question. Data locality is an optional element of MR2, and not an enforcement. Even if you had 10 DataNodes and ran just 1 NodeManager, it can still process the data work (although it would be much slower overall, cause more serial load)

There are no "determine minimum machines to run" form of barriers in the framework.

If you could illustrate the thinking behind the question, perhaps one of us could answer it better.

Re: How many machines to access the dataset from

Explorer

Ok let us suppose I have a 30 gb file.

 

10 gb is on host 1. with replication factor at 5, the replicate for the 10 gb on host 1 is on host2, host3, host4, host5.

 

second 10 gb is on host 6. replicates are host7, host8, host9, host10.

 

third 10 gb is on host 11. replicates are host12, host13, host14, host15.

 

So let us suppose we run a query, which needs 3 gb data from all the three primary machines - host1, host6 and host11. Which inturn would mean that it can also get the data from all the replicates involved.

 

So how many machines would the query access the data from for the 3 gb?

 

Based on network topography, would it only get it from the first three nearest machines?

 

Or for the data from host1, will it take a part of the data from host1, a part from host2 and a part from host3?

 

Appreciate the insights.

Highlighted

Re: How many machines to access the dataset from

Master Guru
If the block replica is available locally, we do not try to read it remotely (local is always preferred and used, unless there's an I/O error reaching it).

There's no "primary" block replica. All replicas are treated and maintained equally, and any one may be preferred for reads (i.e., no preference of one host before the other, locality rules aside).

Re: How many machines to access the dataset from

Explorer

so if there is 10 gb data on one data node and it has two replicates, so for a query seeking the 10 gb data, it will access data only from one node?

 

why not from all three - say 3.33 gb from each ? or atleast from 2 nodes - the block placement policy places the 2nd and 3rd copy on the same rack, right? so if network topography facilitates data access from more than one node, why not use that?

 

isn't that the idea behind distributed computing?

Don't have an account?
Coming from Hortonworks? Activate your account here