Reply
Explorer
Posts: 35
Registered: ‎11-24-2015

How many machines to access the dataset from

[ Edited ]

When you fire a mapreduce or yarn or query against hdfs, what determines how many machines to access the required dataset from?

 

so let us suppose replication factor is set to 5. And the required dataset spans minimum three hosts (each of whichs replicated data set is not available in the other two hosts) - so there are 15 machines from which the data set can be accessed.

 

So what determines the number of machines from which the required dataset is to be accessed.

 

Appreciate the insights.

Posts: 1,903
Kudos: 435
Solutions: 307
Registered: ‎07-31-2013

Re: How many machines to access the dataset from

Am not sure I follow the question. Data locality is an optional element of MR2, and not an enforcement. Even if you had 10 DataNodes and ran just 1 NodeManager, it can still process the data work (although it would be much slower overall, cause more serial load)

There are no "determine minimum machines to run" form of barriers in the framework.

If you could illustrate the thinking behind the question, perhaps one of us could answer it better.
Highlighted
Explorer
Posts: 35
Registered: ‎11-24-2015

Re: How many machines to access the dataset from

[ Edited ]

Ok let us suppose I have a 30 gb file.

 

10 gb is on host 1. with replication factor at 5, the replicate for the 10 gb on host 1 is on host2, host3, host4, host5.

 

second 10 gb is on host 6. replicates are host7, host8, host9, host10.

 

third 10 gb is on host 11. replicates are host12, host13, host14, host15.

 

So let us suppose we run a query, which needs 3 gb data from all the three primary machines - host1, host6 and host11. Which inturn would mean that it can also get the data from all the replicates involved.

 

So how many machines would the query access the data from for the 3 gb?

 

Based on network topography, would it only get it from the first three nearest machines?

 

Or for the data from host1, will it take a part of the data from host1, a part from host2 and a part from host3?

 

Appreciate the insights.

Posts: 1,903
Kudos: 435
Solutions: 307
Registered: ‎07-31-2013

Re: How many machines to access the dataset from

If the block replica is available locally, we do not try to read it remotely (local is always preferred and used, unless there's an I/O error reaching it).

There's no "primary" block replica. All replicas are treated and maintained equally, and any one may be preferred for reads (i.e., no preference of one host before the other, locality rules aside).
Explorer
Posts: 35
Registered: ‎11-24-2015

Re: How many machines to access the dataset from

[ Edited ]

so if there is 10 gb data on one data node and it has two replicates, so for a query seeking the 10 gb data, it will access data only from one node?

 

why not from all three - say 3.33 gb from each ? or atleast from 2 nodes - the block placement policy places the 2nd and 3rd copy on the same rack, right? so if network topography facilitates data access from more than one node, why not use that?

 

isn't that the idea behind distributed computing?