Member since
08-26-2017
8
Posts
0
Kudos Received
0
Solutions
09-06-2017
11:16 AM
2 Kudos
Disaster Recovery in Hadoop cluster
refers to the event of recovering all or most of your important data
stored on a Hadoop Cluster in case of disasters like hardware
failures,data loss ,applications error. There should be minimal or no
downtime in cluster. Disaster can be handled through various techniques : 1) Data loss must be preveneted by writing metadata stored on namenode to a different NFS mount. However High Availability introduced in the latest version of Hadoop is a disaster management technique. 2) HDFS snapshots can also be used in case of recovery. 3) You can enable Trash feature in case of accidental deletion because file deleted first goes to trash folder in HDFS. 4) Hadoop distcp tool can also be used for cluster data copying building a mirror cluster in case of any hardware failure
... View more
09-01-2017
06:22 PM
Partitioning of the keys of the intermediate map output is controlled by the Partitioner. By hash function, key (or a subset of the key) is used to derive the partition. According to the key value each mapper output is partitioned and records having the same key value go into the same partition (within each mapper), and then each partition is sent to a reducer. Partition class determines which partition a given (key, value) pair will go. Partition phase takes place after map phase and before reduce phase. MapReduce job takes an input data set and produces the list of key value pair which is the result of map phase in which input data is split and each task processes the split and each map, output the list of key value pairs. Then, the output from the map phase is sent to reduce task which processes the user-defined reduce function on map outputs. But before reduce phase, partitioning of the map output take place on the basis of the key and sorted. To know more detail about the partitioning: Partition in MapReduce
... View more
08-30-2017
01:17 PM
As mentioned by @Laiba Khan data locality refers to moving compute to data which is typically faster than moving data to compute. In Hadoop, data is divided into blocks and distributed across multiple servers (nodes). Additionally, it is replicated (typically 3 copies total) across these nodes. Thus, subsets of a dataset are distributed across nodes. When a map-reduce or Tez job is started, a container with code is distributed across the cluster nodes. These containers operate on the data in parallel and usually grab data blocks that are stored on the same node, thus achieving parallel processing with data locality. This results in fast overall execution of the full data set distributed across multiple nodes. This is key to operating on large volumes of data ... parallel processing is one component and processing data stored locally is another. Processing data that has to move across the network (no data locality) is slower. Note that in cloud computing it is often advantageous NOT to have data locality. Local disks in the cloud are ephemeral ... if the (virtual) server is destroyed all data sitting on it are destroyed. Thus, putting data on local disks means you lose it when you spin down a cluster. One of the advantages to cloud is paying for servers only when you use them. Thus it is common to have scenarios when you spin up a cluster, do some processing and then spin it down (e.g. running a report or training a model in data science). In this scenario you would want your data stored on non-local data like AWS S3 object storage which is very inexpensive. This data persists separately from your cluster so only your compute is ephemeral. When you spin up a cluster, it reads from the permanent non-local storage and perhaps writes to it. You lose data locality but you gain the ability to pay for your cluster only when you use it. Compute on non-local data in this scenario is slower than local but not extremely so, especially when you scale out your cluster (more nodes) to increase the parallel processing aspect.
... View more