Member since
05-18-2018
50
Posts
3
Kudos Received
0
Solutions
01-30-2019
12:12 PM
How can I change / configure number of Mappers ?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-26-2018
11:20 AM
1 Kudo
How to sort intermediate output based on values In MapReduce?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-03-2018
09:05 AM
What is the process of spilling in Hadoop’s map reduce program?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
10-27-2018
11:14 AM
hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfs-site.xml also specify default block replication and permission checking on HDFS. The three main hdfs-site.xml properties are:
dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs). And also specify where DFS should locate – on the disk or in the remote directory. dfs.data.dir gives the location of DataNodes where it stores the data. fs.checkpoint.dir is the directory on the file system. On which secondary NameNode stores the temporary images of edit logs. Then this EditLogs and FsImage will merge for backup.
... View more
10-24-2018
11:53 AM
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.
... View more
10-16-2018
11:26 AM
What do mean by SafemodeProblem and how User come out of Safe mode in HDFS?
... View more
Labels:
08-13-2018
12:16 PM
\r\n dfs.block.size>\r\n 134217728\r\n \r\n"}" data-sheets-userformat="{"2":769,"3":[null,0],"11":4,"12":0}">Directly we cannot change the number of mappers for a MapReduce job but by changing the block size we can increase or decrease the number of mappers. As we know Number of input splits = Number of mappers
Example If we are having 1TB of input file and the block size for the HDFS is 128MB then number of input splits are (1024/128) 8 input splits so the mappers for the job allotted are 8.
If we reduce the block size from 128MB to 64Mb then 1TB of Input file will be divided in to (1024/64) 16 Input splits and the number of mappers also be 16.
The block size can be changed in hdfs-site.xml by changing the value of dfs.block.size
<property> <name>dfs.block.size> <value>134217728</value> </property>
... View more
07-18-2018
11:58 AM
HDFS Block- Block is a continuous location on the hard drive where data is stored. In general, FileSystem stores data as a collection of blocks. In the same way, HDFS stores each file as blocks. The Hadoop application is responsible for distributing the data block across multiple nodes.
Input Split in Hadoop- The data to be processed by an individual Mapper is represented by InputSplit. The split is divided into records and each record (which is a key-value pair) is processed by the map. The number of map tasks is equal to the number of InputSplits. Initially, the data for MapReduce task is stored in input files and input files typically reside in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible for creating InputSplit.
MapReduce InputSplit vs Blocks in Hadoop InputSplit vs Block Size in Hadoop-
• Block – The default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop FileSystem. • InputSplit – By default, split size is approximately equal to block size. InputSplit is user defined and the user can control split size based on the size of data in MapReduce program. Data Representation in Hadoop Blocks vs InputSplit- • Block – It is the physical representation of data. It contains a minimum amount of data that can be read or write. • InputSplit – It is the logical representation of data present in the block. It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data
... View more
06-01-2018
12:07 PM
Each file to be stored in HDFS is split into numerous blocks and default block size being 128 MB. Each of these blocks are replicated in different data node, the default replication factor being 3. Data node continuously sends heart beat to name node. When the name node stop receiving heartbeat, it understands that particular data node is down. Using the metadata in its memory, name node identifies what all blocks are stored in this data node and identifies the other data nodes in which these blocks are stored. It also copies these blocks into some other data nodes to reestablish the replication factor. This is how, name node tackles data node failure.
... View more