Member since
05-18-2018
50
Posts
3
Kudos Received
0
Solutions
03-28-2019
12:30 PM
Block It is the physical representation of data. It contains a minimum amount of data that can be read or write. The default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop filesystem. InputSplit It is the logical representation of data present in the block. It is used during data processing in the MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data. By default, split size is approximately equal to block size. InputSplit is user-defined and the user can control split size based on the size of data in the MapReduce program.
... View more
03-05-2019
12:22 PM
Hadoop MapReduce uses a key-value pair to process the data in an efficient manner. The MapReduce concept is actually derived from Google white papers which uses this concept. Key-value pairs are not part of the input data, but rather the input data is split in the form of key and value to be processed in the mapper.
... View more
02-08-2019
12:42 PM
Once a map reduce program is built a driver class has to be created that will be submitted to the cluster. For this, we create the object of the JobConf class. One of the properties of this object is setMapperClass. Conf.setMapperClass sets the mapper class in the driver. It helps the driver class to get the details like reading data and generating key-Value pairs out of the mapper.
... View more
02-04-2019
10:10 AM
The NameNode only stores the metadata of blocks in the DataNode. The NameNode utilizes 150 bytes of memory per block.
Generally, it is recommended to allocate 1 GB of memory (RAM) for every 1 million blocks.
Based on the above recommendation we can determine the requirement of NameNode while installing the Hadoop system, by considering the size of the cluster. Since the NameNode stores only the metadata it is rare that such requirement to upgrade the NameNode arise. However, there is a possibility of vertical scalability for NameNode.
... View more
01-30-2019
12:12 PM
How can I change / configure number of Mappers ?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
01-23-2019
12:20 PM
The small files are those which are significantly smaller than the default HDFS file size.i.e;64 MB.HDFS can’t handle these small files efficiently. If store 1 million files on HDFS, it will utilize a lot of Name node space to store the metadata of files, which will make the processing very slow.
... View more
01-16-2019
12:10 PM
If I create folder will there be metadata created in Hadoop?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
01-03-2019
09:42 AM
1 Kudo
Identity mapper and reducer are default mapper and reducer which are picked up by the map-reduce framework when no mapper or reducer class is defined in driver class. They do not do any type of processing in the data and write the value to the output which it gets from the input.
... View more
12-26-2018
11:20 AM
1 Kudo
How to sort intermediate output based on values In MapReduce?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-14-2018
09:09 AM
How many combiners are used for a map-reduce?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-07-2018
08:33 AM
How to Reduce the data volume during shuffling between Mapper and Reducer Node ?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-03-2018
09:05 AM
What is the process of spilling in Hadoop’s map reduce program?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
11-28-2018
09:11 AM
Difference between a MapReduce InputSplit and HDFS block?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
11-20-2018
10:34 AM
To obtain maximum performance from a Hadoop cluster, it needs to be configured correctly. However, finding the ideal configuration for a Hadoop cluster is not easy. The best way to decide on the ideal configuration for the cluster is to run the Hadoop jobs with the default configuration available to get a baseline. After that, the job history log files can be analyzed to see if there is any resource weakness or if the time taken to run the jobs is higher than expected. Repeating the same process can help fine-tune the Hadoop cluster configuration in such a way that it best fits the business requirements. All blocks of the cluster can be of the same size except the last block.
... View more
11-15-2018
11:25 AM
How Name node determines which data node to write on in HDFS ?
... View more
Labels:
11-01-2018
12:09 PM
Statistical information exposed by the Hadoop daemons is Metrics. Hadoop framework uses it for monitoring, performance tuning and debug. By default, there are many metrics available. Thus, they are very useful for troubleshooting. Hadoop framework use hadoop-metrics.properties for ‘Performance Reporting’ purpose. It also controls the reporting for Hadoop. The API provides an abstraction so we can implement on top of a variety of metrics client libraries. The choice of client library is a configuration option. And different modules within the same application can use different metrics implementation libraries. This file is present inside /etc/hadoop.
... View more
10-30-2018
12:07 PM
How can one copy a file into HDFS with a different block size to that of existing block size configuration?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
10-27-2018
11:14 AM
hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfs-site.xml also specify default block replication and permission checking on HDFS. The three main hdfs-site.xml properties are:
dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs). And also specify where DFS should locate – on the disk or in the remote directory. dfs.data.dir gives the location of DataNodes where it stores the data. fs.checkpoint.dir is the directory on the file system. On which secondary NameNode stores the temporary images of edit logs. Then this EditLogs and FsImage will merge for backup.
... View more
10-24-2018
11:53 AM
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.
... View more
10-16-2018
11:26 AM
What do mean by SafemodeProblem and how User come out of Safe mode in HDFS?
... View more
Labels:
10-11-2018
11:37 AM
Earlier ,till Hadoop 1.0 namenode was single point of failure.but 2.0 onwards HDFS federation, introduced , cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace/meta data.They are independent of each other.
... View more
09-17-2018
12:39 PM
HDFS is the storage mechanism of Hadoop which stores very large files running on the cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files. It stores data reliably even in the case of hardware failure. In HDFS, Files are broken into blocks that are distributed across the cluster on the basis of replication factor. The default replication factor is 3, thus each block is replicated 3 times. The first replica is stored on the first data node. The second replica is stored on another datanode within the same rack to minimize network dependency and third is stored on datanode in different racks, ensuring that even if rack fails the data is not lost. Namenode keeps the information of blocks like number of blocks, their replicas, and other details. While Datanode stores actual data and performs various operations like block creation, deletion and replication according to instruction of Namenode. Namenode keeps all meta data like data node location, blocks in it, replication factor etc..Data ode stores actual data and performs instructions given by namenode.
... View more
09-08-2018
12:25 PM
I need to add one more path in HDFS. How can I do that?
... View more
Labels:
- Labels:
-
Apache Hadoop
08-23-2018
10:01 AM
1 Kudo
The main reason for having the HDFS blocks in large size is to reduce the cost of disk seek time. Disk seeks are generally expensive operations. Since Hadoop is designed to run over your entire dataset, it is best to minimize seeks by using large files. In general, the seek time is 10ms and disk transfer rate is 100MB/s. To make the seek time 1% of the disk transfer rate, the block size should be 100MB. Hence to reduce the cost of disk seek time HDFS block default size is 64MB/128MB.
... View more
08-21-2018
11:34 AM
What do you understand by catalyst query optimizer in Apache Spark?
... View more
Labels:
08-17-2018
11:55 AM
Yes, the map reduces job could be submitted from the slave nodes. Jobs could be run from any machine in the cluster as long as each of the nodes has the proper job tracker location property configured. So Hadoop has to be configured with proper job tracker and the name node Mapred.job.tracker should be configured on the slave node to the master’s host and port and the connection has to be established between the master and slave node for ex by running telnet.master.com.8021.
... View more
08-13-2018
12:16 PM
\r\n dfs.block.size>\r\n 134217728\r\n \r\n"}" data-sheets-userformat="{"2":769,"3":[null,0],"11":4,"12":0}">Directly we cannot change the number of mappers for a MapReduce job but by changing the block size we can increase or decrease the number of mappers. As we know Number of input splits = Number of mappers
Example If we are having 1TB of input file and the block size for the HDFS is 128MB then number of input splits are (1024/128) 8 input splits so the mappers for the job allotted are 8.
If we reduce the block size from 128MB to 64Mb then 1TB of Input file will be divided in to (1024/64) 16 Input splits and the number of mappers also be 16.
The block size can be changed in hdfs-site.xml by changing the value of dfs.block.size
<property> <name>dfs.block.size> <value>134217728</value> </property>
... View more
08-01-2018
12:12 PM
As a number of mappers depend upon the number of InputSplits , as no data no input splits hence no mappers. Without any mapper, a number of the reducer is also 0. If we try to run map/reduce job on Hadoop cluster without specifying any input file it will throw following exception: java.io.IOException: No input paths specified in job.
... View more
07-24-2018
11:13 AM
Does Hadoop handle storage exhaustion on one of the data node in the cluster?How?
... View more
Labels: