Member since
06-20-2018
18
Posts
1
Kudos Received
0
Solutions
03-27-2019
12:27 PM
Difference between input split and block in HadoopMapReduce ?InputSplitvs Block Size in Hadoop
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-21-2019
12:12 PM
How to create user in hadoop?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
01-11-2019
12:07 PM
The request for n raises a NameError. This is since n is a variable local to func and we cannot access it elsewhere. It is also true that Python only evaluates default parameter values once; every invocation shares the default value. If one invocation modifies it, that is what another gets. This means you should only ever use primitives, strings, and tuples as default parameters, not mutable objects.
... View more
01-02-2019
09:00 AM
What is identity mapper and reducer?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
08-13-2018
12:00 PM
How can we change no of Mappers for a MapReduce job?
... View more
Labels:
08-01-2018
11:35 AM
If input directory is empty, then no mappers or Reducer will run?
... View more
Labels:
07-28-2018
12:01 PM
The number of instance of record readers will be equal to the number of input splits( the number of mappers).
In the parallel processing, mappers will run parallelly and each mapper will create an instance of the mapper.
... View more
07-23-2018
11:50 AM
How can we set the number of reducers to zero in MapReduce?
... View more
Labels:
07-19-2018
10:53 AM
1 Kudo
The cluster consists of one or more computers/machines working together to provide high availability, reliability, and scalability towards service being provided to clients. If one server/machine fails then work/resources get distributed among other machines in the same cluster. Single node cluster or pseudo-distributed cluster is the one in which all daemon like NameNode, data node, Jobtracker, and tasktracker runs on the single machine. Default replication factor is 1. Multinode cluster is a cluster which is basically used in master-slave fashion where master and slaves runs on different machines and master node/machine runs Namenode and TaskTracker daemons and slave machine runs Datanode and JobTracker daemons.
Note: - YARN is a cluster resource management.
... View more
07-11-2018
11:10 AM
What is Hadoop cluster hardware planning and provisioning?
... View more
Labels:
07-07-2018
11:07 AM
The main work of speculative execution is to reduce the job execution time; however, the clustering efficiency is affected due to duplicate tasks. Since in speculative execution redundant tasks are being executed, thus this can reduce overall throughput. For this reason, some cluster administrators prefer to turn off the speculative execution in Hadoop.
... View more
07-06-2018
04:43 AM
Small file means files which are considerably smaller than block size(64 MB or 128 MB) from Hadoop perspective. Since Hadoop is used for processing huge amount of data, if we are using small files, a number of files would be obviously large. Hadoop is actually designed for a large amount of data ie a small number of large files. Following are the issues with small file 1. Each file, directory, and a block in HDFS is represented as an object in name node’s memory (ie Metadata), and each of which occupies approx. 150 bytes. Scaling these much amounts of memory in the name node for each of these objects is not feasible. In short, if a number of files increases, the memory required to store metadata will be more. 2. HDFS is not designed for efficient access of small files. Handling a large number of small files causes a lot of seeks and a lot of hopping from the data node to the data node to retrieve small files. This is an inefficient data access pattern. 3. Mapper node usually takes a block of input at a time. If the file is very small(ie less than typical block size), the number of mapper task would increase and each task process very little input. This would create a lot of task in queue and overhead would be high. This decreases the overall speed and efficiency of map jobs. Solution: 1. Hadoop archive Files (HAR): HAR command creates a HAR file, which runs a map reduce job to prevent HDFS data to get archived into small files. HAR ensures file size is large and the number is low. 2. Sequence files: By this method, data is stored in such a way that file name will be kay and file name will be valued. MapReduce programs can be created to make a lot of small files into a single sequence file. MapReduce divides sequence files into parts and works on each part independently.
... View more
07-03-2018
04:15 AM
Speculative execution is a MapReduce job optimization technique in Hadoop that is enabled by default. You can disable speculative execution for mappers and reducers in mapped-site.xml as shown below:
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>
... View more
06-22-2018
07:15 AM
Hadoop vs RDBMS: Comparision between Hadoop & Database?
... View more
Labels:
06-20-2018
11:09 AM
Can you explain how to specify more than one path for storage in Hadoop?
... View more
Labels: