Member since
05-18-2018
43
Posts
3
Kudos Received
0
Solutions
03-30-2019
11:50 AM
Differentiate between Map Side join and Reduce side Join in Hadoop?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
03-20-2019
08:20 AM
Partitioner does not run its own JVM .It uses the JVM shared by mapper job as partitioning is the part of mapper function. Mapper and reducer use separate JVM.
... View more
03-09-2019
12:05 PM
what is the requirements of passwordless SSH during the installation of Hadoop?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-27-2019
12:23 PM
The users can be created using below steps: a)Get the information from user as to which machine is he working from. b)create the user in in OS first. c)Create the user in Hadoop by creating his home folder /user/username in Hadoop d)make sure that we have 777 permission for temp directory in HDFS e)using chown command change ownership from Hadoop to user for only his home directory so that he can write into only his directory and not other users. f)add his name into name node hdfs dfsadmin -refreshUserToGroupMappings G)If needed set a space limit for the user to limit the amount of data stored by him.hdfs dfsadmin -setSpaceQuota 50g /user/username
... View more
02-19-2019
11:52 AM
• OLTP are software programs is capable of supporting transaction related applications like insert/delete/update operations • It is related to RDBMS which has low latency • Hadoop framework focuses on High throughput than latency • Hadoop works well with all kinds of data but OLTP handles only structured data hence OLTP is not preferred for Hadoop architecture • OLAP is used for data discovery processes like report viewing, complex analytical calculations and predicting “what if” scenarios • It provides Unified Dimensional Model used for faster analysis of large structured data • Hadoop provides base to create faster analysis of large data and can replace OLAP in providing multidimensional analysis
... View more
02-05-2019
11:52 AM
Can someone explain what does the conf.setMapper class does in the MapReduce?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
01-31-2019
11:19 AM
Number of mappers always equals to the Number of input splits. We can control the number of splits by changing the mapred.min.split.size which controls the minimum input split size. Assume the block size is 64 MB and mapred.min.split.size is set to 128 MB. The size of InputSplit will be 128 MB even though the block size is 64 MB.
... View more
01-04-2019
11:51 AM
In which location NameNode stores its metadata and why?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-27-2018
12:16 PM
Sorting is carried out at the Map side. When all the map outputs have been copied, the reduce task moves into the sort phase i.e.maerging phase. which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 60 map outputs and the merge factor was 15 (the default, controlled by the mapreduce.task.io.sort.factor property, just like in the map’s merge), there would be four rounds. Each round would merge 15 files into 1, so at the end, there would be 4 intermediate files to be processed. This is done using a key-value pair.
... View more
12-17-2018
12:05 PM
There is no specific rule in Hadoop on how many times a combiner should be called. Sometimes it may not be called at all, while Sometimes it may be used once, twice or more depending on the number and size of the output file generated by the mapper.
... View more
12-07-2018
11:27 AM
Improve the performance of data transfer between Mapper and Reducer is by using the Combiner function. Combiner works as a mini reducer which operates on data generated by Mapper and used for the purpose of optimization. 2nd option is we can compress the intermediate output generated by Mapper with the below command in driver class
... View more
12-03-2018
11:47 AM
1 Kudo
When the mapper starts producing the intermediate output it does not directly write the data on the local disk. Rather it writers the data in memory and some sorting of the data (Quick Sort) happens for performance reasons.
Each map task has a circular memory buffer which it writes the output to. By default, this circular buffer is of 100 MB. It can be modified by the parameter mapreduce.task.io.sort.mb.
When the contents of the buffer reach a certain threshold size (MapReduce.map.sort.spill.percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
... View more
11-30-2018
09:49 AM
Which is better MapReduce vs Map only job performance.Why?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
11-22-2018
07:25 AM
1 Kudo
The replication factor is a property that can be set in the HDFS configuration file( hdfs-site.xml).This will be to set global replication factor for the entire cluster.This will only work on the newly created files but not on the existing files. The Replication factor default value is 3, however for cluster in Pseudo distributed mode its 1.The replication factor value is configurable in hdfs-site.xml file.You have to change dfs.replication to a desired value. This file is usually found in the conf folder of the Hadoop installation directory. <property> <name>dfs.replication</name> <value>(desired value)</value> </property>
Change the replication factor on a per-file basis : hadoop fs –setrep –w 3 /file/filename.xml
-setrep commnad to change the replication factor for files that already exist in HDFS.-R flag would recursively change the replication factor on all the files
eg: hadoop fs –setrep –w 3 -R /directory/dir.xml
... View more
11-20-2018
10:28 AM
All the slaves in hadoop cluster should be of same configuration?
... View more
Labels:
- Labels:
-
Apache Hadoop
10-31-2018
11:25 AM
By following methods we can restart the NameNode:
You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command. Then start the NameNode using /sbin/hadoop-daemon.sh start namenode. Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first. Then start all the daemons. The sbin directory inside the Hadoop directory store these script files.
... View more
10-29-2018
12:26 PM
Apache Hadoop achieves security by using Kerberos. At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.
Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT). Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server. Service Request – The client uses the service ticket to authenticate itself to the server.
... View more
10-27-2018
11:12 AM
What are the main hdfs-site.xml properties?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
10-15-2018
11:48 AM
If we have small data set, Uber configuration is used for MapReduce. The Uber mode runs the map and reduce tasks within its own process and avoid overhead of launching and communicating with remote nodes.
... View more
10-10-2018
12:40 PM
The Client can interact with the Hive in the below three ways:-
ü Hive Thrift Client: The Hive server is exposed as thrift service. Hence it is possible to interact with HIVE with any programming language that supports thrift.
ü JDBC Driver: Hive uses pure Type 4 JDBC driver to connect to the server which is defined in org.apache.hadoop.HIVE.JDBC.HiveDriver class. . Pure Java applications may use this driver in order to connect to application using separate host and port.
The BeeLine CLI uses JDBC Driver to connect to the HIVE Server.
ü ODBC Driver: An ODBC Driver allows application that support ODBC to connect to the HIVE server. By default Apache does not ships the ODBC Driver but it is freely available by many vendors.
... View more
10-06-2018
12:00 PM
What do you mean by SequenceFileInputFormat in Hadoop MapReduce?
... View more
Labels:
10-04-2018
10:26 AM
When to put the data in Distributed Cache?
... View more
Labels:
08-17-2018
11:27 AM
How can I submit MapReduce Job from slave node?
... View more
Labels:
07-28-2018
11:05 AM
For a specific map reduce job how many instances of record reader will run ?
... View more
Labels:
07-24-2018
11:40 AM
Storage Exhaustion is an uneven distribution of data across the data nodes in the cluster. . The cause of storage Exhaustion in a cluster is due to the Addition and removal of the data nodes in the cluster. Multiple write and delete operations. Hadoop provides a tool called disk balancer which re-balances the data across data nodes by moving blocks from overbalanced data node to underbalanced data node until a threshold value is maintained. Before moving blocks, disk balancer plans how much data to be transferred between the data nodes. It uses Round-robin and Available space policy for choosing the destination disk. Initially, disk balancer is not enabled by default. To enable a disk balancer 1) open hdfs-site.xml which is locate in (Hadoop-2.5.0-cdh5.3.2/etc/Hadoop) 2) set the property dfs.disk.balancer.enabled to True
We can use the commands start-balancer.sh to invoke the balancer and we can also run it by using hdfs - balancer
Its suggested to run the balancer after adding new nodes to the cluster.
... View more
07-21-2018
12:05 PM
Comparison between Flume and Kafka?
... View more
Labels:
07-18-2018
11:36 AM
Camparison between input split and a block in Hadoop MapReduce?
... View more
Labels: