Member since
05-18-2018
43
Posts
3
Kudos Received
0
Solutions
03-30-2019
11:50 AM
Differentiate between Map Side join and Reduce side Join in Hadoop?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-27-2019
12:23 PM
The users can be created using below steps: a)Get the information from user as to which machine is he working from. b)create the user in in OS first. c)Create the user in Hadoop by creating his home folder /user/username in Hadoop d)make sure that we have 777 permission for temp directory in HDFS e)using chown command change ownership from Hadoop to user for only his home directory so that he can write into only his directory and not other users. f)add his name into name node hdfs dfsadmin -refreshUserToGroupMappings G)If needed set a space limit for the user to limit the amount of data stored by him.hdfs dfsadmin -setSpaceQuota 50g /user/username
... View more
01-31-2019
11:19 AM
Number of mappers always equals to the Number of input splits. We can control the number of splits by changing the mapred.min.split.size which controls the minimum input split size. Assume the block size is 64 MB and mapred.min.split.size is set to 128 MB. The size of InputSplit will be 128 MB even though the block size is 64 MB.
... View more
01-04-2019
11:51 AM
In which location NameNode stores its metadata and why?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-27-2018
12:16 PM
Sorting is carried out at the Map side. When all the map outputs have been copied, the reduce task moves into the sort phase i.e.maerging phase. which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 60 map outputs and the merge factor was 15 (the default, controlled by the mapreduce.task.io.sort.factor property, just like in the map’s merge), there would be four rounds. Each round would merge 15 files into 1, so at the end, there would be 4 intermediate files to be processed. This is done using a key-value pair.
... View more
12-03-2018
11:47 AM
1 Kudo
When the mapper starts producing the intermediate output it does not directly write the data on the local disk. Rather it writers the data in memory and some sorting of the data (Quick Sort) happens for performance reasons.
Each map task has a circular memory buffer which it writes the output to. By default, this circular buffer is of 100 MB. It can be modified by the parameter mapreduce.task.io.sort.mb.
When the contents of the buffer reach a certain threshold size (MapReduce.map.sort.spill.percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
... View more
11-22-2018
07:25 AM
1 Kudo
The replication factor is a property that can be set in the HDFS configuration file( hdfs-site.xml).This will be to set global replication factor for the entire cluster.This will only work on the newly created files but not on the existing files. The Replication factor default value is 3, however for cluster in Pseudo distributed mode its 1.The replication factor value is configurable in hdfs-site.xml file.You have to change dfs.replication to a desired value. This file is usually found in the conf folder of the Hadoop installation directory. <property> <name>dfs.replication</name> <value>(desired value)</value> </property>
Change the replication factor on a per-file basis : hadoop fs –setrep –w 3 /file/filename.xml
-setrep commnad to change the replication factor for files that already exist in HDFS.-R flag would recursively change the replication factor on all the files
eg: hadoop fs –setrep –w 3 -R /directory/dir.xml
... View more
10-31-2018
11:25 AM
By following methods we can restart the NameNode:
You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command. Then start the NameNode using /sbin/hadoop-daemon.sh start namenode. Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first. Then start all the daemons. The sbin directory inside the Hadoop directory store these script files.
... View more
10-27-2018
11:12 AM
What are the main hdfs-site.xml properties?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive