About patelharshali13

patelharshali13 · ‎03-30-2019

Differentiate between Map Side join and Reduce side Join in Hadoop?

patelharshali13 · ‎02-27-2019

The users can be created using below steps: a)Get the information from user as to which machine is he working from. b)create the user in in OS first. c)Create the user in Hadoop by creating his home folder /user/username in Hadoop d)make sure that we have 777 permission for temp directory in HDFS e)using chown command change ownership from Hadoop to user for only his home directory so that he can write into only his directory and not other users. f)add his name into name node hdfs dfsadmin -refreshUserToGroupMappings G)If needed set a space limit for the user to limit the amount of data stored by him.hdfs dfsadmin -setSpaceQuota 50g /user/username

patelharshali13 · ‎01-31-2019

Number of mappers always equals to the Number of input splits. We can control the number of splits by changing the mapred.min.split.size which controls the minimum input split size. Assume the block size is 64 MB and mapred.min.split.size is set to 128 MB. The size of InputSplit will be 128 MB even though the block size is 64 MB.

patelharshali13 · ‎01-04-2019

In which location NameNode stores its metadata and why?

patelharshali13 · ‎12-27-2018

Sorting is carried out at the Map side. When all the map outputs have been copied, the reduce task moves into the sort phase i.e.maerging phase. which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 60 map outputs and the merge factor was 15 (the default, controlled by the mapreduce.task.io.sort.factor property, just like in the map’s merge), there would be four rounds. Each round would merge 15 files into 1, so at the end, there would be 4 intermediate files to be processed. This is done using a key-value pair.

patelharshali13 · ‎12-03-2018

When the mapper starts producing the intermediate output it does not directly write the data on the local disk. Rather it writers the data in memory and some sorting of the data (Quick Sort) happens for performance reasons. Each map task has a circular memory buffer which it writes the output to. By default, this circular buffer is of 100 MB. It can be modified by the parameter mapreduce.task.io.sort.mb. When the contents of the buffer reach a certain threshold size (MapReduce.map.sort.spill.percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.

patelharshali13 · ‎11-22-2018

The replication factor is a property that can be set in the HDFS configuration file( hdfs-site.xml).This will be to set global replication factor for the entire cluster.This will only work on the newly created files but not on the existing files. The Replication factor default value is 3, however for cluster in Pseudo distributed mode its 1.The replication factor value is configurable in hdfs-site.xml file.You have to change dfs.replication to a desired value. This file is usually found in the conf folder of the Hadoop installation directory. <property> <name>dfs.replication</name> <value>(desired value)</value> </property> Change the replication factor on a per-file basis : hadoop fs –setrep –w 3 /file/filename.xml -setrep commnad to change the replication factor for files that already exist in HDFS.-R flag would recursively change the replication factor on all the files eg: hadoop fs –setrep –w 3 -R /directory/dir.xml

patelharshali13 · ‎10-31-2018

By following methods we can restart the NameNode: You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command. Then start the NameNode using /sbin/hadoop-daemon.sh start namenode. Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first. Then start all the daemons. The sbin directory inside the Hadoop directory store these script files.

patelharshali13 · ‎10-27-2018

What are the main hdfs-site.xml properties?

patelharshali13 · ‎10-24-2018

HCatalog is different from Hive? How?

Online	Offline
Last Visited	‎03-30-2019 11:50 AM

Member Since	‎05-18-2018 08:50 AM
Last Visited	‎03-30-2019 11:50 AM
Posts	43
Kudos received	3

Cloudera Community

Differentiate between Map Side join and Reduce sid...

Re: How to create user in hadoop?

Re: How to change / configure number of Mappers ?

In which location NameNode stores its metadata and...

Re: In Mapreduce how to sort intermediate output b...

Re: Explain process of spilling in Hadoop’s map re...

Re: How can one increase replication factor to a d...

Re: How to restart NameNode or all the daemons in ...

What are the main hdfs-site.xml properties?

How HCatalog is different from Hive?