Member since
05-18-2018
43
Posts
3
Kudos Received
0
Solutions
12-04-2019
10:17 AM
Do I need to put the namenode in safe mode to execute this command? or I can execute this on live cluster? hadoop fs –setrep –w 3 -R /
... View more
04-03-2019
12:08 PM
Hadoop supports two kinds of joins to join two or more data sets based on some column. The Map side join and the reduce side join. Map side join is usually used when one data set is large and the other data set is small. Whereas the Reduce side join can join both the large data sets. The Map side join is faster as it does not have to wait for all mappers to complete as in case of reducer. Hence reduce side join is slower. Map Side Join: · Sorted by the same key. · Equal number of partition. · All the records of the same key should be in same partition. Reduce Side Join: · Much flexible to implement. · There has to be custom WritableComparable with necessary function over ridden. · We need a custom partitioner. · Custom group comparator is required.
... View more
02-27-2019
12:23 PM
The users can be created using below steps: a)Get the information from user as to which machine is he working from. b)create the user in in OS first. c)Create the user in Hadoop by creating his home folder /user/username in Hadoop d)make sure that we have 777 permission for temp directory in HDFS e)using chown command change ownership from Hadoop to user for only his home directory so that he can write into only his directory and not other users. f)add his name into name node hdfs dfsadmin -refreshUserToGroupMappings G)If needed set a space limit for the user to limit the amount of data stored by him.hdfs dfsadmin -setSpaceQuota 50g /user/username
... View more
02-10-2019
10:43 PM
@Dukool SHarma Any updates?
... View more
01-05-2019
07:23 AM
RAM.Because metadata information will be needing in every 3 seconds after each Heartbeat. So fast processing will require to process the metadata information. To fasten this kinetic momentum of metadata, Name Node used to stores it into RAM.
How we can change Replication factor when Data is already stored in HDFS hdfs-site.xml is used to configure HDFS . Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
or using hadoop fs shell "hadoop fs –setrep –w 3
... View more
12-27-2018
12:16 PM
Sorting is carried out at the Map side. When all the map outputs have been copied, the reduce task moves into the sort phase i.e.maerging phase. which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 60 map outputs and the merge factor was 15 (the default, controlled by the mapreduce.task.io.sort.factor property, just like in the map’s merge), there would be four rounds. Each round would merge 15 files into 1, so at the end, there would be 4 intermediate files to be processed. This is done using a key-value pair.
... View more
12-03-2018
11:47 AM
1 Kudo
When the mapper starts producing the intermediate output it does not directly write the data on the local disk. Rather it writers the data in memory and some sorting of the data (Quick Sort) happens for performance reasons.
Each map task has a circular memory buffer which it writes the output to. By default, this circular buffer is of 100 MB. It can be modified by the parameter mapreduce.task.io.sort.mb.
When the contents of the buffer reach a certain threshold size (MapReduce.map.sort.spill.percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
... View more
10-27-2018
11:14 AM
hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfs-site.xml also specify default block replication and permission checking on HDFS. The three main hdfs-site.xml properties are:
dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs). And also specify where DFS should locate – on the disk or in the remote directory. dfs.data.dir gives the location of DataNodes where it stores the data. fs.checkpoint.dir is the directory on the file system. On which secondary NameNode stores the temporary images of edit logs. Then this EditLogs and FsImage will merge for backup.
... View more
12-10-2018
01:58 AM
2 Kudos
Let's start with Hive and then HCatalog. Hive
Layer for analyzing, querying and managing large datasets that reside in Hadoop various file systems ⇢ uses HiveQL (HQL) as processing engine ⇢ uses SerDes for serialization and deserialization ⇢ works best with huge volumes of data HCatalog
Table and storage management layer for Hadoop ⇢ basically, the EDW system for Hadoop (it supports several file formats such as RCFile, CSV, JSON, SequenceFile, ORC) ⇢ is a sub-component of Hive, which enables ETL processes ⇢ tool for accessing metadata that reside in Hive Metastore ⇢ acts as an API to expose the metastore as REST interface to external tools such as Pig ⇢ uses WebHcat, a web server for engaging with the Hive Metastore I think the focus has to be made on how they complement each other rather than focusing on their differences. Documentation (3)
This answer from @Scott Shaw is worth checking This slideshare presents the use cases and features of Hive and Hcatalog This direct graph from IBM shows how they use both layers in a batch job I hope this helps! 🙂
... View more
10-15-2018
11:48 AM
If we have small data set, Uber configuration is used for MapReduce. The Uber mode runs the map and reduce tasks within its own process and avoid overhead of launching and communicating with remote nodes.
... View more