Member since
05-18-2018
50
Posts
3
Kudos Received
0
Solutions
02-10-2019
10:43 PM
@Dukool SHarma Any updates?
... View more
12-27-2018
12:16 PM
Sorting is carried out at the Map side. When all the map outputs have been copied, the reduce task moves into the sort phase i.e.maerging phase. which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 60 map outputs and the merge factor was 15 (the default, controlled by the mapreduce.task.io.sort.factor property, just like in the map’s merge), there would be four rounds. Each round would merge 15 files into 1, so at the end, there would be 4 intermediate files to be processed. This is done using a key-value pair.
... View more
12-03-2018
11:47 AM
1 Kudo
When the mapper starts producing the intermediate output it does not directly write the data on the local disk. Rather it writers the data in memory and some sorting of the data (Quick Sort) happens for performance reasons.
Each map task has a circular memory buffer which it writes the output to. By default, this circular buffer is of 100 MB. It can be modified by the parameter mapreduce.task.io.sort.mb.
When the contents of the buffer reach a certain threshold size (MapReduce.map.sort.spill.percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
... View more
12-10-2018
01:58 AM
2 Kudos
Let's start with Hive and then HCatalog. Hive
Layer for analyzing, querying and managing large datasets that reside in Hadoop various file systems ⇢ uses HiveQL (HQL) as processing engine ⇢ uses SerDes for serialization and deserialization ⇢ works best with huge volumes of data HCatalog
Table and storage management layer for Hadoop ⇢ basically, the EDW system for Hadoop (it supports several file formats such as RCFile, CSV, JSON, SequenceFile, ORC) ⇢ is a sub-component of Hive, which enables ETL processes ⇢ tool for accessing metadata that reside in Hive Metastore ⇢ acts as an API to expose the metastore as REST interface to external tools such as Pig ⇢ uses WebHcat, a web server for engaging with the Hive Metastore I think the focus has to be made on how they complement each other rather than focusing on their differences. Documentation (3)
This answer from @Scott Shaw is worth checking This slideshare presents the use cases and features of Hive and Hcatalog This direct graph from IBM shows how they use both layers in a batch job I hope this helps! 🙂
... View more
10-17-2018
12:04 PM
@Dukool SHarma
Safe mode is a NameNode state in which the node doesn’t accept any changes to the HDFS namespace, meaning HDFS will be in a read-only state. Safe mode is entered automatically at NameNode startup, and the NameNode leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition.
When you start up the NameNode, it doesn’t start replicating data to the DataNodes right away. The NameNode first automatically enters a special read-only state of operation called safe mode. In this mode, the NameNode doesn’t honor any requests to make changes to its namespace. Thus, it refrains from replicating, or even deleting, any data blocks until it leaves the safe mode.
The DataNodes continuously send two things to the NameNode—a heartbeat indicating they’re alive and well and a block report listing
all data blocks being stored on a DataNode. Hadoop considers a data block “safely” replicated once the NameNode receives enough block reports from the DataNodes indicating they have a minimum number of replicas of that block. Hadoop makes the NameNode wait for the DataNodes to report blocks so it doesn’t start replicating data prematurely by attempting to replicate data even when the correct
number of replicas exists on DataNodes that haven’t yet reported their block information.
When a preconfigured percentage of blocks are reported as safely replicated, the NameNode leaves the safe mode and starts serving block information to clients. It’ll also start replicating all blocks that the DataNodes have reported as being under replicated.
Use the dfsadmin –safemode command to manage safe mode operations for the NameNode. You can check the current safe mode status with the -safemode get command: $ hdfs dfsadmin -safemode get
Safe mode is OFF in hadoop01.localhost/10.192.2.21:8020
Safe mode is OFF in hadoop02.localhost/10.192.2.22:8020
$ You can place the NameNode in safe mode with the -safemode enter command: $ hdfs dfsadmin -safemode enter
Safe mode is ON in hadoop01.localhost/10.192.2.21:8020
Safe mode is ON in hadoop02.localhost/10.192.2.22:8020
$ Finally, you can take the NameNode out of safemode with the –safemode leave command: $ hdfs dfsadmin -safemode leave
Safe mode is OFF in hadoop01.localhost/10.192.2.21:8020
Safe mode is OFF in hadoop02.localhost/10.192.2.22:8020
$
... View more
08-13-2018
12:59 PM
@rinu shrivastav The split size is calculated by the formula:- max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
Say, HDFS block size is 64 MB and min.input.size is set to 128MB, then there will be split size would be 128MB. To read 256MB of data, there will be two mappers. To increase the number of mappers, then you could decrease min.input.size till the HDFS block size. split size=max(128,min(256,64))
... View more
07-18-2018
11:58 AM
HDFS Block- Block is a continuous location on the hard drive where data is stored. In general, FileSystem stores data as a collection of blocks. In the same way, HDFS stores each file as blocks. The Hadoop application is responsible for distributing the data block across multiple nodes.
Input Split in Hadoop- The data to be processed by an individual Mapper is represented by InputSplit. The split is divided into records and each record (which is a key-value pair) is processed by the map. The number of map tasks is equal to the number of InputSplits. Initially, the data for MapReduce task is stored in input files and input files typically reside in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible for creating InputSplit.
MapReduce InputSplit vs Blocks in Hadoop InputSplit vs Block Size in Hadoop-
• Block – The default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop FileSystem. • InputSplit – By default, split size is approximately equal to block size. InputSplit is user defined and the user can control split size based on the size of data in MapReduce program. Data Representation in Hadoop Blocks vs InputSplit- • Block – It is the physical representation of data. It contains a minimum amount of data that can be read or write. • InputSplit – It is the logical representation of data present in the block. It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data
... View more
07-13-2018
08:01 AM
@Dukool SHarma Rack Awareness Article Part-2. https://community.hortonworks.com/articles/43164/rack-awareness-series-2.html
... View more