Member since
05-18-2018
50
Posts
3
Kudos Received
0
Solutions
02-10-2019
10:43 PM
@Dukool SHarma Any updates?
... View more
01-19-2019
10:49 AM
Yes, there will be metadata in Hadoop . As every change which we do like creation.deletion gets saved in namenode.
... View more
12-27-2018
12:16 PM
Sorting is carried out at the Map side. When all the map outputs have been copied, the reduce task moves into the sort phase i.e.maerging phase. which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 60 map outputs and the merge factor was 15 (the default, controlled by the mapreduce.task.io.sort.factor property, just like in the map’s merge), there would be four rounds. Each round would merge 15 files into 1, so at the end, there would be 4 intermediate files to be processed. This is done using a key-value pair.
... View more
12-17-2018
12:05 PM
There is no specific rule in Hadoop on how many times a combiner should be called. Sometimes it may not be called at all, while Sometimes it may be used once, twice or more depending on the number and size of the output file generated by the mapper.
... View more
12-09-2018
08:53 PM
Hello @Harshali Patel, did you see my answer here? I hope this helps.
... View more
12-03-2018
11:47 AM
1 Kudo
When the mapper starts producing the intermediate output it does not directly write the data on the local disk. Rather it writers the data in memory and some sorting of the data (Quick Sort) happens for performance reasons.
Each map task has a circular memory buffer which it writes the output to. By default, this circular buffer is of 100 MB. It can be modified by the parameter mapreduce.task.io.sort.mb.
When the contents of the buffer reach a certain threshold size (MapReduce.map.sort.spill.percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
... View more
11-28-2018
03:07 PM
Input Split: It’s the logical division of records which means to say it doesn’t contain any data inside but a logical reference to data. It’s only used during data processing by MapReduce . User can control the size of the InputSplit and each InputSplit is assigned to individual mappers for processing. It’s defined by the InputFormat class.
HDFS Block: It’s the physical representation of data. It contains a minimum amount of data that can be read or write. The default size of HDFS block is 128 MB which we can configure according to our requirements. All the blocks of the file are of the same size except the last block which might be of the same size or smaller. The files are divided into 128MB blocks and then stored in the file system.
... View more
11-15-2018
11:42 AM
Namenode contains Metadata i.e. number of blocks, replicas, their location, and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the Datanodes, and assigns tasks to them.
... View more
11-08-2018
08:35 AM
@Dukool SHarma If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
12-10-2018
01:58 AM
2 Kudos
Let's start with Hive and then HCatalog. Hive
Layer for analyzing, querying and managing large datasets that reside in Hadoop various file systems ⇢ uses HiveQL (HQL) as processing engine ⇢ uses SerDes for serialization and deserialization ⇢ works best with huge volumes of data HCatalog
Table and storage management layer for Hadoop ⇢ basically, the EDW system for Hadoop (it supports several file formats such as RCFile, CSV, JSON, SequenceFile, ORC) ⇢ is a sub-component of Hive, which enables ETL processes ⇢ tool for accessing metadata that reside in Hive Metastore ⇢ acts as an API to expose the metastore as REST interface to external tools such as Pig ⇢ uses WebHcat, a web server for engaging with the Hive Metastore I think the focus has to be made on how they complement each other rather than focusing on their differences. Documentation (3)
This answer from @Scott Shaw is worth checking This slideshare presents the use cases and features of Hive and Hcatalog This direct graph from IBM shows how they use both layers in a batch job I hope this helps! 🙂
... View more
10-17-2018
12:04 PM
@Dukool SHarma
Safe mode is a NameNode state in which the node doesn’t accept any changes to the HDFS namespace, meaning HDFS will be in a read-only state. Safe mode is entered automatically at NameNode startup, and the NameNode leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition.
When you start up the NameNode, it doesn’t start replicating data to the DataNodes right away. The NameNode first automatically enters a special read-only state of operation called safe mode. In this mode, the NameNode doesn’t honor any requests to make changes to its namespace. Thus, it refrains from replicating, or even deleting, any data blocks until it leaves the safe mode.
The DataNodes continuously send two things to the NameNode—a heartbeat indicating they’re alive and well and a block report listing
all data blocks being stored on a DataNode. Hadoop considers a data block “safely” replicated once the NameNode receives enough block reports from the DataNodes indicating they have a minimum number of replicas of that block. Hadoop makes the NameNode wait for the DataNodes to report blocks so it doesn’t start replicating data prematurely by attempting to replicate data even when the correct
number of replicas exists on DataNodes that haven’t yet reported their block information.
When a preconfigured percentage of blocks are reported as safely replicated, the NameNode leaves the safe mode and starts serving block information to clients. It’ll also start replicating all blocks that the DataNodes have reported as being under replicated.
Use the dfsadmin –safemode command to manage safe mode operations for the NameNode. You can check the current safe mode status with the -safemode get command: $ hdfs dfsadmin -safemode get
Safe mode is OFF in hadoop01.localhost/10.192.2.21:8020
Safe mode is OFF in hadoop02.localhost/10.192.2.22:8020
$ You can place the NameNode in safe mode with the -safemode enter command: $ hdfs dfsadmin -safemode enter
Safe mode is ON in hadoop01.localhost/10.192.2.21:8020
Safe mode is ON in hadoop02.localhost/10.192.2.22:8020
$ Finally, you can take the NameNode out of safemode with the –safemode leave command: $ hdfs dfsadmin -safemode leave
Safe mode is OFF in hadoop01.localhost/10.192.2.21:8020
Safe mode is OFF in hadoop02.localhost/10.192.2.22:8020
$
... View more
10-11-2018
11:37 AM
Earlier ,till Hadoop 1.0 namenode was single point of failure.but 2.0 onwards HDFS federation, introduced , cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace/meta data.They are independent of each other.
... View more
10-10-2018
12:40 PM
The Client can interact with the Hive in the below three ways:-
ü Hive Thrift Client: The Hive server is exposed as thrift service. Hence it is possible to interact with HIVE with any programming language that supports thrift.
ü JDBC Driver: Hive uses pure Type 4 JDBC driver to connect to the server which is defined in org.apache.hadoop.HIVE.JDBC.HiveDriver class. . Pure Java applications may use this driver in order to connect to application using separate host and port.
The BeeLine CLI uses JDBC Driver to connect to the HIVE Server.
ü ODBC Driver: An ODBC Driver allows application that support ODBC to connect to the HIVE server. By default Apache does not ships the ODBC Driver but it is freely available by many vendors.
... View more
09-20-2018
10:54 AM
Btw, they spam us with granular & top-notch resources. I think it worths the spam. ^.^
... View more
09-10-2018
06:46 PM
I think you are asking about adding directories to Datanodes.
dfs.datanode.data.dir in the hdfs-site.xml file is a comma-delimited list of directories for where the DataNode will store blocks for HDFS. Plus, https://community.hortonworks.com/questions/89786/file-uri-required-for-dfsdatanodedatadir.html
Property
Default
Description
dfs.datanode.data.dir
file://${hadoop.tmp.dir}/dfs/data
Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored
Otherwise, I'm afraid your question doesn't make sense other than running mkdir HDFS command to "add a new directory in HDFS"
... View more
08-21-2018
11:52 AM
1 Kudo
@Dukool SHarma When working with dataframe api spark is aware of the data structure. Hence it made sense to implement a query optimizer to build the most efficient query plan considering the underlying data structure and transformations applied. In Spark this optimization is done by Catalyst optimizer. Catalyst optimizer works on query plan in different phases. Analysis, logical plan, physical plan and code generation. The result of it is a DAG of RDD. If you are interest in reading more about it you should go over the following link: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
08-17-2018
05:25 PM
Yes, as long as the appropriate clients are installed on the slave node, if you also have the /etc/config populated with the correct details for connecting to your instance, then no connection parameters need to be specified for the clients (this is automatically populated if the slave node is deployed/configured by ambari). in that case you submit the job exactly as you would on any other node
... View more
08-13-2018
12:59 PM
@rinu shrivastav The split size is calculated by the formula:- max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
Say, HDFS block size is 64 MB and min.input.size is set to 128MB, then there will be split size would be 128MB. To read 256MB of data, there will be two mappers. To increase the number of mappers, then you could decrease min.input.size till the HDFS block size. split size=max(128,min(256,64))
... View more
08-01-2018
12:12 PM
As a number of mappers depend upon the number of InputSplits , as no data no input splits hence no mappers. Without any mapper, a number of the reducer is also 0. If we try to run map/reduce job on Hadoop cluster without specifying any input file it will throw following exception: java.io.IOException: No input paths specified in job.
... View more
07-24-2018
06:26 PM
Contrary to answer by @Harshali Patel, exhaustion is not defined as an uneven distribution, it is rather a cause of it. A datanode has a property that you can set which defines a threshold of data must be reserved for the OS on that server. Once that limit is exceeded, the datanode process will stop and log an error telling you to delete some files from it. HDFS will continue to function with the other datanodes. The balancer can be ran to keep storage space healthy and even.
... View more
07-18-2018
11:58 AM
HDFS Block- Block is a continuous location on the hard drive where data is stored. In general, FileSystem stores data as a collection of blocks. In the same way, HDFS stores each file as blocks. The Hadoop application is responsible for distributing the data block across multiple nodes.
Input Split in Hadoop- The data to be processed by an individual Mapper is represented by InputSplit. The split is divided into records and each record (which is a key-value pair) is processed by the map. The number of map tasks is equal to the number of InputSplits. Initially, the data for MapReduce task is stored in input files and input files typically reside in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible for creating InputSplit.
MapReduce InputSplit vs Blocks in Hadoop InputSplit vs Block Size in Hadoop-
• Block – The default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop FileSystem. • InputSplit – By default, split size is approximately equal to block size. InputSplit is user defined and the user can control split size based on the size of data in MapReduce program. Data Representation in Hadoop Blocks vs InputSplit- • Block – It is the physical representation of data. It contains a minimum amount of data that can be read or write. • InputSplit – It is the logical representation of data present in the block. It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data
... View more
07-13-2018
08:01 AM
@Dukool SHarma Rack Awareness Article Part-2. https://community.hortonworks.com/articles/43164/rack-awareness-series-2.html
... View more
07-07-2018
11:24 AM
@Dukool SHarma I can think of these 2 cases as a reason why you wont want to have speculative execution on: resource limitation and duplicate output results when saving to database/sink. Speculative execution will not stop the slow executor currently running the task and actually launch the task in new executor to do the same processing hopefully faster. Which ever finishes first wins. This leads to more resource utilization. Also if you are saving to a database, for example, this could also lead into duplicate information at the DB side since there are 2 executors doing same processing eventually. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
07-03-2018
04:15 AM
Speculative execution is a MapReduce job optimization technique in Hadoop that is enabled by default. You can disable speculative execution for mappers and reducers in mapped-site.xml as shown below:
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>
... View more
06-20-2018
06:55 PM
Please see previous question - https://community.hortonworks.com/questions/167618/how-to-specify-more-than-one-path-for-the-storage.html
... View more
06-18-2018
11:24 AM
1 Kudo
HDFS Block:- I Hadoop HDFS stores each file as a block and distribute across the nodes in a cluster. The default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop FileSystem. It is the physical representation of data. It contains a minimum amount of data that can be read or write. InputSplit Data to be processed by mapper is represented by InputSplit. Initially, data for MapReduce task is present in input files in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible to create InputSplit. By default, split size is approximately equal to block size. InputSplit is user-defined and the user can control split size based on the size of data in MapReduce program. It is the logical representation of data present in the block. It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data. InputSplit is only a logical chunk of data i.e. it has just the information about blocks address or location.
... View more
06-10-2018
11:15 PM
@Harshali Patel Any updates? If you found this answer addressed your question, please take a moment to log in and click the "accept" link on the answer.
... View more
06-06-2018
11:32 AM
@Dukool SHarma The Following HCC threads explains it in more detail: https://community.hortonworks.com/questions/193988/what-is-the-small-file-problem.html why the small files are not suitable for HDFS: https://community.hortonworks.com/articles/15104/small-files-in-hadoop.html
In order to find out the small files: 1. https://community.hortonworks.com/articles/46329/analyze-small-file-in-hdfs.html 2. https://community.hortonworks.com/articles/142134/identify-where-most-of-the-small-file-are-located.html
... View more
06-07-2018
09:14 PM
@Dukool Sharma, Reduce only job is not possible with Mapreduce. Reducer requires intermediate data in the form of key-pair value from mapper. Thus its not possible to run just reducers without mappers.
... View more