Member since
05-26-2018
34
Posts
2
Kudos Received
0
Solutions
04-03-2019
12:08 PM
Hadoop supports two kinds of joins to join two or more data sets based on some column. The Map side join and the reduce side join. Map side join is usually used when one data set is large and the other data set is small. Whereas the Reduce side join can join both the large data sets. The Map side join is faster as it does not have to wait for all mappers to complete as in case of reducer. Hence reduce side join is slower. Map Side Join: · Sorted by the same key. · Equal number of partition. · All the records of the same key should be in same partition. Reduce Side Join: · Much flexible to implement. · There has to be custom WritableComparable with necessary function over ridden. · We need a custom partitioner. · Custom group comparator is required.
... View more
03-19-2019
11:31 AM
Does Partitioner run in its own JVM or shares with another process?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
03-11-2019
09:41 AM
Normal SSH gateway requires the password to be entered each time when a service tries to connect to any node. This will slow down the process to a great extent. Usually, Passwordless ssh is getting set up in distributed technology, where always node to node communication must be faster. As we know that Hadoop is fully distributed the technology. All data getting a store in multiple commodity hardware, so there must be faster communication with each other. Actually, Hadoop is working on the Master-Slave architecture. When a client needs to store or access the data from HDFS, then its submit request to Master node then the master node distributing the request to multiple Slave nodes. Mean if we are not doing Passwordless ssh setup then for every client request Master will need to login slaves via credentials. Is this really feasible for faster data processing? Of course No. That’s why we need the Passwordless ssh setup in Hadoop to make it feasible. Master will not need to log in the slaves, it will directly go to the slave address and will fetch or store the required data.
... View more
03-01-2019
11:29 AM
Why Hadoop MapReduce uses key-value pair to process the data?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-12-2019
12:24 PM
Which systems OLTP or OLAP can have a Hadoop Architecture ?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-02-2019
12:20 PM
If DataNode increases, then do we need to upgrade NameNode?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
01-22-2019
12:00 PM
What is the small file problem? If store 1 million small files in HDFS,will there be any issue?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
01-19-2019
10:49 AM
Yes, there will be metadata in Hadoop . As every change which we do like creation.deletion gets saved in namenode.
... View more
01-09-2019
12:37 PM
Can anyone explain me the problem with the following piece of code- >>> def func(n=[]): #playing around pass >>> func([1,2,3]) >>> func() >>> n
... View more
Labels:
01-05-2019
07:23 AM
RAM.Because metadata information will be needing in every 3 seconds after each Heartbeat. So fast processing will require to process the metadata information. To fasten this kinetic momentum of metadata, Name Node used to stores it into RAM.
How we can change Replication factor when Data is already stored in HDFS hdfs-site.xml is used to configure HDFS . Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
or using hadoop fs shell "hadoop fs –setrep –w 3
... View more
12-28-2018
11:59 AM
Ideally how many maps should be configured on a slave in MapReduce.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-24-2018
11:44 AM
mall file problems- suppose we have 10 small size file in HDFS there would be require 10 mappper to run.suppose we have thousand of small file file 1000 of mapper would require to run this will degrade the performance.istead of thosand of mapper there would require to run one Mapper. this reduces the performance.
To overcome this large no. of small file problems, Hadoop provides an abstract class - CombineFileInputFormat. CombineFileInputFormat packs many files into a single split. Nowsingle mapper can used for processing multiple small files
... View more
12-12-2018
06:36 AM
Can you explain to me what is bucketing in Hive.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
12-05-2018
09:24 AM
Can you explain me how to configure Hadoop to reuse JVM for mappers?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
11-30-2018
11:05 AM
There are two phases normally in a MapReduce job, Map phase and Reduce phase. As the name Map, only job itself depicts that the Map only job contains only one phase, Map phase. So hence there’s no sorting and shuffling of intermediate key-value pairs involved in the process, no need of partitioner and combiner, aggregation or summation of key-value pairs is not required, so the output of mapper is directly written to HDFS . Not all jobs can be processed using map only jobs rather jobs like data parsing can be done. Therefore, map only jobs performance is better than MapReduce jobs.
... View more
11-24-2018
11:35 AM
A Backup Node acts as a checkpoint node. It saves the up to date copy of the Namenode metadata files in memory (FsImage and EditLogs) and saves it into the local filesystem FsImage file and resets edits synchronizing it with the active Name Node. Whenever the Name Node is started up, it uses the (files that are backed up in the local file system) FsImage file to know the latest state and uses edits to make the changes and comes back to the latest track. One Backup Node is managed by one Namenode, if the Backup node is present then there is no need of a Checkpoint node.
... View more
11-22-2018
07:18 AM
In hadoop how one can increase replication factor to a desired value?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
11-15-2018
11:42 AM
Namenode contains Metadata i.e. number of blocks, replicas, their location, and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the Datanodes, and assigns tasks to them.
... View more
10-31-2018
10:40 AM
In Hadoop how to restart NameNode or all the daemons?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
10-26-2018
09:14 AM
Data Integrity talks about the correctness of the data. It is very important for us to have a guarantee or assurance that the data stored in HDFS is correct. However, there is always a slight chance that the data will get corrupted during I/O operations on the disk. HDFS creates the checksum for all the data written to it and verifies the data with the checksum during read operation by default. Also, each DataNode runs a block scanner periodically, which verifies the correctness of the data blocks stored in the HDFS.
... View more
09-13-2018
10:19 AM
What should be the replication factor in Hadoop Cluster ?
... View more
Labels:
07-21-2018
12:13 PM
Both Flume & Kafka are used for real-time event processing but they are quite different from each other as per below mentioned points: 1. Kafka is a general purpose publish-subscribe model messaging system. It is not specifically designed for Hadoop as hadoop ecosystem just acts as one of its possible consumer. On the other hand flume is a part of Hadoop ecosystem , which is used for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store, such as HDFS or HBase. It is more tightly integrated with Hadoop ecosystem. Ex, the flume HDFS sink integrates with the HDFS security very well. So its common use case is to act as a data pipeline to ingest data into Hadoop. 2. It is very easy to increase the number of consumers in kafka without affecting its performance & without any downtime. Also it does not keep any track of messages in the topic delivered to consumers. Although it is the consumer’s responsibility to do the tracking of data through offset. Hence it is very scalable contrary to flume as adding more consumers in the flume means changing the topology of Flume pipeline design, which requires some downtime also. 3. Kafka is basically working as a pull model. kafka different consumers can pull data from their respective topic at same time as consumer can process their data in real-time as well as batch mode. On the contrary flume supports push model as there may be a chances of getting data loss if consumer does not recover their data expeditly. 4. Kafka supports both synchronous and asynchronous replication based on your durability requirement and it uses commodity hard drive. Flume supports both ephemeral memory-based channel and durable file-based channel. Even when you use a durable file-based channel, any event stored in a channel not yet written to a sink will be unavailable until the agent is recovered. Moreover, the file-based channel does not replicate event data to a different node. It totally depends on the durability of the storage it writes upon. 5. For Kafka we need to write our own producer and consumer but in case of flume, it uses built-in sources and sinks, which can be used out of box. That’s why if flume agent failure occurs then we lose events in the channel. 6. Kafka always needs to integrate with other event processing framework, that’s why it does not provide native support for message processing In contrast, Flume supports different data flow models and interceptors chaining, which makes event filtering and transforming very easy. For example, you can filter out messages that you are not interested in the pipeline first before sending it through the network for obvious performance reason. However, It is not suitable for complex event processing.
... View more
07-19-2018
10:33 AM
1 Kudo
What do you mean by cluster, single node cluster, and node?
... View more
Labels:
07-05-2018
10:45 AM
Does the picture of Spark come into existence?Why?
... View more
Labels:
06-18-2018
11:24 AM
1 Kudo
HDFS Block:- I Hadoop HDFS stores each file as a block and distribute across the nodes in a cluster. The default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop FileSystem. It is the physical representation of data. It contains a minimum amount of data that can be read or write. InputSplit Data to be processed by mapper is represented by InputSplit. Initially, data for MapReduce task is present in input files in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible to create InputSplit. By default, split size is approximately equal to block size. InputSplit is user-defined and the user can control split size based on the size of data in MapReduce program. It is the logical representation of data present in the block. It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data. InputSplit is only a logical chunk of data i.e. it has just the information about blocks address or location.
... View more
06-14-2018
11:11 AM
How to change the replication factor of data which is already stored in HDFS?
... View more
Labels:
- Labels:
-
Apache Hadoop
06-01-2018
11:19 AM
How is Data node failure is tackled in Hadoop.
... View more
- Tags:
- hadoop
- Hadoop Core
Labels:
- Labels:
-
Apache Hadoop
05-30-2018
06:59 AM
How can we write or store data/file in Hadoop HDFS?
... View more
Labels:
- Labels:
-
Apache Hadoop