Member since
05-18-2018
43
Posts
3
Kudos Received
0
Solutions
12-04-2019
10:17 AM
Do I need to put the namenode in safe mode to execute this command? or I can execute this on live cluster? hadoop fs –setrep –w 3 -R /
... View more
04-03-2019
12:08 PM
Hadoop supports two kinds of joins to join two or more data sets based on some column. The Map side join and the reduce side join. Map side join is usually used when one data set is large and the other data set is small. Whereas the Reduce side join can join both the large data sets. The Map side join is faster as it does not have to wait for all mappers to complete as in case of reducer. Hence reduce side join is slower. Map Side Join: · Sorted by the same key. · Equal number of partition. · All the records of the same key should be in same partition. Reduce Side Join: · Much flexible to implement. · There has to be custom WritableComparable with necessary function over ridden. · We need a custom partitioner. · Custom group comparator is required.
... View more
03-11-2019
09:41 AM
Normal SSH gateway requires the password to be entered each time when a service tries to connect to any node. This will slow down the process to a great extent. Usually, Passwordless ssh is getting set up in distributed technology, where always node to node communication must be faster. As we know that Hadoop is fully distributed the technology. All data getting a store in multiple commodity hardware, so there must be faster communication with each other. Actually, Hadoop is working on the Master-Slave architecture. When a client needs to store or access the data from HDFS, then its submit request to Master node then the master node distributing the request to multiple Slave nodes. Mean if we are not doing Passwordless ssh setup then for every client request Master will need to login slaves via credentials. Is this really feasible for faster data processing? Of course No. That’s why we need the Passwordless ssh setup in Hadoop to make it feasible. Master will not need to log in the slaves, it will directly go to the slave address and will fetch or store the required data.
... View more
02-27-2019
12:23 PM
The users can be created using below steps: a)Get the information from user as to which machine is he working from. b)create the user in in OS first. c)Create the user in Hadoop by creating his home folder /user/username in Hadoop d)make sure that we have 777 permission for temp directory in HDFS e)using chown command change ownership from Hadoop to user for only his home directory so that he can write into only his directory and not other users. f)add his name into name node hdfs dfsadmin -refreshUserToGroupMappings G)If needed set a space limit for the user to limit the amount of data stored by him.hdfs dfsadmin -setSpaceQuota 50g /user/username
... View more
02-08-2019
12:42 PM
Once a map reduce program is built a driver class has to be created that will be submitted to the cluster. For this, we create the object of the JobConf class. One of the properties of this object is setMapperClass. Conf.setMapperClass sets the mapper class in the driver. It helps the driver class to get the details like reading data and generating key-Value pairs out of the mapper.
... View more
02-10-2019
10:43 PM
@Dukool SHarma Any updates?
... View more
01-05-2019
07:23 AM
RAM.Because metadata information will be needing in every 3 seconds after each Heartbeat. So fast processing will require to process the metadata information. To fasten this kinetic momentum of metadata, Name Node used to stores it into RAM.
How we can change Replication factor when Data is already stored in HDFS hdfs-site.xml is used to configure HDFS . Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
or using hadoop fs shell "hadoop fs –setrep –w 3
... View more
12-27-2018
12:16 PM
Sorting is carried out at the Map side. When all the map outputs have been copied, the reduce task moves into the sort phase i.e.maerging phase. which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 60 map outputs and the merge factor was 15 (the default, controlled by the mapreduce.task.io.sort.factor property, just like in the map’s merge), there would be four rounds. Each round would merge 15 files into 1, so at the end, there would be 4 intermediate files to be processed. This is done using a key-value pair.
... View more
12-17-2018
12:05 PM
There is no specific rule in Hadoop on how many times a combiner should be called. Sometimes it may not be called at all, while Sometimes it may be used once, twice or more depending on the number and size of the output file generated by the mapper.
... View more
12-03-2018
11:47 AM
1 Kudo
When the mapper starts producing the intermediate output it does not directly write the data on the local disk. Rather it writers the data in memory and some sorting of the data (Quick Sort) happens for performance reasons.
Each map task has a circular memory buffer which it writes the output to. By default, this circular buffer is of 100 MB. It can be modified by the parameter mapreduce.task.io.sort.mb.
When the contents of the buffer reach a certain threshold size (MapReduce.map.sort.spill.percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
... View more
11-30-2018
11:05 AM
There are two phases normally in a MapReduce job, Map phase and Reduce phase. As the name Map, only job itself depicts that the Map only job contains only one phase, Map phase. So hence there’s no sorting and shuffling of intermediate key-value pairs involved in the process, no need of partitioner and combiner, aggregation or summation of key-value pairs is not required, so the output of mapper is directly written to HDFS . Not all jobs can be processed using map only jobs rather jobs like data parsing can be done. Therefore, map only jobs performance is better than MapReduce jobs.
... View more
11-27-2018
01:57 AM
yes,If u want to build hadoop cluster quickly, u can use docker.
... View more
10-29-2018
12:26 PM
Apache Hadoop achieves security by using Kerberos. At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.
Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT). Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server. Service Request – The client uses the service ticket to authenticate itself to the server.
... View more
10-27-2018
11:14 AM
hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfs-site.xml also specify default block replication and permission checking on HDFS. The three main hdfs-site.xml properties are:
dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs). And also specify where DFS should locate – on the disk or in the remote directory. dfs.data.dir gives the location of DataNodes where it stores the data. fs.checkpoint.dir is the directory on the file system. On which secondary NameNode stores the temporary images of edit logs. Then this EditLogs and FsImage will merge for backup.
... View more
12-10-2018
01:58 AM
2 Kudos
Let's start with Hive and then HCatalog. Hive
Layer for analyzing, querying and managing large datasets that reside in Hadoop various file systems ⇢ uses HiveQL (HQL) as processing engine ⇢ uses SerDes for serialization and deserialization ⇢ works best with huge volumes of data HCatalog
Table and storage management layer for Hadoop ⇢ basically, the EDW system for Hadoop (it supports several file formats such as RCFile, CSV, JSON, SequenceFile, ORC) ⇢ is a sub-component of Hive, which enables ETL processes ⇢ tool for accessing metadata that reside in Hive Metastore ⇢ acts as an API to expose the metastore as REST interface to external tools such as Pig ⇢ uses WebHcat, a web server for engaging with the Hive Metastore I think the focus has to be made on how they complement each other rather than focusing on their differences. Documentation (3)
This answer from @Scott Shaw is worth checking This slideshare presents the use cases and features of Hive and Hcatalog This direct graph from IBM shows how they use both layers in a batch job I hope this helps! 🙂
... View more
10-15-2018
11:48 AM
If we have small data set, Uber configuration is used for MapReduce. The Uber mode runs the map and reduce tasks within its own process and avoid overhead of launching and communicating with remote nodes.
... View more
10-10-2018
12:40 PM
The Client can interact with the Hive in the below three ways:-
ü Hive Thrift Client: The Hive server is exposed as thrift service. Hence it is possible to interact with HIVE with any programming language that supports thrift.
ü JDBC Driver: Hive uses pure Type 4 JDBC driver to connect to the server which is defined in org.apache.hadoop.HIVE.JDBC.HiveDriver class. . Pure Java applications may use this driver in order to connect to application using separate host and port.
The BeeLine CLI uses JDBC Driver to connect to the HIVE Server.
ü ODBC Driver: An ODBC Driver allows application that support ODBC to connect to the HIVE server. By default Apache does not ships the ODBC Driver but it is freely available by many vendors.
... View more
10-06-2018
06:41 PM
It is a Java class for reading Hadoop SequenceFIles http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html
... View more
10-04-2018
03:24 PM
There are many use cases to leverage DS. Typically when you require reference/lookup data to be available across the nifi cluster, DS is a good fit. Ie enriching dataflow.
... View more
09-20-2018
10:54 AM
Btw, they spam us with granular & top-notch resources. I think it worths the spam. ^.^
... View more
08-17-2018
05:25 PM
Yes, as long as the appropriate clients are installed on the slave node, if you also have the /etc/config populated with the correct details for connecting to your instance, then no connection parameters need to be specified for the clients (this is automatically populated if the slave node is deployed/configured by ambari). in that case you submit the job exactly as you would on any other node
... View more
07-28-2018
12:01 PM
The number of instance of record readers will be equal to the number of input splits( the number of mappers).
In the parallel processing, mappers will run parallelly and each mapper will create an instance of the mapper.
... View more
07-24-2018
06:26 PM
Contrary to answer by @Harshali Patel, exhaustion is not defined as an uneven distribution, it is rather a cause of it. A datanode has a property that you can set which defines a threshold of data must be reserved for the OS on that server. Once that limit is exceeded, the datanode process will stop and log an error telling you to delete some files from it. HDFS will continue to function with the other datanodes. The balancer can be ran to keep storage space healthy and even.
... View more
07-21-2018
12:13 PM
Both Flume & Kafka are used for real-time event processing but they are quite different from each other as per below mentioned points: 1. Kafka is a general purpose publish-subscribe model messaging system. It is not specifically designed for Hadoop as hadoop ecosystem just acts as one of its possible consumer. On the other hand flume is a part of Hadoop ecosystem , which is used for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store, such as HDFS or HBase. It is more tightly integrated with Hadoop ecosystem. Ex, the flume HDFS sink integrates with the HDFS security very well. So its common use case is to act as a data pipeline to ingest data into Hadoop. 2. It is very easy to increase the number of consumers in kafka without affecting its performance & without any downtime. Also it does not keep any track of messages in the topic delivered to consumers. Although it is the consumer’s responsibility to do the tracking of data through offset. Hence it is very scalable contrary to flume as adding more consumers in the flume means changing the topology of Flume pipeline design, which requires some downtime also. 3. Kafka is basically working as a pull model. kafka different consumers can pull data from their respective topic at same time as consumer can process their data in real-time as well as batch mode. On the contrary flume supports push model as there may be a chances of getting data loss if consumer does not recover their data expeditly. 4. Kafka supports both synchronous and asynchronous replication based on your durability requirement and it uses commodity hard drive. Flume supports both ephemeral memory-based channel and durable file-based channel. Even when you use a durable file-based channel, any event stored in a channel not yet written to a sink will be unavailable until the agent is recovered. Moreover, the file-based channel does not replicate event data to a different node. It totally depends on the durability of the storage it writes upon. 5. For Kafka we need to write our own producer and consumer but in case of flume, it uses built-in sources and sinks, which can be used out of box. That’s why if flume agent failure occurs then we lose events in the channel. 6. Kafka always needs to integrate with other event processing framework, that’s why it does not provide native support for message processing In contrast, Flume supports different data flow models and interceptors chaining, which makes event filtering and transforming very easy. For example, you can filter out messages that you are not interested in the pipeline first before sending it through the network for obvious performance reason. However, It is not suitable for complex event processing.
... View more
07-18-2018
11:58 AM
HDFS Block- Block is a continuous location on the hard drive where data is stored. In general, FileSystem stores data as a collection of blocks. In the same way, HDFS stores each file as blocks. The Hadoop application is responsible for distributing the data block across multiple nodes.
Input Split in Hadoop- The data to be processed by an individual Mapper is represented by InputSplit. The split is divided into records and each record (which is a key-value pair) is processed by the map. The number of map tasks is equal to the number of InputSplits. Initially, the data for MapReduce task is stored in input files and input files typically reside in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible for creating InputSplit.
MapReduce InputSplit vs Blocks in Hadoop InputSplit vs Block Size in Hadoop-
• Block – The default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop FileSystem. • InputSplit – By default, split size is approximately equal to block size. InputSplit is user defined and the user can control split size based on the size of data in MapReduce program. Data Representation in Hadoop Blocks vs InputSplit- • Block – It is the physical representation of data. It contains a minimum amount of data that can be read or write. • InputSplit – It is the logical representation of data present in the block. It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data
... View more
07-13-2018
08:01 AM
@Dukool SHarma Rack Awareness Article Part-2. https://community.hortonworks.com/articles/43164/rack-awareness-series-2.html
... View more
02-05-2019
03:58 PM
@Harshali, Did you consider RAID levels? Since there are 3 replication factor do you think RAID level should be considered?
... View more
07-06-2018
04:43 AM
Small file means files which are considerably smaller than block size(64 MB or 128 MB) from Hadoop perspective. Since Hadoop is used for processing huge amount of data, if we are using small files, a number of files would be obviously large. Hadoop is actually designed for a large amount of data ie a small number of large files. Following are the issues with small file 1. Each file, directory, and a block in HDFS is represented as an object in name node’s memory (ie Metadata), and each of which occupies approx. 150 bytes. Scaling these much amounts of memory in the name node for each of these objects is not feasible. In short, if a number of files increases, the memory required to store metadata will be more. 2. HDFS is not designed for efficient access of small files. Handling a large number of small files causes a lot of seeks and a lot of hopping from the data node to the data node to retrieve small files. This is an inefficient data access pattern. 3. Mapper node usually takes a block of input at a time. If the file is very small(ie less than typical block size), the number of mapper task would increase and each task process very little input. This would create a lot of task in queue and overhead would be high. This decreases the overall speed and efficiency of map jobs. Solution: 1. Hadoop archive Files (HAR): HAR command creates a HAR file, which runs a map reduce job to prevent HDFS data to get archived into small files. HAR ensures file size is large and the number is low. 2. Sequence files: By this method, data is stored in such a way that file name will be kay and file name will be valued. MapReduce programs can be created to make a lot of small files into a single sequence file. MapReduce divides sequence files into parts and works on each part independently.
... View more
06-22-2018
08:38 AM
1) Volume of data: For the lower volume of data such as few GB’s if RDBMS fulfills your requirement it is the best. When the data size exceeds, RDBMS becomes very slow. In contrast to this, Hadoop framework’s processing power comes into realization when the file sizes are very large and streaming reads and processing is the demand of the situation. 2) Latency: RDBMS can give a very quick response when the data size is ideal for its processing power. In the case of Hadoop, it's very different. First of all, Hadoop is efficient for batch processing of data. Hence, the results are only available after a large amount of data has been processed. Therefore, Hadoop is not the ideal platform to use when immediate results are expected. 3) Throughput: Throughput refers to the amount of data processed in a period of time. And Hadoop's throughput if higher than RDBMS. 4) ACID Property: ACID property is for transaction-based systems. Whereas, in the case of Hadoop nothing like ACID is existent. But if we want to talk in the context of Distributed Databases there is a HBASE property (Basically Available, Soft State, Eventually Consistent). You can dig into it for more info or we can discuss it in a separate thread. 5) Schema: If we talk about RDBMS, it is used to store structured data or semi-structured data with null values in certain columns in the tables. Hadoop is used to store semi-structured data and unstructured data in files. All the processing algorithms are implemented on the files stored in HDFS in case of Hadoop. In the case of RDBMS, querying languages such as SQL are used to fetch data from the tables. 6) Variety of Data Handling: In case of RDBMS, only that data can be stored which can be represented in a certain format in a combination of row and column of the table. In Hadoop, any kind of data can be stored but it's only productive if you can process it using MapReduce job. There are two terms I’ll like to discuss. One is schema-on-write which is used by traditional RDBMS where data should be in a specific format before writing it to the table. In Hadoop, schema-on-read is used where you can store any data in raw format and the structure is imposed at processing time based on the requirements of the processing application. 7) Response Time: Response time for RDBMS is very less if the data is in its processing limits whereas, Hadoop is very fast to process very large files but its jobs are executed in batches from time to time
... View more