Member since
07-04-2016
40
Posts
5
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
734 | 09-16-2016 05:31 AM |
09-22-2021
11:40 PM
How to enable the SSL for livy server in EMR. Can we use KMS certificate for this or is there any other option.
... View more
Labels:
- Labels:
-
Apache Zeppelin
09-19-2016
02:07 PM
I have tested it that we can run the jobs on nodes where there is no data node daemon running and is configured as a edge node. correct me if i am wrong.
... View more
09-19-2016
12:05 PM
if i configure my edge node and not as data node i cannot store data in that datanode . But can i configure node manager on edge node and can i bring the data to the edge node and run the task if all other nodes are busy??
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN
09-19-2016
05:00 AM
@Rushikesh Deshmukh What is the purpose of merging the tables used in joins ?? can you please explain??
... View more
09-16-2016
05:31 AM
1 Kudo
1)Why Secondary namenode is explicitly copying Fsimage from Primary name node when secondary name node is having the same copy of FS image as primary has? There is no guaranty that the fs image in secondary namenode will be exactly same as that in Primary namenode. During checkpoint period of time , there may happen any corruption of data or any crashes and data loss. Its better to get the latest available data from Primary namenode and then merge the editlogs. 2) Initially when cluster is setup will it be having any fsimage at primary node if yes will it contains any data. Yes, When a new namenode is setup in a new cluster it will have a FSimage with no data in it with file name like Fsimage_000000000 representing no transactions. 3) Looks like both primary name node and secondary name node are maintaining all the transaction logs? Is it required to maintain same logs in both locations? if yes, How many old transactions that we have to keep in cluster? is there any configuration for this By default HDFS stores till the transactions count reaches 1 million. Files which are storing transaction logs greater than 1 million are removed from HDFS.
... View more
08-04-2016
02:17 PM
Am not familiar with spark. But looks like it has some functions to meet your requirement. http://stackoverflow.com/questions/36436020/converting-csv-to-orc-with-spark
... View more
08-04-2016
09:21 AM
@Benjamin Leonhardi As per YARN appMaster is a mere code. So am unable to figure out how the new DAG can be submitted to existing AppMaster written to handle some other DAG.
... View more
08-04-2016
09:18 AM
Thank you @Shiv kumar
... View more
08-04-2016
04:58 AM
So, the handshake between client and AppMaster in YARN(which decommissions once job is done) is continued here in a Tez session. and client submits new DAGs directly to AppMaster and resource manager thinks its still the same application running , so the DAGs run with same application id. Correct me if i am wrong.
... View more
08-03-2016
01:00 PM
Hi @ARUN The main reason might be the data blocks needed for the MapReduce job to run are located in those two nodes itself. Can you please check the data blocks of the file you are processing and verify that the data is distributed in 3 nodes. Speculative execution( case when your nodes are too busy running the tasks then the data can be moved temporarily to the third node and run the task.) also not be happening.
... View more
08-03-2016
12:37 PM
As per Tez sessions, DAGs submitted within a session are handled by the same AppMaster. Unable to understand how the new application (DAG) is mapped to the already running AppMaster?? Who does it and how?? As per YARN the resource manager is responsible for launching appmasters. How this functionality is eclipsed by Tez?? Thanks in advance.
... View more
- Tags:
- Hadoop Core
- tez
- YARN
Labels:
- Labels:
-
Apache Tez
-
Apache YARN
08-03-2016
12:33 PM
Am not feeling good to say this. But am not satisfied with you answer. It is fine that application master doing the job of calling inputformat() adn calcuating the input splits and goes on. But am asking what is the meaning of the sentence quoted in the Definitive guide that client places computed inputsplits in HDFS. Am sorry if i am unable to explain my doubt properly.
... View more
08-01-2016
03:30 PM
@Shiv kumar That is what am saying. So " where this happens? " is my question.
... View more
07-26-2016
12:58 PM
Very neatly explained.!
... View more
07-26-2016
11:19 AM
so is it like it will read all the data 1GB and then split the data into logical splits and assign map task to it?? Then what are the computed input splits placed in HDFS while job being submitted... at that AppMaster will not be even launched. and how come 1 GB file will be divided into 10 splits if the block size is 256?? the division is based on splitsize which can be configurable (as of my knowledge).
... View more
07-26-2016
07:16 AM
If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??
"Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the shared filesystem in a directory
named after the job ID (step 3).". Above is the line form Hadoop Definitive guide. And how map works if the split spans over data blocks in two different data nodes??
... View more
Labels:
- Labels:
-
Apache Hadoop
-
HDFS
07-19-2016
01:00 PM
Can MaintenanceMode be the answer?? if yes what happens when a node is kept in maintenance mode.?? How replication works for the data kept in maintenance mode node.?? what happens when i decomission a data node?? and what happens when i delete a datanode???
... View more
Labels:
- Labels:
-
Apache Hadoop
07-11-2016
11:09 AM
YARN has many advantages over MapReduce (MRv1). 1) Scalability - Decreasing the load on the Resource Manager(RM) by delegating the work of handling the tasks running on slaves to application Master, RM can now handle more requests than Job tracker facilitating addition of more nodes. 2) Unlike MPv1 which is strongly coupled with the MapReduce , YARN supports many kinds of code running on them like MR2,Tez, Storm, Spark etc 3) Optimized resource allocation - There are no fixed number of slots separately allocated for Mapper and Reducers in YARN, which is the case in MRv1. So the available capacity of the nodes can be used to any task which needs resources. 4) When Resource manager fails , the jobs running on the cluster need not be restarted again after the recovery of Resource Manager. 5) Failover mechanism is implemented by ZK which is already part of Resource manager which says, we don't need to run another deamon.
... View more
07-05-2016
12:46 PM
@Benjamin Leonhard Thank you for your quick and explanatory answers. Can you please clarify few more doubts i have, 1) What is the reason behind storing the output of MapReduce to HDFS?? why cant we directly send to client or display them. What happens to the output files?? are they stored permanently or flushed after some time?? if so on what basis?? 2) Will MapReduce run when we read the data from the HDFS??
... View more
07-05-2016
11:04 AM
@Benjamin Leonhardi satisfied with your answer. But for the second question , am taking about each chunk of file divided. not about replicas of the block.
... View more
07-05-2016
04:26 AM
lets assume we have a file of 300 MB and block size of HDFS is 128 MB and hence the file is divided into 128(B1),128(B2),44(B3).. so 1. where does this splitting of huge file takes place.? As many people say "client splits the file" .what a client actually is?? HDFS client (if yes can you give me the flow from executed command like -put to HDFS client to Nanenodes and Datanodes) or any other external tools (if yes example). 2.Does the client forms 3 pipelines for each block to replicate which run in parallel??. 3. DN1 which received B1 will start sending the data to DN2 before 128 MB of its block is full?? And of my third point is true... doesn't that contradict the replication principle where "we will get the complete block of data and then start replicating" rather just replicating as soon as we get the chunks of data of total block. Can you also provide the possible reasons why the flow is not the otherwise.
... View more
Labels:
- Labels:
-
Apache Hadoop