About jagadeesan

jagadeesan · ‎12-06-2018

@Junfeng Chen You can change the path to the temp folder for each Spark application by spark.local.dir property like below SparkConf conf = new SparkConf().setMaster("local”).setAppName("test”).set("spark.local.dir", "/tmp/spark-temp"); Reference Please accept the answer you found most useful

jagadeesan · ‎12-04-2018

Please can you try to run below command, hdp-select sets a given version to be the current version, by creating appropriate symlinks to the folder with appropriate version number. hdp-select status | grep -i hdfs

jagadeesan · ‎12-04-2018

I can suggest, for 20 kafka machines you can go with 3 zookeeper servers

jagadeesan · ‎12-04-2018

@Michael Bronson Zookeeper servers are tolerates the servers down. But yes it's always recommendable if you are in the planning for scaling the cluster go with more resources and robust hardwares. It’s completely perfect to move the Zookeeper servers from VM machines to physical machines with more resources.

jagadeesan · ‎12-04-2018

@Michael Bronson In normal small deployment using 3 zookeeper servers is acceptable, but keep in mind that you will only be able to tolerate 1 server down in this case. If you have a ZooKeeper ensemble has 5 or 7 servers, which tolerates 2 and 3 servers down, respectively. I hope this answers your question. Reference: https://kafka.apache.org/documentation/#zk

jagadeesan · ‎12-04-2018

@Michael Bronson This is the harmless message you can ignore it. which will be addressed in 2.3.0 version of ambari. Please see:https://issues.apache.org/jira/browse/AMBARI-12420 The DataNode code has been changed in 2.3.0 Ambari so that it would stop logging the EOFException if a client connected to the data transfer port and immediately closed before sending any data. Link

jagadeesan · ‎12-04-2018

@Michael Bronson The Namenode stores metadata about the data being stored in datanodes whereas the datanode stores the actual Data. The Namenode will also require RAM directly proportional to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of namenode memory for every 1 million blocks stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the namenode provides plenty of room to grow the cluster. So, thousands of datanodes can be handled by a single namenode, but there are many factors to consider: namenode memory size, number of blocks to be stored, block replication factor, how will the cluster be used, etc. In short, “number of datanodes a single name node can handle depends on the size of the name node (How much metadata it can hold)” Please accept the answer you found most useful

jagadeesan · ‎12-01-2018

@Gulshan Agivetova You can force Ambari Server to start by skipping this check with the following option: ambari-server start --skip-database-check

jagadeesan · ‎11-27-2018

@Amit Mishra We can configure Knox with other authentication options too other than LDAP. Here is the link to the list of supported authentication providers for Knox (i.e., LDAP, PAM, Kerberos) https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_security/content/authentication_providers.html https://knox.apache.org/books/knox-1-1-0/user-guide.html#HadoopAuth+Authentication+Provider Please accept the answer you found most useful

jagadeesan · ‎11-26-2018

@vamsi valiveti Shuffling is the process of transferring data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%. Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers). A great source of information for these steps is this Yahoo tutorial. A nice graphical representation of this is the following: Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster) Ref Please accept the answer you found most useful

Online	Offline
Last Visited	‎12-24-2024 07:17 PM

Member Since	‎11-12-2018 10:00 AM
Last Visited	‎12-24-2024 07:17 PM
Posts	189
Kudos received	177

Cloudera Community

Re: Apache Storm support in Cloudera

Re: Complete example for using spark MLlib for twi...

Re: CDP - Zeppeling: Spark + Livy + Hive - HWC

Re: CDP - Zeppelin - Livy Error

Re: Spark3 connection to HIVE ACID Tables

Re: How to change Spark _temporary directory when ...

Re: ERROR datanode.DataNode + error processing WRI...

Re: datanode machine + how many datanode we can ad...

Re: datanode machine + how many datanode we can ad...

Re: datanode machine + how many datanode we can ad...

Re: ERROR datanode.DataNode + error processing WRI...

Re: datanode machine + how many datanode we can ad...

Re: Install and configure new Ambari server for ex...

Re: how to disable LDAP authentication in Knox

Re: Map reduce Flow clarification