Member since
11-12-2018
189
Posts
177
Kudos Received
32
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
511 | 04-26-2024 02:20 AM | |
665 | 04-18-2024 12:35 PM | |
3219 | 08-05-2022 10:44 PM | |
2937 | 07-30-2022 04:37 PM | |
6415 | 07-29-2022 07:50 PM |
12-06-2018
04:27 AM
3 Kudos
@Junfeng Chen You can change the path to the temp folder for each Spark application by spark.local.dir property like below SparkConf conf = new SparkConf().setMaster("local”).setAppName("test”).set("spark.local.dir", "/tmp/spark-temp"); Reference Please accept the answer you found most useful
... View more
12-04-2018
12:03 PM
3 Kudos
Please can you try to run below command, hdp-select sets a given version to be the current version, by creating appropriate symlinks to the folder with appropriate version number. hdp-select status | grep -i hdfs
... View more
12-04-2018
11:45 AM
2 Kudos
I can suggest, for 20 kafka machines you can go with 3 zookeeper servers
... View more
12-04-2018
09:24 AM
2 Kudos
@Michael Bronson Zookeeper servers are tolerates the servers down. But yes it's always recommendable if you are in the planning for scaling the cluster go with more resources and robust hardwares. It’s completely perfect to move the Zookeeper servers from VM machines to physical machines with more resources.
... View more
12-04-2018
08:57 AM
3 Kudos
@Michael Bronson In normal small deployment using 3 zookeeper servers is acceptable, but keep in mind that you will only be able to tolerate 1 server down in this case. If you have a ZooKeeper ensemble has 5 or 7 servers, which tolerates 2 and 3 servers down, respectively. I hope this answers your question. Reference: https://kafka.apache.org/documentation/#zk
... View more
12-04-2018
07:52 AM
4 Kudos
@Michael Bronson This is the harmless message you can ignore it. which will be addressed in 2.3.0 version of ambari. Please see:https://issues.apache.org/jira/browse/AMBARI-12420 The DataNode code has been changed in 2.3.0 Ambari so that it would stop logging the EOFException if a client connected to the data transfer port and immediately closed before sending any data. Link
... View more
12-04-2018
07:26 AM
3 Kudos
@Michael Bronson The Namenode stores metadata about the data being stored in datanodes whereas the datanode stores the actual Data. The Namenode will also require RAM directly proportional to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of namenode memory for every 1 million blocks stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the namenode provides plenty of room to grow the cluster. So, thousands of datanodes can be handled by a single namenode, but there are many factors to consider: namenode memory size, number of blocks to be stored, block replication factor, how will the cluster be used, etc. In short, “number of datanodes a single name node can handle depends on the size of the name node (How much metadata it can hold)” Please accept the answer you found most useful
... View more
12-01-2018
10:48 AM
3 Kudos
@Gulshan Agivetova You can force Ambari Server to start by skipping this check with the following option: ambari-server start --skip-database-check
... View more
11-27-2018
06:03 AM
3 Kudos
@Amit Mishra We can configure Knox with other authentication options too other than LDAP. Here is the link to the list of supported authentication providers for Knox (i.e., LDAP, PAM, Kerberos) https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_security/content/authentication_providers.html https://knox.apache.org/books/knox-1-1-0/user-guide.html#HadoopAuth+Authentication+Provider Please accept the answer you found most useful
... View more
11-26-2018
02:13 PM
2 Kudos
@vamsi valiveti Shuffling is the process of transferring data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%. Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers). A great source of information for these steps is this Yahoo tutorial. A nice graphical representation of this is the following: Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster) Ref Please accept the answer you found most useful
... View more
- « Previous
- Next »