Member since
07-30-2019
111
Posts
181
Kudos Received
35
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1854 | 02-07-2018 07:12 PM | |
1301 | 10-27-2017 06:16 PM | |
1778 | 10-13-2017 10:30 PM | |
3355 | 10-12-2017 10:09 PM | |
698 | 06-29-2017 10:19 PM |
04-11-2017
08:18 PM
@hardik desai The NameNode appears to be up from your screenshot, so it's difficult to say what went wrong. Check for errors in the service logs of an affected service instance. Also look through your NameNode logs for any errors.
... View more
04-11-2017
08:10 PM
Hi @Sami Ahmad, there is no reliable way to recover data once you have removed a directory with the -skipTrash option. Your best bet is to immediately stop the cluster as soon as you realize your mistake and then try the recovery steps from the linked article. However even that won't work if the DataNodes have already deleted the block files (the delay between issuing the delete command to DataNodes deleting block files can be anywhere from a few seconds to a few minutes). In your case, you don't see the transaction in the edits_inprogress_0000000000010978611 file likely because the edit logs have rolled over and the delete transaction is in an older edit log file. If your cluster has been up and running since the last 4 days there is little hope of recovering the data now, unfortunately.
... View more
04-06-2017
04:33 PM
1 Kudo
Using RAID can reduce the availability and fault tolerance of HDFS. It certainly reduces the overall performance as compared to JBOD. We strongly recommend configuring your disks as JBOD since HDFS already stores data redundantly by replicating across nodes/racks and can automatically recover from disk and node failures.
... View more
04-05-2017
01:46 PM
I'd also post this question on the Ambari track to check why Ambari didn't detect the DataNodes doing down. Also from your logs it is hard to say why the DataNode went down. I again recommend increasing the DataNode heap allocation via Ambari. Also check that your nodes are provisioned with sufficient amount of RAM.
... View more
04-05-2017
01:41 PM
Ok looks like you have automatic failover enabled. I am not sure why you get the EOFException. Look through your NameNode logs to see if there are any errors.
... View more
04-05-2017
01:36 PM
The Mover will move blocks within the same node when possible and thus try to avoid network activity. If that is not possible (e.g. when a node doesn't have SSD or when the local SSDs are full), it will move block replicas across the network to another node that has the target media. I've edited my answer.
... View more
04-04-2017
09:06 PM
@Riccardo Iacomini, are you asking about the HDFS move/rename command? Move is purely a metadata operation on the NameNode and does not result in any data movement until the HDFS Mover utility is run. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Mover_-_A_New_Data_Migration_Tool Edit: The Mover will move blocks within the same node when possible and thus try to avoid network activity.
If
that is not possible (e.g. when a node doesn't have SSD or when the
local SSDs are full), it will move block replicas across the network to
another node that has the target media.
... View more
04-03-2017
11:14 PM
1 Kudo
You likely have Kerberos enabled. The DataNode process starts as root so it can bind a privileged port (<1024) for data transfer. Then it launches another process as user hdfs. You should not kill either process. The "refused to connect" error looks like some network connectivity issue in your environment, or you are hitting the wrong port number. See if you can find the correct info port from either configuration or from the DataNodes tab of the NameNode web UI.
... View more
04-03-2017
06:23 PM
1 Kudo
Are you using HDP and did you enable NameNode HA using Ambari? If so then you should have automatic failover configured. Automatic Failover requires the ZooKeeper service instances and ZooKeeper FailoverControllers to be up and running. If you setup HA manually, then you may need to transition one of the NNs to active status manually as described here: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
... View more
04-03-2017
05:08 PM
2 Kudos
These pages are still present. You can navigate to the jmx servlet on the DataNode web UI. e.g. http://<datanode>:50075/jmx
... View more
03-31-2017
06:59 PM
1 Kudo
@Chad Woodhedad, yes 'fs -du` is expensive compared to other read operations. Running it every 5 minutes is probably overkill. You can run it less frequently e.g. once an hour. `hdfs dfsadmin -report` is also expensive compared to typical read operations. We've occasionally seen these calls affect NameNode performance when buggy monitoring scripts invoke them many times per second. Barring that, you should be fine.
... View more
03-27-2017
11:46 PM
Just curious - why did you restart the Data Nodes? Did they crash?
... View more
03-27-2017
06:37 PM
1 Kudo
Hi @Joshua Adeleke, how frequently do you see the errors. These are some times seen in busy clusters and usually clients/HDFS recover from transient failures. If there are no job or task failures around the time you of the errors, I would just ignore them. Edit: I took a look at your attached log file. There's a lot of GC pauses as @Namit Maheshwari pointed out. Try increasing the DataNode heap size and PermGen/NewGen allocations until the GC pauses go away. 2017-03-25 10:10:18,219 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(192)) - Detected pause in JVM or host machine (eg GC): pause of approximately 44122ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=44419ms
... View more
03-27-2017
06:31 PM
Hi @JJ Tsien, 1. We don't support setting quotas via a configuration file. Quotas must be set using the setQuota command issued by an administrator. 2. There is no way to specify quotas as a percentage of total storage either, sorry.
... View more
03-07-2017
06:10 PM
3 Kudos
@Viswa, the Apache documentation provides a good description. https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Secondary_NameNode I have never seen Checkpoint/Backup node being used in practice and these should be considered deprecated. I recommend using the Secondary NameNode. Ideally you should use NameNode HA which eliminates the single point of failure. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html If you use Ambari to install HDP, SecondaryNameNode is enabled by default and NameNode HA can be enabled using a wizard.
... View more
03-07-2017
06:03 PM
3 Kudos
Hi @Akash S, this is a known issue and it is benign. You can safely ignore the alerts. If you want to avoid seeing them you can append the following options to HADOOP_DATANODE_OPTS via Ambari and restart DataNodes. -XX:CMSInitiatingOccupancyFraction=70-XX:+UseCMSInitiatingOccupancyOnly-XX:ConcGCThreads=8-XX:+UseConcMarkSweepGC Also asked here: https://community.hortonworks.com/questions/85961/ambari-alert-datanode-heapsize-alert.html#answer-85964
... View more
03-01-2017
10:08 PM
Hi @Kumar Veerappan, Yes this alert seems to be new for Ambari 2.2.2. I don't see it in the 2.2.1 release.
... View more
02-27-2017
10:49 PM
This is a benign issue and the alerts can be safely ignored. It was addressed by AMBARI-18936. You can fix it by manually adding the following options to HADOOP_DATANODE_OPTS via Ambari: -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:ConcGCThreads=8 -XX:+UseConcMarkSweepGC
... View more
01-20-2017
10:15 PM
3 Kudos
@apappu is correct. These JVM options should be added to HADOOP_DATANODE_OPTS in the hadoop-env template. After making the changes you should restart all the DataNodes (there is no need to restart the NameNodes). I recommend restarting DataNodes two at a time.
... View more
11-01-2016
10:28 PM
1 Kudo
This error has nothing to do with the missing user-specific group. The NameNode has no requirement that the group name match the user name. As long as the id command succeeded and returned any groups the NameNode would not have logged the error. This error was likely due to a temporary infrastructure issue as pointed out by @Smart Solutions in a later comment. The following article is a good starting point for any issues related to groups lookups. https://community.hortonworks.com/articles/38591/hadoop-and-ldap-usage-load-patterns-and-tuning.html
... View more
08-17-2016
06:11 PM
Hi @William Bolton, are these applications accessing HDFS directly? What's the mode of access e.g. WebHDFS REST API, Java APIs or something else?
... View more
08-17-2016
06:01 PM
Hi @jovan karamacoski, are you able to share what your overall goal is? The NameNode detects DataNode failures in ~10 minutes and queues re-replication work. Disk failures can take longer and we are planning to make improvements in this area soon. The re-replication logic is complex. If you think your changes will be broadly useful please consider filing a bug in Apache HDFS Jira and submitting the changes as a patch. Best, Arpit.
... View more
08-16-2016
11:58 PM
3 Kudos
Commenting to clarify that some of the advice above is not wrong but it can be dangerous. Starting with HDP 2.2 and later, the DataNode is more strict about where it expects block files to be. I do not recommend moving block files or folders on DataNodes around manually, unless you really know what you are doing. @jovan karamacoski, to answer your original question - the NameNode drives the re-replication (specifically the BlockManager class within the NameNode). The ReplicationMonitor thread wakes up periodically and computes re-replication work for DataNodes. The re-replication logic has multiple triggers like block reports, heartbeat timeouts, decommission etc.
... View more
08-08-2016
07:43 PM
Thanks for the heads up @Kuldeep Kulkarni. It could be a couple of things: The ephemeral port was in use by another process that is now gone. There is a process using the port but it is running with different user credentials. @vijay kadel were you running the ps/netstat/lsof commands as the root user?
... View more
08-04-2016
09:53 PM
1 Kudo
The question is unclear to me but I recommend reading the following three blog posts carefully as they go into great detail about balancer basics, configuration and best practices: https://community.hortonworks.com/articles/43615/hdfs-balancer-1-100x-performance-improvement.html https://community.hortonworks.com/articles/43849/hdfs-balancer-2-configurations-cli-options.html https://community.hortonworks.com/articles/44148/hdfs-balancer-3-cluster-balancing-algorithm.html
... View more
08-04-2016
09:47 PM
2 Kudos
Hi @ripunjay godhani, we no longer recommend setting up NameNode HA with NFS. Instead please use the Quorum Journal Manager setup. The Apache HA with QJM documentation is a good start: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html NameNode image files will be stored on two nodes (active and standby NN) in this setup. The latest edit logs will be on the active NameNode and at least two journal nodes (usually all three, unless one Journal Node has an extended downtime). The NameNodes can optionally be configured to write their edit logs to separate NFS shares if you really want but it is not necessary. You don't need RAID 10. HDFS HA with QJM provides good durability and availability with commodity hardware.
... View more
08-04-2016
09:33 PM
1 Kudo
@sgowda, thanks for confirming you just want to mount volumes at a new location. If you are just remounting then your existing HDFS metadata and data files will be present but under new Linux paths. In that case decommissioning is not necessary. You just need to to update NameNode and DataNode configuration settings like dfs.namenode.name.dir, dfs.datanode.data.dir to point to the new locations. See this link for a full list of settings, not all may apply to you. Don't reformat the NN else you will lose all your data. The simplest approach is:
Take a full cluster downtime and bring down all HDFS services. Remount volumes at the new location on all affected nodes. Update NN and DN configurations via Ambari to point to the new storage roots. Restart services. If you are not familiar with these settings I recommend learning more about HDFS first since its easy to lose data via administrative mistakes.
... View more
08-03-2016
11:13 PM
Are you just mounting volumes at a new location?
... View more
08-01-2016
11:33 PM
3 Kudos
Hi @Facundo Bianco, you are using a privileged port number (1004) for data transfer so you cannot enable SASL. Please check your hdfs-site.xml to ensure SASL is not enabled via dfs.data.transfer.protection. The Secure DataNode section from the Apache HDFS documentation describes this. https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SecureMode.html#Secure_DataNode Since you are using HDP with Ambari, I recommend using the Ambari Kerberos Wizard especially if you are setting it up for the first time. At the very least it will provide you with a working reference configuration. The Ambari Kerberos Wizard is documented here: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_Ambari_Security_Guide/content/_running_the_kerberos_wizard.html
... View more
08-01-2016
08:16 PM
4 Kudos
If you have setup automatic failover with ZooKeeper Failover Controllers then the ZKFC processes will automatically transition the Standby NN to Active status if the current active is unresponsive. The decision about which NN should be made active is taken by the ZKFC instances (coordinating via ZooKeeper). Ambari does not decide which NN should be active. If you wish to perform a manual failover then you can use the hdfs dfsadmin command as @Sagar Shimpi suggested. Both alternatives are described in the HDP documentation: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_hadoop-ha/content/ha-nn-deploy-nn-cluster.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_hadoop-ha/content/nn-ha-auto-failover.html If you want to better understand the internals of automatic NN failover (recommended if you are administering a Hadoop cluster with HA), I recommend reading the Apache docs, specifically the section on Automatic Failover. https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
... View more