Member since
03-22-2017
54
Posts
12
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1589 | 10-21-2021 12:49 AM | |
927 | 04-01-2021 05:31 AM | |
952 | 03-30-2021 04:23 AM | |
2002 | 03-23-2021 04:30 AM | |
2682 | 03-05-2021 04:33 AM |
10-21-2021
12:49 AM
@DA-Ka You need to use HDFS Find tool "org.apache.solr.hadoop.HdfsFindTool" for that purpose. Refer below links which suggests some method to fid the old Files. - http://35.204.180.114/static/help/topics/search_hdfsfindtool.html However, t he search-based HDFS find tool has been removed and is superseded in CDH 6 by the native "hdfs dfs -find" command, documented here: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#find
... View more
10-21-2021
12:38 AM
@PrernaU can you provide more details about this - " The objective it to share the data between tow CDP clusters." Are you trying to copy data between two distinct clusters? Are you looking at some solution like HDFS Replication task? If yes, please have a look at our Replication Manager tool in CDP for that purpose. - https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/replication-manager/topics/rm-dc-configuring-replication-of-hdfs-data.html
... View more
10-21-2021
12:24 AM
@PrernaU just wanted to check if you got a chance to check our blog post on viewFS here - https://blog.cloudera.com/global-view-distributed-file-system-with-mount-points/ You may also refer the community post citing the configuration steps - https://community.cloudera.com/t5/Community-Articles/Enabling-and-configuring-the-ViewHDFS-client-side-mounts-in/ta-p/306752 If you have reviewed the above pages and still getting some issue, let us know.
... View more
06-17-2021
11:45 AM
Hello @sipocootap2 The failover controller log snippet you shared here indicating the HealthMonitor thread on Active NameNode couldn't fetch the state of the local NameNode (via health check RPC) within "ha.health-monitor.rpc-timeout.ms" timeout period of 45sec (45000ms). Since there is no response within the timeout period from the local NN, the NN service entered into the "SERVICE_NOT_RESPONDING" state. NOTE: "The HealthMonitor is a thread which is responsible for monitoring the local NameNode. It operates in a simple loop, calling the monitorHealth RPC. The HealthMonitor maintains a view of the current state of the NameNode based on the responses to these RPCs. When it transitions between states, it sends a message via a callback interface to the ZKFC." The condition you cited here suggests the local NN (Active NameNode here) went unresponsive/hung or busy. Hence the local FailoverController (activeNN_zkfc) triggered a NN failover after monitorHealth RPC timed out and suggest the Standby NameNode host failover controller (SbNN_zkfc) to promote/transition local standby NN to Active State. Answers to your query Q) I have no idea why SocketTimeoutException was raised while doing doHealthChecks. Ans) Looks like Active NN was unresponsive or busy, hence the RPC call was timed out (marked with socket timeout exception) Q) "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout" log, when I look for PORT2 in the namenode, that port doesn't seem to be used. Ans) The PORT2 (local=/NAMENODE:PORT2) you see is an ephemeral port (any random port) used by HealthMonitor RPC to communicate with local NN service port 8022 ( remote=NAMENODE/NAMENODE:PORT). Since health monitor thread is local to NN means running on same node as NN, you see NN hostname appearing as both local and remote endpoint. Ref: https://community.cloudera.com/t5/Support-Questions/Namenode-failover-frequently/td-p/41122
... View more
04-01-2021
05:31 AM
Hello @Amn_468, As you explained the /data mount point is used for YARN, Kudu and Impala apart from DN storage volumes. Here HDFS considers disk usage of /data/dfs/dn as HDFS/DFS used and rest all disk usage as NON-HDFS usage. If the "/data" mount point is used as YARN local directory (/data/yarn/nm), Kudu data/WAL directory (/data/kudu/*) or Impala Scratch directory (/data/impala/*) directory , then those directory usage will be considered as non-DFS Usage. In general YARN local directory or Impala Scratch directory gets empty after successful job run. In case there are files resides from a previous job run that was killed/aborted, then you need to remove those files manually to get the disk space recovered. However, Kudu space will remain intact/utilised as long as the mount point is used for Kudu Service. You can calculate the disk usage by each service and then you can calculate how much data you can recover if the YARN local directory and Impala Scratch directory data gets deleted pr removed fully. In case you are running on ext4 file system and low on available space, consider lowering the superuser block reservation from 5% to 1% (using the "tune2fs -m 1" option) on the fils system which will allow you to have some more free space on the mount point.
... View more
03-30-2021
05:35 AM
1 Kudo
Hello @wert_1311 You can balance the disk usage of the DN storage volumes using "intra-disk balancer" feature available in CDH starting release 5.8.2 and later. You need to enable the feature by adding the " dfs . disk . balancer . enabled" configuration to HDFS via the HDFS safety valve snippet in Cloudera Manager following the blog here - https://blog.cloudera.com/how-to-use-the-new-hdfs-intra-datanode-disk-balancer-in-apache-hadoop/ A typical disk-balancer task involves three steps (implemented via the "hdfs diskbalancer" command): plan, execute, and query. The steps are as follows: 1. Enable intra disk balancer config in HDFS 2. "Plan" the intra disk balancer 3. Execute the created plan 4. Query the running/executed plan 5. Verify the balancer report For more info refer the apache doc here - https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html Thanks and Regards, Pabitra Das
... View more
03-30-2021
04:23 AM
1 Kudo
Hello @Amn_468 Please note that, you get the block count alert after hitting the warning/critical threshold value set in HDFS Configuration. It is a Monitoring alert and doesn't impact any HDFS operations as such. You may increase the monitoring threshold value in CM ( CM > HDFS > Configurations > DataNode Block Count Thresholds) However, CM monitors the block counts on the DataNodes is to ensure you are not writing many small files into HDFS. Increase in block counts on DNs is an early warning of small files accumulation in HDFS. The simplest way to check if you are hitting small files issue is to check the average block size of HDFS files. Fsck should show the average block size. If it's too low a value (eg ~ 1MB), you might be hitting the problems of small files which would be worth looking at, otherwise, there is no need to review the number of blocks. [..] $ hdfs fsck / .. ... Total blocks (validated): 2899 (avg. block size 11475601 B) <<<<< [..] Similarly, you can get the average file size in HDFS by running a script as follows: $hdfs dfs -ls -R / | grep -v "^d" |awk '{OFMT="%f"; sum+=$5} END {print "AVG File Size =",sum/NR/1024/1024 " MB"}' The file size reported by Reports Manager under "HDFS Reports" in Cloudera Manager can be different as the report is extracted from >1hour old FSImage (not a latest one). Hope this helps. Any question further, feel free to update the thread. Else mark solved. Regards, Pabitra Das
... View more
03-23-2021
04:30 AM
1 Kudo
Hello @meenzoon It seems the Cloudera Manager Service itself is not running. Could you please check the CM Server (#service cloudera-scm-server status) status on the host? If not running, please restart the CM service (cloudera-scm-server) and then check the role status. If it still reports unknown health for the management host, then check the health alert and share the message here. In case of CM Server startup failure, please check the CM Server log on the host. The CM Server log would provide an insight to the cause of failure.
... View more
03-22-2021
04:50 AM
Hello @pauljoshiva You need to add the new nodes with a new config group. One set of DNs in default config group (where the storage directories are laid from /hdp/hdfs01 - /hdp/hdfs09) and anotehr set of DNs in new config group (with directories /hdp/hdfs01, /hdp/hdfs02, /hdp/hdfs03). That way you can have all DNs added to cluster with 2 separate config groups.
... View more
03-16-2021
03:08 AM
Hello @Monds you can recover the lease on the file, running below command: #hdfs debug recoverLease -path <path-of-the-file> [-retries <retry-times>] This command will ask the NameNode to try to recover the lease for the file (successfully close the file if there are still healthy replicas) Ref: https://blog.cloudera.com/understanding-hdfs-recovery-processes-part-1/
... View more
03-15-2021
04:40 AM
Hello @Babar Thank you for resolving the issue and marking the thread as solved. Glad to know that you identify the problem and resolved it. Please note HDFS-14383 (Compute datanode load based on StoragePolicy) has been included in the recent release of CDP 7.1.5 and 7.2.x
... View more
03-13-2021
04:53 AM
1 Kudo
Yes, it is applicable for CDP 7.x release @novice_tester
... View more
03-12-2021
11:00 AM
2 Kudos
Hello @novice_tester Cloudera validates and tests against all the latest browsers like Google Chrome, Firefox, Safari and MS Edge. Please refer page on supported browser here - https://my.cloudera.com/supported-browsers.html and - https://docs.cloudera.com/management-console/cloud/requirements-aws/topics/mc-supported-browsers.html
... View more
03-12-2021
10:47 AM
Hello @Babar, It seems the DN disk configuration (dfs.datanode.data.dir) is not appropriate. Could you please configure the disks as cited here - https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_heterogeneous_storage_oview.html#admin_heterogeneous_storage_config If your SSD disk is mounted as below: /dn_vg1/vol1_ssd -----> mounted as ----> /data/1 /dn_vg2/vol2_ssd -----> mounted as -----> /data/2 /dn_vg3/vol3_ssd -----> mounted as -----> /data/3 and scsi/sata disks are mounted as below: /dn_vg1/vol1_disk -----> mounted as ----> /data/4 /dn_vg2/vol2_disk ------> mounted as -----> /data/5 Then configure the DN data directories (dfs.datanode.data.dir) as follows: - dn-1: "[SSD]/data/1/dfs/dn" - dn-2: "[SSD]/data/1/dfs/dn,[SSD]/data/2/dfs/dn" - dn-3: "[DISK]/data/4/dfs/dn,[SSD]/data/3/dfs/dn,[DISK]/data/5/dfs/dn" You need to create the /dfs/dn directories with ownership of hdfs:hadoop and permission of 700 on each mount point so that the volume can be used to store the blocks. Please check the mount points and reconfigure the data directories.
... View more
03-05-2021
04:33 AM
1 Kudo
Hello @uxadmin, Thank you for asking a follow-up question. Please note that, NameNode is responsible for keeping metadata of the files/blocks written into HDFS. Hence an increase in block count means NameNode has to keep more metadata information and may need more heap memory. As a thumb rule, we suggest 1GB of heap memory allocation for NameNode for every1 Million blocks in HDFS. Similarly, every 1Million block in DN requires ~1GB heap memory to operate smoothly. As I said earlier, there is no hard limit to store blocks in DN but having too many blocks is an indication of small file accumulation in HDFS. You need to check the average block size in HDFS to understand if you are hitting small file issue. Fsck should show the average block size. If it's too low a value (eg ~ 1MB), you might be hitting the problems of small files which would be worth looking at, otherwise, there is no need to review the number of blocks. [..] $ hdfs fsck / .. ... Total blocks (validated): 2899 (avg. block size 11475601 B) <<<<< [..] In short, there is no limit for block count threshold for DN but an increase in block counts of DN is an early indicator of small files issue in cluster. Of course, more small files mean more heap memory requirement for both NN and DN. In a perfect world where all files are created with 128MiB block size (default block size of HDFS), a 1 TB filesystem on DN can hold 8192 blocks (1024*1024/128). By that calculation, a DN with 23 TB can hold 188,416 blocks, but realistically we don't have all files created with 128MiB block and not all files occupy an entire block. So in a normal CDH cluster installation, we keep a minimal value of 500000 as a warning threshold for DN block counts. However, depending upon your use case and file write in HDFS, the block count may hit over a period of time. However, a value for the block count threshold can be determined by the data node disk size used for storing blocks. Say you have allocated 10 numbers of 2TB disks (starting /data/1/dfs/dn to /data/10/dfs/dn) for block write in DataNode, which means 20TB is available to write blocks and if you are writing files with average block size of 10MB, it means you can accommodate maximum 2,097,152 blocks (20TB/10MB) on that DN. So a threshold value of 1M (1000000) is a good value to be set as the WArning threshold. Hope this helps. Any question further, feel free to revert back. Cheers! In case your question has been answered, make sure to mark the answer as the accepted solution. If you find a reply useful, say thanks by clicking on the thumbs up button.
... View more
03-04-2021
10:38 AM
Hi @dv_conan, Similar issue is addressed here - https://community.cloudera.com/t5/Support-Questions/failed-to-execute-command-install-yarn-mapreduce-framework/td-p/301804 Please refer and make the necessary changes to directory permissions and let us know if that helped you.
... View more
03-04-2021
10:30 AM
Hello @samglo , Please note Solr CDCR is not supported in CDP yet. Refer to Cloudera blog on Solr CDCR (Cross Data Center Replication) support: - https://blog.cloudera.com/backup-and-disaster-recovery-for-cloudera-search/ Solr CDCR The future holds the promise of a Solr to Solr replication feature as well, a.k.a. CDCR . This is still maturing upstream and will need some time to further progress before it can be considered for mission critical production environments. Once it matures we will evaluate its value in addition to all our existing options of recovery for Search. The above solutions, presented in this blog, are production-proven and provides a very good coverage along with flexibility for today’s workloads. However, you can refer apache document on Solr CDCR below for some information about setup: - https://solr.apache.org/guide/6_6/cross-data-center-replication-cdcr.html or Cloudera Community article - https://community.cloudera.com/t5/Community-Articles/How-to-setup-cross-data-center-replication-in-SolrCloud-6/ta-p/247945
... View more
03-04-2021
10:06 AM
Hello @nj20200 It seems there is an older/previous version of openssl-devel package ( openssl-libs-1.0.2k-19.el7.x86_64) is installed, which is causing the installation failure of new version openssl-devel package ( openssl-devel-1.0.1e-60.el7.x86_64). So instead of installing the package, update the openssl-devel package by running "#yum update openssl-devel with -force option" or just remove the previous package and install the new version of openssl-devel package.
... View more
03-04-2021
07:25 AM
Hello @uxadmin please note that block count threshold configuration is intended for DataNodes only. This is a DataNode health test that checks for whether the DataNode has too many blocks. It's because having too many blocks on a DataNode may affect the DataNode's performance. There's no hard limit on the # of blocks writable to a DN, as block size is merely a logical concept, not a physical layout. However, the block count alert serves to indicate an early warning to a growing number of small files issue. While your DN can handle a lot of blocks in general, going too high will cause performance issues. Your processing speeds may get lower if you keep a lot of tiny files on HDFS (depends on your use-case of course) so would be worth looking into. You can find the block count threshold in HDFS config by navigating to CM > HDFS > Configuration > DataNode Block Count Thresholds When the block counts on each DN goes above the threshold, CM triggers an alert. So you need to adjust the threshold value based on the block counts on each DN. You can determine the block counts on each DN, navigating to CM > HDFS > WebUI > Active NN > DataNodes tab > Block counts column under Datanode section. Hope this helps.
... View more
03-02-2021
09:09 AM
Hello @kolli_sandeep , it seems the failover controllers are down in the cluster. Please follow the steps here [1] and start the Failover Controller roles which will transition the NameNdoes to Active/Standby state. You need to follow below steps; Stop the FailoverController Roles under HDFS > Instances page Remove the HA state from ZK. On a ZooKeeper server host, run zookeeper-client. Execute the following to remove the configured nameservice. This example assumes the name of the nameservice is nameservice1. You can identify the nameservice from the Federation and High Availability section on the HDFS Instances tab: rmr /hadoop-ha/nameservice1 (If you don't see any znode /hadoop-ha in ZK znode list, skip the step) After removing the HA znode in ZK, Go to CM and Click the HDFS > Instances > Federation and High Availability > Actions Under Actions menu, Select Actions > Initialize High Availability State in ZooKeeper . Then start the Failover Controllers role ( CM > Instances > Select FailoverControllers > Actions for selected > Start) Verify the NameNdoe State and if you don't see the active/standby state of NN, If any failure, just Restart the HDFS service [1] https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_hag_hdfs_ha_enabling.html
... View more
03-02-2021
12:06 AM
Hello @raghurok , Could you please check now and see if you still getting the timeout error. I believe the timeout is due to some network glitch or maintenance activity. I hope you will be able to access it now.
... View more
03-01-2021
09:06 AM
1 Kudo
Hello @muslihuddin , Please note, while enabling HA, CM puts all 3 Journal Nodes into a single group call "Default Group" by default assuming you are going to use the same config value for the 3 JN directories. Since you are using /app/jn for one node and /data/jn for the other 2 JN nodes, it created two separate JN config groups. However, to prevent the CM alert, you can mention /data/jn in the JN default group config so that 2 JNs will be part of the Default config group rather than a separate one, and the 3rd JN will continue to operate in a separate config group till you use /data/jn directory as its edits directory. Just in case you need to change the JN directory on any JN refer teh steps here - https://docs.cloudera.com/documentation/enterprise/latest/topics/cm_mc_jn.html
... View more
02-01-2021
10:58 AM
1 Kudo
Hi @pauljoshiva Though it is expected to have uniform disk configiuration across datanodes in cluster, you can have two different sets of disk confgiration on DNs. You can have one parition of 2TB size on each disk (3 *2TB =6TB on each DN) even though existing has 1 TB size of partition on each disk across all 9 disks (9*1TB=9TB on each DN). There will be no issue running DNs with such configiration, but you may see 6TB size DNs are filling faster than 9TB size DNs due to the fact that, NN doesn't consider availabel free space on DNs before writing blocks into it. NameNode picks the DN randomly after evaluating network distnace of the DN from client. Hope this helps. Thank you
... View more
02-01-2021
10:35 AM
Hello @vvk Please note, while adding/removing the journal nodes from the running cluster, you need to ensure a quorum of journal nodes available for NameNodes. (As cited in the shared document--> NameNode high availability requires that you must maintain at least three, active JournalNodes in your cluster.) It means NameNode ensures at least a quorum of Journal Nodes (2 of 3 journal nodes) available for edits log write at any given point Failing to write edits into a quorum of journal nodes, NameNode is expected to crash (shutdown itself). I believe this could be the scenario in your case. So you need to add new journal nodes first to the cluster before removing the old Journal nodes one by one ensuring a quorum of journal nodes available in the cluster. If you see NN crashed even after edits log write was successful on a quorum of JNs, then we need to check the NN log for any other issues. Thank you
... View more
11-11-2020
02:06 AM
Hello @Amn_468 Since you reported the DN Pause time, I spoke/referred about DN heap only. The block counts on most of the DN seems >6Millions, hence would suggest to increase the DN heap to 8GB (from current value of 6GB) and perorm a rolling restart to bring the new heap size into effect. There is no straight forward way to say you hit the small file problem but if your average block size is few MB or less than a MB in size, it is an indication that you are storing/accumulating small files in HDFS. Simplest way to determine small files in cluster is to run fsck. Fsck should show the average block size. If it's too low a value (eg ~ 1MB ), you might be hitting the problems of small files which would be worth looking at, otherwise, there is no need to review the number of blocks. [..] $ hdfs fsck / .. ... Total blocks (validated): 2899 (avg. block size 11475601 B) <<<<< [..] You may refer belwo links for your help on dealing with small files. - https://blog.cloudera.com/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/ - https://community.cloudera.com/t5/Community-Articles/Identify-where-most-of-the-small-file-are-located-in-a-large/ta-p/247253
... View more
11-09-2020
09:42 AM
Hello @Masood, I believe you are asking the commands to run in order to determine the active NN apart from CM UI ( CM > HDFS > Instance > NameNode) From CLI you have to run couple of commands to detemrine the Active/Standby NN List the namenode hostnames # hdfs getconf -namenodes c2301-node2.coelab.cloudera.com c2301-node3.coelab.cloudera.com Get nameservice name # hdfs getconf -confKey dfs.nameservices nameservice1 Get active and standby namenodes # hdfs getconf -confKey dfs.ha.namenodes.nameservice1 namenode11,namenode20 # su - hdfs $ hdfs haadmin -getServiceState namenode11 active $ hdfs haadmin -getServiceState namenode20 standby Get active and standby namenode hostnames $ hdfs getconf -confKey dfs.namenode.rpc-address.nameservice1.namenode11 c2301-node2.coelab.cloudera.com:8020 $ hdfs getconf -confKey dfs.namenode.rpc-address.nameservice1.namenode20 c2301-node3.coelab.cloudera.com:8020 If you want to get the active namenode hostname from hdfs-site.xml file, you can go through following python script in github – https://github.com/grakala/getActiveNN . Thank you
... View more
11-09-2020
09:22 AM
Hello @sace17 It seems your problem is related to credential cache. Per "https://bugzilla.redhat.com/show_bug.cgi?id=1029110", If the keyring ccache is changed from UID to username like below, it is not possible to get ticket as non-root user.
default_ccache_name = KEYRING:persistent:%{username} We have a KB article talks about the problem - https://community.cloudera.com/t5/board/article/ta-p/74262 Per KB article, CDH/Hadoop components do not fully support the advanced Linux feature KEYRING to store Keberos credentials. Remove any global profile setting for environment variable KRB5CCNAME. If no type prefix is present, the FILE type is assumed, which is supported by CDH/Hadoop components. Please remove/comment the section in /etc/krb5.conf file of all cluster nodes and that should solve your problem. Ref community post on the same problem here - https://community.cloudera.com/t5/Support-Questions/Kerberos-Cache-in-IPA-RedHat-IDM-KEYRING-SOLVED/td-p/108373 Additional Reference: - https://web.mit.edu/kerberos/krb5-1.12/doc/basic/ccache_def.html Thank you
... View more
11-09-2020
09:06 AM
Hello @AlexP Ref: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#setrep Referring to HDFS document, answers to your questions are inline. [Q1.] How to estimate how much time would this command take for a single directory (without -w)? [A1.] It depends upon the numbr of files in the directory. If you are running setrep against a path which is a directory, then the command recursively changes the replication factor of all files under the directory tree rooted at path. The time varies dependsing on the file count under the path/directory. [Q2.] Will it trigger a replication job even if I don't use the '-w' flag? [A2.] Yes, replication will trigger without -w flag. However, it is good practice to use -w to ensure all files are having required replication factor set prior to command exits. Please note, the -w flag requests that the command wait for the replication to complete. Though use of -w potentially takes a long time to complete the command but it gurantees the replication factor changed to the specified value. [Q3.] If yes, does it mean that the NameNode will actually start deleting 'over-replicated' blocks of all existing files under a particular directory? [A3.] Yes, your understanding is correct. The additonal 1 replica of the block will mark the block as over-replicated and same will be deleted from cluster. This action will be performed for each files under the directory path keeping only 2 replicas of the file blocks. Hope this helps.
... View more
11-09-2020
08:29 AM
Hello @Amn_468 The DN Pause alert you see for 1/9 DataNodes are indication of growing blocks on it. Compared to other DNs, possibly this DN in question have stored more number of blocks than other nodes. You may compare the block counts of each DN in HDFS > HDFS > WebUI > Active NN Web UI > DataNodes > Cehck the blocks column under section "In Operation". The log snippet you shared indicates a pause of 2sec only, which is not sign of worry. However, with proper JVM heap size allocated for DN, you may avoid these frequent pause alerts. As a thumb rule you may need 1GB heap for 1Million blocks and since you have 6GB allocated for DN heap, please verify the block counts on the DNs and ensure they are not too high (> 6Millions) in count which may explain why there are so many pause alerts. In case the block count is too high than expected, it means you need to increase the heap size to accomodate the block objects in JVM heap memeory. On a. side note, growing block counts also an early warning/indication of small files problem in cluster. You need to be vigilant about that. Verify the average block size and that would help you to understand, if you are having small files problem in your cluster. Regards, Pabitra Das
... View more
09-30-2020
10:14 PM
Hello @vincentD Please review the stdout and stderr of the DN which going down frequently. You can navigate to CM > HDFS > Instance > Select the DN which went down > Processes > click on stdout/stderr atthe bottom of the page. I am asking to verify stdout/stderr suspecting an OOM error (due to java heap running out of memory) leading to the DN exit/shutdown abruptly. If the DN exit is due to OOM Error, please increase the DN heap size to adequate value to get rid off teh issue further. DN heap sizing rule of thumb says: 1 GB heap memory for 1Million blocks. You can verify your block counts on each DN by navigating to CM > HDFS > NN Web UI > Active NN > DataNode and you can see the DN stats on that page showing block counts and disk usage etc..
... View more