About Shelton

Shelton · ‎12-07-2020

@Yuriy_but Absolutely that's not normal. There are a couple of thinks to check the Bad disk or corruption. Force checkpoint and restart the DN to see the behavior. Ensure the NN has enough memory allocated. Remove it from the cluster and rejoin afresh after reformating. Check the NIC and connectivity. Run the balancer to equally distribute the date.

Shelton · ‎11-09-2020

@ni4ni @Masood Unfortunateéy I will dispute @Masood resèponse with a reference to hadoop.apache.org documentation see link below In a HA setup, the standby does effectively do the checkpointing reference, to maintain correct documentation as a community reference please un-accept the answer www.hadoop.org See extract a quote from the above website "Note that, in an HA cluster, the Standby NameNodes also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode." Happy hadooping

Shelton · ‎11-08-2020

@jlguti Could you share your /etc/host file entry, you can scramble the hostname+domain but leave the format intact. Can you also check the value of yarn.resourcemanager.hostname in the Advanced yarn-site if it has a correct entry for the RM

Shelton · ‎11-08-2020

@Amn_468 The Namenode is the brain of the cluster, it has the footprint of the cluster location of the files, ACL's, stores the HDFS metadata, the directory tree of all files in the file system, and tracks the files across the cluster and does not store the actual data or the dataset. The data itself is actually stored in the Datanodes. Your error 2020-10-27 16:20:05,140 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1821ms GC pool 'ParNew' had collection(s): count=1 time=2075ms) This indicates that the NameNode paused for longer than the expected time of 60000ms. This also explains why DataNode did not get a response from NameNode in designated 60000ms. The warning also indicates that the pause was due to GC which calls for a memory and GC Tuning. NameNode knows the location, list of the blocks with this information NameNode knows how to construct the file from blocks. The fastest way to render this information is to store it in memory that's the reason the NN is usually on a high-end server configured with a lot of memory (RAM). because the block locations are stored in RAM An ideal starter config in production for a datanode and Namende would be Name Node Configuration Processors: 2 Quad Core CPUs running @ 2 GHz RAM: 128 GB Disk: 6 x 1TB SATA Network: 10 Gigabit Ethernet Data Node Configuration Processors: 2 Quad Core CPUs running @ 2 GHz RAM: 64 GB Disk: 12-24 x 1TB SATA Network: 10 Gigabit Ethernet A fundamental parameter to tune for garbage collectors is the number of HDFS blocks stored in the Hadoop cluster in your case 23,326,719 files. The number of files, and associated blocks, is a fundamental parameter in the tuning process. The Namenode maintains the complete directory structure in memory. Therefore, more files mean more objects to manage. Most of the time, Hadoop clusters are configured without knowledge of the final workload in terms of the number of files that will be stored. Having in mind the strong connection between these two aspects is crucial to anticipate future turbulence in the hdfs quality of service. You should analyze log prints produced by the garbage collector the gc.log files found in the Namenode logs directory the available memory is filling up before the garbage collector activity is able to release it. Hope that helps

Shelton · ‎11-08-2020

@jlguti There can be a barrage of reasons can you share the sit^ze of your cluster [Master & Slaves] and most important your spark-submit config. I see you have a total memory of 2GB and the logs of the container

Shelton · ‎11-06-2020

@AlexP Changing the replication factor doesn't change the replication factor of existing files but only the new files that will be created after issuing the "hdfs dfs -setrep" command You will have to manually change the replication factor of the old files To bulk change the replication factor $ hdfs dfs -setrep -R -w 2 /apps/ Changing the replication factor of a single file $ hdfs dfs –setrep –w 3 /apps/testfile.txt Reducing the replication factor also speeds the write performance as you are writing to fewer DataNodes and also reduces Namenode metadata but causes overhead in the reads as it's more difficult to find a node that has a replica. How to estimate how much time would this command take for a single directory (without -w)? This will depend on the size of the data and your cluster processing power Will it trigger a replication job even if I don't use the '-w' flag? Once you change the replication factor the internal data block reporting mechanism will kick in to update the Namenode of the replicas and the excess replica sitting on the same data node will be marked as over -replicated and good fro deletion If yes, does it mean that the NameNode will actually start deleting 'over-replicated' blocks of all existing files under a particular directory? After reducing the replication factor the data blocks will become over-replicated, the namenode will detect that using the metadata and chooses a replica to remove. The name node will remove a replica from the data node with the least amount of available disk space hence helping to rebalance the load over the cluster. HDFS fsck is used to check the health of the file system, to find missing files, over replicated, under replicated and corrupted blocks run the below To get corrupt or under replicated files $ hdfs dfs fsck / $ hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files Delete Under-replicated blocks in HDFS # To turbo charge use xargs -n 500 (or --max-args 500) $ cat /tmp/under_replicated_files |xargs -n 500 hdfs dfs -setrep 1 /tmp/under_replicated_files You can also put the above commands in a crontab

Shelton · ‎11-04-2020

@banshidhar_saho Surely that was I typo error I am happy that it worked for you. If you feel that helped you solve the issue you have, it would be great to accept it as an answer so the thread can be marked a solution for other members to reference. Happy hadooping

Shelton · ‎11-04-2020

@Masood Can you give some context? Is it an MR, Spark job ?? Do you have some logs to share. Can you share the application Logs?

Shelton · ‎11-04-2020

@banshidhar_saho If you want to restrict the [others] you will need to apply POSIX permission on the file/directory look at the below examples Below gives read and execute to [others] $ hdfs dfs -setfacl -m others::r-x /path/to/dir Below gives read to [others] $ hdfs dfs -setfacl -m others::r-- /path/to/dir Below gives no permissions to [others] $ hdfs dfs -setfacl -m others::--- /path/to/dir After the changes to validate run this snippet that should show it's set and to prove as [others] $ hdfs dfs -getfacl /path/to/dir try to list files in the directory Happy hadooping

Shelton · ‎11-03-2020

@NCBank Can you start a new thread and tag me. Please include your logs or error message. The Thread you are updating is OLD.

Online	Offline
Last Visited	‎06-05-2025 02:03 PM

Member Since	‎01-19-2017 04:35 AM
Last Visited	‎06-05-2025 02:03 PM
Posts	3,676
Kudos received	627

Cloudera Community

Re: Apache nifi memory consumption in kubernetes

Re: Nifi toolkit command for GitLabFlowRegistry

Re: Not able to delete the NiFi existing flow usin...

Re: Securing Nifi with SSL and using OIDC provider...

Re: External zookeeper and nifi cluster connection...

Re: data node long start

Re: How does HDFS checkpointing work in a HA clust...

Re: Process Stuck in Hadoop Cluster

Re: Data Node Pause Duration

Re: Process Stuck in Hadoop Cluster

Re: Changing HDFS replication factor on existing f...

Re: HDFS file/dir ownership to be <user>:<correspo...

Re: YARN job is working slow, what could be the re...

Re: HDFS file/dir ownership to be <user>:<correspo...

Re: Kerberos KDC no working