Member since
10-25-2019
16
Posts
8
Kudos Received
4
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1681 | 05-21-2020 06:39 AM | |
| 1730 | 05-21-2020 05:05 AM | |
| 8766 | 05-17-2020 11:58 AM | |
| 7840 | 05-05-2020 01:22 AM |
01-09-2026
11:52 PM
➤ It sounds like you are encountering a common issue in HDFS where the metadata overhead and block minimums are causing a massive discrepancy between your actual data size and your disk utilization. While 650 files at 4MB each technically equals 2.6GB of data, the way HDFS manages these on your physical disks (especially in smaller or test clusters) can lead to unexpected storage consumption. ➤ Root Causes of the 100% Utilization 1. Reserved Space and "Non-DFS Used" HDFS does not have access to the entire disk. By default, Hadoop reserves a portion of the disk for the OS and non-Hadoop data (usually defined by dfs.datanode.du.reserved). If you are running on small disks (e.g., 20GB–50GB), the combination of your data, logs, and reserved space can quickly hit the 100% threshold. 2. Local Filesystem Block Overheads Even though your HDFS block size is 4MB, your underlying OS filesystem (EXT4 or XFS) uses its own block size (usually 4KB). However, the metadata for 650 individual files, their checksums (.meta files), and the edit logs on the NameNode create a "death by a thousand cuts" scenario for small disks. 3. Log Accumulation Check /var/log/hadoop or your configured log directory. In HDFS 3.3.5, if a cluster is struggling with space, the DataNodes and NameNodes generate massive amounts of "Heartbeat" and "Disk Full" logs, which consume the remaining Non-DFS space, pushing the disk to 100%. ➤How to Tackle the Situation Step 1: Identify Where the Space Is Going Run the following command to see if the space is taken by HDFS data or other files: $ hdfs dfsadmin -report DFS Used: Space taken by your 650 files. Non-DFS Used: Space taken by logs, OS, and other applications. If this is high, your logs are the culprit. Step 2: Clear Logs and Temporary Data If "Non-DFS Used" is high, clear out the Hadoop log directory: # Example path rm -rf /var/log/hadoop/hdfs/*.log.* rm -rf /var/log/hadoop/hdfs/*.out.* Step 3: Adjust the "Disk Checked" Thresholds By default, a DataNode stops working if the disk is 95% full. If you are in a test environment and need to squeeze out more space, you can lower the reserved space in hdfs-site.xml: <property> <name>dfs.datanode.du.reserved</name> <value>1073741824</value> </property> Step 4: Combine Small Files (Long-term Fix) HDFS is designed for large files.1 650 files of 4MB are considered "Small Files." The Problem: Every file, regardless of size, takes up roughly 150 bytes of RAM on the NameNode and creates separate metadata entries. The Solution: Use the getmerge command or a MapReduce/Spark job to combine these 650 files into 2 or 3 larger files (e.g., 1GB each).
... View more
09-17-2020
11:33 PM
use hadoop as root password ( you may be asked to change it )
... View more
07-22-2020
09:24 PM
ambari files view (same PB for Hue File browser) is not the good tool if you want to upload (very) big files. it's running in JVMs, and uploading big files will use more memory (you will hit maximum availaible mem very quickly and cause perfs issues to other users while you are uploading ) BTW it's possible to add other ambari server views to increase perfs (it may be dedicated to some teams/projects ) for very big files prefer Cli tools : scp to EDGE NODE with a big FS + hdfs dfs -put. or distcp or use an object storage accessible from you hadoop cluster with a good network bandwidth
... View more
05-21-2020
06:39 AM
2 Kudos
apt-get installation doesn't seem to install any bitcoin package same thing via python package manager (pip ...) it's probably a mistake in the dockerfile anyway the docker image is old, and the github repo doesn't seem to exist any more
... View more
05-21-2020
05:05 AM
1 Kudo
cloudera CDP is based on a Cloudera Runtime Version + a Cloudera Manager version that is compatible (with that cloudera runtime) https://docs.cloudera.com/cdpdc/7.0/release-guide/topics/cdpdc-release-notes-links.html at time of writing: CDP DC 1.0 uses Cloudera runtime 7.0.3 and cloudera manager 7.0.3 the Cloudera Runtime Component Version is aimed to keep a set of consistent hadoop components versions that can work together. it will also make more easy to migrate from CDH/HDP if services/components versions are the same or close to the runtime components versions of CDP if i'm not wrong, there is actually only one CDP DC (1.0) version with minor updates of CM and cloudera components runtime versions https://docs.cloudera.com/cloudera-manager/7.0.3/release-notes/topics/cm-release-notes.html https://docs.cloudera.com/runtime/7.0.3/release-notes/topics/rt-runtime-component-versions.html
... View more
05-20-2020
06:12 AM
Thanks , enabling the GIGC has helped along with reducing the Heap JVM settings in the bootstrap conf file. Thanks again
... View more
05-14-2020
03:53 AM
Hello @rvillanueva , You can check how many threads are used by a user by running ps -L -u <username> | wc -l if the user’s open files limit ( ulimit -n <user name >) is hit then the user can’t spawn any further more threads. Most possible reasons in this case could be, Same user running other jobs and having open files on the node where it tries to launch/spawn the container. systems thread might have excluded. see which application is running and what is their current open files Kindly check application log (application_XXX),if available and see which phase it throw's the exception and on which node the issue is faced.
... View more
05-13-2020
08:27 AM
@satishjan1 The initial question is asking about setting the hostname. The information you reference is telling you to do that, but for a different operating system. My first response was telling you how to do it for RHEL. For your next question, you do not have to set the hostname in /etc/sysconfig/network, you have to do it the way required for your operating system. See Above. The hostname must be set, and persist after reboot. If you do not set the hostname before installing the cluster you will have unmentionable problems with services and components later on down the road.
... View more
05-12-2020
02:30 AM
2 Kudos
Hi, I'm able to solve the issue by running the ambari-server setup command again and select the 4 option instead of embedded DB. That solved the issue. Now I'm able to start the service without any issue. Thanks, GophalRaj
... View more
05-05-2020
01:22 AM
1 Kudo
@Mondi just update dfs.cluster.administrators with the admin username you want in hdfs config (and restart hdfs,yarn,MR2,... services) ex: dfs.cluster.administrators = hdfs,ops you can also use a HDFS administrators group (only one administrator group) using dfs.permissions.superusergroup ex : dfs.permissions.superusergroup = operations to verify config has been updated, once services restarted hdfs getconf -confKey dfs.cluster.administrators or hdfs getconf -confKey dfs.permissions.superusergroup
... View more