Created 07-21-2020 05:42 AM
I'm facing an issue during the upgrade of HDP 3.1.0.0-78 to 3.1.4.0-315 on Ubuntu 18
The upgrade process is not able to restart the datanodes.
I get the error java.lang.RuntimeException: Cannot start datanode because the configured max locked memory size (dfs.datanode.max.locked.memory) of 2147483648 bytes is more than the datanode's available RLIMIT_MEMLOCK ulimit of 16777216 bytes.
I don't understand why this error happens. The datanodes were well started before the starting of the upgrade process and the system setting RLIMIT_MEMLOCK hasn't been changed.
Thanks in advance for your help
Created 07-23-2020 11:38 PM
I was able to restart to the datanode from the Ambari UI after a restart of the ambari-agent on the servers where the datanode run
Created 07-21-2020 06:50 AM
Please can you check those 2 values dfs.datanode.max.locked.memory and ulimit
The dfs.datanode.max.locked.memory determines the maximum amount of memory a DataNode will use for caching. The "locked-in-memory size" corresponds to ulimit (ulimit -l) of the DataNode user that needs to be increased to match this parameter.
The current dfs.datanode.max.locked.memory is 2 GB and while the RLIMIT_MEMLOCK is 16 MB
If you get the error “Cannot start datanode because the configured max locked memory size… is more than the datanode’s available RLIMIT_MEMLOCK ulimit,” that means that the operating system is imposing a lower limit on the amount of memory that you can lock than what you have configured. To fix this, you must adjust the ulimit -l value that the DataNode runs with.
Usually, this value is configured in /etc/security/limits.conf. However, it will vary depending on what operating system and distribution you are using please adjust the values accordingly remember that you will need space in memory for other things as well, such as the DataNode and application JVM heaps and the operating system page cache.
Once adjust the datanode should start as a charm 🙂
Hope that helps
Created 07-21-2020 07:39 AM
Thanks for this reply,
but I don't understand why the datanode started correctly before the upgrade process and failed during the upgrade process without any change on the OS limit RLIMIT_MEMLOCK
Created 07-21-2020 07:55 AM
Those are internals to Cloudera and that confirms myth migration/upgrades are never smooth we still need humans 🙂 Please do those changes and let me know if your datanodes fires up correctly.
Created 07-22-2020 12:18 AM
In fact, I can’t restart the datanode after the upgrade of Ambari from 2.7.3.0 to 2.7.4.0, not during the upgrade of HDP, and while the restart works fine before the upgrade
Below the logs of the restart with the error : The operating system limit max locked memory is set to 2197152 kbytes and it's more than the value of the parameter dfs.datanode.max.locked.memory (2147483648 bytes)
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 257446
max locked memory (kbytes, -l) 2197152
max memory size (kbytes, -m) unlimited
open files (-n) 128000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 65536
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
==> /var/log/hadoop/hdfs/hadoop-hdfs-root-datanode-di-dbdne-fe-develophdpwkr-01.log <==
2020-07-22 06:42:20,156 INFO datanode.DataNode (LogAdapter.java:info(51)) - registered UNIX signal handlers for [TERM, HUP, INT]
2020-07-22 06:42:20,422 INFO security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(1009)) - Login successful for user dn/di-dbdne-fe-develophdpwkr-01.node.fe.sd.diod.tech@DIOD.TECH using keytab file /etc/security/keytabs/dn.service.keytab
2020-07-22 06:42:20,574 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(137)) - Scheduling a check for [DISK]file:/mnt/hdd0/hadoop/hdfs/data
2020-07-22 06:42:20,581 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(137)) - Scheduling a check for [DISK]file:/mnt/hdd1/hadoop/hdfs/data
2020-07-22 06:42:20,582 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(137)) - Scheduling a check for [DISK]file:/mnt/hdd2/hadoop/hdfs/data
2020-07-22 06:42:20,582 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(137)) - Scheduling a check for [DISK]file:/mnt/hdd3/hadoop/hdfs/data
2020-07-22 06:42:20,582 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(137)) - Scheduling a check for [RAM_DISK]file:/mnt/dn-tmpfs
2020-07-22 06:42:20,656 INFO impl.MetricsConfig (MetricsConfig.java:loadFirst(118)) - Loaded properties from hadoop-metrics2.properties
2020-07-22 06:42:20,911 INFO timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:init(85)) - Initializing Timeline metrics sink.
2020-07-22 06:42:20,912 INFO timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:init(105)) - Identified hostname = di-dbdne-fe-develophdpwkr-01.node.fe.sd.diod.tech, serviceName = datanode
2020-07-22 06:42:20,943 INFO availability.MetricSinkWriteShardHostnameHashingStrategy (MetricSinkWriteShardHostnameHashingStrategy.java:findCollectorShard(42)) - Calculated collector shard di-dbdne-fe-develophdpadm-01.node.fe.sd.diod.tech based on hostname: di-dbdne-fe-develophdpwkr-01.node.fe.sd.diod.tech
2020-07-22 06:42:20,943 INFO timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:init(135)) - Collector Uri: http://di-dbdne-fe-develophdpadm-01.node.fe.sd.diod.tech:6188/ws/v1/timeline/metrics
2020-07-22 06:42:20,943 INFO timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:init(136)) - Container Metrics Uri: http://di-dbdne-fe-develophdpadm-01.node.fe.sd.diod.tech:6188/ws/v1/timeline/containermetrics
2020-07-22 06:42:20,948 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:start(204)) - Sink timeline started
2020-07-22 06:42:20,988 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:startTimer(374)) - Scheduled Metric snapshot period at 10 second(s).
2020-07-22 06:42:20,989 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:start(191)) - DataNode metrics system started
2020-07-22 06:42:21,068 INFO common.Util (Util.java:isDiskStatsEnabled(395)) - dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2020-07-22 06:42:21,070 INFO datanode.BlockScanner (BlockScanner.java:<init>(184)) - Initialized block scanner with targetBytesPerSec 1048576
2020-07-22 06:42:21,073 INFO datanode.DataNode (DataNode.java:<init>(486)) - File descriptor passing is enabled.
2020-07-22 06:42:21,074 INFO datanode.DataNode (DataNode.java:<init>(499)) - Configured hostname is di-dbdne-fe-develophdpwkr-01.node.fe.sd.diod.tech
2020-07-22 06:42:21,074 INFO common.Util (Util.java:isDiskStatsEnabled(395)) - dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2020-07-22 06:42:21,076 ERROR datanode.DataNode (DataNode.java:secureMain(2883)) - Exception in secureMain
java.lang.RuntimeException: Cannot start datanode because the configured max locked memory size (dfs.datanode.max.locked.memory) of 2147483648 bytes is more than the datanode's available RLIMIT_MEMLOCK ulimit of 16777216 bytes.
Created on 07-22-2020 05:33 AM - edited 07-22-2020 05:35 AM
In fact, I can't restart the datanode from the Ambari UI, but I can restart it by executing the following command directly on the server where the datanode should run
/var/lib/ambari-agent/ambari-sudo.sh -H -E /usr/hdp/3.1.0.0-78/hadoop/bin/hdfs --config /usr/hdp/3.1.0.0-78/hadoop/conf --daemon start datanode
Therefore I think that the operating system limit max locked memory is right set on the server where the datanode should run
Created 07-23-2020 11:38 PM
I was able to restart to the datanode from the Ambari UI after a restart of the ambari-agent on the servers where the datanode run