Created 01-25-2019 08:52 AM
I've installed CDH-5.16.1-1.cdh5.16.1.p0.3 on SLES 12.3 in "single user mode".
I have some services running
When I try to start HDFS, Secondary Name Node and Data Node seems ok, but Name Node fails with this error:
2019-01-25 17:21:52,016 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at org.apache.hadoop.ipc.Server.start(Server.java:2696) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.start(NameNodeRpcServer.java:448) at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:713) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:692) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
I immediately thought it was ulimits related issues, so I modified it.
cloudera-scm soft nofile 32768
cloudera-scm hard nofile 1048576
cloudera-scm soft nproc 127812
cloudera-scm hard nproc unlimited
cloudera-scm soft memlock unlimited
cloudera-scm hard memlock unlimited
Nothing changed.
Someone suggested me to change servicemd related limits, we raised service-related limits and last run has very high value, but Name Node continues to fail
> ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 127812 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024680 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1024360 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
# prlimit -p 100642 RESOURCE DESCRIPTION SOFT HARD UNITS AS address space limit unlimited unlimited bytes CORE max core file size 0 unlimited bytes CPU CPU time unlimited unlimited seconds DATA max data size unlimited unlimited bytes FSIZE max file size unlimited unlimited bytes LOCKS max number of file locks held unlimited unlimited locks MEMLOCK max locked-in-memory address space unlimited unlimited bytes MSGQUEUE max bytes in POSIX mqueues 819200 819200 bytes NICE max nice prio allowed to raise 0 0 NOFILE max number of open files 1024680 1024680 files NPROC max number of processes 1024360 unlimited processes RSS max resident set size unlimited unlimited bytes RTPRIO max real-time priority 0 0 RTTIME timeout for real-time tasks unlimited unlimited microsecs SIGPENDING max number of pending signals 127812 127812 signals STACK max stack size 8388608 unlimited bytes
I also raised Name Node log level to trace, but logs are pretty clean
2019-01-25 17:21:52,016 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 21 on 8020: starting 2019-01-25 17:21:52,016 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020: starting 2019-01-25 17:21:52,016 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at org.apache.hadoop.ipc.Server.start(Server.java:2696) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.start(NameNodeRpcServer.java:448) at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:713) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:692) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615) 2019-01-25 17:21:52,017 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 20 on 8020: starting 2019-01-25 17:21:52,019 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
Any suggestion?
Thanks
SUSE Linux Enterprise Server 12 (x86_64) VERSION = 12 PATCHLEVEL = 3 # This file is deprecated and will be removed in a future service pack or release. # Please check /etc/os-release for details about this release. NAME="SLES" VERSION="12-SP3" VERSION_ID="12.3" PRETTY_NAME="SUSE Linux Enterprise Server 12 SP3" ID="sles" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:12:sp3"
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 10 On-line CPU(s) list: 0-9 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 10 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 37 Model name: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Stepping: 1 CPU MHz: 2300.000 BogoMIPS: 4600.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-4 NUMA node1 CPU(s): 5-9 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm arat retpoline kaiser tsc_adjust
Created 01-29-2019 12:31 AM
We solved the issue.
Looks like it is a ulimit related problem.
We raised user limits under
/etc/security/limits.d/
And then we created a file under
/etc/systemd/system/cloudera-scm-agent.service.d/override.conf
To override service-level limits
And we raised the value
echo "65536" > /sys/fs/cgroup/pids/system.slice/cloudera-scm-agent.service/pids.max
(instead of rebooting).
Created 01-26-2019 08:37 PM
Hi @seleoni Can you go to HDFS configuration and check Java Heap Size of NameNode in Bytes
if it around 1 GB, try to pop it up, restart the namenode and check if it solve your issue.
Created 01-29-2019 12:31 AM
We solved the issue.
Looks like it is a ulimit related problem.
We raised user limits under
/etc/security/limits.d/
And then we created a file under
/etc/systemd/system/cloudera-scm-agent.service.d/override.conf
To override service-level limits
And we raised the value
echo "65536" > /sys/fs/cgroup/pids/system.slice/cloudera-scm-agent.service/pids.max
(instead of rebooting).