Support Questions

seleoni · ‎01-25-2019

I've installed CDH-5.16.1-1.cdh5.16.1.p0.3 on SLES 12.3 in "single user mode".

I have some services running

cloucera-scm-server
cloudera-scm-agent
Cloudera Management Service (Alert Publisher, Event Server, Host Monitor, Service Monitor)
Zookeeper

When I try to start HDFS, Secondary Name Node and Data Node seems ok, but Name Node fails with this error:

2019-01-25 17:21:52,016 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at org.apache.hadoop.ipc.Server.start(Server.java:2696)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.start(NameNodeRpcServer.java:448)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:713)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:692)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)

I immediately thought it was ulimits related issues, so I modified it.

cloudera-scm soft nofile 32768
cloudera-scm hard nofile 1048576
cloudera-scm soft nproc 127812
cloudera-scm hard nproc unlimited
cloudera-scm soft memlock unlimited
cloudera-scm hard memlock unlimited

Nothing changed.

Someone suggested me to change servicemd related limits, we raised service-related limits and last run has very high value, but Name Node continues to fail

> ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127812
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024680
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024360
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

# prlimit -p 100642
RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
AS         address space limit                unlimited unlimited bytes
CORE       max core file size                         0 unlimited bytes
CPU        CPU time                           unlimited unlimited seconds
DATA       max data size                      unlimited unlimited bytes
FSIZE      max file size                      unlimited unlimited bytes
LOCKS      max number of file locks held      unlimited unlimited locks
MEMLOCK    max locked-in-memory address space unlimited unlimited bytes
MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
NICE       max nice prio allowed to raise             0         0
NOFILE     max number of open files             1024680   1024680 files
NPROC      max number of processes              1024360 unlimited processes
RSS        max resident set size              unlimited unlimited bytes
RTPRIO     max real-time priority                     0         0
RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
SIGPENDING max number of pending signals         127812    127812 signals
STACK      max stack size                       8388608 unlimited bytes

I also raised Name Node log level to trace, but logs are pretty clean

2019-01-25 17:21:52,016 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 21 on 8020: starting
2019-01-25 17:21:52,016 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020: starting
2019-01-25 17:21:52,016 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at org.apache.hadoop.ipc.Server.start(Server.java:2696)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.start(NameNodeRpcServer.java:448)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:713)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:692)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
2019-01-25 17:21:52,017 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 20 on 8020: starting
2019-01-25 17:21:52,019 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

Any suggestion?

Thanks

SUSE Linux Enterprise Server 12 (x86_64)
VERSION = 12
PATCHLEVEL = 3
# This file is deprecated and will be removed in a future service pack or release.
# Please check /etc/os-release for details about this release.
NAME="SLES"
VERSION="12-SP3"
VERSION_ID="12.3"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP3"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp3"

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                10
On-line CPU(s) list:   0-9
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             10
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 37
Model name:            Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Stepping:              1
CPU MHz:               2300.000
BogoMIPS:              4600.00
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-4
NUMA node1 CPU(s):     5-9
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm arat retpoline kaiser tsc_adjust

seleoni · ‎01-29-2019

We solved the issue.

Looks like it is a ulimit related problem.

We raised user limits under

/etc/security/limits.d/

And then we created a file under

/etc/systemd/system/cloudera-scm-agent.service.d/override.conf

To override service-level limits

And we raised the value

echo "65536" > /sys/fs/cgroup/pids/system.slice/cloudera-scm-agent.service/pids.max

(instead of rebooting).

View solution in original post

Fawze · ‎01-26-2019

Hi @seleoni Can you go to HDFS configuration and check Java Heap Size of NameNode in Bytes

if it around 1 GB, try to pop it up, restart the namenode and check if it solve your issue.

seleoni · ‎01-29-2019