Support Questions

Find answers, ask questions, and share your expertise

After enabling TLS cloudera agent heartbeat failing

avatar
Explorer

Version: Cloudera Express 5.15.0 

Java VM Name: Java HotSpot(TM) 64-Bit Server VM

Java VM Vendor: Oracle Corporation

Java Version: 1.7.0_67

 

System details:

Linux optim-rhel72-uppu.development.unicomglobal.software 3.10.0-327.28.3.el7.x86_64 #1 SMP Fri Aug 12 13:21:05 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

 

It is a single host and I am using self signed certificate. I am just validating a POC with one of my product and hence not yet licensed.

 

Followed the steps mentioned at this link: 

https://www.cloudera.com/documentation/enterprise/5-11-x/topics/how_to_configure_cm_tls.html

https://www.cloudera.com/documentation/enterprise/5-15-x/topics/sg_self_signed_tls.html

 

After enabling TLS, cloudera agant heartbeat is failing with the below lines in the cloudera-scm-agent.log

 

[27/Dec/2018 20:58:28 +0000] 6869 MainThread agent        ERROR    Heartbeating to optim-rhel72-uppu.development.unicomglobal.software:7182 failed.
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/agent.py", line 1424, in _send_heartbeat
    self.max_cert_depth)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/https.py", line 138, in __init__
    self.conn.connect()
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/M2Crypto-0.24.0-py2.7-linux-x86_64.egg/M2Crypto/httpslib.py", line 59, in connect
    sock.connect((self.host, self.port))
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/M2Crypto-0.24.0-py2.7-linux-x86_64.egg/M2Crypto/SSL/Connection.py", line 195, in connect
    ret = self.connect_ssl()
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/M2Crypto-0.24.0-py2.7-linux-x86_64.egg/M2Crypto/SSL/Connection.py", line 188, in connect_ssl
    return m2.ssl_connect(self.ssl, self._timeout)
SSLError: unexpected eof

 

Below lines in the cloudera-scm-server.log

2018-12-27 20:58:13,025 WARN 1320793343@agentServer-16:org.mortbay.log: javax.net.ssl.SSLHandshakeException: null cert chain
2018-12-27 20:58:28,034 WARN 1320793343@agentServer-16:org.mortbay.log: javax.net.ssl.SSLHandshakeException: null cert chain
2018-12-27 20:58:43,447 WARN 1320793343@agentServer-16:org.mortbay.log: javax.net.ssl.SSLHandshakeException: null cert chain
2018-12-27 20:58:58,082 WARN 1320793343@agentServer-16:org.mortbay.log: javax.net.ssl.SSLHandshakeException: null cert chain
2018-12-27 20:59:13,140 WARN 1320793343@agentServer-16:org.mortbay.log: javax.net.ssl.SSLHandshakeException: null cert chain

 

I have tried multiple times but none of them working. 

 

I didn't find any error while running this command:

openssl s_client -showcerts -connect optim-rhel72-uppu.development.unicomglobal.software:7182

 

Any help would be highly appreciated.

 

Thanks,

Tulasi

 

1 ACCEPTED SOLUTION

avatar
Master Guru

@Tulasi,

 

Thank you for providing your config.  It appears you have space characters at the beginning of your cert/key configs.  Remove the space characters form the beginning of the following lines and then restart the agent:


 verify_cert_file=/opt/cloudera/security/pki/optim-rhel72-uppu.pem
 verify_cert_dir=/opt/cloudera/security/pki
 client_key_file=/opt/cloudera/security/pki/agent.key
 client_keypw_file=/etc/cloudera-scm-agent/agentkey.pw
 client_cert_file=/opt/cloudera/security/pki/agent.pem

View solution in original post

20 REPLIES 20

avatar
Super Collaborator
Please note that keystores/truststores need to be in standard JCEKS format and the documentation does not state otherwise. PKCS12 is only used for exporting and converting certificate plus private key to PEM format for CM agent config. Can you please point to the sentence suggesting PKCS12 format so that we can correct it?

avatar
Explorer

Hi,

 

Here is my response to your questions, can you please correct me what I am doing wrong. Also if you need some more details, I should be able to share.

 

Thanks,

Tulasi

 

1.) Ensure that the certificates are in a standard x509 format for the agent.
Yes it is standard x509, see my response to Bgooley

2.) Ensure that the truststores/keystores on the CM host are in JCEKS format and not pkcs12.
As per cloudera document, it should be JCEKS. From the link  
https://www.cloudera.com/documentation/enterprise/5-15-x/topics/how_to_configure_cm_tls.html
section "Generate TLS Certificate", point 3

3.) Make sure that the cloudera-scm user can read the Private Key, Certificates, Truststores, and Password Files.
Yes, see my response to gzigldrum

4.) Make sure that the certificate on the failing agent contains a proper CN and DNS Alt Name if Alt Names are in use.
Yes, I have verified this as well

5.) Are you using self-signed certificates or certificates signed by a CA?
I am using self signed certificate

6.) If all else fails you can obtain a tcpdump of attempted communication with the server. The port that we normally heartbeat to is 7182. You can then review the conversation between the server and agent to attempt to identify at what point the error is returned and potentially what error is being observed at the protocol level. You can identify and restrict your tcpdump information by tcp.stream.

[root@optim-rhel72-uppu ~]# tcpdump -i any 'port 7182'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
21:20:03.562131 IP optim-rhel72-uppu.development.unicomglobal.software.44942 > optim-rhel72-uppu.development.unicomglobal.software.7182: Flags [S], seq 3560294529, win 43690, options [mss 65495,sackOK,TS val 1632415805 ecr 0,nop,wscale 7], length 0
21:20:03.562225 IP optim-rhel72-uppu.development.unicomglobal.software.44942 > optim-rhel72-uppu.development.unicomglobal.software.7182: Flags [.], ack 1, win 342, options [nop,nop,TS val 1632415805 ecr 1632415805], length 0
21:20:03.562549 IP optim-rhel72-uppu.development.unicomglobal.software.44942 > optim-rhel72-uppu.development.unicomglobal.software.7182: Flags [P.], seq 1:254, ack 1, win 342, options [nop,nop,TS val 1632415806 ecr 1632415805], length 253
21:20:03.587871 IP optim-rhel72-uppu.development.unicomglobal.software.44942 > optim-rhel72-uppu.development.unicomglobal.software.7182: Flags [.], ack 16390, win 1365, options [nop,nop,TS val 1632415831 ecr 1632415831], length 0
21:20:03.587919 IP optim-rhel72-uppu.development.unicomglobal.software.44942 > optim-rhel72-uppu.development.unicomglobal.software.7182: Flags [.], ack 21184, win 2388, options [nop,nop,TS val 1632415831 ecr 1632415831], length 0
21:20:03.619895 IP optim-rhel72-uppu.development.unicomglobal.software.44942 > optim-rhel72-uppu.development.unicomglobal.software.7182: Flags [P.], seq 254:516, ack 21184, win 2388, options [nop,nop,TS val 1632415863 ecr 1632415831], length 262
21:20:03.628945 IP optim-rhel72-uppu.development.unicomglobal.software.44942 > optim-rhel72-uppu.development.unicomglobal.software.7182: Flags [F.], seq 516, ack 21185, win 2388, options [nop,nop,TS val 1632415872 ecr 1632415864], length 0

avatar
Super Collaborator

In addition to that, can you please show us the CM agent configuration with

# egrep -v '^[[:blank:]]*#|^$' /etc/cloudera-scm-agent/config.ini

avatar
Explorer

Here is the output:

 

[root@optim-rhel72-uppu ~]# egrep -v '^[[:blank:]]*#|^$' /etc/cloudera-scm-agent/config.ini
[General]
server_host=optim-rhel72-uppu.development.unicomglobal.software
server_port=7182
max_collection_wait_seconds=10.0
metrics_url_timeout_seconds=30.0
task_metrics_timeout_seconds=5.0
monitored_nodev_filesystem_types=nfs,nfs4,tmpfs
local_filesystem_whitelist=ext2,ext3,ext4,xfs
impala_profile_bundle_max_bytes=1073741824
stacks_log_bundle_max_bytes=1073741824
stacks_log_max_uncompressed_file_size_bytes=5242880
orphan_process_dir_staleness_threshold=5184000
orphan_process_dir_refresh_interval=3600
scm_debug=INFO
dns_resolution_collection_interval_seconds=60
dns_resolution_collection_timeout_seconds=30
[Security]
use_tls=1
max_cert_depth=9
 verify_cert_file=/opt/cloudera/security/pki/optim-rhel72-uppu.pem
 verify_cert_dir=/opt/cloudera/security/pki
 client_key_file=/opt/cloudera/security/pki/agent.key
 client_keypw_file=/etc/cloudera-scm-agent/agentkey.pw
 client_cert_file=/opt/cloudera/security/pki/agent.pem
[Hadoop]
[Cloudera]
[JDBC]
[root@optim-rhel72-uppu ~]#

 

avatar
Master Guru

@Tulasi,

 

Thank you for providing your config.  It appears you have space characters at the beginning of your cert/key configs.  Remove the space characters form the beginning of the following lines and then restart the agent:


 verify_cert_file=/opt/cloudera/security/pki/optim-rhel72-uppu.pem
 verify_cert_dir=/opt/cloudera/security/pki
 client_key_file=/opt/cloudera/security/pki/agent.key
 client_keypw_file=/etc/cloudera-scm-agent/agentkey.pw
 client_cert_file=/opt/cloudera/security/pki/agent.pem

avatar
Master Guru

NOTE:

 

You should only specify verify_cert_dir OR verify_cert_file, not both

Since you have a pem file, I would suggest using verify_cert_file and commenting out "verify_cert_dir=/opt/cloudera/security/pki"

avatar
Explorer

@bgooley

space characters at the bginning of cert/key in the agent configuration file is created this problem. After removing all of those spaces, restarted agent worked.

 

I didn't expect a space can create this sort of problem without telling what is going wrong.

 

Thanks for helping to figure this silly problem.

 

 

avatar
Explorer

I have followed the steps under "Configuring TLS/SSL for HDFS, YARN and MapReduce"

 

Service did not start successfully; not all of the required roles started: only 0/1 roles started. Reasons : Service has only 0 NodeManager roles running instead of minimum required 1

 

YARN failing to start and I see below error in the log

Can't open /run/cloudera-scm-agent/process/190-yarn-NODEMANAGER/container-executor.cfg: Permission denied

 

This is the permission:

-rw-r----- 1 yarn hadoop   997 Jan 18 02:22 creds.localjceks
-rw------- 1 yarn hadoop  1746 Jan 18 02:22 yarn.keytab
-r-------- 1 root hadoop   156 Jan 18 02:22 container-executor.cfg
-rw------- 1 root root    3688 Jan 18 02:22 supervisor.conf

 

But after giving permission, restart creates another foldr with the same permission, how to resolve this problem.

 

Thanks,

Tulasi

 

avatar
Master Guru

@Tulasi,

 

Could you start a new thread with your new issue so that we don't mix issues in the same thread.  the space character issue is likely to help others, so it would be good to start a new thread for permission denied issue.  I think it is a known one, but it will be easier to discuss if we can start fresh.

avatar
Explorer

Thanks Ben, will create a new thread.