Support Questions

Find answers, ask questions, and share your expertise

ATS2-hbase starts but on the wrong node

avatar
Expert Contributor

Hello,

I have a new hdp3.0.1 installation with ats-hbase which runs embedded (with proper queue configured, as per the documentation).

At the end of all tasks (seen with the hive compactor, oozie steps), I have hundreds of lines with

org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING

ending up with :

org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Failed to process Event JOB_FINISHED for the job : job_1542872934100_0068 org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548) at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1405) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Call From null to prod-nl-dpnode3.dmdelivery.local:33602 failed on socket timeout exception:t java.lang.Thread.run(Thread.java:745) Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Call From null to prod-nl-dpnode3.dmdelivery.local:33602 failed on socket timeout exception

Looking at /var/log/hadoop-yarn/yar/hadoop-yarn-nodemanager I have a lot of lines with:

Call exception, tries=7, retries=7, started=8194 ms ago, cancelled=false, msg=Call to xxxxx/192.168.x.x:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: prod-nl-dpnode1.dmdelivery.local/192.168.36.161:17020, details=row 'prod.timelineservice.entity,hive!yarn-cluster!xxxx-34-compactor-vault.contact.license_name=lectiva!^?�����@@!^?����d��^?���!MAPREDUCE_TASK_ATTEMPT!^?�����!attempt_1542205428050_2307_m_000461_0,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1542294270073, seqNum=-1

Looking at /var/log/hadoop-yarn/yarn/hadoop-yarn-timelinereader, I see

Connection refused: dpnode1/192.168.36.161:17020

Indeed, there is no hbase on dpnode1. Hbase does run on dpnode5 (or another one, depending on yarn restart), but in any case, the timelinereader does not know which server to reach, and always goes to one seemingly hardcoded hostname.

How can I tell yarn to use the right node to connect to hbase?

Thanks,

5 REPLIES 5

avatar
Super Guru

@Guillaume Roger,

I guess ATSv2 is running as a service and not embedded mode. Can you filter for "is_hbase_system_service" in YARN configs and check the value. If it is set to true, then ATS v2 will be running as a yarn application. Else, it will be running in embedded mode. If it is running as a yarn application, then it can be started on any of the node where NodeManagers are present with proper resources.

Can you check in the Yarn application logs , if HBase master and region servers are able to come up properly.

avatar
Expert Contributor

@Aditya Sirna, you are right, hbase runs as a service (is_hbase_system_service_launch is true).

I am giving example with nodeN, which are the names of my data nodes, This is based on what I see right now and makes it easier to understand

The region server (node5) tries to report for duty but fails. It tries to connect to node1:17020, but port 17020 is only open on node5.

On node1 hbase master tried to start, but stopped because it apparently cannot find the active namenode

Failed get of master address: java.io.IOException: Can't get master address from ZooKeeper; znode data == null

I will look into zookeeper, it seems to ring a bell.

I have 2 questions if you don't mind:

- how do you start a yarn service on a specic node?

- how does the timelinereader know where to connect?

In any case thanks, you gave me some ideas to carry on.

avatar
Explorer

@Guillaume Roger / @Aditya Sirna How do we start the ats-hbase?

I tried sudo yarn app -start ats-hbase but it gives error:

ERROR client.Api Service Client: File does not exist: .yarn/services/ats-hbase/ats-hbase.json

The service was running on my cluster I stopped it from resource manager Services tab as I had reconfigure container size and this service was using old container size. But I know cannot see any service in the tab to start.

Please help as absence of this service causing the jobs to hang(for approximately 5 mins) after 100% completion of mappers and reducers.

avatar
Expert Contributor

I got it working by

- cleaning up the hdfs directories of hbase-ats

- cleaning up the zookeeper nodes related to hbase-hdfs

I hope there are better ways, but that's the only one I found out and was working.

avatar
Explorer

Thanks @Guillaume Roger

I was facing the above issue because the ats-hbase.json file is present under ats-hbase user in hdfs and running the command from any other user doesn't help.

I was able to fix the issue by trying couple of things

1. Running below command

2. Restarting Timeline Service V2.0 through ambari.

I am not sure whether it actually started the hbase-ats because I cannot see any service running in resource manager which was the case earlier.

curl -k -u: -H "Content-Type: application/json" -X PUT http://<ResourceManagerHost>:<ResourceManagerPort>/app/v1/services/ats-hbase?user.name=yarn-ats -d '{"state": "STOPPED" }'

curl -k -u: -H "Content-Type: application/json" -X PUT http://<ResourceManagerHost>:<ResourceManagerPort>/app/v1/services/ats-hbase?user.name=yarn-ats -d '{"state": "STARTED" }'

I am not sure restarting helped because I had restarted the cluster few times and that might have restarted the timeline server.

I used below document for reference.

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/data-operating-system/content/options_to_re...