Created on 05-04-2018 10:37 AM - edited 09-16-2022 06:10 AM
I decommissioned and deleted 2 out of 3 my HDFS data nodes. Although I expected blocks to have been replicated, it had not.
I started getting under replication error on my cluster. I have started HDFS balancer now but hive queries are terribly slow.
Is there some relation between two? Is it because it has to write to three nodes when files are underreplicated?
Created 05-04-2018 12:19 PM
if you decommissioned 2 out of 3 data nodes, you only have one node left? In this case everything must execute on this single data node, which will have performance impact.
If you still have three nodes left and the replication is ongoing for almost all your files you will have massive network load impacting for sure also any queries. During this rebalancing you will experience slow responses.
The relative impact is lower if you have 100 nodes and 2 get decommissioned, but it is still there.
Created 05-04-2018 02:28 PM
@Harald Berhoff
Thank you.
On my old data nodes and now event with new ones, I have been seeing this error in Hive query logs when it fails:
It's bugging the entire system and we find ourselves clueless on this.Could you give some direction to help fix this error please?
INFO : Completed executing command(queryId=hive_20180504183030_1d508d09-3ac9-47fe-8d43-fd13804ec1d1); Time taken: 0.165 seconds ERROR : Job Submission failed with exception 'java.lang.RuntimeException(Error caching map.xml: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010]. 74865 millis timeout left.)' java.lang.RuntimeException: Error caching map.xml: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010]. 74865 millis timeout left. at org.apache.hadoop.hive.ql.exec.Utilities.setBaseWork(Utilities.java:743) at org.apache.hadoop.hive.ql.exec.Utilities.setMapWork(Utilities.java:678) at org.apache.hadoop.hive.ql.exec.Utilities.setMapRedWork(Utilities.java:670) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:362) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:142) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:80) Caused by: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010]. 74865 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2270) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1701) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1620) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:772)
Created 05-04-2018 02:28 PM
@Harald Berghoff Once I get above mentioned error, most of my hive jobs after this start failing with
ERROR : Execution failed with exit status: 1 ERROR : Obtaining error information ERROR : Task failed! Task ID: Stage-7 Logs: ERROR : /var/log/hive/hadoop-cmf-CD-HIVE-XCVXskZf-HIVESERVER2-ip-172-31-4-192.ap-south-1.compute.internal.lo... ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask INFO : Completed executing command(queryId=hive_20180504194545_4cbaf9cb-9f83-410f-9804-1618f1ee10d1); Time taken: 0.156 seconds Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask (state=08S01,code=1) Closing: 0: jdbc:hive2://ip-172-31-4-192.ap-south-1.compute.internal:10000/default Intercepting System.exit(2) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]
Created 05-04-2018 03:03 PM
Hive does submit M&R jobs to process the files. So the sequence is quite clear, the M&R jobs get stopped and as a result your hive query fails as well.
The error message indicates, that the job was waiting for an I/O channel, and got interrupted. What I can't say for sure if the M&R job got interrupted due to the replication taking place in your cluster or for any other reasons.
Created 05-04-2018 04:38 PM
@Harald Berghoff It has happened before replication started taking place as well. What could be the other reasons it may be getting interrupted for?
Created 05-04-2018 04:54 PM
Also,In cluster metrics screen, I see:
application_1525427506968_1194 dr.who hadoop YARNr oot.users.dr_dot_who Fri May 4 22:20:24 +0550 2018Fri May 4 22:20:25 +0550 2018FAILEDFAILEDN/AN/AN/A Historyapplication_1525427506968_1193dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:20 +0550 2018Fri May 4 22:20:23 +0550 2018FAILEDFAILEDN/AN/AN/A Historyapplication_1525427506968_1192dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:17 +0550 2018Fri May 4 22:20:19 +0550 2018FAILEDFAILEDN/AN/AN/A Historyapplication_1525427506968_1191dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:13 +0550 2018Fri May 4 22:20:15 +0550 2018FAILEDFAILEDN/AN/AN/A Historyapplication_1525427506968_1190dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:09 +0550 2018Fri May 4 22:20:11 +0550 2018FAILEDFAILEDN/AN/AN/A Historyapplication_1525427506968_1189dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:06 +0550 2018Fri May 4 22:20:08 +0550 2018FAILEDFAILEDN/AN/AN/A Historyapplication_1525427506968_1188dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:02 +0550 2018Fri May 4 22:20:05 +0550 2018FAILEDFAILEDN/AN/AN/A Historyapplication_1525427506968_1187dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:19:59 +0550 2018Fri May 4 22:20:01 +0550 2018FAILEDFAILEDN/AN/AN/A Historyapplication_1525427506968_1186hueCreate TEMPORARY TABL...year(current_date())(Stage-1)MAPREDUCEroot.users.hueFri May 4 22:19:49 +0550 2018Fri May 4 22:20:31 +0550 2018FINISHEDSUCCEEDEDN/A
All of the applications that have dr.who as user have failed otherwise those being executed as hue users have run fine.
@Harald Berghoff
Created 05-05-2018 08:17 AM
Problem
The application log file shows:
74865 millis timeout while waiting for channel to be ready for connecting: java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010]. 74865 millis timeout left.
All nodes are connected to each other via an internal switch, which is a subnet of 172.31.4.x. This network is not open to public access.
Cause
Each node in the Hadoop cluster has an internal IP (through an internal switch) and external IP address, used to communicate with clients and external apps. Hadoop cluster by using the internal IP addresses. According to the description, this is caused by the multi-homed cluster.
Solution
In this case, in the hdfs-site.xml file a property dfs.client.use.datanode.hostname is set. This is the parameter that should force a client to retrieve a hostname instead of an IP address and perform its own lookup of the hostname to get a routable path to that host.
To solve this, add the following line into the custom hdfs-site properties.
dfs.client.use.datanode.hostname=true
Hope that help please revert
Created 05-05-2018 10:37 AM
Thanks. Could you please explain what a multi-home cluster is. For sake of clarification, I am sharing how my services are distributed across servers:
It's a 6 node cluster
3 of my instances are m4.4x Large
3 of instances are m4.2x Large
1 Master Server has following services on it:
-> hue
->hIVE
-> Zoopkeer
->oozie
->sentry
->spark
->sqoop 2
-> nn
-> snn
2 workers nodes with
->Node Manager
-> HDFS
3. Gateway server with
-> gateways
->node manager
4.1 node with
-> node manager on it
Is my architecture a problem apart from NN and SNN being on same server? @Geoffrey Shelton Okot
Recently, I have added node manager only node and commissioned two data nodes with more space (since older one was running out of space). I had to rebalance the cluster and post this all my hive queries have been running so slow. would be able to point to possible reasons? My master node has HDFS on it but not node manager. Could this be the reason?
Created 05-05-2018 10:39 AM
Also, it used to work fine until sometime back. I still don't understand why did it start happening out of nowhere. My hive queries running through oozie still work alot of times and some other times it breaks with this error