Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

relationship between Hive query and missing blocks on cluster

I decommissioned and deleted 2 out of 3 my HDFS data nodes. Although I expected blocks to have been replicated, it had not.

I started getting under replication error on my cluster. I have started HDFS balancer now but hive queries are terribly slow.

Is there some relation between two? Is it because it has to write to three nodes when files are underreplicated?

21 REPLIES 21

Super Collaborator

if you decommissioned 2 out of 3 data nodes, you only have one node left? In this case everything must execute on this single data node, which will have performance impact.

If you still have three nodes left and the replication is ongoing for almost all your files you will have massive network load impacting for sure also any queries. During this rebalancing you will experience slow responses.

The relative impact is lower if you have 100 nodes and 2 get decommissioned, but it is still there.

@Harald Berhoff

Thank you.

On my old data nodes and now event with new ones, I have been seeing this error in Hive query logs when it fails:

It's bugging the entire system and we find ourselves clueless on this.Could you give some direction to help fix this error please?

INFO  : Completed executing command(queryId=hive_20180504183030_1d508d09-3ac9-47fe-8d43-fd13804ec1d1); Time taken: 0.165 seconds
ERROR : Job Submission failed with exception 'java.lang.RuntimeException(Error caching map.xml: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010]. 74865 millis timeout left.)'
java.lang.RuntimeException: Error caching map.xml: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010]. 74865 millis timeout left.
	at org.apache.hadoop.hive.ql.exec.Utilities.setBaseWork(Utilities.java:743)
	at org.apache.hadoop.hive.ql.exec.Utilities.setMapWork(Utilities.java:678)
	at org.apache.hadoop.hive.ql.exec.Utilities.setMapRedWork(Utilities.java:670)
	at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:362)
	at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:142)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:80)
Caused by: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010]. 74865 millis timeout left.
	at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2270)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1701)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1620)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:772)

@Harald Berghoff Once I get above mentioned error, most of my hive jobs after this start failing with

ERROR : Execution failed with exit status: 1
ERROR : Obtaining error information
ERROR : 
Task failed!
Task ID:
  Stage-7

Logs:

ERROR : /var/log/hive/hadoop-cmf-CD-HIVE-XCVXskZf-HIVESERVER2-ip-172-31-4-192.ap-south-1.compute.internal.lo...
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
INFO  : Completed executing command(queryId=hive_20180504194545_4cbaf9cb-9f83-410f-9804-1618f1ee10d1); Time taken: 0.156 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask (state=08S01,code=1)
Closing: 0: jdbc:hive2://ip-172-31-4-192.ap-south-1.compute.internal:10000/default
Intercepting System.exit(2)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]

Super Collaborator

Hive does submit M&R jobs to process the files. So the sequence is quite clear, the M&R jobs get stopped and as a result your hive query fails as well.

The error message indicates, that the job was waiting for an I/O channel, and got interrupted. What I can't say for sure if the M&R job got interrupted due to the replication taking place in your cluster or for any other reasons.

@Harald Berghoff It has happened before replication started taking place as well. What could be the other reasons it may be getting interrupted for?

Also,In cluster metrics screen, I see:

application_1525427506968_1194 dr.who hadoop YARNr oot.users.dr_dot_who Fri May 4 22:20:24 +0550 2018Fri May 4 22:20:25 +0550 2018FAILEDFAILEDN/AN/AN/A

Historyapplication_1525427506968_1193dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:20 +0550 2018Fri May 4 22:20:23 +0550 2018FAILEDFAILEDN/AN/AN/A

Historyapplication_1525427506968_1192dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:17 +0550 2018Fri May 4 22:20:19 +0550 2018FAILEDFAILEDN/AN/AN/A

Historyapplication_1525427506968_1191dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:13 +0550 2018Fri May 4 22:20:15 +0550 2018FAILEDFAILEDN/AN/AN/A

Historyapplication_1525427506968_1190dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:09 +0550 2018Fri May 4 22:20:11 +0550 2018FAILEDFAILEDN/AN/AN/A

Historyapplication_1525427506968_1189dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:06 +0550 2018Fri May 4 22:20:08 +0550 2018FAILEDFAILEDN/AN/AN/A

Historyapplication_1525427506968_1188dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:20:02 +0550 2018Fri May 4 22:20:05 +0550 2018FAILEDFAILEDN/AN/AN/A

Historyapplication_1525427506968_1187dr.whohadoopYARNroot.users.dr_dot_whoFri May 4 22:19:59 +0550 2018Fri May 4 22:20:01 +0550 2018FAILEDFAILEDN/AN/AN/A

Historyapplication_1525427506968_1186hueCreate TEMPORARY TABL...year(current_date())(Stage-1)MAPREDUCEroot.users.hueFri May 4 22:19:49 +0550 2018Fri May 4 22:20:31 +0550 2018FINISHEDSUCCEEDEDN/A

All of the applications that have dr.who as user have failed otherwise those being executed as hue users have run fine.

@Harald Berghoff

Mentor

@Sim kaur

Problem

The application log file shows:

74865 millis timeout while waiting for channel to be ready for connecting: java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010]. 74865 millis timeout left.

All nodes are connected to each other via an internal switch, which is a subnet of 172.31.4.x. This network is not open to public access.

Cause

Each node in the Hadoop cluster has an internal IP (through an internal switch) and external IP address, used to communicate with clients and external apps. Hadoop cluster by using the internal IP addresses. According to the description, this is caused by the multi-homed cluster.

Solution

In this case, in the hdfs-site.xml file a property dfs.client.use.datanode.hostname is set. This is the parameter that should force a client to retrieve a hostname instead of an IP address and perform its own lookup of the hostname to get a routable path to that host.

To solve this, add the following line into the custom hdfs-site properties.

dfs.client.use.datanode.hostname=true

Hope that help please revert

Thanks. Could you please explain what a multi-home cluster is. For sake of clarification, I am sharing how my services are distributed across servers:

It's a 6 node cluster

3 of my instances are m4.4x Large

3 of instances are m4.2x Large

1 Master Server has following services on it:

-> hue

->hIVE

-> Zoopkeer

->oozie

->sentry

->spark

->sqoop 2

-> nn

-> snn

2 workers nodes with

->Node Manager

-> HDFS

3. Gateway server with

-> gateways

->node manager

4.1 node with

-> node manager on it

Is my architecture a problem apart from NN and SNN being on same server? @Geoffrey Shelton Okot

Recently, I have added node manager only node and commissioned two data nodes with more space (since older one was running out of space). I had to rebalance the cluster and post this all my hive queries have been running so slow. would be able to point to possible reasons? My master node has HDFS on it but not node manager. Could this be the reason?

Also, it used to work fine until sometime back. I still don't understand why did it start happening out of nowhere. My hive queries running through oozie still work alot of times and some other times it breaks with this error

@Geoffrey Shelton Okot: Thanks. Is there something else that I can do to resolve this issue in Map-red? Spark isn't working all that well for me.

Mentor

@Sim kaur

For sure all was working when you had 3 datanodes !!! The default replication factor is 3 so if you delete 2 out of 3 my HDFS data nodes, that literally means you have ONLY one copy of your file.

With 6 nodes you could have a setup like this

  • 2 Master node
  • 3 Datanodes(every datanode should have a node manager default)
  • I Edge Node (Low-end node )

You should have a least 3 zookeeper servers running and a client on each node! When you are not running NameNode HA you will see NN,SNN running on the same node the SNN daemon is only an NN helper for merging the edits and fsimage, it offloads the task of merging from the NN but if you plan to have High Availability then you should configure a real NameNode HA the primary and standby NameNodes MUST run on 2 different node !!

There is no better document than the HWX Multi-home-cluster but from your setup you are running CDH, I don't thisnk there is a big difference.

Please set add the custom hdfs-site properties check the HDFS configuration parameters in CDH. .

dfs.client.use.datanode.hostname=true

In your previous post above can you explain me the

  • 3
  • 4

And please revert

Thanks. Let me check with

  1. dfs.client.use.datanode.hostname=true

and I will confirm.

Also, adding 3 zookeeper servers as well.

Thanks for clarifying that is required to take SNN on a different node when only Namenode is working in HA mode.

But what I am not able to understand is if the error:

Job Submission failed with exception 'java.lang.RuntimeException(Error caching map.xml: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/172.31.4.192:42632 remote=/172.31.4.192:50010].

and

ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
INFO  : Completed executing command(queryId=hive_20180504194545_4cbaf9cb-9f83-410f-9804-1618f1ee10d1); Time taken: 0.156 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask (state=08S01,code=1)
Closing: 0: jdbc:hive2://ip-172-31-4-192.ap-south-1.compute.internal:10000/default
Intercepting System.exit(2)

is related to having under-replicated data. My data has been replicated across 3 nodes now and I am still running into this error and so was I before decommissioning my data nodes that have run out of space and commissioning the new ones. @Geoffrey Shelton Okot

3. Gateway server has following roles:


7 Role(s)
  • HDFS Gateway
  • HDFS HttpFS
  • Hive Gateway
  • YARN (MR2 Included) NodeManager
  • Sentry Gateway
  • Spark Gateway
9.28s ago

4. This server only has

  • YARN (MR2 Included) NodeManager

Added it since our vCPU's are upto full usage sometimes.

I would also like to clarify if it is okay to have HDFS Data Node on a server without Node Manager on it

and

if it is okay to have node manager without having HDFS on same node.

Mentor

@Sim kaur

I would suggest you install at least the node managers on the servers where datanodes are running this way node managers can find the data locally. Datanodes are part of HDFS and node managers are part of Yarn.

Datanodes are used to store data on HDFS whereas Nodemanagers are used to start a container on Yarn. There is no strict rule that datanodes and node managers have to be on the same host. If you have nodemanagers on all nodes, in this case, the containers running on hosts where datanodes aren't installed will still run the application by copying data from datanodes. That could be the issue of timeouts you are experiencing.

Mentor

@Sim kaur

Any updates?

@Geoffrey Shelton Okot : I have made the changes

i.e

  1. dfs.client.use.datanode.hostname=true

2. Removed yarn from nodes that did not have HDFS

3. Although still have a node that only has HDFS (Is there a way that Hadoop can prioritize execution of jobs on worker nodes that have both of these services started on them?)

The error happens randomly throughout the day and in morning hours when most of our hive reports run. I am in IST time zone. Please give me until tomorrow morning IST to confirm if this has completely resolved the issue 🙂

Also, I would like to know what does exactly happen if zookeeper data is lost? I tried installing zookeeper server instance on other nodes but the new instances did not start and kept returning supervisord failed to start.

Mentor

@Sim kaur

Okay will wait for your feedback!!

How many zookeepers do you have running now? Anything less than 3 is NOT good! check zookeeper split brain documentation

For the failed installation run

ps aux |grep supervisor 

if there is an existing supervisor process PID kill it if one exists and then start the agent to make sure the agent is using the right supervisord.conf /var/run/cloudera-scm-agent/supervisor/supervisor.conf


@Geoffrey Shelton Okot I am still getting the error

Also, from the metadata, I see that it is executing on the node that has both Nodemanager and HDFS.

Is there somewhere I can trace down reason for this error?

Note: /var/log/hive/hadoop-cmf-CD-HIVE-XCVXskZf-HIVESERVER2-ip-172-31-4-192.ap-south-1.compute.internal.lo... This path does not exist on any of my nodes.

Once I start getting this error, all my hive jobs fail until after a while and are usually fixed if I restart my cluster.

ERROR : Execution failed with exit status: 1
ERROR : Obtaining error information
ERROR : 
Task failed!
Task ID:
  Stage-20

Logs:

ERROR : <a href="http://PUBLICDOMAINNAME:8888/filebrowser/view=/var/log/hive/hadoop-cmf-CD-HIVE-XCVXskZf-HIVESERVER2-ip-172-31-4-192.ap-south-1.compute.internal.log.out">/var/log/hive/hadoop-cmf-CD-HIVE-XCVXskZf-HIVESERVER2-ip-172-31-4-192.ap-south-1.compute.internal.log.out</a>
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
WARN  : Shutting down task : Stage-1:MAPRED
WARN  : Shutting down task : Stage-7:MAPRED
WARN  : Shutting down task : Stage-11:MAPRED
INFO  : Completed executing command(queryId=hive_20180506020606_5277a480-1de4-4a48-9469-b6a2e7d0b200); Time taken: 0.163 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask (state=08S01,code=1)
Closing: 0: jdbc:hive2://ip-172-31-4-192.ap-south-1.compute.internal:10000/default
Intercepting System.exit(2)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]

Mentor

@Sim kaur

Just answered your YARN memory setup on another thread, you were scheduling you hive queries through oozie 🙂 Please do use the excel file to help you setup correctly your Vcores and Memory.

Please accept and close the previous threads where part of your issues was resolved because the Memory issue is different from the zookeeper setup or Cluster setup question which I had already answered.

Try to keep your threads component specific its easier to resolve and open a new thread 🙂

Mentor

@Sim kaur

Have you tried switching to beeline? and executed your queries?

set hive.execution.engine=mr;

Take note Hive CLI should be deprecated as the Hive community has long recommended using the Beeline plus HiveServer2 configuration

I believe I am already using Beeline. Yes, I tried switching back to MapReduce execution engine but still get the same error. @Geoffrey Shelton Okot I do have hive server 2 up and running on the cluster. Also, I don't find it ideal having to switch to Hive on Spark because of this unidentified issue. Do you mind pointing out what could be the other reasons for interruption on Map Reduce job or if it is possible to escalate it?