We run Cloudera based hadoop cluster - 5.11 (not hortonworks) and currently we added 5 new impala daemon nodes. (Impala version 2.8). After adding new nodes, things looked fine ,but after 7-8 hours we are getting below errors when the impala co-ordinator tried to connect to the newly added nodes. Please help resolving it as we are blocked in production
Error : Sender timed out waiting for receiver fragment instance
Detailed error :
I0506 20:16:55.660058 72446 coordinator.cc:1417] CancelFragmentInstances() query_id=d6447c0a5ed591c4:47ac776800000000, tried to cancel 35 fragment instances
I0506 20:16:55.663775 72446 coordinator.cc:756] Query id=d6447c0a5ed591c4:47ac776800000000 failed because fragment id=d6447c0a5ed591c4:47ac776800000006 on host=hadoop-slave22.use1.data.ripple.com:22000 failed.
I0506 20:16:55.664430 72093 coordinator.cc:1060] All fragment instances finished due to one or more errors.
Sender timed out waiting for receiver fragment instance: d6447c0a5ed591c4:47ac77680000001b
CDH 5.11 and Impala 2.8 is pretty old. You should try the latest versions.
In the logs, there is one failed fragment instance and one timeout fragment instance.
The failed one is on host hadoop-slave22.use1.data.ripple.com. You should check impalad logs on it for more details.
The timeout one has instance_id=d6447c0a5ed591c4:47ac77680000001b. The first part (d6447c0a5ed591c4) is the same for this query. The last part (47ac77680000001b) is the id of the fragment instance. You can check previous logs to see where this instance is scheduled. Then check impalad logs of the scheduled host. Usually this might due to network saturation. Later Impala versions have more RPC improvements, e.g. KRPC.