Created 06-26-2018 10:39 AM
Hi,
We're facing an issue with Spark in Production environments that the spark workers does not get connected with Spark Master. Please see the logs below and help to resolve this issue.
Master Log:
18/06/25 22:59:12 INFO master.Master: akka.tcp://sparkWorker@spark7:7084 got disassociated, removing it. 18/06/25 22:59:12 INFO master.Master: akka.tcp://sparkWorker@spark7:7084 got disassociated, removing it. 18/06/25 22:59:12 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@spark7:7084] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 18/06/25 22:59:28 INFO master.Master: akka.tcp://sparkWorker@spark7:7079 got disassociated, removing it. 18/06/25 22:59:28 INFO master.Master: akka.tcp://sparkWorker@spark7:7079 got disassociated, removing it. 18/06/25 22:59:28 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@spark7:7079] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 18/06/25 22:59:28 INFO master.Master: akka.tcp://sparkWorker@spark7:7082 got disassociated, removing it. 18/06/25 22:59:28 INFO master.Master: akka.tcp://sparkWorker@spark7:7082 got disassociated, removing it. 18/06/25 22:59:28 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@spark7:7082] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 18/06/25 22:59:35 INFO master.Master: akka.tcp://sparkWorker@spark8:7081 got disassociated, removing it. 18/06/25 22:59:35 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@spark8:7081] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 18/06/25 22:59:35 INFO master.Master: akka.tcp://sparkWorker@spark8:7081 got disassociated, removing it. 18/06/25 23:00:23 INFO master.Master: akka.tcp://sparkWorker@spark9:7081 got disassociated, removing it. 18/06/25 23:00:23 INFO master.Master: akka.tcp://sparkWorker@spark9:7081 got disassociated, removing it. 18/06/25 23:00:23 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@spark9:7081] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 18/06/25 23:00:45 INFO master.Master: akka.tcp://sparkWorker@spark8:7085 got disassociated, removing it. 18/06/25 23:00:45 INFO master.Master: akka.tcp://sparkWorker@spark8:7085 got disassociated, removing it. 18/06/25 23:00:45 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@spark8:7085] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 18/06/25 23:00:48 INFO master.Master: akka.tcp://sparkWorker@spark7:7083 got disassociated, removing it. 18/06/25 23:00:48 INFO master.Master: akka.tcp://sparkWorker@spark7:7083 got disassociated, removing it. 18/06/25 23:00:48 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@spark7:7083] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 18/06/25 23:01:52 INFO master.Master: akka.tcp://sparkWorker@spark0:7080 got disassociated, removing it. 18/06/25 23:01:52 INFO master.Master: akka.tcp://sparkWorker@spark0:7080 got disassociated, removing it. 18/06/25 23:01:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@spark0:7080] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
Worker Log
18/06/25 22:43:56 INFO util.Utils: Successfully started service 'sparkWorker' on port 7081. 18/06/25 22:43:56 INFO worker.Worker: Starting Spark worker HKLPADBID09:7081 with 4 cores, 16.0 GB RAM 18/06/25 22:43:56 INFO worker.Worker: Running Spark version 1.4.1-palantir3 18/06/25 22:43:56 INFO worker.Worker: Spark home: /opt/palantir/spark-1.4.1-palantir3-bin-hadoop2.4 18/06/25 22:43:56 INFO server.Server: jetty-8.y.z-SNAPSHOT 18/06/25 22:43:56 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:8084 18/06/25 22:43:56 INFO util.Utils: Successfully started service 'WorkerUI' on port 8084. 18/06/25 22:43:56 INFO ui.WorkerWebUI: Started WorkerWebUI at http://SPARK:8084 18/06/25 22:43:56 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:44:10 INFO worker.Worker: Retrying connection to master (attempt # 1) 18/06/25 22:44:10 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:44:24 INFO worker.Worker: Retrying connection to master (attempt # 2) 18/06/25 22:44:24 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:44:38 INFO worker.Worker: Retrying connection to master (attempt # 3) 18/06/25 22:44:38 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:44:52 INFO worker.Worker: Retrying connection to master (attempt # 4) 18/06/25 22:44:52 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:45:06 INFO worker.Worker: Retrying connection to master (attempt # 5) 18/06/25 22:45:06 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:45:20 INFO worker.Worker: Retrying connection to master (attempt # 6) 18/06/25 22:45:20 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:46:42 INFO worker.Worker: Retrying connection to master (attempt # 7) 18/06/25 22:46:42 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:48:04 INFO worker.Worker: Retrying connection to master (attempt # 8) 18/06/25 22:48:04 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:49:26 INFO worker.Worker: Retrying connection to master (attempt # 9) 18/06/25 22:49:26 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:50:48 INFO worker.Worker: Retrying connection to master (attempt # 10) 18/06/25 22:50:48 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:52:10 INFO worker.Worker: Retrying connection to master (attempt # 11) 18/06/25 22:52:10 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:53:32 INFO worker.Worker: Retrying connection to master (attempt # 12) 18/06/25 22:53:32 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:54:54 INFO worker.Worker: Retrying connection to master (attempt # 13) 18/06/25 22:54:54 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:56:16 INFO worker.Worker: Retrying connection to master (attempt # 14) 18/06/25 22:56:16 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:57:38 INFO worker.Worker: Retrying connection to master (attempt # 15) 18/06/25 22:57:38 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 22:59:00 INFO worker.Worker: Retrying connection to master (attempt # 16) 18/06/25 22:59:00 INFO worker.Worker: Connecting to master akka.tcp://sparkMaster@spark:7077/user/Master... 18/06/25 23:00:22 ERROR worker.Worker: All masters are unresponsive! Giving up. 18/06/25 23:00:22 INFO util.Utils: Shutdown hook called
Created 06-27-2018 09:56 AM
@Geoffrey Shelton Okot @adash Seeking your help to fix the issue which mentioned above. Thanks.
Created 06-27-2018 10:58 AM
Can you check the ports in the Java code and the Akka configuration match
Created 06-27-2018 01:31 PM
@Geoffrey Shelton Okot I see the port mentioned in config is correct and how to check the port in java code.