Support Questions
Find answers, ask questions, and share your expertise

Impala DML frozen on CDH manager frozen - hidden dependency?

Highlighted

Impala DML frozen on CDH manager frozen - hidden dependency?

Explorer

We noticed that when the Cloudera management node is frozen in our cluster that our impala DML queries don't finish anymore and are simply stuck.

We did not expect that.

This is reproducable when freezing the management node with lxc-freeze but also noticed the same behavior when the node is not available for another reason.

 

We assumed impala just needs statestore, metastore and hive metastore to be happy. But for some reason it insert statements simply get stuck without any notice of what might be going on.

Select queries are still running fine.

 

We can't see anything in the logs of catalogd, statestore or hive metastore. All of those services and all impalad run on different nodes from the one being frozen.

 

Is there a hidden dependency from Impala to any of those services that could prevent DMLs from finishing?

 

  • Cloudera Management Service Activity Monitor
  • Cloudera Management Service Alert Publisher
  • Cloudera Management Service Event Server
  • Cloudera Management Service Host Monitor
  • Cloudera Management Service Service Monitor

 

Those are the only services running on the frozen node.

 

Our cluster looks something like this:

 

node H1:

  • Cloudera Management Service Activity Monitor
  • Cloudera Management Service Alert Publisher
  • Cloudera Management Service Event Server
  • Cloudera Management Service Host Monitor
  • Cloudera Management Service Service Monitor

node H2 - H<n>:

  • Catalog Service
  • Statestore
  • HiveServer2
  • HDFS JournalNode
  • HDFS NameService
  • ...

node H<n>-H<m>

  • impalad
  • HDFS Datanode
  • ...

 

When comparing the query output from impalad between frozen cloudera mangement node state and unfrozen state it seems like the difference seems to be in the cleanup process

 

normal insert:

 

not frozen:

I1007 20:00:59.769767 49261 admission-controller.cc:440] Schedule for id=a94b13644229aa92:dd551bc800000000 in pool_name=root.impala-default cluster_mem_needed=6.00 GB PoolConfig:
I1007 20:00:59.769858 49261 admission-controller.cc:451] Admitted query id=a94b13644229aa92:dd551bc800000000
I1007 20:00:59.769892 49261 coordinator.cc:441] Exec() query_id=a94b13644229aa92:dd551bc800000000 stmt=INSERT INTO `sb_sandbox`.`continues_insert_test` VALUES (0)
I1007 20:00:59.770063 49261 coordinator.cc:592] starting 1 fragment instances for query a94b13644229aa92:dd551bc800000000
I1007 20:00:59.771142 17639 fragment-mgr.cc:40] ExecPlanFragment() instance_id=a94b13644229aa92:dd551bc800000000 coord=h9.yieldlab.lan:22000
I1007 20:00:59.771531 48326 plan-fragment-executor.cc:119] Prepare(): query_id=a94b13644229aa92:dd551bc800000000 instance_id=a94b13644229aa92:dd551bc800000000
I1007 20:00:59.772028 49261 coordinator.cc:630] started 1 fragment instances for query a94b13644229aa92:dd551bc800000000
I1007 20:00:59.772033 48326 plan-fragment-executor.cc:175] descriptor table for fragment=a94b13644229aa92:dd551bc800000000
I1007 20:00:59.772241 48326 plan-fragment-executor.cc:300] Open(): instance_id=a94b13644229aa92:dd551bc800000000
I1007 20:00:59.772488 49261 impala-server.cc:895] Query a94b13644229aa92:dd551bc800000000 has timeout of 2m
I1007 20:00:59.955390 18884 coordinator.cc:1536] Fragment instance completed: id=a94b13644229aa92:dd551bc800000000 host=h9.yieldlab.lan:22000 remaining=0
I1007 20:00:59.955576 48328 coordinator.cc:1031] Finalizing query: a94b13644229aa92:dd551bc800000000
I1007 20:00:59.955627 48326 fragment-mgr.cc:99] PlanFragment completed. instance_id=a94b13644229aa92:dd551bc800000000
.b/sb_sandbox/continues_insert_test/_impala_insert_staging/a94b13644229aa92_dd551bc800000000/
I1007 20:00:59.769767 49261 admission-controller.cc:440] Schedule for id=a94b13644229aa92:dd551bc800000000 in pool_name=root.impala-default cluster_mem_needed=6.00 GB PoolConfig:
I1007 20:00:59.769858 49261 admission-controller.cc:451] Admitted query id=a94b13644229aa92:dd551bc800000000
I1007 20:00:59.769892 49261 coordinator.cc:441] Exec() query_id=a94b13644229aa92:dd551bc800000000 stmt=INSERT INTO `sb_sandbox`.`continues_insert_test` VALUES (0)
I1007 20:00:59.770063 49261 coordinator.cc:592] starting 1 fragment instances for query a94b13644229aa92:dd551bc800000000
I1007 20:00:59.771142 17639 fragment-mgr.cc:40] ExecPlanFragment() instance_id=a94b13644229aa92:dd551bc800000000 coord=h9.yieldlab.lan:22000
I1007 20:00:59.771531 48326 plan-fragment-executor.cc:119] Prepare(): query_id=a94b13644229aa92:dd551bc800000000 instance_id=a94b13644229aa92:dd551bc800000000
I1007 20:00:59.772028 49261 coordinator.cc:630] started 1 fragment instances for query a94b13644229aa92:dd551bc800000000
I1007 20:00:59.772033 48326 plan-fragment-executor.cc:175] descriptor table for fragment=a94b13644229aa92:dd551bc800000000
I1007 20:00:59.772241 48326 plan-fragment-executor.cc:300] Open(): instance_id=a94b13644229aa92:dd551bc800000000
I1007 20:00:59.772488 49261 impala-server.cc:895] Query a94b13644229aa92:dd551bc800000000 has timeout of 2m
I1007 20:00:59.955390 18884 coordinator.cc:1536] Fragment instance completed: id=a94b13644229aa92:dd551bc800000000 host=h9.yieldlab.lan:22000 remaining=0
I1007 20:00:59.955576 48328 coordinator.cc:1031] Finalizing query: a94b13644229aa92:dd551bc800000000
I1007 20:00:59.955627 48326 fragment-mgr.cc:99] PlanFragment completed. instance_id=a94b13644229aa92:dd551bc800000000
.b/sb_sandbox/continues_insert_test/_impala_insert_staging/a94b13644229aa92_dd551bc800000000/
I1007 20:01:00.379140 49261 impala-hs2-server.cc:679] CloseOperation(): query_id=a94b13644229aa92:dd551bc800000000
I1007 20:01:00.379181 49261 impala-server.cc:906] UnregisterQuery(): query_id=a94b13644229aa92:dd551bc800000000
I1007 20:01:00.379195 49261 impala-server.cc:992] Cancel(): query_id=a94b13644229aa92:dd551bc800000000
I1007 20:01:00.379209 49261 coordinator.cc:1351] Cancel() query_id=a94b13644229aa92:dd551bc800000000
I1007 20:01:00.379233 49261 coordinator.cc:1417] CancelFragmentInstances() query_id=a94b13644229aa92:dd551bc800000000, tried to cancel 0 fragment instances
I1007 20:01:00.379140 49261 impala-hs2-server.cc:679] CloseOperation(): query_id=a94b13644229aa92:dd551bc800000000
I1007 20:01:00.379181 49261 impala-server.cc:906] UnregisterQuery(): query_id=a94b13644229aa92:dd551bc800000000
I1007 20:01:00.379195 49261 impala-server.cc:992] Cancel(): query_id=a94b13644229aa92:dd551bc800000000
I1007 20:01:00.379209 49261 coordinator.cc:1351] Cancel() query_id=a94b13644229aa92:dd551bc800000000
I1007 20:01:00.379233 49261 coordinator.cc:1417] CancelFragmentInstances() query_id=a94b13644229aa92:dd551bc800000000, tried to cancel 0 fragment instances

 

 

frozen version (insert statement never finishes and just hangs):

 

frozen:

I1007 20:03:32.718174 49317 admission-controller.cc:440] Schedule for id=8541dc1ac1e3542e:e991eaf600000000 in pool_name=root.impala-default cluster_mem_needed=6.00 GB PoolConfig:
I1007 20:03:32.718283 49317 admission-controller.cc:451] Admitted query id=8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.718322 49317 coordinator.cc:441] Exec() query_id=8541dc1ac1e3542e:e991eaf600000000 stmt=INSERT INTO `sb_sandbox`.`continues_insert_test` VALUES (0)
I1007 20:03:32.718523 49317 coordinator.cc:592] starting 1 fragment instances for query 8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.719136 18513 fragment-mgr.cc:40] ExecPlanFragment() instance_id=8541dc1ac1e3542e:e991eaf600000000 coord=h9.yieldlab.lan:22000
I1007 20:03:32.719440 50128 plan-fragment-executor.cc:119] Prepare(): query_id=8541dc1ac1e3542e:e991eaf600000000 instance_id=8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.719676 50128 plan-fragment-executor.cc:175] descriptor table for fragment=8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.719667 49317 coordinator.cc:630] started 1 fragment instances for query 8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.719853 50128 plan-fragment-executor.cc:300] Open(): instance_id=8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.720300 49317 impala-server.cc:895] Query 8541dc1ac1e3542e:e991eaf600000000 has timeout of 2m
I1007 20:03:32.847476 11346 coordinator.cc:1536] Fragment instance completed: id=8541dc1ac1e3542e:e991eaf600000000 host=h9.yieldlab.lan:22000 remaining=0
I1007 20:03:32.847577 50130 coordinator.cc:1031] Finalizing query: 8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.847705 50128 fragment-mgr.cc:99] PlanFragment completed. instance_id=8541dc1ac1e3542e:e991eaf600000000
.b/sb_sandbox/continues_insert_test/_impala_insert_staging/8541dc1ac1e3542e_e991eaf600000000/
I1007 20:03:32.718174 49317 admission-controller.cc:440] Schedule for id=8541dc1ac1e3542e:e991eaf600000000 in pool_name=root.impala-default cluster_mem_needed=6.00 GB PoolConfig:
I1007 20:03:32.718283 49317 admission-controller.cc:451] Admitted query id=8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.718322 49317 coordinator.cc:441] Exec() query_id=8541dc1ac1e3542e:e991eaf600000000 stmt=INSERT INTO `sb_sandbox`.`continues_insert_test` VALUES (0)
I1007 20:03:32.718523 49317 coordinator.cc:592] starting 1 fragment instances for query 8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.719136 18513 fragment-mgr.cc:40] ExecPlanFragment() instance_id=8541dc1ac1e3542e:e991eaf600000000 coord=h9.yieldlab.lan:22000
I1007 20:03:32.719440 50128 plan-fragment-executor.cc:119] Prepare(): query_id=8541dc1ac1e3542e:e991eaf600000000 instance_id=8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.719676 50128 plan-fragment-executor.cc:175] descriptor table for fragment=8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.719667 49317 coordinator.cc:630] started 1 fragment instances for query 8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.719853 50128 plan-fragment-executor.cc:300] Open(): instance_id=8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.720300 49317 impala-server.cc:895] Query 8541dc1ac1e3542e:e991eaf600000000 has timeout of 2m
I1007 20:03:32.847476 11346 coordinator.cc:1536] Fragment instance completed: id=8541dc1ac1e3542e:e991eaf600000000 host=h9.yieldlab.lan:22000 remaining=0
I1007 20:03:32.847577 50130 coordinator.cc:1031] Finalizing query: 8541dc1ac1e3542e:e991eaf600000000
I1007 20:03:32.847705 50128 fragment-mgr.cc:99] PlanFragment completed. instance_id=8541dc1ac1e3542e:e991eaf600000000
.b/sb_sandbox/continues_insert_test/_impala_insert_staging/8541dc1ac1e3542e_e991eaf600000000/

 Impala 2.7, CDH 5.10

 

if you have any idea what could cause this we would be happy to hear about it.

 

3 REPLIES 3
Highlighted

Re: Impala DML frozen on CDH manager frozen - hidden dependency?

There's no dependency on any of the Cloudera management services.

 

Inserts are also going to depend on the HDFS service being healthy (i.e. namenodes, data nodes, etc).

 

There are various other underlying services that could be in play - Kerberos infrastructure like the KDC, the KMS if you're using certain encryption features, etc.

 

Those logs look like the client didn't actually close the query, so I'd question whether there was something that disrupted the client connect to the impala daemon (e.g. a load balancer was paused, or something happened to the client process).

 

Highlighted

Re: Impala DML frozen on CDH manager frozen - hidden dependency?

I should also say - If you have a chance to upgrade your cluster, I think your experience with Impala would be improved quite a lot. The last CDH5 release - 5.16.2 is a big jump in scalability, performance and reliability from 5.10. CDH6.3.3 is a big jump beyond that in terms of features, then CDP is another huge step, particularly for metadata performance.

Highlighted

Re: Impala DML frozen on CDH manager frozen - hidden dependency?

Explorer

Thanks for your input.

 

We are running our stage cluster in "non-production mode" using the embedded postgres. This postgres is also used by the two hive metastore servers. The postgres db is hosted on the same node as the other cloudera services. When this host now freezes the impala insert queries freeze as well.

 

We were surprised to see that there seems to be no timeout from the hive metastore servers and their backing db (postgres) and no error either.

 

This probably also happens when backed by an external postgres or mysql database, although not tested by us.

 

I wonder if this might be solved by a newer CDH version. We are currently looking into upgrading and would like to do so for other reasons very much so.

 

Don't have an account?