Support Questions

Find answers, ask questions, and share your expertise

How to increase the thread number of MaintenanceManager in KUDU 1.2?

I have set the maintenance_manager_num_threads to 6 by CM, but only two MaintenanceManager threads in the pstack infomation. One is "MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) ", the other is "kudu::MaintenanceManager::RunSchedulerThread()". I find there are too many elements in "Non-running operations". How could I increase the performance of the MaintenanceManager?

Rising Star

How exactly are you setting the flag in CM? Did you restart the Kudu service after setting it? Have you verified that it actually took effect? When a Kudu process starts up, it'll log its non-default command line arguments; you can see if maintenance_manager_num_threads is in that log output.


Expert Contributor
It also would be helpful to upload a copy of the maintenance manager
dashboard. It may be that the "not running" tasks are not running because
they are not eligible to run.


The “-maintenance_manager_num_threads=6” is in "KUDU_TSERVER/gflagfile". There are about 1275 not-running tasks in the maintenance manager dashboard : 211 CompactRowSetsOp tasks with "true" Runnable; 213 FlushDeltaMemStoresOp tasks with "false" Runnable; 212 FlushMRSOp tasks(62 "true" Runnable and 150 "false" Runnable); 213 LogGCOp tasks with "true" Runnable; 217 MajorDeltaCompactionOp tasks with "false" Runnable ; 213 MinorDeltaCompactionOp tasks with "false" Runnable. And there are File Descriptors WARNING in tablet server: "Concerning : Open file descriptors: 17,205. File descriptor limit: 32,768. Percentage in use: 52.51%. Warning threshold: 50.00%."


Thank you for the info.


Did you have a chance to look into tserver's log?  It might happen that due to some bug the setting from gflagfile are not effective/applied because the tablet server hasn't been restarted with those new flags.  Just trying to narrow it down.


Another thing I'm curious about the what do you see in 'Thread Groups' page (/threadz) of the tablet server's internal webserver.  Looking there would be useful as well.


If there is no activity at all, there would be just 1 item there, named like 'maintenance_scheduler-15482927'.  However, that does not mean you have just 1 maintenance thread.  E.g., in my case in the 'Thread Group: maintenance' I have just 1 item like that, but in reality I have 6 idle threads waiting on a condition variable (I found that attaching to the process with gdb, and that's under OS X):


   11 "MaintenanceMgr [worker]-154828" 0x00007fff8e7c7db6 in __psynch_cvwait ()

   10 "MaintenanceMgr [worker]-154828" 0x00007fff8e7c7db6 in __psynch_cvwait ()

    9 "MaintenanceMgr [worker]-154828" 0x00007fff8e7c7db6 in __psynch_cvwait ()

    8 "MaintenanceMgr [worker]-154828" 0x00007fff8e7c7db6 in __psynch_cvwait ()

    7 "MaintenanceMgr [worker]-154828" 0x00007fff8e7c7db6 in __psynch_cvwait ()

    6 "MaintenanceMgr [worker]-154828" 0x00007fff8e7c7db6 in __psynch_cvwait ()


The stack of each worker thread looks like the following


(gdb) bt

#0  0x00007fff8e7c7db6 in __psynch_cvwait ()

#1  0x00007fff8f673728 in _pthread_cond_wait ()

#2  0x0000000109e8d223 in kudu::ConditionVariable::Wait (this=0x10b610e28) at /Users/aserbin/Projects/kudu/src/kudu/util/

#3  0x000000010a056b76 in kudu::ThreadPool::DispatchThread (this=0x10b610d20, permanent=true) at /Users/aserbin/Projects/kudu/src/kudu/util/

#4  0x000000010a05b2f9 in boost::_mfi::mf1<void, kudu::ThreadPool, bool>::operator() (this=0x10b6328e0, p=0x10b610d20, a1=true) at mem_fn_template.hpp:165

#5  0x000000010a05b257 in boost::_bi::list2<boost::_bi::value<kudu::ThreadPool*>, boost::_bi::value<bool> >::operator()<boost::_mfi::mf1<void, kudu::ThreadPool, bool>, boost::_bi::list0> (this=0x10b6328f0, f=@0x10b6328e0, a=@0x70000030fc50) at bind.hpp:315

#6  0x000000010a05b1da in boost::_bi::bind_t<void, boost::_mfi::mf1<void, kudu::ThreadPool, bool>, boost::_bi::list2<boost::_bi::value<kudu::ThreadPool*>, boost::_bi::value<bool> > >::operator() (this=0x10b6328e0) at bind.hpp:895

#7  0x000000010a05af60 in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf1<void, kudu::ThreadPool, bool>, boost::_bi::list2<boost::_bi::value<kudu::ThreadPool*>, boost::_bi::value<bool> > >, void>::invoke (function_obj_ptr=@0x10b697490) at function_template.hpp:159

#8  0x0000000107c27f68 in boost::function0<void>::operator() (this=0x10b697488) at function_template.hpp:770

#9  0x000000010a047851 in kudu::Thread::SuperviseThread (arg=0x10b697440) at /Users/aserbin/Projects/kudu/src/kudu/util/

#10 0x00007fff8f67299d in _pthread_body ()

#11 0x00007fff8f67291a in _pthread_start ()

#12 0x00007fff8f670351 in thread_start ()


As you can see, the stack trace of an idle maintenance thread does not contain anything which like MaintenanceManager.  That might be the reason you didn't see the expected number of stack traces collected by pstack.


So, I would recommend to clarify on this step-by-step.  First, I would clarify whether the kudu-tserver process is running with the expected flags.  If yes, then next step would be understanding how much activity you have in your system to expect those threads being busy with appropriate stack traces.

I can see the six threads of "MaintenanceMgr [worker]" in "/threadz?group=thread%20pool". But now, my problem is why are there File Descriptors WARNING in tablet servers: "Concerning : Open file descriptors: 17,205. File descriptor limit: 32,768. Percentage in use: 52.51%. Warning threshold: 50.00%."

Rising Star

By default, CM will warn when 50% of a process' FDs are in use. Also by default, Kudu's block manager system will use 50% of the FDs available to the process. So, after accounting for some additional FDs for WALs, Kudu ends up using a little over 50% of the available FDs and CM warns about it.


If this bothers you, you can:

  1. Reconfigure Kudu's block_manager_max_open_files to some fixed value below 16384. The default value of -1 means Kudu will use 50% of what's available (16384 in your case).
  2. Reconfigure CM to warn at a higher threshold than 50%.
  3. Wait for CDH 5.11, where Kudu's percent usage was dropped from 50% to 40%.