Support Questions
Find answers, ask questions, and share your expertise

Why does HBase need a nodemanager when it uses coprocessors?

Node Manager is used to start, execute and monitor containers on YARN.

Co-processor on the other hand is a framework which does distributed computation directly within the HBase server processes.

I have tables in HBase which I query using phoenix. My doubt is when I have coprocessor which is doing the computation and fetching the result from HBase, then why do I need Node Manager at all.
The reason I ask this question is currently on my YARN UI, I see the three nodes of my cluster, which has 8GB&8core assigned to it. But none of it is getting used while I fire query at HBase. The no of container created is 0 as well. Can someone please help me understand this concept or refer any link


Super Collaborator

During the regular operations such as put/get/scan HBase doesn't need YARN. There are only several utilities that may use MR jobs such as bulkload or rowcount, The components of the cluster that are required for regular HBase work are: zookeeper and HDFS ( well, distributed filesystem, not only HDFS).

Thanks for the response @Sergey. But I still have few doubts. Request you to please reply on this.

When you say and I quote 'There are only several utilities that may use MR jobs such as bulkload or rowcount', there should be containers getting created at Node Managers.

The thing is when I run my bulk upload utility, still I can see no containers getting created (UI at port 8088 doesn't show any thing happening) . Can you please suggest how can I check if MR jobs are getting created at the time of bulk upload or say when I do a row count.

If YARN takes care of MR jobs in hadoop(HDFS+MR) happening in parallel, what exactly does a co-processors do to achieve parallel processing in HBase. How do I verify that yes, scans or gets or puts are happening in a parallel way. Any link on this will help big time.

Super Collaborator

HBase is using memory caches for the tables to reduce latency times. So dependent on the query, you will not see any disk I/O at all. How your data is distributed across the Hbase nodes depends on the 'sharding', which Hbase can do automatcally or you can define it during table creation. The operation you execute i.e. scan is executed on all RegionServers holding parts of the table in parallel. And to my understanding Yarn isn't balancing node resources between Hbase queries and MR jobs. In the cases I am aware of, HBase and yarn are configured to only use a share of the available hardware resources on the node, i.e. the RAM, to avoid issues.

; ;