Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark on YARN fails accessing an external HBase cluster


Spark on YARN fails accessing an external HBase cluster

Rising Star

I'm getting this error when trying to access HBase data in an external cluster when running Spark on YARN in another cluster. But, when I run Spark in local mode, it works fine. We are on CDH 5.4.8. I read that YARN can only access HBase on the same cluster because YARN needs to access the underlying HFiles stored in HDFS. Is this true?





Re: Spark on YARN fails accessing an external HBase cluster

Rising Star

Found the root cause of why YARN is hanging and HDFS is being bombarded with data causing the File Descriptors to run high. It was Log Aggregation.


Log Aggregation is enabled by default. This means that all the Node Managers will log their tasks into a central location in HDFS. This is a good thing for debugging YARN Applications. But for all this time, the directory was misconfigured and set to a directory that does not exist. So, this resulted in all the nodes just spitting out error messages while still continuing to process tasks. Once we noticed these errors, we set the HDFS to the default directory, and the errors went away. In return, we got something even worse. A flood of log entries began overwhelming HDFS causing a major slowdown, sometimes halting, of data storage attempts. File Descriptors maxed out. And YARN, since it cannot log, suspended in a Pending state. The only course of action was to turn off Log Aggregation to make each Node Manager store its own logs locally.


I would like to use this feature, but I don't know how to use without causing this problem again.