Support Questions

Find answers, ask questions, and share your expertise

What are Oozie Production Recommendations?

avatar

What our deployment/production recommendations for Oozie?

Such as:

- Database recommendations (i.e. don’t use Derby)

- HA considerations

- Component placement

- Scaling, required resources, ...

1 ACCEPTED SOLUTION

avatar

As a general rule we do NOT use the default Ambari Databases. Pick one, MySQL, Oracle or PostgreSQL and have a separate instance stood up for it. Then use it for all of your repositories.

It should be this way for any environment beyond a sandbox. I wouldn’t even do a POC with the defaults. Simply because the defaults are all over the place and POC can turn into production systems :). Once you’ve committed to using a repository, changing “types” is not really possible. You need to start over basically. I know a lot have asked about this in the past, but it’s a mess.

Take 2 minutes in the beginning, setup an “Independent” MySQL (or other) database and use it. If you need to move the MySQL around in the future, that’s possible and more obtainable than switching types.

NOTE: Ambari won’t lay down MySQL until the Hive Metastore, so even if you figure out a way to use that Metastore for Oozie, Ranger, etc.. it’s will be controlled by the Hive Service Config. So it WILL restart MySQL when you’ve allow Ambari to install it. If you didn’t catch me saying it earlier, Install a separate and independent RDBMS for your Metastores.

View solution in original post

8 REPLIES 8

avatar
Contributor

DB – MySQL worked great for an install at a large customer. There is some work to swap out the default after Ambari has already been configured.

See the following KB article for more details: Moving Oozie to MySQL with Ambari

I haven’t setup HA for Oozie, but I believe @dstreever@hortonworks.com was recently working on this. You’ll need Zookeeper for HA.

We had over 1000 various bundles/coordinators/workflows running without any noticeable performance impact using default mem settings.

avatar
Expert Contributor

Oozie with Derby as the DB should not be deployed I had to get into couple of escalations with customers having several GB derby DBs.

MySQL is a good choice and Oracle is also supported. On Windows, we support SQLServer. Not sure about Postgresql installations even though it is also supported (and we will support SQLAnywhere going forward)

Regarding placement, it is good to have Oozie server separate from other components (HS2 mentioned is good). This is one area we want to explore as part of a reference architecture and any feedback on that would be great.

Oozie supports NN and RM HA

Oozie HA : It is now supported by Ambari. You need zookeeper conifgured - we have a section describing HA setup - QE has tested Oozie HA with Apache HTTPD w/mod_proxy, but other load balancing solutions (nginx is something that I also configured before the mod_proxy configuration for QE).

As part of rolling upgrade, we did fix a bunch of issues with Oozie HA configuration (client now does a client side retry of hosts, sharelib purge is disabled and RU orchestration handles the server shutdown/restart handling)

avatar

As a general rule we do NOT use the default Ambari Databases. Pick one, MySQL, Oracle or PostgreSQL and have a separate instance stood up for it. Then use it for all of your repositories.

It should be this way for any environment beyond a sandbox. I wouldn’t even do a POC with the defaults. Simply because the defaults are all over the place and POC can turn into production systems :). Once you’ve committed to using a repository, changing “types” is not really possible. You need to start over basically. I know a lot have asked about this in the past, but it’s a mess.

Take 2 minutes in the beginning, setup an “Independent” MySQL (or other) database and use it. If you need to move the MySQL around in the future, that’s possible and more obtainable than switching types.

NOTE: Ambari won’t lay down MySQL until the Hive Metastore, so even if you figure out a way to use that Metastore for Oozie, Ranger, etc.. it’s will be controlled by the Hive Service Config. So it WILL restart MySQL when you’ve allow Ambari to install it. If you didn’t catch me saying it earlier, Install a separate and independent RDBMS for your Metastores.

avatar
Super Collaborator

@MCarter@hortonworks.com @ssen@hortonworks.com lot of good content in this thread..Lets harvest it into a KB article

avatar
Rising Star

In the cluster audit I have been performing I have noticed that most clients under 80 nodes have been using 1 database, but after 80 nodes (examples in the 120+) range they have begun using a dedicated database for Oozie. Of course this could just be systematic of how much oozie use was taking place at 1 client vs the other.

I would agree with all others above that you should not be using Derby, you should not locate the Oozie server with HS2. By having a dedicated database as @dstreever@hortonworks.com recommends you can then move it off into another physical dedicated database when the metrics dictate that the load has outgrown the ability for multiple services to live in the same physical database.

avatar
Expert Contributor

Do we have any rules of thumb on what the total database size for metadata databases for say a 100 node cluster would be. I have an initial install with all metadata (Ambari, Hive, Oozie, Hue) using a single MySql instance. The DBA's are asking what kind of space to expect once the cluster is up to production size.

avatar
Super Collaborator

So many good practices in this thread. We need to harvest it to a KB article

avatar
Rising Star

I will create a KB article soon