Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Embedded vs External database?

avatar
Explorer

We have a cluster with 36 nodes and have initially installed Cloudera Manager with an embedded database. The only services currently installed are HDFS and YARN (MR2). Now, I would like to properly install Hive and Impala on this cluster. 

 

The Question:

What is the difference between an embedded database and an external database? To me, it seems like embedded postgreSQL is just a script that creates all of the databases for you, where as with an external you do it on your own. If this is the only difference, it must be okay for me to use the same postgresql installation to create an "external" database for Hive, right?

 

An Issue I ran into previously:

One time, I followed the documentation to install Hive exactly, which included doing a yum install of postgres. I didn't inspect enough and had just performed the installtion. Little did I know, I had actually just installed a second version of postgreSQL. I had 9.2 from the yum and 8.4 from Cloudera Manager. Everything was working for a while, but then some unrelated error occurred and caused the Cloudera Manager database to stop. When trying to start the database again, Cloudera Manager attempted to connect to the 8.4 db using 9.2 commands. Woops... lesson learned here.

 

My Goal:

Ultimately, I am trying to use postgreSQL for both Cloudera Manager and Hive/Impala. From my previous lesson, I also learned that it would probably be best to keep only one installation of postgreSQL too. What would be the proper way to install?

 

Thank you!!!

1 ACCEPTED SOLUTION

avatar
Hi,

In general, we suggest that you use an external database for production. The embedded database is just handy for getting started.

The embedded database is just a regular postgresql that is started by custom init scripts on a custom port in a custom data directory.

When creating the Hive service, the wizard will prompt you if you'd like to use the embedded database or an external one. If you use the embedded database, a user role and a database will be created with the correct permissions for you automatically. If you use an external one, you must do these steps yourself and provide the host / port / database name / username / password.

The CM documentation does not say to do a yum install of postgres. You just use the Cloudera Manager UI and click the Add Service option in the dropdown menu by your cluster name, same as adding any other service.

For smaller clusters, it's fine to consolidate onto a single database. As your load grows, you'll want to migrate some databases and possibly their roles to different hosts. I wouldn't run two PostgreSQL on the same host as that will just consume more RAM than is really needed. It would be better to consolidate onto the external PostgreSQL, as the embedded one is not intended for production.

Thanks,
Darren

View solution in original post

3 REPLIES 3

avatar
Hi,

In general, we suggest that you use an external database for production. The embedded database is just handy for getting started.

The embedded database is just a regular postgresql that is started by custom init scripts on a custom port in a custom data directory.

When creating the Hive service, the wizard will prompt you if you'd like to use the embedded database or an external one. If you use the embedded database, a user role and a database will be created with the correct permissions for you automatically. If you use an external one, you must do these steps yourself and provide the host / port / database name / username / password.

The CM documentation does not say to do a yum install of postgres. You just use the Cloudera Manager UI and click the Add Service option in the dropdown menu by your cluster name, same as adding any other service.

For smaller clusters, it's fine to consolidate onto a single database. As your load grows, you'll want to migrate some databases and possibly their roles to different hosts. I wouldn't run two PostgreSQL on the same host as that will just consume more RAM than is really needed. It would be better to consolidate onto the external PostgreSQL, as the embedded one is not intended for production.

Thanks,
Darren

avatar
Explorer

That's really helpful information! Thank you!! 

 

So, if I am getting this right, we would use an external database for production because it gives us more control of database options and configurations during install which become imporant especailly for larger clusters?

 

Thank you again

avatar
Yes, that's basically the reason. You also wouldn't have to leave the data directory in the hardcoded location that CM uses, which doesn't always work well for folks as the database grows large.