We have a cluster with 36 nodes and have initially installed Cloudera Manager with an embedded database. The only services currently installed are HDFS and YARN (MR2). Now, I would like to properly install Hive and Impala on this cluster.
What is the difference between an embedded database and an external database? To me, it seems like embedded postgreSQL is just a script that creates all of the databases for you, where as with an external you do it on your own. If this is the only difference, it must be okay for me to use the same postgresql installation to create an "external" database for Hive, right?
An Issue I ran into previously:
One time, I followed the documentation to install Hive exactly, which included doing a yum install of postgres. I didn't inspect enough and had just performed the installtion. Little did I know, I had actually just installed a second version of postgreSQL. I had 9.2 from the yum and 8.4 from Cloudera Manager. Everything was working for a while, but then some unrelated error occurred and caused the Cloudera Manager database to stop. When trying to start the database again, Cloudera Manager attempted to connect to the 8.4 db using 9.2 commands. Woops... lesson learned here.
Ultimately, I am trying to use postgreSQL for both Cloudera Manager and Hive/Impala. From my previous lesson, I also learned that it would probably be best to keep only one installation of postgreSQL too. What would be the proper way to install?
That's really helpful information! Thank you!!
So, if I am getting this right, we would use an external database for production because it gives us more control of database options and configurations during install which become imporant especailly for larger clusters?
Thank you again