Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Config for external DB and Cluster for streamset and Kafa

Highlighted

Config for external DB and Cluster for streamset and Kafa

Explorer

Hi

I am trying to plan for a architecture in which I need to have streamset and kafka to ingest the data to cloudera platform on the top of AWS instances. I got two main concerns:

 

1- What should be the suitable config for the cluster subnet including streamset and kafka? Should I consider it as a gateaway nodes and and just have boot disk and not data disk? Is there any template or recommmended config?

 

2- I also notice for the production, it is recommended to have an external DB. Do we have a recommendation on the type of config of this DB and its server?

3 REPLIES 3

Re: Config for external DB and Cluster for streamset and Kafa

Expert Contributor

Hi Armen1365,

 

Take a look at Cloudera's AWS reference architecture for recommendations to get you started. It doesn't cover Streamsets specifically, but Kafka and other parts of CDH are included.

 

http://www.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_aws.pdf

 

For both Cloudera Director and Cloudera Manager, we do recommend using an external database server. Cloudera Director can use MySQL for its own database, and it can set up MySQL for Cloudera Manager and for cluster services that use a database such as Hive. The documentation has configuration recommendations, like using InnoDB instead of MyISAM under the hood. Overall, though, there aren't strict requirements for the database configurations. You'll just want enough room, particularly for cluster service databases like the Hive metastore, for your future needs.

 

https://www.cloudera.com/documentation/director/latest/topics/director_use_ext_db_for_director_data....

 

https://www.cloudera.com/documentation/director/latest/topics/director_external_db.html

Re: Config for external DB and Cluster for streamset and Kafa

Explorer

Thanks for the quick reply. 

I checked the documents. So for high availability, the external database should be outside of cluster, right? So does it means that I need to put database outside of vpc and use internet to connect to it or I just need to have another instance in the same vpc?

Re: Config for external DB and Cluster for streamset and Kafa

Expert Contributor

You can use a service like RDS to host the external database server, and let it handle replication and availability. The database server can reside in the same VPC. You would not want to host the server outside AWS or route requests to it over the public internet, for cost and performance reasons.