I am trying to plan for a architecture in which I need to have streamset and kafka to ingest the data to cloudera platform on the top of AWS instances. I got two main concerns:
1- What should be the suitable config for the cluster subnet including streamset and kafka? Should I consider it as a gateaway nodes and and just have boot disk and not data disk? Is there any template or recommmended config?
2- I also notice for the production, it is recommended to have an external DB. Do we have a recommendation on the type of config of this DB and its server?
Take a look at Cloudera's AWS reference architecture for recommendations to get you started. It doesn't cover Streamsets specifically, but Kafka and other parts of CDH are included.
For both Cloudera Director and Cloudera Manager, we do recommend using an external database server. Cloudera Director can use MySQL for its own database, and it can set up MySQL for Cloudera Manager and for cluster services that use a database such as Hive. The documentation has configuration recommendations, like using InnoDB instead of MyISAM under the hood. Overall, though, there aren't strict requirements for the database configurations. You'll just want enough room, particularly for cluster service databases like the Hive metastore, for your future needs.
Thanks for the quick reply.
I checked the documents. So for high availability, the external database should be outside of cluster, right? So does it means that I need to put database outside of vpc and use internet to connect to it or I just need to have another instance in the same vpc?
You can use a service like RDS to host the external database server, and let it handle replication and availability. The database server can reside in the same VPC. You would not want to host the server outside AWS or route requests to it over the public internet, for cost and performance reasons.