We have a requirement of building of a Hadoop cluster and hence looking for details on cluster sizing and best practices.
To give you some input :
1) Estimated overall data size --> 12 to 15 TB
2) Each year data growth of approx. 1.2 TB
3) Data will be transferred from source ERP, CRM, DW systems to hadoop along with streaming data also (like log files)
4) Batch ingestion of data into Hadoop
5) Stream ingestion into Hadoop
6) Analytics and data mart pointing to that Hadoop data
7) Hadoop distribution HDP 2.5.3
What can be the sizing of this kind of scenario or you can provide guidance to us.
Another point. We also have a requirement of doing CDC in hadoop and we are thinking of using pySpark and Kafka-Connect. Pyspark we have successfully done a CDC but need some input for Kafka-Connect.
Found out that confluent provides such an utility. Can we use that in HDP 2.5.3? If so then any example ?
We are really looking for your guidance.
Thanks and Regards,
Also we have Salesforce as one of the sources. How can we ingest data to Hadoop from Salesforce? We won't have direct connection to Salesforce db so may be we can't use Sqoop.
For cluster planning, you can refer to official guide from Hortonworks:
For using Confluent, you can check out this exhaustive and detailed guide, it also has many other supporting pages and next steps at the end of the page: