Hello everyone from the group !!!. This is my first post in Hortonworks community. Use Case: We have on premise Hadoop clusters production (Cluster1) and development (Cluster2) with active data of size ~500 TB in each cluster. We are migrating our Hadoop cluster to Azure cloud with IaaS option (Procuring Azure VM and configure Hadoop clusters by Hortonworks cloudbreak provisioning). In Azure as well, we have both the production + development cluster setup. Problem Statement: We need to migrate the data from on-premise Hadoop clusters (production + development) to Azure cloud. We would like to do some unit testing on whether the data copy is perfect from on-premise to Azure cloud. For the unit testing, we will take source table count, source table directory size and compared with target table count, target table directory size respectively. Solution: For that, we are planning to write some custom shell scripts with distcp to automate the data copy to the cloud. Question: How do we compare the table count, table directory size between the on-premise Hadoop cluster and Azure cloud for the above requirement? Any suggestions/inputs on the data copy approach above? PS: 1. We are NOT are using Azure WASB, HDInsights, ADL. On the target cluster, it is just azure VM configured with Hortonworks Hadoop cluster provisioning using Hortonnworks cloudbreak. 2. We are not going with azure export/import or hive export/import option for the data copy from on premise to azure cloud. Appreciate your support on this.
... View more