Let's say I have two different clusters authenticated by Kerberos. Let's say I have enabled TDE and managed that with Ranger-KMS. How can I design a solution which can work in this regards?
Based on my understanding from the Kerberos point of view I have to provide a centralised KDC for both clusters. How can a user connect to one of the clusters run a spark query with some of the data stored in cluster 1 HDFS some of that in cluster 2 HDFS? How can a user run a hive aggregation query on both clusters? Even if we provide a single KDC for both of the clusters, since the principle for same services in different clusters would be different, it should not be permitted for a single user to run a single job on data from both clusters at the same time. Because the required service ticket for accessing Hive on cluster 1 would be different that the hive on cluster 2. What about Ranger-KMS? Is there any manual way to integrate both Ranger-KMS instances?
You are right that what you are trying to do is not possible. They are after all two different clusters. Just because they both use Kerberos to authenticate doesn't mean you can have one client run query that spans two different clusters and reads data from two different clusters. That is not possible by default.
If you would like to do something like that, then you a data virtualization software. Something like Cisco Information Server or Denodo. the open source data virtualization software is also available as Teiid by JBoss. Without using one of these tools, it is not possible to run one single query that reads data from two different clusters. Think about this way. If you have two different Oracle Systems which have the exact same way to authenticate, can you run a query that reads data from both Oracle systems? Answer is no, at least not without these virtualization tools. Same with Hadoop.