Let say I already have a cluster internally in my lab. Is it possible to add more hosts but from AWS to act as DataNode?
Technically possible for sure with the proper networking setup that allows bi-directional communication between the hosts in your lab the ones in AWS. DNS should also work seamlesly across these two different networks.
What's your use case? Why not run the entire cluster on AWS? It's likely that the latency and the limited bandwith will significantly impact the performance of this hybrid cluster.
With a hybrid topology as described it's highly unlikely that you will achive an acceptable level of performance while shufling data between the local environment and AWS. This is a guess - I don't have performance numbers to share. I would love to know more about your results if you get to try this out.
Does anyone have an update on this use case? We would be interested to know if this "hybrid / bursting to the AWS cloud" architecture is realistic.
Andrei's original take on the idea still holds true today, as far as we've seen. Cloudera's general testing of different cluster configurations has found that even splitting a cluster across availability zones, while having the whole cluster in AWS, still can lead to performance problems. Splitting across regions is worse, and is somewhat close to the hybrid architecture you're thinking about.
Here's Cloudera's reference architecture doc, by the way: http://www.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_aws.pdf
A different take on the idea is to have a separate cluster in AWS that can take on the additional workload, and set up focused data transfers of workload data from on-prem up to the cloud cluster and of results back from the cloud cluster. Maybe there's a work allocation system fronting both clusters that can send jobs to the local cluster by default, but out to the cloud cluster when the local one is overburdened. This would avoid individual job performance problems and probably reduce the data transfer costs into and out of AWS (if you have a VPN gateway set up, then those costs might be irrelevant anyway).
So, hybrid architectures are realistic, but spanning single clusters between on-prem and cloud is not a great implementation.