Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Prod. HDP sandbox/clones for organizational users

avatar
Super Collaborator

A prod. cluster is already in place - HDP-2.4.2.0-258 installed using Ambari 2.2.2.0.

Following are the existing and upcoming scenarios :

  1. There are various 'actors' - hadoop developers and admin., data scientists, enthusiasts etc. who currently download and use the HDP sandbox on their local machines
  2. The prod. cluster has lot of data and it is NOT advisable to have a large no. of users right away
  3. The idea is to have a central system using which a large no. of users can 'spawn'/download & install their own sandboxes which are a tiny image of the prod. cluster in terms of the data and the services
  4. It's indispensable for this system to allow the users to decide what subset of data they want to include in their sandbox

I have a few thoughts :

  1. Maybe, it's sensible to provide a centralized download of the latest HDP sandbox, however, this may be version and otherwise different(maybe, far ahead !) from the prod. cluster
  2. While the users would be willing to execute the queries/drag-drop tables, files etc. to select the data they want, almost none would be prepared to load this data manually from the production to their own sandboxes
  3. Maybe, there are some existing tools that can be used to do this

Can the community help me to assess the viability of this requirement/suggest alternatives ?

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Super Collaborator
  • a defined HDP version - not necessarily the latest

    Yes, that's correct

  • possibly a "built in" data set that is provisioned with the VM

The 'built-in' data set will be different/customized for each VM spawned(as it will be used by different roles)

The tutorial seems informative but I have a question - can Vagrant connect to the prod. cluster WITHOUT MAJOR changes to the prod. machines and spawn VMs as required with custom data sets ? Apologies if it sounds stupid but I'm unable to visualize how Vagrant will work with the prod. cluster

avatar

Vagrant provides a VM in that is run by the provisioner of your choice, for instance, VirtualBox or VMWare. The network configuration of your VM determines whether you can connect to the network outside. Typically, in your example you would use one of two configurations:

  1. In a bridged network configuration, the VM has full access to the outside network and can see any machine out there. It also means that your VM is visible as its very own network device from the outside. While this is very convenient, it may be a security issue. And corporate networks may ban you from adding non-approved network devices.
  2. In a NAT configuration, traffic is routed through the host machine. In short, this means the VM can see the outside network but the outside network cannot see the VM. You can however expose some of the VM's services using port forwarding.

If you want to "bake" your data sets into your Vagrant boxes, this can all be scripted. In order to always get the recent version of the data set, you might want to create a Vagrant box, based on a plain sandbox, that just goes out to the production system and fetches its data as it is spun up the first time. Because the Vagrant box acts as a client using standard APIs, generally speaking I believe you would not have to change your production systems. To give you a precise answers I would need to know your case in more detail, though.