Created 02-09-2016 10:13 PM
Being new to Bigdata world, I have few questions bothering. We are planning to setup a Hadoop cluster. However, we have some general queries. Request to please throw some light on these.
1. In the big-data world, how / what kind of testing is performed? What testing tools/ frameworks are present for testing bigdata applications
2. Is there any need of having separate production, development, deployment and testing clusters? If so what are the merits?
3. Whats the best approach to handle disaster recovery, for bigdata clusters?
4. What are the general guidelines / best practices that one need to be aware of when stepping into Hadoop world and setting up clusters
Apart from these, are there any special considerations to be looked in when setting up hadoop clusters.
Created 02-09-2016 10:19 PM
1) Testing tools : Really good guide http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-t...
2) Yes , we do need separate clusters for dev, QA and Prod ( Same as RDBMS world) for users to test and QA before prod releases.
3) Active-Active (3rd party tools like Wandisco) , Active-Passive , Apache Falcon is best for DR.
4) Get familiar with the Hadoop stack. Get familiar with 100% open source model of Hadoop. There are vendors who are selling Hadoop but not 100% open source. Hortonworks is the only 100% open source enterprise ready hadoop "no vendor lock in". Once you pick the vendor then learn the technology stack
Operations, Security, Data operating system ...Please see this blog https://www.linkedin.com/pulse/20141204175510-28584737-rdbms-to-hadoop
The above blog covers the sandbox and other details.
Created 02-09-2016 10:19 PM
1) Testing tools : Really good guide http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-t...
2) Yes , we do need separate clusters for dev, QA and Prod ( Same as RDBMS world) for users to test and QA before prod releases.
3) Active-Active (3rd party tools like Wandisco) , Active-Passive , Apache Falcon is best for DR.
4) Get familiar with the Hadoop stack. Get familiar with 100% open source model of Hadoop. There are vendors who are selling Hadoop but not 100% open source. Hortonworks is the only 100% open source enterprise ready hadoop "no vendor lock in". Once you pick the vendor then learn the technology stack
Operations, Security, Data operating system ...Please see this blog https://www.linkedin.com/pulse/20141204175510-28584737-rdbms-to-hadoop
The above blog covers the sandbox and other details.
Created 02-09-2016 10:57 PM
Using this to complete the answer based on your latest questions:
1)Testing: Link You will see various components listed.Spark is just a component out of 20+ components. You can run basic smoke test for each component.Pick a use case and then pick a tech/component, research on that piece & you will find tons of testing docs.
2)I shared benchmark link as I did not realize that you are asking for app testing
3)Falcon can do HDFS, Hive replication from one location to another.On prem,cloud or wherever your cluster is running.You have to have servers besides storage in DR
Created 02-09-2016 11:14 PM
Thanks, it does help to better my understanding.
For DR, apart from storage, how to ensure all the job history information, workflow schedules, configuration information etc are recovered on the DR cluster, when my active cluster is down? For configuration info etc, backups would help probably, but still there would be some missing blocks to recover from if the active cluster is not covered fully from DR perspective.
Created 02-09-2016 11:22 PM
You can backup ambari database as it has details of all the configs.
Active active setup will meet your requirements.
You can back up server like any other server.
We have built in HA for namenode and RM " you can enable it once cluster is installed"
Falcon will be backing up and replicating the data.
Hdfs replication will provide fault tolerance
Again, look into wandisco 🙂
Created 02-10-2016 12:17 AM
@Astronaut Bigdatanova Please see this Slideshare portal has tons of resources specially to meet your requirement.
Created 02-09-2016 10:48 PM
@Astronaut Bigdatanova It should also be stresssed that Hadoop is not a transactional system (OLTP) like most RDBM's. Hadoop is a massively scalable storage and batch data processing system. Hadoop offloads the particularly difficult problem of simultaneously ingesting, processing and delivering/exporting large volumes of data so existing systems can focus on what they were designed to do.
Created 02-09-2016 10:51 PM
1. For testing tools, I am trying to understand from unit testing perspective, how to test user programs / applications. For example Spark programs. The link on the blog talks about benchmarking the cluster, probably assessing the hardware and some software configurations at Hadoop level. However, my question is more at unit testing level for developers
2. Again, in relation to the above question, what are we going to 'test' in the test environment and how different is it going to be if only benchmarks are going to be performed
3. I am yet to look into WANDisco. But just glanced on Falcon. Based on my understanding, Is this setup at a storage level only or for the complete cluster? For example, assume I have my hadoop cluster sitting on EC2 and running some cool spark / hadoop jobs using data stored in the storage layer. From a a DR perspective, how to ensure whole cluster is safe-guarded. Looks like Falcon can be used to backup my data from EC2 cluster machines to S3 or some other alternate storage. But what about the rest of the cluster?
Created 02-09-2016 10:51 PM
@Astronaut Bigdatanova at least partial answer https://community.hortonworks.com/content/repo/15674/variety-of-hbase-unit-testing-utilities.html
https://community.hortonworks.com/repos/3900/hadoop-mini-clusters.html
https://community.hortonworks.com/questions/8130/some-hive-unit-tests-doesnt-work-in-intellij.html
Created 02-10-2016 12:04 AM
@Astronaut Bigdatanova Great questions!!!