10-05-2018 01:54 AM - last edited on 10-05-2018 11:10 AM by cjervis
I am hoping someone here can point me to a good resource. I have a project this year at work to standup a data science / Hadoop / Spark platform. Out initial focus was going to be leveraging AWS or Azure to do this (EMR or HDInsight). Well, we're getting push-back and the cloud thing doesn't seem to be happening quickly. My contingency plan is to purchase some hardware on our internal cloud infrastructure (these would not be physical, they'd be virtual, but on my own dedicated hosts).
Due to this, my focus has been on the data science, machine learning, scala, python, etc.. Since a cloud solution would have taken care of the actual configuration of the Hadoop / Spark clusters for me, I have not worked on that as much. I have worked in IT for over 15 years and have experience from an system administration and architecture perspective. Setting something up would not be impossible, it's just a reality now.
My question - are there any good solid resources (books from Amazon, documentation, etc.) that I can leverage when planning a cluster out and doing all of the configuration and integration? This is what I am looking at:
~20 node cluster
Both Spark and Hadoop
Management tools like Yarn and Mesos
Other tools that will aid in the use and monitoring of the cluster.
I'd like one good or a few solid resources that will guide my to building, configuring, monitoring and managing this cluster. Also something more recent that covers the new versions and how compatible they are (Hadoop 2.x with Spark 2.x). I have an affinity toward O'Reilly books, but I'm not opposed to other resources.
Any suggestions or recommendations from a systems architecture, admin, etc. roll would be great. Thanks so much!