We are in an decision point to select the right approach to transform the data. Want to have your input ?
Our case: We use Hive as main data lake store and all the data (so far) are structured data. Same as traditional data warehouse, we need to do transformation (lookup, aggregation, etc) on source tables to target tables. Now, need to decide which approach to go. Now I tend to go with coding approach (HiveQL, Spark) and build our own metadata. But the tools like Talend was also recommended by others. So want to hear some ideas here.
Once driver behind the decision is that I want to build a team of high tech skills. I do have traditional ETL background (informatica, datastages, etc) and see the pros and cons. So don't want settle with " a large team of low-skill programmers supporting a single tool" and believe "today's big data developers are a bit more technical than their data warehousing counterparts. And so, they are even less enamored by clunky frameworks, less intimidated of writing a lot of code if necessary".
Your thought ?
@Allen Niu This might be a tad bit late; however if you want your team to be more experienced developers I would certainly shoot for Spark and Hive. Both components have libraries and jars that support one another and the Spark API makes learning how to develop in Java, Scala, and Python super easy. I personally started to learn how to code in C# and translated those skills into Python and Scala for Spark ML Lib, Spark Core, and Spark SQL. Might be a little biased as I am a BIG SPARK junkie ;) But the ability to clean data with SPARK at scale is ABSOLUTELY brilliant. I know hortonworks has some great development courses for Spark and Hive as well. Here is the link: https://hortonworks.com/services/training/certification/hdp-certified-spark-developer/
What was your final decision?