We are getting involved a landscape where data from SAP-BW will be sent to Hadoop-HDFS for further application needs.
Data from ADSOs will be pulled and stored in hadoop at the staging layer first and then Staging tables(hive) would be joined to propagate to Hadoop core layer. Hadoop details: cloudera cdh, hive,impala, spark 1.6 Suggestions needed: A. Any suggestions what would be the best option to get the data from about 300 SAP ADSOs to Hadoop on a daily basis.
B. Based on aapplication needs, multiple stage tables(adsos) are required to be joined to form one core table. For example 1 core table might be formed from 10-15 stage tables(ADSOs). And we have arround 80 core tables with such scenarios.
Considering the amount of joins to be performed and unavailability of spark 2.0 or later, what would be the best possible options we should opt for? Hive route would be slower and expectations are on the faster side of the etl turnaround.