I am trying to understand how hadoop can fit in an enteprise data architecture. If we want incorporate hadoop into our enterprise data architecture, does this mean we have to copy all the data in our huge file system into HDFS if we want to have these files indexed? Wouldn't this lead to duplicating huge amounts of data (100s of TBs)?
You will deploy Hadoop cluster then start with very basic use case "Data Archival"
You can move cold data into Hadoop from your warehouse system to reduce the foot print in EDW. Any data sitting on tape is useless unless you have an automated way to recover & restore. You can have backup of data in readable format if you like , Hive .
Check out this use case
I read the data archival use case and i think we can use a data architecture of this sort. Is it possible to archive the sql database of Microsoft SharePoint into hadoop also?
@Ahmad Debbas there is no pre-requisite to move all your enterprise data into HDFS. Much of your data can and will still reside outside of HDFS. The challenge is that existing data structures were created based off application needs, e.g. ERP, CRM, etc. This created silos of data because of an application driven architecture.
HDFS and the larger Hadoop ecosystem provides the opportunity for a data driven architecture. My suggestion to you is to look at your current data sources and determine which sets of data would provide the highest value and insight if they were to be consolidated in HDFS. You can then choose to run the analytics directly off the data in HDFS or move the data to a other systems.
Don't worry about data duplication. The fear of duplicating data is a by-product from relational systems where compute performance was an issue and from SAN's and MPPs where storage costs are a concern. HDFS solves both these problems.
As you mature your Hadoop environment, you will want to think HDFS first. This means any new data coming coming into your enterprise should land in HDFS first. From there it can go where ever you want it go. HDFS will provide the most flexibility, scalability, and most economical solution for storing your data and start your journey toward a data driven enterprise.