01-24-2015 03:09 PM
I can't find any forum on this site where I can ask architecture level questions... so I am asking them here.
These days lots of big data vendors are talking about implementing a active archive solution on Hadoop. the value add being shown is that we can hae a multi tiered storage system where EDWs host small quantities of heavity used data whereas old and unused data is transferred to hadoop where it resides on JBOD type storage.
Although on surface this sounds nice, it doesn't really sound cost effective in reality
1. Earlier I had a combination of ETL tools and stored procedures doing the job of data processing. now the data is processed on Hadoop by map reduce jobs.
Since big data is in nascent stages this code is hard to write (PIG, or whatever)
So in the end the task of writing ETL become much harder an expensive.
2. With the data being split accross Hadoop and EDW, there is a confusion in apps as to where they should go and query the data from. So the app design become really complex because now apps needs to try out combinations of where they will get the data from.
This complicates app design and adds extra development cost
3. There is also plenty of duplicate development if we use Hadoop as EDH and use a EDW downstream. since users can choose between querying hadoop or EDW, we have to do duplicate effort in creating queriable views.
note that the data on hadoop will be very denormailizd whereas in EDW it will be normalized... but both will need business query views and therefore would require separate effort.
So even though the active archive concept saves some disk money..... its benefits are quickly eroded by the organization having to write code accross 2 silos and doing multiple tiers of development.