Created 04-25-2017 03:39 PM
I am doing capacity planning for my future cluster(pure data lake). We are going to archive data from MS SQL Server to Hadoop. Current size of MS SQL Server is close to 100TB and it is growing at 1TB /month. Is there a size difference between MS SQL Server data files and HDFS ? If so do you know what is the ratio.
Please let me know if anyone has come across a similar situation.
Thanks
Kumar
Created 04-25-2017 03:54 PM
Roughly speaking your size will be triple in HDFS because data blocks are replicated to produce three copies for high availability reasons (e.g. if a disk fails). In addition to that you will need local disc about ~20% of hdfs capacity for intermediate storage of job results. That is very high level starting point. You can compress data in hdfs and you will likely process data in hdfs and keep the original, but those involve strategic decisions around how you will be using the data (a separate discussion/post). Since you do not want to cut it close with capacity planning, give yourself a margin beyond 3x + 20%. Keep in mind that hdfs storage is very cheap compared to MS SQL, so the redundant data blocks will still not have a large relative cost.
Created 05-02-2017 08:50 PM
My Capacity Planning exercise says I need 5PB in next 5 years , but definetly I am not going to get 5 PB next month, I may be given as the system grows.
In my hardware refresh project , Please provide me some Architecture guidance inorder to scale to this capacity.
Thanks
Kumar
Created 05-03-2017 07:32 PM
The good news is that it is quite easy to add new data nodes to your cluster. This is especially true if you use Ambari and its workflow steps. The rest of the cluster becomes aware of the new data nodes as members of the existing cluster ... no work needs to be done for clients, jobs, etc. When a new data node is added, hdfs starts balancing existing file blocks to the new nodes (everything is automatic). So, adding sets of say 4 or 8 nodes to the cluster periodically is operationally easy and has no impact on existing jobs. Purchasing and racking and stacking the servers that will be added as data nodes is the hardest part (unless you are in the cloud) and enough lead time should be given to this in your capacity planning.
Created 05-08-2017 07:28 PM
Thanks Greg. This helps
Created 05-02-2017 10:59 PM
You've got all the pointers, you just need to do the math:
Compression also depends on type of data (i.e., columns, types and rows) hence, it would be ideal for you to test this out on a sandbox/single node cluster, this should give you some good estimates on compression for overall planning.
Created 05-03-2017 02:17 AM
I actually disagree that 1TB on MS-SQL equals = 3TB on Hadoop. When you store data in MS-SQL Server, what kind of storage layer you are using? If it's SAN then SAN already replicates data for resiliency. Imagine if MSSQL server was keeping only one copy of data. Without a doubt, you would have lost some data by now. If you are not storing in SAN, then you probably have a RAID array. Depending on type of RAID, you again have multiple copies of data to prevent in case of disk failure. You might be using a technique like erasure coding which you can also use with Hadoop. Here is a link for more details.
Now, my next point, please understand that you probably also have a DR MSSQL or backup. You will need that for Hadoop also. We are comparing simply one MS SQL server environment vs one Hadoop cluster.
In Hadoop you will and should compress data. You can find out but there is a good chance, your current SQL Server environment is also compressed. So this assumes that you do not save much by compression in Hadoop because your data is also currently compressed. If your data is currently not compressed then you'll be happy to see the results with Snappy or ZLIB in Hadoop.
So going back to your original question. Data is data. If you have 100 TB of data, in Hadoop you will have to make three copies. When you compress, you might compress your data by a factor of 3 and make that replication footprint disappear.
As for the size of cluster, that depends on a number of factors. Are you buying hardware for next three years (right thing to do). Then just plan accordingly. Give each node 12x2TB = 24TB capacity or you can go higher these days and a number of factors come into this. See the following link (this link ignores power consumption due to more disks - that can be big for your infrastructure team):
https://community.hortonworks.com/content/kbentry/48878/hadoop-data-node-density-tradeoff.html
On each node, expect 20% for temporary files/space used by running jobs.