Support Questions

kums · ‎04-25-2017

I am doing capacity planning for my future cluster(pure data lake). We are going to archive data from MS SQL Server to Hadoop. Current size of MS SQL Server is close to 100TB and it is growing at 1TB /month. Is there a size difference between MS SQL Server data files and HDFS ? If so do you know what is the ratio.

Please let me know if anyone has come across a similar situation.

Thanks

Kumar

gkeys · ‎04-25-2017

Roughly speaking your size will be triple in HDFS because data blocks are replicated to produce three copies for high availability reasons (e.g. if a disk fails). In addition to that you will need local disc about ~20% of hdfs capacity for intermediate storage of job results. That is very high level starting point. You can compress data in hdfs and you will likely process data in hdfs and keep the original, but those involve strategic decisions around how you will be using the data (a separate discussion/post). Since you do not want to cut it close with capacity planning, give yourself a margin beyond 3x + 20%. Keep in mind that hdfs storage is very cheap compared to MS SQL, so the redundant data blocks will still not have a large relative cost.

kums · ‎05-02-2017

My Capacity Planning exercise says I need 5PB in next 5 years , but definetly I am not going to get 5 PB next month, I may be given as the system grows.

In my hardware refresh project , Please provide me some Architecture guidance inorder to scale to this capacity.

Thanks

Kumar

gkeys · ‎05-03-2017

The good news is that it is quite easy to add new data nodes to your cluster. This is especially true if you use Ambari and its workflow steps. The rest of the cluster becomes aware of the new data nodes as members of the existing cluster ... no work needs to be done for clients, jobs, etc. When a new data node is added, hdfs starts balancing existing file blocks to the new nodes (everything is automatic). So, adding sets of say 4 or 8 nodes to the cluster periodically is operationally easy and has no impact on existing jobs. Purchasing and racking and stacking the servers that will be added as data nodes is the hardest part (unless you are in the cloud) and enough lead time should be given to this in your capacity planning.

kums · ‎05-08-2017

Thanks Greg. This helps

srai1 · ‎05-02-2017

@Kumar Veerappan

You've got all the pointers, you just need to do the math:

1TB of MS SQL Data = 3 TB on HDFS for minimal replicated blocks.
If one data node as 10 TB storage, then you should consider leaving about 2TB on each data node for other operations
With this config, you'd need 24 data nodes (8GB of storage for hdfs data) which can store up to 192TB of data which is good for 92 days OR more, provided you enable compression. Try the same for bigger volume of data which should give you the numbers for storing data over 1-5 years period.

Compression also depends on type of data (i.e., columns, types and rows) hence, it would be ideal for you to test this out on a sandbox/single node cluster, this should give you some good estimates on compression for overall planning.

mqureshi · ‎05-03-2017

@Kumar Veerappan

I actually disagree that 1TB on MS-SQL equals = 3TB on Hadoop. When you store data in MS-SQL Server, what kind of storage layer you are using? If it's SAN then SAN already replicates data for resiliency. Imagine if MSSQL server was keeping only one copy of data. Without a doubt, you would have lost some data by now. If you are not storing in SAN, then you probably have a RAID array. Depending on type of RAID, you again have multiple copies of data to prevent in case of disk failure. You might be using a technique like erasure coding which you can also use with Hadoop. Here is a link for more details.

Now, my next point, please understand that you probably also have a DR MSSQL or backup. You will need that for Hadoop also. We are comparing simply one MS SQL server environment vs one Hadoop cluster.

In Hadoop you will and should compress data. You can find out but there is a good chance, your current SQL Server environment is also compressed. So this assumes that you do not save much by compression in Hadoop because your data is also currently compressed. If your data is currently not compressed then you'll be happy to see the results with Snappy or ZLIB in Hadoop.

So going back to your original question. Data is data. If you have 100 TB of data, in Hadoop you will have to make three copies. When you compress, you might compress your data by a factor of 3 and make that replication footprint disappear.

As for the size of cluster, that depends on a number of factors. Are you buying hardware for next three years (right thing to do). Then just plan accordingly. Give each node 12x2TB = 24TB capacity or you can go higher these days and a number of factors come into this. See the following link (this link ignores power consumption due to more disks - that can be big for your infrastructure team):

https://community.hortonworks.com/content/kbentry/48878/hadoop-data-node-density-tradeoff.html

On each node, expect 20% for temporary files/space used by running jobs.

Cloudera Community

Support Questions

HDFS Capacity Planning question