Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HBASE and SOLR capacity planning

avatar
Super Collaborator

Hi All,

I am looking for ballpark estimation for SOLR storage requirement

Let's say I have 100TB raw data (structured relational data)

if I keep it in HDFS - I would need 200TB+ storage considering compression factor 2.3 and replication factor 3 (100/2.3 * 4)

if I keep in HBASE - I would need around 1000TB storage considering HBASE needs 5 times storage of HDFS

if I keep in SOLR - I would need around 100TB (close to the raw data size) - no hdfs

Can you pls validate these assumptions, please provide if you have rough estimates, that will help to do a rough sizing.

Thanks,

Avijeet

1 ACCEPTED SOLUTION

avatar
Super Guru

@Avijeet Dash

Please see my comments inline:

if I keep it in HDFS - I would need 200TB+ storage considering compression factor 2.3 and replication factor 3 (100/2.3 *4)

how do you know your compression will be 2.3? What type of data it is? Is this going to be in structural format (like columnar data)? Anyways, assuming your replication factor is 3, you need to multiple by 3 not 4 which will bring your storage requirements around 100x3/2.3 = 130 TB.

if I keep in HBASE - I would need around 1000TB storage considering HBASE needs 5 times storage of HDFS

Where did you get this five times number? HBase uses HDFS for storage which means a replication factor of 3. Compression can also be enabled in HBase so considering that, your storage requirement should be very similar to what you have in HDFS (not exactly same because ORC stores data differently than HBase which means your compression will result in different storage requirements).

if I keep in SOLR - I would need around 100TB (close to the raw data size) - no hdfs

If you are not storing data in HDFS then where is it going? Regular file system? If yes, then are you going to have a RAID array or SAN environment where your data will be stored and replicated for resiliency. Both SAN and RAID make additional copies of data for resiliency. Now using something like Erasure coding (Reed Solomon), you can save space but it is not going to be exactly 1 to 1 ratio. Imagine if you store only one copy and your disk fails (are you okay with losing data - If yes then you can set replication factor in HDFS to 1)? Also, SOLR will require more space because it needs to create large indexes which are stored in addition to your data and can be very large.

One question that's bugging me here is that it seems to me that you are looking for one of these systems for your use case. If yes, then you cannot make a decision based on your storage requirement. Bringing raw data in HDFS is very different then importing it in HBase which in turn is very different then using SOLR. Which approach you will take depends on your ultimate use case. If you need to index data for fast searches then you'll likely use SOLR which by the way will require more storage vs if your use case is ETL offload then you probably should just use HDFS combined with Hive and Spark.

View solution in original post

3 REPLIES 3

avatar
Super Guru

@Avijeet Dash

Please see my comments inline:

if I keep it in HDFS - I would need 200TB+ storage considering compression factor 2.3 and replication factor 3 (100/2.3 *4)

how do you know your compression will be 2.3? What type of data it is? Is this going to be in structural format (like columnar data)? Anyways, assuming your replication factor is 3, you need to multiple by 3 not 4 which will bring your storage requirements around 100x3/2.3 = 130 TB.

if I keep in HBASE - I would need around 1000TB storage considering HBASE needs 5 times storage of HDFS

Where did you get this five times number? HBase uses HDFS for storage which means a replication factor of 3. Compression can also be enabled in HBase so considering that, your storage requirement should be very similar to what you have in HDFS (not exactly same because ORC stores data differently than HBase which means your compression will result in different storage requirements).

if I keep in SOLR - I would need around 100TB (close to the raw data size) - no hdfs

If you are not storing data in HDFS then where is it going? Regular file system? If yes, then are you going to have a RAID array or SAN environment where your data will be stored and replicated for resiliency. Both SAN and RAID make additional copies of data for resiliency. Now using something like Erasure coding (Reed Solomon), you can save space but it is not going to be exactly 1 to 1 ratio. Imagine if you store only one copy and your disk fails (are you okay with losing data - If yes then you can set replication factor in HDFS to 1)? Also, SOLR will require more space because it needs to create large indexes which are stored in addition to your data and can be very large.

One question that's bugging me here is that it seems to me that you are looking for one of these systems for your use case. If yes, then you cannot make a decision based on your storage requirement. Bringing raw data in HDFS is very different then importing it in HBase which in turn is very different then using SOLR. Which approach you will take depends on your ultimate use case. If you need to index data for fast searches then you'll likely use SOLR which by the way will require more storage vs if your use case is ETL offload then you probably should just use HDFS combined with Hive and Spark.

avatar
Super Collaborator

Thanks @mqureshi,

sorry about the wild estimate numbers. Is it possible to share any sample reference from any real-life scenario, how much storage required proportionately for one large big table.

The solution we are looking at needs both HBASE and SOLR as there is a need or real time read/write as well as full text search. So need to plan 2 separate clusters.

Also I was told Solr is better without HDFS

https://community.hortonworks.com/questions/71032/hdp-search.html#answer-71041

avatar
Super Guru

@Avijeet Dash

Here is a link for HBase sizing that you can use:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_Sys_Admin_Guides/content/ch_clust_capaci...

If you are using both HBase and SOLR, I am going to assume you are going to index HBase columns in SOLR. There are two concepts in SOLR when it comes to sizing. What will you be indexing and what will you be storing. If you know what you'll be storing (all of HBase columns? Probably not, but I am no one to say) and what will you be indexing (definitely not everything but whatever you index will be in addition to what you store).

As for SOLR is better without HDFS is more of an opinion. I have seen cluster where SOLR cloud is running just fine along side HBase and HDFS. Here is what you should remember. Zookeeper should have its own dedicated disk (please do not share zookeeper disks - I cannot over emphasize this). Size appropriately. Meaning have the right amount of CPU and memory resources. If you are going to give 4GB of heap space to SOLR then there will likely be problems (do not go on the other extreme as it will result in Java garbage collection pauses - ideal heap to start with is 8-12 GB). Another thing to remember is what kind of queries will your end users be running. If they start scanning entire SOLR index, there shouldn't be a doubt that you will run into issues.