Support Questions
Find answers, ask questions, and share your expertise

solr index size vs hadoop hdfs size


I am planning indexing hdfs content into solr, but i am not sure how the solr size compares to hdfs size. example, hdfs index size is for 30TB (pdfs, ppt, word docs). Also note that I only want to extract textual content and index into solr.


please could you help? thanks! @james.jones, @ccasano, @Shivaji Dutta

Super Collaborator

@Naveen Keshava this is a very difficult to say based on TB on disk for these types of files. For example your files may be full of formatting and images. Also it depends on how the cluster will be used. You need more specifics, but here's a starting point.

Let's talk about disk and memory. I know this part is fairly basic here...

  1. Calculate approximately how many documents you have: NDOCS
  2. Estimate average number of words per document: NWORDS
  3. Estimate average number of characters per page: NCHARS
  4. Estimate average number of UTF8 bytes per character. I tend to approximate 2 bytes per character, but that is a very rough calculation: NBYTES
  5. Now do the math to calculate approximately how many bytes of text you have: Total_Raw_Text_Bytes = NDOCS * NWORDS * NCHARS * NBYTES.

You will probably have to do some extraction to get good numbers. And, that calculation is NOT the size of your Solr index on disk. It's just go figure out how much actual text you have.

Now, you have a lot of other things to consider which can drastically change the index size.

  1. Will you store all of the content?
  2. How many fields will you have?
  3. How many fields will you index and how will they be indexed?
  4. Will you store all fields or only index them (or some of them)?
  5. How will you index them? (e.g. what Analyzers and Tokenizers will you use)
  6. How many Solr replicas will you have if you are using local storage or what is your HDFS replication factor? (for many cases local storage is preferred over HDFS, but not all)

To continue on this track of disk and memory, check out this spreadsheet. It is old but for the most part is a reasonable calculator.

Now, a few other factors that may apply based on use of the cluster:

  1. Average and maximum queries per minute (and parallel queries)
  2. Data ingest rate - updates per minute/second.
  3. Frequency of ingest - constant or batch
  4. Will users be querying it (and hard) while you are ingesting data?
  5. Complexity of queries such as boolean clauses
  6. Number of facets per query
  7. How fast does a new document need to be visible in the index after being added?
  8. What is your tolerance for disaster?

Remember that Solr loves fast disk IO and loves RAM. The higher the load (query or ingest) the more it needs. Ideally you want your entire index to fit into memory across the cluster for fastest performance but that is not always possible. The fact is, you'll eventually just have to test, test, test.

Super Collaborator

@Michael Young, @Jonas Straub - you guys may also have ideas and may have some clearer guidelines.

Super Collaborator

@Naveen Keshava - one thing you didn't mention is whether 30TB is before or after HDFS replication, so keep that in mind.