I am planning indexing hdfs content into solr, but i am not sure how the solr size compares to hdfs size. example, hdfs index size is for 30TB (pdfs, ppt, word docs). Also note that I only want to extract textual content and index into solr.
@Naveen Keshava this is a very difficult to say based on TB on disk for these types of files. For example your files may be full of formatting and images. Also it depends on how the cluster will be used. You need more specifics, but here's a starting point.
Let's talk about disk and memory. I know this part is fairly basic here...
You will probably have to do some extraction to get good numbers. And, that calculation is NOT the size of your Solr index on disk. It's just go figure out how much actual text you have.
Now, you have a lot of other things to consider which can drastically change the index size.
To continue on this track of disk and memory, check out this spreadsheet. It is old but for the most part is a reasonable calculator.
Now, a few other factors that may apply based on use of the cluster:
Remember that Solr loves fast disk IO and loves RAM. The higher the load (query or ingest) the more it needs. Ideally you want your entire index to fit into memory across the cluster for fastest performance but that is not always possible. The fact is, you'll eventually just have to test, test, test.