My use case
I am having 20gb file per day. (pipe delimited text file)
I have indexed 90 days data (20 * 90 gb)
Record count - 5.5 billion
total fields - 30
indexed fields - called number , calling number , time_key
All other fields i stored (as per schema.cml)
index size - 300gb
No of shards = 4
I used below method to index (org.apache.solr.hadoop.MapReduceIndexerTool)
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.M apReduceIndexerTool --morphline-file $path/morphlines.conf –output -dir hdfs://MASTERNODE:8020/$path2 --go-live --zk-host MASTERNODE:2181/solr --collection COLLECTIONNAME --mappers 4 --reducers 12 hdfs://Masternode/path/asd.txt
In My test bed i have 4 datanodes and 1 name node. (Test bed on cloudera 5.4.7)
each node has 256gb ram,Any performance increasing tips i should follow in solr ?
It took around 120 sec to get 3000 record out put in one search.(Range query based on time key ).But after first time querry , its getting cached and then if i executed again i m getting response less than 1 sec with larger records out put as well (10000 record out put also getting with in 1 sec)
Note that when retriving 10 - 20 records , then performance was good on firsttime it self.
Thanks for your input gchanan
By the way if i'm changing my solrconfig.xml as per above input , do i need to re create collection and reload all data again ? (Because total data size is 5.5 billion and hard to reupload again )
or just editing solrconfig.xml and restart service is enough ?