04-04-2017 01:18 AM
we are experiencing some performance issues with Solr batch indexing: we have a cluster composed by 4 workers, each of which is equipped with 32 cores and 256GB of RAM. YARN is configured to use 100 vCores and 785.05GB of memory. The HDFS storage is managed by an EMC Isilon system connected through a 10Gb interface. Our cluster runs CDH 5.8.0, features Solr 4.10.3 and it is Kerberized.
With the current setup, speaking of compressed data, we can index about 25GB per day and 500GB per month by using MapReduce jobs. Some of these jobs run daily and they take almost 12 hours to index 15 GB of compressed data. In particular, MorphlineMapper jobs last approximately 5 hours and TreeMergeMapper last about 6 hours. Are these performances normal? Can you suggest us some tweaks that could improve our indexing performances?
04-07-2017 01:54 AM
It seems rather slow.
That said it could be caused by a lot of things.
Do you use a custom map/reduce job for indexing or are you using the "MapReduceIndexerTool" ?
Does your Yarn configuration allow enough memory for map and reduce job ? Too small JVM can lead to slowing the performance of a job (too much GC).
Do you see some pending containers during the job ?
Does your network is saturated ?
From where are you extracting the data ? (Hive ? HBase ? other ?)