Reply
New Contributor
Posts: 5
Registered: ‎02-10-2017

Solr indexing performances

Hi,
we are experiencing some performance issues with Solr batch indexing: we have a cluster composed by 4 workers, each of which is equipped with 32 cores and 256GB of RAM. YARN is configured to use 100 vCores and 785.05GB of memory. The HDFS storage is managed by an EMC Isilon system connected through a 10Gb interface. Our cluster runs CDH 5.8.0, features Solr 4.10.3 and it is Kerberized.
With the current setup, speaking of compressed data, we can index about 25GB per day and 500GB per month by using MapReduce jobs. Some of these jobs run daily and they take almost 12 hours to index 15 GB of compressed data. In particular, MorphlineMapper jobs last approximately 5 hours and TreeMergeMapper last about 6 hours. Are these performances normal? Can you suggest us some tweaks that could improve our indexing performances?

 

Thank you.

Stefano

Posts: 173
Topics: 8
Kudos: 19
Solutions: 19
Registered: ‎07-16-2015

Re: Solr indexing performances

It seems rather slow.

That said it could be caused by a lot of things.

 

Do you use a custom map/reduce job for indexing or are you using the "MapReduceIndexerTool" ?

Does your Yarn configuration allow enough memory for map and reduce job ? Too small JVM can lead to slowing the performance of a job (too much GC).

Do you see some pending containers during the job ?

Does your network is saturated ?

From where are you extracting the data ? (Hive ? HBase ? other ?)

 

etc.

New Contributor
Posts: 4
Registered: ‎08-23-2017

Re: Solr indexing performances

Hi,

 

we are using the MapReduceIndexerTool and there are no network problems. We are reading compressed files from HDFS and decompressing them in our morphline. This is the way we run our script:

cmd_hdp=$(
HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf" hadoop --config /etc/hadoop/conf.cloudera.yarn \
jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
-D morphlineVariable.ZK_HOST=hostname1:2181/solr \
-D morphlineVariable.COLLECTION=my_collection \
-D mapreduce.map.memory.mb=8192 \
-D mapred.child.java.opts=-Xmx4096m \
-D mapreduce.reduce.java.opts=-Xmx4096m \
-D mapreduce.reduce.memory.mb=8192 \
--output-dir hdfs://isilonhostname:8020/tmp/my_tmp_dir \
--morphline-file morphlines/my_morphline.conf \
--log4j log4j.properties \
--go-live \
--collection my_collection \
--zk-host hostname1:2181/solr \
hdfs://isilonhostname:8020/my_input_dir/
)



The MorphlineMapper phase takes all available resources, the TreeMergeMapper takes only a couple of containers.

 

We don't need to make queries for the moment, we just need to index historical data. We are wondering if there is a way to speed up indexing time and then optimize collections for searching when indexing is complete.

 

Thank you,

Sergio

Posts: 173
Topics: 8
Kudos: 19
Solutions: 19
Registered: ‎07-16-2015

Re: Solr indexing performances

Hi, I assume you are working with the previous person on the same project ?

 

Some questions for you that might helps you :

 

>

-D mapreduce.map.memory.mb=8192 \
-D mapred.child.java.opts=-Xmx4096m \
-D mapreduce.reduce.java.opts=-Xmx4096m \
-D mapreduce.reduce.memory.mb=8192 \

Usualy, the "java.opts" parameters are set to 80% of the memory.mb one. Is there some specific reason you have set it to "50%" only ?

 

>

How is configured your collection ? How many shards (and how many replica per shard) ?

 

>

In the first post, it is said that the mapper phase takes 5 hours (for roughly 15GB of compressed data).

What is the processing time of a "single mapper task" ?

How many mapper tasks are launched in this phase ?

 

>

What is the compression algorithm used ?

Is it efficient (the trade off between compression rate and performance is acceptable?) ?

 

>

There is no information on the disk part of the "worker nodes".

What types of disks are attached. And how many ?

Did you check the I/O. Is there some contention on them ?

What is the maximum throughput of the disks ?

 

>

Is the CPUs of the "worker nodes" overused during an indexation ? Or is it "idle" ?

 

>

Is there some room to improve the processing time of the morphline script ? Is it efficient enough ?

There is no "load_solr" instruction in the morphline ?

 

>

15GB of compressed data : how many lines does it represent ? (how many fields per line ?)

 

>

How is Solr handling the load at the end of the indexing process ? (when Solr is loading the data)

 

Best regards,

Mathieu

 

 

 

 

 

 

 

 

New Contributor
Posts: 4
Registered: ‎08-23-2017

Re: Solr indexing performances

[ Edited ]

Hi,
yes I forgot to say that I'm a colleague of the guy in the first post.
Thanks for your help, I'll try to be as clear as possible.


Usualy, the "java.opts" parameters are set to 80% of the memory.mb one. Is there some specific reason you have set it to "50%" only ?
We didn't find good documentation about this, we just set these parameters based on our experience. Are there resources about this?

We upgraded our cluster since the first post. Now we have 8 workers, with a total of 200 vcores and 1,5TB of memory.


How is configured your collection ? How many shards (and how many replica per shard) ?
Our collections have 12 shards and 2 replicas per shard. The workload is balanced between machines, so there are 3 cores per machine.


In the first post, it is said that the mapper phase takes 5 hours (for roughly 15GB of compressed data).
What is the processing time of a "single mapper task" ?
How many mapper tasks are launched in this phase ?
Is the CPUs of the "worker nodes" overused during an indexation ? Or is it "idle" ?
How is Solr handling the load at the end of the indexing process ? (when Solr is loading the data)
Our source produces about 20GB of compressed data every day, split into about 550 compressed files. The number of map tasks is the same as the number of input files. We run one indexing task per day.
A single MorphlineMapper's map task takes about 20 minutes to complete. Considering the total number of cores, with a full unloaded cluster, the map phase takes about 1 hour to complete. During this phase, the worker's CPUs are almost 100% loaded.

The reduce phase takes almost 4 hours to complete. We tried two different approaches here:
We started indexing without setting the "--reducers" parameter. In this case, this phase takes 24 cores and almost 3 hours. When it ends, the TreeMergeMapper starts, which takes almost 2 hours to complete.
As far as I know, during this phases 24 "virtual shards" are created, then they are merged into the final 12 desired shards.

To avoid the TreeMergeMapper job, we tried to set the number of reducers to 12 (the same as the number of shards). In this case, by the way, the MorphlineMapper's reduce phase takes 12 cores and almost 5 hours to complete.
So, we can't see any significant improvement using this strategy.

When MorphlineMapper job (and eventually the TreeMergeMapper one) ends, the "Current" sign in the Solr web ui's Statistics tab becomes red, meaning there is something going on. We can't keep track of this in yarn, and cpu and memory usage is not very high during this stage. What is it about? After 4-5 hours the sign returns green and the collection is fully available.

 

solrui.png


What is the compression algorithm used ?
Is it efficient (the trade off between compression rate and performance is acceptable?) ?
15GB of compressed data : how many lines does it represent ? (how many fields per line ?)
Gzip is used to compress records, every compressed file contains a single txt. This is the way our source sends data.
For 20GB of data we have about (550 compressed files)x(37 MB). Every file contains about 320000 records. Every record is made of 23 text fields, some of which are dynamic:

<field name="id" stored="true" indexed="true" type="string" multiValued="false" required="true"/>
<field name="_version_" stored="true" indexed="true" type="long"/>
<field name="timestamp" indexed="true" stored='true' type="date"/>
<field name="file_name" stored="true" indexed="true" type="text_general"/>
<field name="cod_auth" default="null" stored="true" indexed="true" type="text_general"/>
<!-- field used for full text search -->
<field name="text" default="null" stored="true" indexed="true" type="text_general"/>
<dynamicField name="file_*" type="text_general" indexed="false" stored="false"/>
<dynamicField name="base_id*" type="text_general" indexed="false" stored="false"/>
<dynamicField name="*" type="text_general" indexed="true" stored="true"/>

 


Is there some room to improve the processing time of the morphline script ? Is it efficient enough ?
There is no "load_solr" instruction in the morphline ?
Yes, we have a "loadSolr" instruction in the morphline:

{
    loadSolr {
        solrLocator : ${SOLR_LOCATOR}
    }
}

 

New Contributor
Posts: 4
Registered: ‎08-23-2017

Re: Solr indexing performances

[ Edited ]
 
Highlighted
Posts: 173
Topics: 8
Kudos: 19
Solutions: 19
Registered: ‎07-16-2015

Re: Solr indexing performances

[ Edited ]

I have three idea after reading the information you have provided :

1 - Try increasing the parameter "mapreduce.reduce.java.opts" to 80% of 8Gb. This might help the reducer phase processing time

 

2 - 550 files for 20GB of data means an average size of 37MB per file. I guess your block size is higher (64 ou 128MB). Having fewer bigger files might help the mapper phase.

 

3 - I don't think you need the LOAD_SOLR instruction inside the morphline.

From my understanding, the HBaseIndexerTool is in charge to load solr at the end of the processing. Having this particular instruction inside the morphline means you load twice solr (as far as I understand, at least this is the case when reading from HBase).

Try removing this instruction in the morphline configuration file.

 

Good luck !

 

 

Announcements