Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Expert Contributor

Properly Size Index

Understanding what to index often requires deep business domain expertise on the data. This yields better indexing strategy and increases accuracy for searching data. Not all data will be indexed but for an organization who's acquiring brand new data, this requires indexing all data until it is understood what value it brings to the business. What this means is that data needs to be re-indexed so it is a good practice to store raw data somewhere cheap, often in HDFS or in the cloud object storage.

Tuning for Speed

This starts with monitoring QTime. It provides performance metrics on how fast the request was received, query parsing and actual search. This doesn't include the time it took to send back the response to the client which depends on heavily on how big the payload and how fast the network I/O is.

Sorting works best with short valued properties like price, age, etc but doesn't on tokenized values like date, long field type values and others. For range queries, use trie field types. Otherwise wise, avoid it. For near realtime search, soft commits are recommended since this brings the recently indexed data available in memory. A good interval soft commit is 15 seconds. Hard commit on the other hand is more for durability where index goes to disk first then memory. 60 seconds of hard commit interval is a good value to avoid the transaction logs getting out of hand. The server restart will be very slow the longer the commit interval.

Parallel SQL is very slow and should only be used for batch type searching. It works very similarly with how Map-Reduce work in Hadoop. The only value this brings is in querying index data from multiple collections using SQL syntax.

Oversharding can be used for performance reasons where all machines has shards for specific replica. This kind be very helpful with a very large data set.

If the size of the index is smaller than the available memory of the Solr cluster, it is possible to load them all into OS Cache by running a touch command recursively on all index files.

Sizing Hardware

There are several factors that influences the hardware configuration.

  1. # of documents
  2. frequency of data updates
  3. # of requests per second
  4. average size of document
  5. # of features that impacts heap consumption

# of documents

Storage bound first, then memory. Depending on how many fields are scored, this can consume a lot of memory.

frequency of data updates

CPU bound first then I/O. CPU impact is big due to deserialization cost. It affects memory management as well. I will require a decent heap size.

# of requests per second

CPU bound.

average size of document

Storage bound. The # of terms also incurs heap overhead. Rule of thumb is that raw to index ratio is typically 5:1.

# of features that impacts heap consumption

Memory bound. Heavy usage of facets and sorts will require good amount of memory. A single facet query can bring a cluster down. Facets can drive the cost of the hardware. This requires further understanding on how facets are used and properly design query. Having the right data is always correct than scoring algorithms. Get hits first, then score.

6,610 Views
Comments
New Contributor

Could you explain “Oversharding can be used for performance reasons where all machines has shards for specific replica” in more detail?

Do you mean implict shards?

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎12-27-2016 05:42 PM
Updated by:
 
Contributors
Top Kudoed Authors