Thursday, May 31, 2012

Solr Query Time Performance Tips

Solr has been build on top of Lucene information retrieval system and implements Vector Space Model for ranking document with respect to a particular set of terms called query string. Solr is written in JAVA and can be installed as web application in any of Servlet Containers like Tomcat. Though Solr is highly scalable and configurable when it comes large data sets (billions of documents) the following steps can be considered to improve the query time.

Up-gradation of Hardware

Hardware Requirements : A lot of computation takes place when indexes are created and the indexed documents are queried. It would be better to have dedicated physical machines as Solr instances. The hardware requirements of a machine depends on the numbers of requests the server is expected to handle. For querying over 10 million documents and satisfying 1.2 million hits per days we used dedicated VM with quad core Intel CPU (2.8 GHz) with 12GB RAM.  

Changing the Software Configurations

Increasing Java Virtual Machine Heap Size: The JAVA virtual machine handles object life cycle in the heap area and increasing the size of heap area would result in better performance. This can be done by adding the following option in the tomcat configuration file (catalina.sh)
 
JAVA_OPTS="$JAVA_OPTS "-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=1536m" 

Use Solr Caching Feature : Solr caches the query result set based on the least recently used policy. So if the result-set that has not been served for long and the caching size requires removal of some result-set then the mentioned result-set will be removed from the cached data. More can be read from http://wiki.apache.org/solr/SolrCaching#

Using Distributed Search Feature 

For auto suggestions use different Solr Instance : Auto Suggest fires a huge amout of request for a given query string and it is an add-on functionality. Moreover auto-suggest fields generate many options for a input document field value and this increase a large set of values unser consideration over the time of generation of suggestions for a given query string. The field used for auto-suggestion for the same reason should not be used for general query string that is used for normal search over the indexed documents.  It would be nice to have different Solr instance/ stack for the auto-suggest functionality.

Use replications or shards: When there a huge of request it is a better idea to distrbute the request over set of Solr instances  that are replica of each other. The master-slave replication model consists of one master that is dedicately used for index creation/ updation and the same indexes are pulled by the slaves that are dedicated used for serving the requests and are generally put behind a load balancer. This model solves the read-write problem as well as the indexes are created by master only and read requests are served by slaves which do not perform the write operations. This also avoid the indexes from being corrupted. 

If there are huge number of similar documents (say information about millions of books), then instead of created a huge set of indexed on one machine the indexed can be distrbuted over set of Solr instance that can be collectively called for a query. These set up is called Sharding. Each shard will now have small set of values that will be used for query. (More can read regarding distributed search on  http://wiki.apache.org/solr/DistributedSearch.)

Tuning Search Queries

Query over minimum fields: The number of fields to used for query should be minimum to enable smaller set of sets of field values under consideration for computation on runtime. Moreover the queries should be written following short circuit or minimum evaluation strategy of Boolean Algebra.

User filter queries: Filter queries (fq) are used to specify a specific set of documents under consideration for query. It is used to reduce the set of documents to be considered for score calculation and works much like "WHERE clause of MySQL".

No comments: