Improve Indexing Speed

Typically the biggest bottlenecks to indexing speed are all at the application’s end, the largest of which is simply fetching data from the database, followed next by serializing that data into XML. Posting that data over the network to Solr is fairly negligible if you are located in the same EC2 region as our servers, but can be another factor if you are not.

For the database bottlenecks, the worst offenders are typically due to indexing associated objects in other tables. You should consider writing a reindexing task to ensure that you’re using joins for eager loading to avoid making multiple trips to the database per record. You should also make sure that you have correct database indices set up for the relevant foreign key joins.

For Ruby on Rails applications, something like this is a good approximation for the Sunspot rake sunspot:reindex task that can be optimized a bit for your application:

Post.includes(:author).find_each(:batch_size => 10) do |posts|
  posts.each { |post| post.solr_index }
end

If you’re using a background job processor like DelayedJob or Resque, it can really help to queue up individual indexing jobs and use multiple workers to index in parallel. This also has the benefit of speeding up your application because your users don’t have to wait for Solr updates to take place during the normal usage of your application.

With DelayedJob, a simple approach could look like this:

class Post
  searchable do
    # …
  end
  handle_asynchronously :solr_index
end

Anecdotally, we know of setups using dozens of background job processors to reindex many millions of records in an hour or two.