HTTP 503: Service Unavailable

Occasionally you may see a “503 Service Unavailable” error in your logs. These occur periodically, and represent less than 0.05% of all requests to Solr. They can be due to a number of causes.

One relatively mundane cause is normal system maintenance, during which we may make an index read-only for roughly 1–2 minutes to restart an instance of Solr. This can happen once or twice a month, and should not affect searches.

Other causes are trickier to pin down, due to a combination of random factors such as networking packet loss and JVM garbage collection pauses.

We have a few recommendations to harden your application in the event of these errors:

Upgrade your index to a more recent version of Solr. Some users are running on older versions of Solr, which can contribute to these kinds of 503 errors. We strongly recommend that any Solr 3.x index which experiences problems be replaced with a newer index on Solr 4.x.
Retry your requests. A 503 error in our systems is almost always intermittent, and may be retried immediately, or multiple times with an exponential back off. In particular, we recommend that incremental upgrades be processed in a queue, which lends toward easier automatic retries.
Upgrade to a Business class subscription. Having dedicated resources available can provide more consistency by mitigating some classes of Solr memory management issues experienced in multitenant shared clusters.
Report the problem. If your application has experienced a high rate of 503s sustained for more than a few minutes, and we haven’t announced a larger outage on @websolrstatus, it may be indicative of a larger problem that we need to know about. Let us know your index URL and send us some example requests to support@websolr.com (please don't tweet your URL to us!).

Implementing retries

The implementation of retries will vary based on the platform and Solr client. As an example, recent versions of Sunspot include an optional session proxy which can automatically retry these kinds of errors. You could add something like this to a Rails initializer:

Sunspot.session = Sunspot::SessionProxy::Retry5xxSessionProxy.new(Sunspot.session)

Queued updates

User activity which gradually creates or updates single records over time should have their index updates queued with a system such as Resque. That way, temporary errors such as a 503 are isolated from the everyday operation of the rest of your application, and failed jobs can be more easily retried.