Using Wordnet with Websolr
WordNet is a huge lexical database that collects and orders English words into groups of synonyms. It can offer major improvements in relevancy, but it is not at all necessary for many use cases. Make sure you understand the trade offs (discussed below) well before setting it up.
What Are Synonyms in Solr?
Let’s say that you have an online store with a lot of products. You want users to be able to search for those products, but you want that search to be smart. For example, say that your user searches for “bowtie pasta.” You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has “farfalle” instead of “bowtie pasta”. How do you address this issue?
Solr has a mechanism for defining custom synonyms, through the SynonymFilterFactory. This lets search administrators define groups of related terms and even corrections to commonly misspelled terms. A typical synonyms.txt file might look like this:
i-pod, i pod => ipod feline,kitten,cat,kitty
This is great for solving the proximate issue, but what it can get extremely tedious to define all groups of related words in your index.
How Does WordNet Improve Synonyms?
WordNet is essentially a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. An entry in WordNet looks something like this:
Let’s break it down:
This line expresses that the word ‘kitty’ is a noun, and the first word in synset 102122298 (which includes other terms like “kitty-cat,” “pussycat,” and so on). The line also indicates ‘kitty’ is the fourth most commonly used term according to semantic concordance texts. You can read more about the structure and precise definitions of WordNet entries in the documentation.
The WordNet has become extremely useful in text processing applications, including data storage and retrieval. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable.
Why Wouldn’t Everyone Want WordNet?
Relevancy tuning can be a deeply complex subject, and WordNet – especially when the complete file is used – has trade offs, just like any other strategy. Synonym expansion can be really tricky and can result in unexpected sorting, lower performance and more disk use. WordNet can introduce all of these issues with varying severity.
When synonyms are expanded at index time, Solr uses WordNet to generate all tokens related to a given token, and writes everything out to disk. This has several consequences: slower indexing speed, higher load during indexing, and significantly more disk use. Larger index sizes often correspond to memory issues as well.
There is also the problem of updating. If you ever want to change your synonym list, you’ll need to reindex everything from scratch. And WordNet includes multi-term synonyms in its database, which can break phrase queries.
Expanding synonyms at query time resolves some of those issues, but introduces others. Namely, performing expansion and matching at query time adds overhead to your queries in terms of server load and latency. And it still doesn’t really address the problem of multi word synonyms.
There are some really great examples of what this means here (this is documentation for Elasticsearch, but the same principals are true for Solr). The takeaway is that WordNet is not a panacea for relevancy tuning, and it may introduce unexpected results unless you’re doing a lot of preprocessing or additional configuration.
tl;dr: Do not simply assume that chucking a massive synset collection at your index will make it faster with more relevant results.
How Do I Use the WordNet list with Websolr?
Websolr allows users to maintain their own synonyms list for each index via the dashboard. Unfortunately, the raw WordNet synonyms are maintained in a Prolog file called
wn_s.pl, the format of which is not compatible with Solr’s SynonymFilterFactory. Recall the format of a Solr synonym file:
i-pod, i pod => ipod feline,kitten,cat,kitty
Contrast this with the standard format of WordNet lists:
In short, the raw WordNet synonyms list can not simply be copied and pasted into Websolr.
It is possible to generate a flat file compatible with Solr’s synonym handling, but it involves compiling a Java program to read the source file and produce a Solr-compatible file. Users have reported some success with this approach, however it is not officially supported. If this is something you’re comfortable with doing, then we encourage you to try it out.
If generating a flat file seems tedious and overly complex, then you may want to consider forgoing WordNet expansion and simply creating your own synonyms list based on the common tokens in your index. For a majority of purposes, WordNet expansion will only yield marginal gains in relevance, whereas a more narrowly defined set of synonyms based on your corpus will have a far greater impact.
WordNet is a large subject and a great topic to delve deeper into. Here are some links for further reading: