Search Smith

ColdFusion, SQL queries, and, of course, searching

Solr: Showing faceted search stems in human-readable terms

Posted by David Faber on March 12, 2012

A fascinating question came up on StackOverflow. Suppose you have a Solr core (collection for you ColdFusion peeps) and you want to return the most common terms found in the index. If you facet on a field that has stemming enabled, Solr will return the stems and not the matching terms. Instead, you will see stemmed terms like the following: associ, studi, signific, increas – generally not the sort of thing you want to show to your end users. However, if you use highlighting as well as faceting, fragments or snippets from the fields that match will be returned along with the search results (and along with the facet results), and you can then examine those snippets for the matching terms in a format that is readable by humans. For example, if you do the following –

?q=keyword&facet=true&facet.field=description&hl=true&hl.fl=description&hl.fragsize=0&hl.simple.pre=[&hl.simple.post=]

– then the matching terms will be returned in the highlighting structure wrapped in square brackets. You can then examine those results using regular expressions to pull out the friendly matching terms. One caveat is that unless your index is very small, you will likely only be able to retrieve a sampling of the terms matching the stems. The reason for this is that highlighting returns only those fields relevant to the documents (or records) returned by the query, and is dependent on the number of rows specified.

Update: Well, it appears that I was trying to do too much here. This won’t work as written. It can’t be done in a single query. Rather, what you would need to do is to use faceting to get the top indexed terms:

?q=*.*&facet=true&facet.field=description&facet.limit=20&rows=20

This will return the top 20 (parameter facet.limit) indexed terms. You can then query Solr with highlighting to retrieve the terms that actually match the stemmed terms:

?q=stem&hl=true&hl.fl=description&hl.fragsize=0&hl.simple.pre=[&hl.simple.post=]&rows=20

Twenty rows should be a good number to find a decent sampling of matched terms.

Leave a Comment

Your email address will not be published. Required fields are marked *