Search Smith

ColdFusion, SQL queries, and, of course, searching

Posts Tagged ‘Solr’

More Statistics

Posted by David Faber on March 15, 2013

Odd that my lack of blogging does not seem to have affected traffic on this site! I posted in April of last year that the site had reached 1,000 page views from 400 unique visitors; that was over approximately 3 months of operation. In the (almost a full) year since then, there have been over 7,000 page views from 3,400-plus unique visitors, despite the fact that I have written exactly two blog posts in that time span (one of them yesterday):

Unsurprisingly, my most popular post — with 2,000 page views! — is still my ColdFusion Solr Tutorial, written in February 2012. Hopefully it is still relevant even with the release of ColdFusion 10. On the surprising side, my second most popular post is the aforementioned post about SHA hashing in Oracle 10g. I wish I could take even some credit for the content of that post but all I really did was bring the word of another developer (Jakub Wartak) to the attention of this blog’s readers. I wish I could say that my posts on SQL have been as warmly received!

Posted in ColdFusion, Miscellany, Oracle, Solr | Tagged: , , | Leave a Comment »

Levenshtein Distance in ColdFusion

Posted by David Faber on March 14, 2013

It’s been a good long time since I’ve posted here — to be honest, apart from personal and family issues (new baby, moving, etc.) which took away a good chunk of the time that I had formerly set aside for blogging, there just haven’t been any “blog-worthy” technical issues that have come up.

We were tasked with the following: Given a particular document, search Solr with using the original document’s keywords to find unique related documents of another type. The related documents are already indexed, so it practically writes itself, right?

In the course of developing a solution for this, we discovered that the related documents had been copied many times, so we would need to filter for uniqueness (these are not large records to string comparison is not terribly expensive). However, and this is the kicker, not only had they been copied, but also their wording slightly tweaked — and these tweaks could be found anywhere in the copied document. It would not be as easy as comparing the first 100 characters, or even the first 30 characters; the changes could be found literally anywhere, but to the human eye the copied document would look practically identical to its source. For various reasons, we did not want to display these (yes, SEO was one of our concerns).

A quick Google search led me to the Levenshtein distance, or edit distance, between two strings. Further searching also turned up a couple of ColdFusion solutions: one given by Brad Wood (with whose blogging I had previously been unfamiliar) and another (CFLib) cited by Ray Camden. I am not at all eager to copy large blocks of code or to rely on external CF libraries (my first attempt at using such a library, CFSolr, turned out poorly — but that’s a topic for another time), and continued searching. It turns out that there is a method for computing the Levenshtein distance between two strings in the Apache Commons Java library — specifically, the StringUtils object in the Commons Lang library. This library appears to be available in ColdFusion 9 by default (perhaps because of its inclusion of Apache Solr?); I could not say whether it is also available in any other version of ColdFusion (7, 8, or even 10). However loading an external Java library into ColdFusion for usage by developers is not difficult, and in this case I think it is worth it.

Step 1: Create a StringUtils object

<cfset string_utils_obj = createObject("java", "org.apache.commons.lang.StringUtils") />

That’s all! There isn’t really a step 2. ;-)

More seriously, step 2 in our case involved executing a Solr query via <CFHTTP>, looping over the results, comparing each result’s Levenshtein “ratio” (the ratio of the result’s Levenshtein distance to the length of the larger of the two compared strings) against all previous unique results, and storing the result if its minimum ratio was 25% or greater (lower = less difference).

Step 2

<cfloop array="#result_array#" index="current_result">
    <cfset min_levenshtein_ratio = 1 />
    <cfloop array="#doc_array#" index="current_doc">
        <cfset temp_levenshtein_distance = string_utils_obj.getLevenshteinDistance(the_result, current_doc) />

        <!--- Levenshtein ratio = Levenshtein distance / max(length of strings compared) --->
        <cfset temp_levenstein_ratio = temp_levenshtein_distance / max( len(the_result), len(current_doc) ) />
        <cfif temp_levenstein_ratio LT min_levenshtein_ratio>
            <cfset min_levenshtein_ratio = temp_levenstein_ratio />
        </cfif>
        <cfif min_levenshtein_ratio LT 0.25>
            <cfbreak />
        </cfif>
    </cfloop>
    <cfif min_levenshtein_ratio LT 0.25>
        <cfcontinue />
    </cfif>
    <!--- This is a unique result, let's save it! --->
    <cfset arrayAppend(doc_array, the_result) />
</cfloop>

Posted in ColdFusion, Solr | Tagged: , , , , , | Leave a Comment »

Solr: Showing faceted search stems in human-readable terms

Posted by David Faber on March 12, 2012

A fascinating question came up on StackOverflow. Suppose you have a Solr core (collection for you ColdFusion peeps) and you want to return the most common terms found in the index. If you facet on a field that has stemming enabled, Solr will return the stems and not the matching terms. Instead, you will see stemmed terms like the following: associ, studi, signific, increas – generally not the sort of thing you want to show to your end users. However, if you use highlighting as well as faceting, fragments or snippets from the fields that match will be returned along with the search results (and along with the facet results), and you can then examine those snippets for the matching terms in a format that is readable by humans. For example, if you do the following –

?q=keyword&facet=true&facet.field=description&hl=true&hl.fl=description&hl.fragsize=0&hl.simple.pre=[&hl.simple.post=]

– then the matching terms will be returned in the highlighting structure wrapped in square brackets. You can then examine those results using regular expressions to pull out the friendly matching terms. One caveat is that unless your index is very small, you will likely only be able to retrieve a sampling of the terms matching the stems. The reason for this is that highlighting returns only those fields relevant to the documents (or records) returned by the query, and is dependent on the number of rows specified.

Update: Well, it appears that I was trying to do too much here. This won’t work as written. It can’t be done in a single query. Rather, what you would need to do is to use faceting to get the top indexed terms:

?q=*.*&facet=true&facet.field=description&facet.limit=20&rows=20

This will return the top 20 (parameter facet.limit) indexed terms. You can then query Solr with highlighting to retrieve the terms that actually match the stemmed terms:

?q=stem&hl=true&hl.fl=description&hl.fragsize=0&hl.simple.pre=[&hl.simple.post=]&rows=20

Twenty rows should be a good number to find a decent sampling of matched terms.

Posted in Solr | Tagged: , , , | Leave a Comment »