Search Smith

ColdFusion, SQL queries, and, of course, searching

Levenshtein Distance in ColdFusion

Posted by David Faber on March 14, 2013

It’s been a good long time since I’ve posted here — to be honest, apart from personal and family issues (new baby, moving, etc.) which took away a good chunk of the time that I had formerly set aside for blogging, there just haven’t been any “blog-worthy” technical issues that have come up.

We were tasked with the following: Given a particular document, search Solr with using the original document’s keywords to find unique related documents of another type. The related documents are already indexed, so it practically writes itself, right?

In the course of developing a solution for this, we discovered that the related documents had been copied many times, so we would need to filter for uniqueness (these are not large records to string comparison is not terribly expensive). However, and this is the kicker, not only had they been copied, but also their wording slightly tweaked — and these tweaks could be found anywhere in the copied document. It would not be as easy as comparing the first 100 characters, or even the first 30 characters; the changes could be found literally anywhere, but to the human eye the copied document would look practically identical to its source. For various reasons, we did not want to display these (yes, SEO was one of our concerns).

A quick Google search led me to the Levenshtein distance, or edit distance, between two strings. Further searching also turned up a couple of ColdFusion solutions: one given by Brad Wood (with whose blogging I had previously been unfamiliar) and another (CFLib) cited by Ray Camden. I am not at all eager to copy large blocks of code or to rely on external CF libraries (my first attempt at using such a library, CFSolr, turned out poorly — but that’s a topic for another time), and continued searching. It turns out that there is a method for computing the Levenshtein distance between two strings in the Apache Commons Java library — specifically, the StringUtils object in the Commons Lang library. This library appears to be available in ColdFusion 9 by default (perhaps because of its inclusion of Apache Solr?); I could not say whether it is also available in any other version of ColdFusion (7, 8, or even 10). However loading an external Java library into ColdFusion for usage by developers is not difficult, and in this case I think it is worth it.

Step 1: Create a StringUtils object

<cfset string_utils_obj = createObject("java", "org.apache.commons.lang.StringUtils") />

That’s all! There isn’t really a step 2. 😉

More seriously, step 2 in our case involved executing a Solr query via <CFHTTP>, looping over the results, comparing each result’s Levenshtein “ratio” (the ratio of the result’s Levenshtein distance to the length of the larger of the two compared strings) against all previous unique results, and storing the result if its minimum ratio was 25% or greater (lower = less difference).

Step 2

<cfloop array="#result_array#" index="current_result">
    <cfset min_levenshtein_ratio = 1 />
    <cfloop array="#doc_array#" index="current_doc">
        <cfset temp_levenshtein_distance = string_utils_obj.getLevenshteinDistance(the_result, current_doc) />

        <!--- Levenshtein ratio = Levenshtein distance / max(length of strings compared) --->
        <cfset temp_levenstein_ratio = temp_levenshtein_distance / max( len(the_result), len(current_doc) ) />
        <cfif temp_levenstein_ratio LT min_levenshtein_ratio>
            <cfset min_levenshtein_ratio = temp_levenstein_ratio />
        <cfif min_levenshtein_ratio LT 0.25>
            <cfbreak />
    <cfif min_levenshtein_ratio LT 0.25>
        <cfcontinue />
    <!--- This is a unique result, let's save it! --->
    <cfset arrayAppend(doc_array, the_result) />

Posted in ColdFusion, Solr | Tagged: , , , , , | Leave a Comment »

SHA-256 Hash in Oracle 10g

Posted by David Faber on August 24, 2012

It’s been a while since I’ve posted, and I’m not certain when I will get back to posting regularly (sorry!), but I came across an issue yesterday and found a blog post that was too good not to bring to others’ attention. I was looking to hash a column in Oracle with the SHA-2 256-bit hash function but that is not available in Oracle 10g.

The key to accomplishing this is found in the ability to load Java classes into Oracle and create Oracle functions based on the methods of those classes.

vnull – SHA1, SHA256, SHA512 in Oracle for free without using DBMS_CRYPTO

Incidentally, according to this post on the Oracle forums, the DBMS_CRYPTO package is certainly included in all editions of Oracle 11g, not just Enterprise Edition. There is nothing in the Oracle licensing information that would suggest that DBMS_CRYPTO is not available on Standard Edition, Standard Edition One, etc.

Having these encryption algorithms available within the Oracle DBMS is certainly helpful as it allows data to be encrypted or hashed via database triggers and the like.

To test your SHA-256 hash results, try the following link: SHA-256 Cryptographic Hash Algorithm.

If SHA-256 is not strong enough for you, SHA-512 is also included in the GNU Crypto project.

Posted in Oracle, SQL | Tagged: , , , , , , , , , , | 2 Comments »

1000 Page Views

Posted by David Faber on April 3, 2012

Search Smith hit 1,000 page views this morning. So far the site has had over 500 visitors and 400 unique visitors. Not a big deal for the three-plus months the site has been in operation, but perhaps more impressive is the fact that we’ve gotten over 500 page views and 200 unique visitors since the last time I reported on site statistics (less than one month ago).

I will return to semi-regular posting at some point soon after Easter.

Posted in Off-Topic | Leave a Comment »