Search Smith

ColdFusion, SQL queries, and, of course, searching

Posts Tagged ‘SEO’

Oracle: SEO-Friendly URLs II

Posted by David Faber on March 15, 2013

I posted over a year ago about how one might use regular expressions in Oracle to return SEO-friendly URLs in queries. The query I came up with at the time looked something like this:

SELECT id, title, description, pubdate, journal_name, author_name, num_reads
     , REGEXP_REPLACE(REGEXP_REPLACE(LOWER(title), '[^-a-z0-9/ ]', ''), '[/ ]', '-') AS url_keyword
  FROM articles

Since then, I’ve been looking into regular expressions quite a bit. One difficulty with the expression above is that multiple characters will be replaced with multiple hyphens, plus other punctuation will be ignored — they will simply be replaced with nothing. Another, perhaps more obvious difficulty is its complexity! A better approach would be the following:

SELECT id, title, description, pubdate, journal_name, author_name, num_reads
     , REGEXP_REPLACE(LOWER(title), '[^a-z0-9]+', '-') AS url_keyword
  FROM articles

This simply replaces consecutive non-alphanumeric characters with a single hyphen. Since Oracle supports the POSIX character classes in regular expressions, the above could also be written:

SELECT id, title, description, pubdate, journal_name, author_name, num_reads
     , REGEXP_REPLACE(LOWER(title), '[^[:alnum:]]+', '-') AS url_keyword
  FROM articles

where [[:alnum:]] is the POSIX character class for alphanumeric characters and [^[:alnum:]] is its negation. A third option, if you know your data has no underscores or want them to be preserved for some reason (and not transformed into hyphens) would be the following:

SELECT id, title, description, pubdate, journal_name, author_name, num_reads
     , REGEXP_REPLACE(LOWER(title), '\W+', '-') AS url_keyword
  FROM articles

Posted in Oracle, SQL | Tagged: , , , , , | Leave a Comment »

Levenshtein Distance in ColdFusion

Posted by David Faber on March 14, 2013

It’s been a good long time since I’ve posted here — to be honest, apart from personal and family issues (new baby, moving, etc.) which took away a good chunk of the time that I had formerly set aside for blogging, there just haven’t been any “blog-worthy” technical issues that have come up.

We were tasked with the following: Given a particular document, search Solr with using the original document’s keywords to find unique related documents of another type. The related documents are already indexed, so it practically writes itself, right?

In the course of developing a solution for this, we discovered that the related documents had been copied many times, so we would need to filter for uniqueness (these are not large records to string comparison is not terribly expensive). However, and this is the kicker, not only had they been copied, but also their wording slightly tweaked — and these tweaks could be found anywhere in the copied document. It would not be as easy as comparing the first 100 characters, or even the first 30 characters; the changes could be found literally anywhere, but to the human eye the copied document would look practically identical to its source. For various reasons, we did not want to display these (yes, SEO was one of our concerns).

A quick Google search led me to the Levenshtein distance, or edit distance, between two strings. Further searching also turned up a couple of ColdFusion solutions: one given by Brad Wood (with whose blogging I had previously been unfamiliar) and another (CFLib) cited by Ray Camden. I am not at all eager to copy large blocks of code or to rely on external CF libraries (my first attempt at using such a library, CFSolr, turned out poorly — but that’s a topic for another time), and continued searching. It turns out that there is a method for computing the Levenshtein distance between two strings in the Apache Commons Java library — specifically, the StringUtils object in the Commons Lang library. This library appears to be available in ColdFusion 9 by default (perhaps because of its inclusion of Apache Solr?); I could not say whether it is also available in any other version of ColdFusion (7, 8, or even 10). However loading an external Java library into ColdFusion for usage by developers is not difficult, and in this case I think it is worth it.

Step 1: Create a StringUtils object

<cfset string_utils_obj = createObject("java", "org.apache.commons.lang.StringUtils") />

That’s all! There isn’t really a step 2. 😉

More seriously, step 2 in our case involved executing a Solr query via <CFHTTP>, looping over the results, comparing each result’s Levenshtein “ratio” (the ratio of the result’s Levenshtein distance to the length of the larger of the two compared strings) against all previous unique results, and storing the result if its minimum ratio was 25% or greater (lower = less difference).

Step 2

<cfloop array="#result_array#" index="current_result">
    <cfset min_levenshtein_ratio = 1 />
    <cfloop array="#doc_array#" index="current_doc">
        <cfset temp_levenshtein_distance = string_utils_obj.getLevenshteinDistance(the_result, current_doc) />

        <!--- Levenshtein ratio = Levenshtein distance / max(length of strings compared) --->
        <cfset temp_levenstein_ratio = temp_levenshtein_distance / max( len(the_result), len(current_doc) ) />
        <cfif temp_levenstein_ratio LT min_levenshtein_ratio>
            <cfset min_levenshtein_ratio = temp_levenstein_ratio />
        </cfif>
        <cfif min_levenshtein_ratio LT 0.25>
            <cfbreak />
        </cfif>
    </cfloop>
    <cfif min_levenshtein_ratio LT 0.25>
        <cfcontinue />
    </cfif>
    <!--- This is a unique result, let's save it! --->
    <cfset arrayAppend(doc_array, the_result) />
</cfloop>

Posted in ColdFusion, Solr | Tagged: , , , , , | Leave a Comment »

Oracle: Friendly URLs

Posted by David Faber on January 3, 2012

On the one hand, we want to avoid hitting the database if we don’t have to. On the other hand, if we can use built-in functions in our queries, this can be preferable to manipulating the data after it’s been returned from the database. For example, writing this:

SELECT emp_id, INITCAP(fname) AS fname, INITCAP(lname) AS lname
  FROM emp

is certainly* better than writing this:

SELECT emp_id, fname, lname
  FROM emp

and then using your development tool to capitalize the names after the data have already been returned (in CF it would look something like this: ucase(left(fname, 1)) & lcase(mid(fname, 2, len(fname) – 1)) – clunky!!).

*As long as your database queries need not be portable.

A better example still would be the following:

SELECT emp_id, COALESCE(dept_id, 0) AS dept_id
  FROM emp

instead of:

SELECT emp_id, dept_id
  FROM emp

With the latter I have to check to see if dept_id is non-null before displaying, while with the former I can simply display as-is.

What does this have to do with friendly URLs? Friendly as in SEO-friendly. The URL is very important in search engine optimization. Suppose we have a table of articles as in a previous post:

SELECT id, title, description, pubdate, journal_name, author_name, num_reads
  FROM articles

We would like to have the article’s title in the URL. The conventional wisdom is to use hyphens (-) as word separators. We also don’t want any weird characters that are going to be url-encoded. The solution (or, at least, “a” solution) is to use regular expressions to get rid of unusual characters and replace spaces with hyphens. And here Oracle is truly our friend:

SELECT id, title, description, pubdate, journal_name, author_name, num_reads
     , REGEXP_REPLACE(REGEXP_REPLACE(LOWER(title), '[^-a-z0-9/ ]', ''), '[/ ]', '-') AS url_keyword
  FROM articles

The inner REGEXP_REPLACE replaces every character that is not (a) alphanumeric, (b) a space, or (c) a hyphen or forward slash with a null string. The outer one replaces forward slashes and spaces with hyphens. What is left over is bunches of alphanumeric characters (hopefully, words) separated by hyphens – in other words, an SEO-friendly URL string.

Posted in ColdFusion, Oracle, SQL | Tagged: , , , , | 1 Comment »