Search Smith

ColdFusion, SQL queries, and, of course, searching

Posts Tagged ‘SQL’

Oracle: Friendly URLs

Posted by David Faber on January 3, 2012

On the one hand, we want to avoid hitting the database if we don’t have to. On the other hand, if we can use built-in functions in our queries, this can be preferable to manipulating the data after it’s been returned from the database. For example, writing this:

SELECT emp_id, INITCAP(fname) AS fname, INITCAP(lname) AS lname
  FROM emp

is certainly* better than writing this:

SELECT emp_id, fname, lname
  FROM emp

and then using your development tool to capitalize the names after the data have already been returned (in CF it would look something like this: ucase(left(fname, 1)) & lcase(mid(fname, 2, len(fname) – 1)) – clunky!!).

*As long as your database queries need not be portable.

A better example still would be the following:

SELECT emp_id, COALESCE(dept_id, 0) AS dept_id
  FROM emp

instead of:

SELECT emp_id, dept_id
  FROM emp

With the latter I have to check to see if dept_id is non-null before displaying, while with the former I can simply display as-is.

What does this have to do with friendly URLs? Friendly as in SEO-friendly. The URL is very important in search engine optimization. Suppose we have a table of articles as in a previous post:

SELECT id, title, description, pubdate, journal_name, author_name, num_reads
  FROM articles

We would like to have the article’s title in the URL. The conventional wisdom is to use hyphens (-) as word separators. We also don’t want any weird characters that are going to be url-encoded. The solution (or, at least, “a” solution) is to use regular expressions to get rid of unusual characters and replace spaces with hyphens. And here Oracle is truly our friend:

SELECT id, title, description, pubdate, journal_name, author_name, num_reads
     , REGEXP_REPLACE(REGEXP_REPLACE(LOWER(title), '[^-a-z0-9/ ]', ''), '[/ ]', '-') AS url_keyword
  FROM articles

The inner REGEXP_REPLACE replaces every character that is not (a) alphanumeric, (b) a space, or (c) a hyphen or forward slash with a null string. The outer one replaces forward slashes and spaces with hyphens. What is left over is bunches of alphanumeric characters (hopefully, words) separated by hyphens – in other words, an SEO-friendly URL string.

Posted in ColdFusion, Oracle, SQL | Tagged: , , , , | 1 Comment »

ColdFusion: Queries inside query loops

Posted by David Faber on January 3, 2012

One of the first places I typically look to improve the performance of ColdFusion templates is the query loop. Sometimes developers, in their haste to get things done, don’t take the time to think about how they might optimize the performance of their templates. In some instances this can be justified. Maybe there aren’t enough rows returned by the database query to worry about the time it takes to process each record. But even in these cases there could be issues with scalability in the future.

How many times have we seen something like this:

<cfquery name="get_employees" datasource="#the_dsn#">
    SELECT emp_id, fname, lname, dept_id
      FROM employees
</cfquery>
<cfloop query="get_employees">
    <cfquery name="get_dept_info" datasource="#the_dsn#">
        SELECT dname
          FROM dept
         WHERE dept_id = <cfqueryparam cfsqltype="CF_SQL_INTEGER" value="#dept_id#" />
    </cfquery>
    ...
</cfloop>

?

Now admittedly this is an elementary example, which could be solved by using a simple SQL join in the first query:

SELECT e.emp_id, e.fname, e.lname, e.dept_id, d.dname
  FROM emp e, dept d
 WHERE e.dept_id = d.dept_id(+)

(Yes, I did use Oracle’s ancient syntax for an outer join above. I can’t get used to the ANSI standard syntax.)

What if we’ve already run the department query for some reason, however? Maybe we’d rather keep our employee query trim. In that case, we can do the following:

<cfquery name="get_depts" datasource="#the_dsn#">
    SELECT dept_id, dname
      FROM dept
</cfquery>
<cfset dept_names = {} /> <!--- Use structNew() for CF8 or less --->
<cfloop query="get_depts">
    <cfset dept_names[dept_id] = dname />
</cfloop>

Now when I loop over my employee query I can do the following:

<cfloop query="get_employees">
    <cfset dname = "" />
    <cfif structKeyExists(dept_names, dept_id)>
        <cfset dname = dept_names[dept_id] />
    </cfif>
    ...
</cfloop>

One area in which this can be very helpful is when one is using aggregate functions. Suppose we want to get some aggregate data (number of employees, total of salaries, etc.) for each of our departments. However, we also want to get information about each department. We could use a single query with multiple columns in the GROUP BY clause:

SELECT d.dept_id, d.dname, d.budget, d.location, d.supervisor_name
     , COUNT(e.emp_id) AS employee_cnt, SUM(e.salary) AS total_salary
  FROM dept d, emp e
 WHERE d.dept_id = e.dept_id(+)
 GROUP BY d.dept_id, d.dname, d.budget, d.location, d.supervisor_name

Now the performance of a query with a GROUP BY clause can be improved by adding index(es) on the column(s) mentioned in the GROUP BY clause (at least in Oracle). I would rather not add indexes where I don’t absolutely have to, however – I’d rather not have to think about the possible impact they may have. (Depending on the company for which you work, you might not be permitted to create indexes on your own.) What I can do in this case is break the query into two parts, one:

<cfquery name="get_dept_info" datasource="#the_dsn#">
    SELECT dept_id, dname, budget, location, supervisor_name FROM dept
</cfquery>

and two:

<cfquery name="get_dept_data" datasource="#the_dsn#">
    SELECT dept_id, COUNT(emp_id) AS employee_cnt, SUM(salary) AS total_salary
      FROM emp
     GROUP BY dept_id
</cfquery>
<cfset dept_counts = {} />
<cfset dept_salary = {} />
<cfloop query="get_dept_data">
    <cfset dept_counts[dept_id] = employee_cnt />
    <cfset dept_salary[dept_id] = total_salary />
</cfloop>

Then when I loop over my “get_dept_data” query I have the data in those structures available, without needing to query the database further.

Posted in ColdFusion, Oracle, SQL | Tagged: , , , , | 2 Comments »

ColdFusion 9: Indexing custom fields in Solr

Posted by David Faber on December 23, 2011

Our first step is to create an object in ColdFusion that we can use to communicate with the Solr server:

<cfset the_server = createObject("java", "org.apache.solr.client.solrj.impl.CommonsHttpSolrServer").init("http://localhost:8983/solr/arts_solr") />

One very nice feature of Solr is that you can use its query syntax to delete records. For example, if you wanted to delete all records with the word “cancer” in the title, you would do the following:

<cfset the_server.deleteByQuery( "title:cancer" ) />

However, in this case we want to delete everything, so we’ll use wildcards:

<cfset the_server.deleteByQuery( "*:*" ) /> <!--- Delete everything --->

Now that the collection has been completely purged, we can add some records. Let’s grab some data from a query:

<cfquery name="get_all_articles" datasource="#the_datasource#">
    SELECT id, title, description, pubdate, journal_name, author_name, num_reads
      FROM articles
</cfquery>

We’re going to index all of the articles we currently have in our database. Here we’ll create an array to store the results of the query.

<cfset the_articles = arrayNew(1) />

Let’s put the results of the query into the array:

<cfloop query="get_all_articles">
    <!--- Strip out HTML from the description --->
    <cfset the_summary = REReplace(description, "<[^>]+>", "", "all") />
    <cfset temp_article = createObject("java", "org.apache.solr.common.SolrInputDocument") />
    <cfset temp_article.addField("uid", id) />
    <cfset temp_article.addField("key", id) />
    <cfset temp_article.addField("size", len(description)) />
    <cfset temp_article.addField("summary", the_summary) />
    <cfset temp_article.addField("title", title) />
    <cfset temp_article.addField("description", description) />
    <cfset temp_article.addField("contents", description & " " & title) />
    <cfset temp_article.addField("pubdate", pubdate) />
    <cfset temp_article.addField("journal_name", journal_name) />
    <cfset temp_article.addField("author_name", author_name) />
    <cfset temp_article.addField("num_reads", num_reads) />
    <cfset temp_article.addField("modified", now()) />
    <cfset arrayAppend(the_articles, temp_article) />
</cfloop>

I am assuming that all of the above fields will already have been defined in the collection’s schema.xml file. The rest is easy:

<cfset the_server.add(the_articles) /> <!--- Add the articles to the index --->
<cfset the_server.commit() /> <!--- Commit the changes --->
<cfset the_server.optimize() /> <!--- Optimize the index --->

And that is really all there is to it, at least for indexes where you don’t expect to have hundreds of thousands of records. If you have many records, you would want to segment the indexing OR partition the collection horizontally (so, for example, you could have one collection for articles from 2011, another for articles from 2010, etc.). Searching on more than one collection at a time is not much more difficult than searching on a single collection, but it is fodder for a future post.

Posted in ColdFusion, Solr, SQL | Tagged: , , , , , , | 6 Comments »