Search Smith

ColdFusion, SQL queries, and, of course, searching

Posts Tagged ‘query’

ColdFusion and Solr: Dealing with the results

Posted by David Faber on March 8, 2012

The Solr web service allows you to return results in either XML format or JSON (the format is XML by default; to return results in JSON format, add the parameter “wt=json” to your Solr query string). Suppose we run a query on the Solr collection myindex. There are three ways we can search this collection: (1) we could use the <cfsearch> tag; (2) we could use the Solr web service and return XML; or (3) we could use the Solr web service and return JSON. We’ve talked about <cfsearch> and its limitations before so I’ll limit this discussion to the second and third options.

XML:

<cfhttp url="http://localhost:8983/solr/myindex?q=*:*&fl=*,score" result="myresult" />

JSON:

<cfhttp url="http://localhost:8983/solr/myindex?q=*:*&fl=*,score&wt=json" result="myresult" />

In each case (XML or JSON), ColdFusion has powerful functions that help you manage the data without needing to parse it manually. Typically I create a friendly query that I can return to the front-end developer that he can use in place of a SQL query or <cfsearch> result.

<cfset search_results = queryNew("id,title,description,pubdate,journal_name,author_name,num_reads,score"
                               , "CF_SQL_INTEGER,CF_SQL_VARCHAR,CF_SQL_CLOB,CF_SQL_DATE,CF_SQL_VARCHAR,CF_SQL_VARCHAR,CF_SQL_INTEGER,CF_SQL_DOUBLE")
/>
<!--- We're using our articles index mentioned in a previous post --->

XML:

<cfif isXML(myresult)>
    <cfset xml_result = XMLParse(myresult.fileContent) />
    <!--- continue processing --->
</cfif>

JSON:

<cfif isJSON(myresult)>
    <cfset json_result = deserializeJSON(myresult.fileContent) />
    <!--- continue processing --->
</cfif>

In either case we will iterate over our results structure and put the results in our friendly query.

JSON:
The JSON structure is particularly friendly for our purposes. First, let’s get the number of results and the maximum score:

<cfset result_cnt = json_result.response.numFound />
<cfset max_score = json_result.response.maxScore />

These fields aren’t absolutely necessary, of course. In particular, maxScore seems superfluous but it might be useful if you’re converting from Verity to Solr and need scores in a % format. The Verity search engine scores results between 0 and 1, which is easily converted to a percentage; while Solr has no maximum score. So to get a Verity-style score, simply divide each result’s score by maxScore.

Now we simply loop over the docs array of the response:

<cfloop array="#json_result.response.docs#" index="doc">
    <cfif structKeyExists(doc, "id") AND structKeyExists(doc, "title")>
        <!--- Both id and title are absolutely required --->
        <cfset queryAddRow(search_results) />
        <cfset querySetCell(search_results, "id", doc[id]) />
        <cfset querySetCell(search_results, "title", doc[title]) />
        <cfset querySetCell(search_results, "score", doc[score] / max_score) />
        <cfif structKeyExists(doc, "description")>
            <cfset querySetCell(search_results, "description", doc[description]) />
        </cfif>
        <cfif structKeyExists(doc, "pub_date")>
            <cfset querySetCell(search_results, "pub_date", doc[pub_date]) />
        </cfif>
        <cfif structKeyExists(doc, "journal_name")>
            <cfset querySetCell(search_results, "journal_name", doc[journal_name]) />
        </cfif>
        <cfif structKeyExists(doc, "author_name")>
            <cfset querySetCell(search_results, "author_name", doc[author_name]) />
        </cfif>
        <cfif structKeyExists(doc, "read_cnt")>
            <cfset querySetCell(search_results, "read_cnt", doc[read_cnt]) />
        </cfif>
    </cfif>
</cfloop>

Dealing with XML results is a bit more complicated so I will discuss that in a subsequent post or in an update to this one. Suffice it to say that the approach will be very similar – we will iterate over the results and store them in a friendly query.

Posted in ColdFusion, Solr | Tagged: , , , , , , , | 7 Comments »

ColdFusion: Queries inside query loops

Posted by David Faber on January 3, 2012

One of the first places I typically look to improve the performance of ColdFusion templates is the query loop. Sometimes developers, in their haste to get things done, don’t take the time to think about how they might optimize the performance of their templates. In some instances this can be justified. Maybe there aren’t enough rows returned by the database query to worry about the time it takes to process each record. But even in these cases there could be issues with scalability in the future.

How many times have we seen something like this:

<cfquery name="get_employees" datasource="#the_dsn#">
    SELECT emp_id, fname, lname, dept_id
      FROM employees
</cfquery>
<cfloop query="get_employees">
    <cfquery name="get_dept_info" datasource="#the_dsn#">
        SELECT dname
          FROM dept
         WHERE dept_id = <cfqueryparam cfsqltype="CF_SQL_INTEGER" value="#dept_id#" />
    </cfquery>
    ...
</cfloop>

?

Now admittedly this is an elementary example, which could be solved by using a simple SQL join in the first query:

SELECT e.emp_id, e.fname, e.lname, e.dept_id, d.dname
  FROM emp e, dept d
 WHERE e.dept_id = d.dept_id(+)

(Yes, I did use Oracle’s ancient syntax for an outer join above. I can’t get used to the ANSI standard syntax.)

What if we’ve already run the department query for some reason, however? Maybe we’d rather keep our employee query trim. In that case, we can do the following:

<cfquery name="get_depts" datasource="#the_dsn#">
    SELECT dept_id, dname
      FROM dept
</cfquery>
<cfset dept_names = {} /> <!--- Use structNew() for CF8 or less --->
<cfloop query="get_depts">
    <cfset dept_names[dept_id] = dname />
</cfloop>

Now when I loop over my employee query I can do the following:

<cfloop query="get_employees">
    <cfset dname = "" />
    <cfif structKeyExists(dept_names, dept_id)>
        <cfset dname = dept_names[dept_id] />
    </cfif>
    ...
</cfloop>

One area in which this can be very helpful is when one is using aggregate functions. Suppose we want to get some aggregate data (number of employees, total of salaries, etc.) for each of our departments. However, we also want to get information about each department. We could use a single query with multiple columns in the GROUP BY clause:

SELECT d.dept_id, d.dname, d.budget, d.location, d.supervisor_name
     , COUNT(e.emp_id) AS employee_cnt, SUM(e.salary) AS total_salary
  FROM dept d, emp e
 WHERE d.dept_id = e.dept_id(+)
 GROUP BY d.dept_id, d.dname, d.budget, d.location, d.supervisor_name

Now the performance of a query with a GROUP BY clause can be improved by adding index(es) on the column(s) mentioned in the GROUP BY clause (at least in Oracle). I would rather not add indexes where I don’t absolutely have to, however – I’d rather not have to think about the possible impact they may have. (Depending on the company for which you work, you might not be permitted to create indexes on your own.) What I can do in this case is break the query into two parts, one:

<cfquery name="get_dept_info" datasource="#the_dsn#">
    SELECT dept_id, dname, budget, location, supervisor_name FROM dept
</cfquery>

and two:

<cfquery name="get_dept_data" datasource="#the_dsn#">
    SELECT dept_id, COUNT(emp_id) AS employee_cnt, SUM(salary) AS total_salary
      FROM emp
     GROUP BY dept_id
</cfquery>
<cfset dept_counts = {} />
<cfset dept_salary = {} />
<cfloop query="get_dept_data">
    <cfset dept_counts[dept_id] = employee_cnt />
    <cfset dept_salary[dept_id] = total_salary />
</cfloop>

Then when I loop over my “get_dept_data” query I have the data in those structures available, without needing to query the database further.

Posted in ColdFusion, Oracle, SQL | Tagged: , , , , | 2 Comments »

ColdFusion 9: Indexing custom fields in Solr

Posted by David Faber on December 23, 2011

Our first step is to create an object in ColdFusion that we can use to communicate with the Solr server:

<cfset the_server = createObject("java", "org.apache.solr.client.solrj.impl.CommonsHttpSolrServer").init("http://localhost:8983/solr/arts_solr") />

One very nice feature of Solr is that you can use its query syntax to delete records. For example, if you wanted to delete all records with the word “cancer” in the title, you would do the following:

<cfset the_server.deleteByQuery( "title:cancer" ) />

However, in this case we want to delete everything, so we’ll use wildcards:

<cfset the_server.deleteByQuery( "*:*" ) /> <!--- Delete everything --->

Now that the collection has been completely purged, we can add some records. Let’s grab some data from a query:

<cfquery name="get_all_articles" datasource="#the_datasource#">
    SELECT id, title, description, pubdate, journal_name, author_name, num_reads
      FROM articles
</cfquery>

We’re going to index all of the articles we currently have in our database. Here we’ll create an array to store the results of the query.

<cfset the_articles = arrayNew(1) />

Let’s put the results of the query into the array:

<cfloop query="get_all_articles">
    <!--- Strip out HTML from the description --->
    <cfset the_summary = REReplace(description, "<[^>]+>", "", "all") />
    <cfset temp_article = createObject("java", "org.apache.solr.common.SolrInputDocument") />
    <cfset temp_article.addField("uid", id) />
    <cfset temp_article.addField("key", id) />
    <cfset temp_article.addField("size", len(description)) />
    <cfset temp_article.addField("summary", the_summary) />
    <cfset temp_article.addField("title", title) />
    <cfset temp_article.addField("description", description) />
    <cfset temp_article.addField("contents", description & " " & title) />
    <cfset temp_article.addField("pubdate", pubdate) />
    <cfset temp_article.addField("journal_name", journal_name) />
    <cfset temp_article.addField("author_name", author_name) />
    <cfset temp_article.addField("num_reads", num_reads) />
    <cfset temp_article.addField("modified", now()) />
    <cfset arrayAppend(the_articles, temp_article) />
</cfloop>

I am assuming that all of the above fields will already have been defined in the collection’s schema.xml file. The rest is easy:

<cfset the_server.add(the_articles) /> <!--- Add the articles to the index --->
<cfset the_server.commit() /> <!--- Commit the changes --->
<cfset the_server.optimize() /> <!--- Optimize the index --->

And that is really all there is to it, at least for indexes where you don’t expect to have hundreds of thousands of records. If you have many records, you would want to segment the indexing OR partition the collection horizontally (so, for example, you could have one collection for articles from 2011, another for articles from 2010, etc.). Searching on more than one collection at a time is not much more difficult than searching on a single collection, but it is fodder for a future post.

Posted in ColdFusion, Solr, SQL | Tagged: , , , , , , | 6 Comments »