Search Smith

ColdFusion, SQL queries, and, of course, searching

Archive for December, 2011

ColdFusion 9: Indexing custom fields in Solr

Posted by David Faber on December 23, 2011

Our first step is to create an object in ColdFusion that we can use to communicate with the Solr server:

<cfset the_server = createObject("java", "org.apache.solr.client.solrj.impl.CommonsHttpSolrServer").init("http://localhost:8983/solr/arts_solr") />

One very nice feature of Solr is that you can use its query syntax to delete records. For example, if you wanted to delete all records with the word “cancer” in the title, you would do the following:

<cfset the_server.deleteByQuery( "title:cancer" ) />

However, in this case we want to delete everything, so we’ll use wildcards:

<cfset the_server.deleteByQuery( "*:*" ) /> <!--- Delete everything --->

Now that the collection has been completely purged, we can add some records. Let’s grab some data from a query:

<cfquery name="get_all_articles" datasource="#the_datasource#">
    SELECT id, title, description, pubdate, journal_name, author_name, num_reads
      FROM articles
</cfquery>

We’re going to index all of the articles we currently have in our database. Here we’ll create an array to store the results of the query.

<cfset the_articles = arrayNew(1) />

Let’s put the results of the query into the array:

<cfloop query="get_all_articles">
    <!--- Strip out HTML from the description --->
    <cfset the_summary = REReplace(description, "<[^>]+>", "", "all") />
    <cfset temp_article = createObject("java", "org.apache.solr.common.SolrInputDocument") />
    <cfset temp_article.addField("uid", id) />
    <cfset temp_article.addField("key", id) />
    <cfset temp_article.addField("size", len(description)) />
    <cfset temp_article.addField("summary", the_summary) />
    <cfset temp_article.addField("title", title) />
    <cfset temp_article.addField("description", description) />
    <cfset temp_article.addField("contents", description & " " & title) />
    <cfset temp_article.addField("pubdate", pubdate) />
    <cfset temp_article.addField("journal_name", journal_name) />
    <cfset temp_article.addField("author_name", author_name) />
    <cfset temp_article.addField("num_reads", num_reads) />
    <cfset temp_article.addField("modified", now()) />
    <cfset arrayAppend(the_articles, temp_article) />
</cfloop>

I am assuming that all of the above fields will already have been defined in the collection’s schema.xml file. The rest is easy:

<cfset the_server.add(the_articles) /> <!--- Add the articles to the index --->
<cfset the_server.commit() /> <!--- Commit the changes --->
<cfset the_server.optimize() /> <!--- Optimize the index --->

And that is really all there is to it, at least for indexes where you don’t expect to have hundreds of thousands of records. If you have many records, you would want to segment the indexing OR partition the collection horizontally (so, for example, you could have one collection for articles from 2011, another for articles from 2010, etc.). Searching on more than one collection at a time is not much more difficult than searching on a single collection, but it is fodder for a future post.

Posted in ColdFusion, Solr, SQL | Tagged: , , , , , , | 6 Comments »

Verity to Solr Addendum

Posted by David Faber on December 22, 2011

One day into this blog and I’m already lying about upcoming posts! Seriously, I noticed reading over my first post that I had unintentionally left a couple of things out.

The first is a postscript to the blog post I cited by ColdFusion Muse. This blog post dealt with some issues caused by using Solr with the default (as installed under ColdFusion) options for the JVM. One of these options is

-XX:+AggressiveOpts

Oracle, the maintainer of Java since its takeover of Sun, has the following to say on this subject (kudos to a colleague for turning this up):

On the same day, Oracle released Java 6u29 fixing the same problems occurring with Java 6, if the JVM switches -XX:+AggressiveOpts or -XX:+OptimizeStringConcat were used. Of course, you should not use experimental JVM options like -XX:+AggressiveOpts in production environments! We recommend everybody to upgrade to this latest version 6u29.

Now I can’t speak for anyone else, but I do find it strange that Solr would be configured by default (at least in its ColdFusion edition) to use an experimental JVM option. It’s not as if this is a new option, or that it was originally used for some other purpose. Sun mentioned that it was an experimental option in a 2005 white paper on Java Tuning:

-XX:+AggressiveOpts
Turns on point performance optimizations that are expected to be on by default in upcoming releases. The changes grouped by this flag are minor changes to JVM runtime compiled code and not distinct performance features (such as BiasedLocking and ParallelOldGC). This is a good flag to try the JVM engineering team’s latest performance tweaks for upcoming releases. Note: this option is experimental! The specific optimizations enabled by this option can change from release to release and even build to build. You should reevaluate the effects of this option with prior to deploying a new release of Java.

The other thing that I omitted was a short discussion on how to search a Solr collection on a custom field, or how to filter a search. Let’s consider the Solr query from the previous post:

http://localhost:8983/solr/core0/select?q=cancer&fl=*,score&wt=json

This is just a plain-vanilla keyword search — there are no custom fields involved at all (well, there could be — but more on that later unless I am lying about future posts again). If we want to search on a custom field, we can do the following:

http://localhost:8983/solr/core0/select?q=custom1:cancer&fl=*,score&wt=json

If we want to filter (filtering will search the field, but it won’t affect the results’ scores) the search on a custom field, we can do the following:

http://localhost:8983/solr/core0/select?q=cancer&fl=*,score&wt=json&fq=custom1:cancer

The “fq” parameter is used for filter queries and I believe the same syntax is used for this parameter as is used for the “q” parameter.

What if you want to search or filter on TWO (or more) custom fields? That is easy enough.

http://localhost:8983/solr/core0/select?q=cancer&fl=*,score&wt=json&fq=%2Bcustom1:cancer%20%2Bcustom2:mesothelioma

The “%2B” codes are url-encoded plus (+) signs, used because the “+” sign is used to url-encode spaces, while “%20” is the code for a url-encoded space. I suppose one might use a + sign here instead, but that might get confusing. (As an aside, I am old enough that some of the oldest code I used was written out on a blackboard for me, and we used an uppercase delta character Δ to denote a space as spaces are not always obvious when written.) I believe (but I am not entirely certain without checking) that the + signs are needed in front of both terms.

Posted in ColdFusion, Oracle, Solr | Tagged: , , , , , , | Leave a Comment »

ColdFusion 9: Upgrading from Verity to Solr

Posted by David Faber on December 22, 2011

I started working with ColdFusion around 1998-9. Since that time my primary interests have been (1) database queries and how they can be optimized, and (2) keyword searches. I’ll consider the latter in this post.

When CF9 was first released I was skeptical of making the upgrade (and it is an upgrade, as I can see in hindsight) from Verity to Solr. While Adobe had implied, at the time, that support for Verity might be dropped in the future, it was not entirely clear that anything would be gained from the change. The tags used (<CFINDEX> and <CFSEARCH>) were the same, while the syntax was slightly different (to search the title field, one used “title:” instead of “<CF_TITLE>” in the search criteria). However, there were supposed to be significant speed gains, so I pushed for my company to make the switch.

Two problems presented themselves immediately. One was that the Solr service ran out of memory very quickly. This was resolved thanks to a very helpful post on ColdFusion Muse. The other problem was that searches on the title field were now case-sensitive where they had not been before (as an aside, Verity searches — at least those using <CFSEARCH> — are case-insensitive unless the search criteria are mixed-case. For example, searching for “cancer” or “CANCER” will return the same results, but searching for “cANcEr” will not). Since I was under a deadline, I did not take the time to look into the underlying issue — instead, I used the CUSTOM1 field to store the title, while putting everything I had been jamming into the CUSTOM1 field into the TITLE field! These were fields that I wanted returned in a search, so as to avoid querying the database again, but which I did not expect to use to filter the search. So whether or not a search of that field was case-insensitive was irrelevant for my purpose — I was using it to store data, not to search. My search criteria went from this: “logic AND <CF_TITLE>fuzzy” to this: “logic +title:fuzzy”.

That solution was klunky, but it worked fine. The search index was not flexible, but it worked — until I needed the flexibility.

The <CFINDEX> tag is very limited. There aren’t many attributes available for custom fields (either for searching or merely for storing) — just CUSTOM1, CUSTOM2, CUSTOM3, and CUSTOM4. (Of course this is a big upgrade over versions of CF in which only two custom fields were available.) I wanted to return more than just four custom fields, so I jammed them into the CUSTOMx fields as delimited lists. This of course makes it impossible (or, at least, difficult) to filter on those custom fields; they store data and nothing more. I wish I could remember where I read this (I certainly didn’t discover it on my own — credit where credit is due), but the key to unlocking some of the power of Solr is to edit the collection’s schema, then break out of the <CFINDEX>/<CFSEARCH> paradigm and use Solr as a web service.

To configure the collection’s schema, go to the collection’s home directory (on Windows, this will likely be something like C:\ColdFusion9\collections\<collection-name>), then to the conf directory, and open schema.xml. There is a bunch of stuff at the top that is beyond the scope of this blog post. However, if you scroll down to line 440 or thereabouts, you see the fields defined for this search collection (these are the default fields that ColdFusion creates when a collection is created):

On lines 444 and 445 we see some fields that are probably familiar, then again on ll. 478-481 and 483. The first thing that sticks out (to me, at least) is that the “title” field is not does not have type “text” but something else — “text_ws”. This is a special type defined in the schema file above that tokenizes field only on whitespace but is also case-sensitive. For our purposes here we can just replace this with “text”:

N.B.: When you make any changes to the schema.xml file, you need to restart the Solr service before you can re-index the collection. I think, but I have not tested it yet, that you can make this change and still use <CFINDEX> and <CFSEARCH> as you did before, only now with the ability to make case-insensitive searches on the title. (If you’re wondering why not simply store the title in all-lowercase or all-uppercase, it’s because I want the title stored exactly, and I do not want to “waste” one of my custom fields on it.)

A second question comes to mind. Can you simply add new custom fields here, in addition to custom1 .. custom4? The answer is yes. The problem then becomes one of indexing and searching. How do you index the new custom fields? and: How do you filter searches on them?

Searching on them is the easier part. You can call a Solr collection as a web service and it will return data in XML or JSON format:

Here “core0” is the name of the Solr collection (this is the default collection that ships with CF 9), the “q” parameter is where we submit our criteria, the “fl” parameter shows which fields we want returned (in this case, we want them all as well as the pseudo-field “score”), and the “wt” field tells the web server the format in which the data ought to be returned. A value of “json” will return JSON, while a value of anything else will return XML. (There are other formats available, but, again, this is beyond the scope of this post.)

Now comes the hard part — actually indexing your data. There are a couple of tools available on the internet for this, but personally I found them unacceptable for one reason or another. But I also think that it’s important to know what is going on under the hood, so to speak, in the event you need something for your search that isn’t in one of those tools. I’ll go over that in my next post.

Posted in ColdFusion, Solr | Tagged: , , , | 2 Comments »