Search Smith

ColdFusion, SQL queries, and, of course, searching

ColdFusion 9: Indexing custom fields in Solr

Posted by David Faber on December 23, 2011

Our first step is to create an object in ColdFusion that we can use to communicate with the Solr server:

<cfset the_server = createObject("java", "org.apache.solr.client.solrj.impl.CommonsHttpSolrServer").init("http://localhost:8983/solr/arts_solr") />

One very nice feature of Solr is that you can use its query syntax to delete records. For example, if you wanted to delete all records with the word “cancer” in the title, you would do the following:

<cfset the_server.deleteByQuery( "title:cancer" ) />

However, in this case we want to delete everything, so we’ll use wildcards:

<cfset the_server.deleteByQuery( "*:*" ) /> <!--- Delete everything --->

Now that the collection has been completely purged, we can add some records. Let’s grab some data from a query:

<cfquery name="get_all_articles" datasource="#the_datasource#">
    SELECT id, title, description, pubdate, journal_name, author_name, num_reads
      FROM articles

We’re going to index all of the articles we currently have in our database. Here we’ll create an array to store the results of the query.

<cfset the_articles = arrayNew(1) />

Let’s put the results of the query into the array:

<cfloop query="get_all_articles">
    <!--- Strip out HTML from the description --->
    <cfset the_summary = REReplace(description, "<[^>]+>", "", "all") />
    <cfset temp_article = createObject("java", "org.apache.solr.common.SolrInputDocument") />
    <cfset temp_article.addField("uid", id) />
    <cfset temp_article.addField("key", id) />
    <cfset temp_article.addField("size", len(description)) />
    <cfset temp_article.addField("summary", the_summary) />
    <cfset temp_article.addField("title", title) />
    <cfset temp_article.addField("description", description) />
    <cfset temp_article.addField("contents", description & " " & title) />
    <cfset temp_article.addField("pubdate", pubdate) />
    <cfset temp_article.addField("journal_name", journal_name) />
    <cfset temp_article.addField("author_name", author_name) />
    <cfset temp_article.addField("num_reads", num_reads) />
    <cfset temp_article.addField("modified", now()) />
    <cfset arrayAppend(the_articles, temp_article) />

I am assuming that all of the above fields will already have been defined in the collection’s schema.xml file. The rest is easy:

<cfset the_server.add(the_articles) /> <!--- Add the articles to the index --->
<cfset the_server.commit() /> <!--- Commit the changes --->
<cfset the_server.optimize() /> <!--- Optimize the index --->

And that is really all there is to it, at least for indexes where you don’t expect to have hundreds of thousands of records. If you have many records, you would want to segment the indexing OR partition the collection horizontally (so, for example, you could have one collection for articles from 2011, another for articles from 2010, etc.). Searching on more than one collection at a time is not much more difficult than searching on a single collection, but it is fodder for a future post.

6 Responses to “ColdFusion 9: Indexing custom fields in Solr”

  1. […] /> <!— We're using our articles index mentioned in a previous post […]

  2. […] do you know which fields match your query? For example, if we search our articles index for “malaria,” and we want to know whether we matched the term in title, description, […]

  3. Rhys Jones said

    Hi David,

    I love your postings with regard to Solr/CF9. I’m in the middle of migrating our Verity Collections and I’d love to take advantage of some of the Solr features you demonstrate in this posting. My question is: is it possible to extend your example to a hybrid index that indexes both the database-related meta information listed above AND perhaps indexing a PDF of the entire article (extending your example in this posting)? We have a vast number of digital “articles” along with a lot of meta data stored in the database that we’d love to incorporate into the collection.

  4. David Faber said

    Yes, that certainly is possible. I’m not at all familiar with importing documents and data into Solr using something other than ColdFusion; but what you could do is store the file path along with metadata in a database table, then use the CFPDF tag to extract the text of the PDF document and store it in a custom field:

    <cfpdf action="extracttext" source="#file_path#.pdf" pages="*" type="string" name="file_text" />

    <cfset temp.addField("contents_file_t", file_text) />

    Or something like that.

  5. Charlie F said

    David and Rhys,

    This is exactly the functionality I’m trying to build. I also have a “vast number of digital “articles” along with a lot of meta data stored in the database that we’d love to incorporate into the collection.” Users submit their manuscripts to the system, and others can search/download them.

    What I was planning on doing was using BLOB full text search in SQL2008, which works great, but I can’t convince my corp. host to allow it. The beauty of that system was that the metadata and the document (PDF, DOC, DOCX, etc.) were indexed as one entity, and returned as one record.

    I think I understand your suggestion — parse the text of the docs into a SQL field that can be indexed, and then just point the user to a download URL pointing the filesystem. But what if I have a mix of filetypes, namely MS Word docs?

    I don’t think there’s any big reason I couldn’t convert incoming uploads to PDFs and then use your CFPDF idea. Is that what you would suggest?


  6. David Faber said

    @Charlie, as a matter of fact I have a project in which I am doing just that. We accept the documents as Word docs (DOC, DOCX) or PDFs. If a Word doc is uploaded then I convert it to a PDF in a batch process. I then use CFPDF to extract text from PDFs and store that text in the database to be indexed. It is not a perfect solution but it is workable for our purposes.

    Hope this helps.

Leave a Comment

Your email address will not be published. Required fields are marked *