Monday, April 20, 2009

A Better Default Search Field

‹prev | My Chain | next›

When you search for documents containing the word "chocolate" with Google, you enter "chocolate" as the search term. When use Google to find documents containing the word "chocolate" on a particular site, say http://eeecooks.com, you would enter "site:eeecooks.com chocolate".

Because this is how Google works, this is how search works.

But this is not how the current seach in eee-code works. To search for a recipe with "chocolate" in it and a title that contains "pancake", I currently have to query couchdb-lucene with a search of "title:pancake all:chocolate". Yesterday, I started down the path of trying to pre-process the search query. Today, I think better of it.

Lucene's QueryParser supports a default field argument in its constructor. If we supply "all" as the default field, which is possible in couchdb-lucene in src/main/java/com/github/rnewson/couchdb/lucene/Config.java:
    static final QueryParser QP = new QueryParser("all", ANALYZER);
Then the QueryParser interprets "title:pancake chocolate" to be identical to "title:pancake all:chocolate".

Just to be sure, give curl a try with the old standby of "wheatberries" (and "all:wheatberries"):
cstrom@jaynestown:~/repos/eee-code$ curl http://localhost:5984/eee/_fti?q=all:wheatberries
{"q":"+_db:eee +all:wheatberri",
"etag":"120c60536a7",
"skip":0,
"limit":25,
"total_rows":1,
"search_duration":1,
"fetch_duration":1,
"rows":[
{"_id":"2008-07-19-oatmeal",
"date":"2008/07/19",
"title":"Multi-grain Oatmeal",
"score":0.5710114240646362
}]
}
cstrom@jaynestown:~/repos/eee-code$ curl http://localhost:5984/eee/_fti?q=wheatberries
{"q":"+_db:eee +all:wheatberri",
"etag":"120c60536a7",
"skip":0,"limit":25,
"total_rows":1,
"search_duration":0,
"fetch_duration":1,
"rows":[
{"_id":"2008-07-19-oatmeal",
"date":"2008/07/19",
"title":"Multi-grain Oatmeal",
"score":0.5710114240646362
}]
}
Note that both queries are both interpreted as "+_db:eee +all:wheatberri"—both use the "all" field to scope the the search even though the second does not explicitly include it.

Also of note is that "wheatberri" is the Porter stem of "wheatberries" (this stemming was explicitly set a few days ago). The "_db" field is how couchdb-lucene works with multiple databases. All documents from all databases (e.g. the recipe documents in the development and test databases) are all stored in the same index. Couchdb-lucene automatically infers the db parameter from the database being queried ("eee" in the above examples). Using this parameter, couchdb-lucene only searches for documents in the current database, effectively limiting search even though the search index is not similarly limited.
(commit)

With that in place, I can back out the workaround from yesterday, leaving the search action much simpler:
get '/recipes/search' do
data = RestClient.get "#{@@db}/_fti?q=#{params[:q]}"
@results = JSON.parse(data)

haml :search
end
Next up: searching ingredients and then onto paginating and sorting (which couchdb-lucene supports out of the box).
(commit)

2 comments:

  1. I'll be providing clear semantics for a default field in 0.3.

    ReplyDelete
  2. Ooh! Thanks for pointing that out. From the 0.3 TODO there will be a "defaults" attribute on the design document that will be able to do this—and more!

    Already looking forward to it. And thanks so much for your work. It's made things *much* easier for me!

    ReplyDelete