Thursday, March 19, 2009

Full Text Indexing of CouchDB with Lucene

‹prev | My Chain | next›

Having gotten couchdb-lucene and edge CouchDB installed and running, I'll keep my chain going by trying to get indexing and searching to work.

I am running in local development environment (./utils/run), so I need to edit the etc/couchdb/local_dev.ini to include:
[couchdb]
os_process_timeout=60000 ; increase the timeout from 5 seconds.

[external]
fti=/usr/bin/java -jar /home/cstrom/repos/couchdb-lucene/target/couchdb-lucene-SNAPSHOT-jar-with-dependencies.jar -search

[update_notification]
indexer=/usr/bin/java -jar /home/cstrom/repos/couchdb-lucene/target/couchdb-lucene-SNAPSHOT-jar-with-dependencies.jar -index

[httpd_db_handlers]
_fti = {couch_httpd_external, handle_external_req, <<"fti">>}
The next step is to start up the CouchDB server:
cstrom@jaynestown:~/repos/couchdb$ ./utils/run 
Apache CouchDB 0.9.0a756286 (LogLevel=info) is starting.
Apache CouchDB has started. Time to relax.
[info] [<0.58.0>] 127.0.0.1 - - 'GET' /_all_dbs 200
[info] [<0.58.0>] 127.0.0.1 - - 'GET' /eee/_design/lucene 404
[info] [<0.58.0>] 127.0.0.1 - - 'GET' /eee 200
To verify that the index is working, you can access the _fti resource of the database:
cstrom@jaynestown:~/repos/couchdb-lucene/target$ curl http://localhost:5984/eee/_fti
{"doc_count":7,"doc_del_count":2,"last_modified":1237514082000,"current":true,"optimized":false,"disk_size":13669}
Nice! I do have 7 documents in there, so we look to be in good shape.

To search, append a q query parameter to the request with a value in the form attribute_name:search term. We like our greens, so, to search for all recipes (in our limited sample) that include a word starting with "green" in the summary, you would supply the search term: q=summary:green*.

Giving it a try, I find that we do indeed have 2 recipes mentioning "greens":
cstrom@jaynestown:~/repos/couchdb-lucene/target$ curl http://localhost:5984/eee/_fti?q=summary:green*
{"q":"+_db:eee+summary:green*","etag":"1202191c377","skip":0,"limit":25,"total_rows":2,"search_duration":1,"fetch_duration":1,
"rows":[{"_id":"2006-10-08-dressing", "score":0.9224791526794434},
{"_id":"2006-08-01-beansgreens","score":0.8661506175994873}]}
Aside from the yak shaving needed to get edge CouchDB running, this was by far the easiest experience I have ever had in getting Lucene indexing running.

I would ultimately like to be able to search an entire document, not just individual fields, but this will do for now.

1 comment: