Add Full-Text Search to your Django project with Whoosh
Whoosh is a pure-python full-text indexing and searching library. Whoosh was opensourced recently and makes it easy to add a fulltext search to your site without any external services like Lucene or Solr for example.
Whoosh is pretty flexible, but to keep it simple let's assume that the index
is stored in settings.WHOOSH_INDEX
(which should be a path on the
filesystem) and that our application is a Wiki and we want to search the
Wiki pages.
Indexing Documents
Before we can query the index we have to make sure that the documents are
indexed properly and automatically. It doesn't matter if you put the following
code into a new app or an existing app. You only have to make sure, that it
lives in a file, which is loaded by Django when the process starts, resonable
places would be the __init__.py
or the models.py
file (or any file
imported in those of course) in any app.
The following code listing is interrupted by short explanaitions but should be saved into one file:
import os from django.db.models import signals from django.conf import settings from whoosh import store, fields, index from rcs.wiki.models import WikiPage WHOOSH_SCHEMA = fields.Schema(title=fields.TEXT(stored=True), content=fields.TEXT, url=fields.ID(stored=True, unique=True))
At the top of the file a Schema is defined. The Schema tells Whoosh what data should go into the index and how it should be organized. In this example every indexed document is stored as three fields by whoosh:
title
: A title of the document is stored in the index.content
: Content is the content of the document, Whoosh will not store the whole content in the index but will only use the data to build it's index.url
: To make it easier to uniquely identify a document in the index and to link to the document on the search result pages without fetching the document from the database I decided to store the url to the document in the index.
def create_index(sender=None, **kwargs): if not os.path.exists(settings.WHOOSH_INDEX): os.mkdir(settings.WHOOSH_INDEX) storage = store.FileStorage(settings.WHOOSH_INDEX) ix = index.Index(storage, schema=WHOOSH_SCHEMA, create=True) signals.post_syncdb.connect(create_index)
To make sure the index, which is stored on the filesystem, is available the
function create_index
is called by Django's post_syncdb
signal and
creates the index if it is not already present. This method uses the
Schema defined earlier.
def update_index(sender, instance, created, **kwargs): storage = store.FileStorage(settings.WHOOSH_INDEX) ix = index.Index(storage, schema=WHOOSH_SCHEMA) writer = ix.writer() if created: writer.add_document(title=unicode(instance), content=instance.content, url=unicode(instance.get_absolute_url())) writer.commit() else: writer.update_document(title=unicode(instance), content=instance.content, url=unicode(instance.get_absolute_url())) writer.commit() signals.post_save.connect(update_index, sender=WikiPage)
To make sure the index is automatically updated everytime a page on the Wiki
changes, the function update_index
is called whenever a WikiPage object
sends the post_save
signal via Django's signal framework.
If the instance was created it is added as a new document to the index and if it was edited (but existed before) the entry in the index is updated. The document is identified in the index by it's unique URL.
Query the Index
At this point we have made sure, that Whoosh will always keep an up-to-date index of our WikiPage pages. The next step is to create a view, which allows querying the index.
A single view and a template is all we need to let users search the index. The template contains a simple form:
<form action="" method="get"> <input type="text" id="id_q" name="q" value="{{ query|default_if_none:"" }}" /> <input type="submit" value="{% trans "Search" %}"/> </form>
By setting method
to GET
and action
to an empty string we tell the
browsesr to submit the form to the current URL with the value of the input
field (named q
) appended to the url as a querystring. A search for the
term "Django" will result in a request like this:
http://server/somwhere/?q=Django
I've also added the parsed query back to the search form while displaying the results. Therefore the user-experience is further improved, because the user can now easily edit the query and submit it again.
If you have a special search page (instead of a search box on every page) you
might also consider giving focus to the input field to save the user an extra
click. If you don't use a JavaScript framework a very simple solution would
be to use the onload
attribute of the body tag:
<body onload="document.getElementById('id_q').focus();">
Now lets have a look at the view-code which handles the requests:
from django.conf import settings from django.views.generic.simple import direct_to_template from whoosh import index, store, fields from whoosh.qparser import QueryParser from somwhere import WHOOSH_SCHEMA def search(request): """ Simple search view, which accepts search queries via url, like google. Use something like ?q=this+is+the+serch+term """ storage = store.FileStorage(settings.WHOOSH_INDEX) ix = index.Index(storage, schema=WHOOSH_SCHEMA) hits = [] query = request.GET.get('q', None) if query is not None and query != u"": # Whoosh don't understands '+' or '-' but we can replace # them with 'AND' and 'NOT'. query = query.replace('+', ' AND ').replace(' -', ' NOT ') parser = QueryParser("content", schema=ix.schema) try: qry = parser.parse(query) except: # don't show the user weird errors only because we don't # understand the query. # parser.parse("") would return None qry = None if qry is not None: searcher = ix.searcher() hits = searcher.search(qry) return direct_to_template(request, 'search.html', {'query': query, 'hits': hits})
The view imports the previously defined WHOOSH_SCHEMA
and gets the index
location from the settings. Most of the clutter is only there to improve the
user-experience by tranforming some chars found in search queries into their
Whoosh equivalents and by catching all exceptions raised by the Whoosh QueryParser.
Displaying the search results in the template is pretty straight-forward:
{% if hits %} <ul> {% for hit in hits %} <li><a href="{{ hit.url }}">{{ hit.title }}</a></li> {% endfor %} </ul> {% endif %}
Conclusion
With Whoosh and not more than 100 Lines of code (including the template) it is possible to add full-text search capabilities to your Django project. I've already added the code above to two projects and I'm pretty impressed by the ease of use and the performance of Whoosh.
The result is that I can now make my Django powered sites a bit more awesome by adding full-text search (if applicable) and the best is: at ~100 LOC it comes almost for free.
Related Projects
For a different approach to add Whoosh to your Django project you might also want to have a look at django-whoosh by Eric Florenzano which is available on GitHub. Django-Whoosh is basically a Manager which is added to your objects and will take care of indexing and lets you fetch objects by querying the Whoosh index. The idea is clever but only works if you want to edit the Model classes to add the manager. My approach is completely based on signals and will therefore work with any reuseable app without editing the app itself.
Another app which combines Whoosh and Django is djoosh, also available on GitHub but it seems as if it's not finished at the moment. Djoosh aims to provide a mechanism which allows registering of Models with the Indexing infrastructure in a similar way as contrib.admin does.
Hi Arne,
nice article. Thanks for this!
Kai
Geschrieben von Kai Diefenbach 8 Minuten nach Veröffentlichung des Blog-Eintrags am 17. März 2009, 10:11. Antworten
Very cool stuff! I really like your approach, as it's explicit and exact. My approach was literally just a few hours of experimentation, so I wouldn't recommend it for anyone to use in production :)
Geschrieben von Eric Florenzano 9 Minuten nach Veröffentlichung des Blog-Eintrags am 17. März 2009, 10:12. Antworten
Thanks a lot, that is exactly what I was looking for!
Do you know if there are options to handle accents and upper/lowercase?
I mean, I'd like to search "héhé" and retrieve documents with "héhe", "hehe", "Hehe" and so on.
Geschrieben von David, biologeek 2 Stunden, 20 Minuten nach Veröffentlichung des Blog-Eintrags am 17. März 2009, 12:23. Antworten
Hi David,
I think this is currently not implemented in Whoosh, but you may want to ask on the official Mailinglist:
http://groups.google.com/group/whoosh/
Additionally, this thread might be related to your question:
http://groups.google.com/group/whoosh/browse_thread/thread/565fbe0d65fa85fe
Geschrieben von Arne Brodowski 2 Stunden, 24 Minuten nach Veröffentlichung des Blog-Eintrags am 17. März 2009, 12:27. Antworten
Arne, thanks for this guide! I like this approach so much that I'm going to be implementing this on my own blog soon. I really appreciate the info, and I've passed this post along through Twitter under my 'adoleo' account. Thanks!
Geschrieben von Brandon Konkle 7 Stunden, 47 Minuten nach Veröffentlichung des Blog-Eintrags am 17. März 2009, 17:50. Antworten
Lots of neat stuff being done with Whoosh nowadays, it was just added as a backend for <a href="http://haystacksearch.org/">django-haystack</a> which is a fork of djangosearch.
Geschrieben von huxley 1 Monat nach Veröffentlichung des Blog-Eintrags am 19. April 2009, 00:58. Antworten
Can someone help with the proper url pattern for the search view? I've tired a couple with no luck.
Geschrieben von Nate 3 Monate, 1 Woche nach Veröffentlichung des Blog-Eintrags am 24. Juni 2009, 04:19. Antworten
And if I want to search to be available on all pages in the masthead?
Geschrieben von nate 3 Monate, 1 Woche nach Veröffentlichung des Blog-Eintrags am 24. Juni 2009, 04:34. Antworten
that's right. Just point the form at /search/ and you can search from every page. The form can be plain html, no need to include a django form object in your master-template, just make sure that the field names are the same as on the form class.
Geschrieben von Arne 3 Monate, 1 Woche nach Veröffentlichung des Blog-Eintrags am 24. Juni 2009, 13:55. Antworten
Vielen Dank Arne!
Geschrieben von Sergej 1 Jahr, 3 Monate nach Veröffentlichung des Blog-Eintrags am 5. Juli 2010, 18:53. Antworten
hello arne, thanks ! i#ve read yor article. i like this one and im going to be implementing it. ill be sure our webmaster can do this for us. thanks for this!
Geschrieben von Lena Smirnova 1 Jahr, 11 Monate nach Veröffentlichung des Blog-Eintrags am 20. Feb. 2011, 18:30. Antworten
Danke, whoosh und django rocks
Geschrieben von programy 2 Jahre, 1 Monat nach Veröffentlichung des Blog-Eintrags am 11. Mai 2011, 19:17. Antworten