Snow sprint report #1 : indexing

So we are here in Austria, sprinting on Zope and Plone (thanks to Lovely Systems). I have proposed a task on building an alternative indexer system for Plone. So, we worked with Dokai and Tom on this. Those guys rock, really !

Our goal was to create a plone 3 buildout that provides an out of the box solution.

Background

Let me give you some background about indexing in Zope before presenting our work. The default indexing system is quite effective, as long as your instance is not getting too big. Some years ago, we had to create an alternative indexer for CPS at Nuxeo, that would externalize the catalog because we figured out that :
- 50% of the size of the ZODB was the catalog (I am talking about gigas here) - 50% of the time on object creation was taken by indexing tasks, and was getting quite slow as the instance was growing.

Those values are approximate, but quite near the reality back then (I know some people worked on making indexing better on Zope lately).

Julien then wrote a XML-RPC server that would take care of the indexing tasks and reply to queries. The software behind it was Lucene, together with PyLucene. The overall solution was quite good, beside the pain we had to install it on some specific Linux back then.

Anyway. What did Julien some years ago exists now and is called Solr. I also had some experiences a while ago with Xapian (as Sidnei did too), which is quite efficient too, and easier to use from Python (see here)

Solr, Xapian

So the first task to do was to decide what to use. I called Alan from Enfold Systems because the guys over there have been working on the topic for years.

As a matter of fact, they have created a package for Python that bind a Solr server.
They also have a Plone integration that provides an utility to index content on Solr.
Since the guys are releasing all of this very soon as open source, we decided
to go with this solution for the sprint.

It is not a technological choice (Lucene) because Alan and some guys from
Lemur are actually considering a drop-in replacement for Solr based on Xapian.

In other words, the work done will be compatible with both Lucene and Xapian technologies. Xapian is pretty interesting since it avoids deploying Java ;)

The sprint task

The task was quite "simple" since the Enfold guys did all the hard work :)
So we worked on:
1. a buildout that builds a Solr server and launches it 2. a Plone integration to use Solr seamlessly

The buildout

The buildout done and usable (We tried it under Windows, MacOSX and Debian)
It uses new recipe we wrote:

collective.recipe.ant : build Java softwares using ant
collective.recipe.solrinstance : builds a Solr instance and provide a script to launch it

If you want to try it, here's (roughly) how (comment the blog entry in case of a problem)

$ svn co https://svn.enfoldsystems.com/public/enfold.solr/branches/snowsprint08-buildout buildout

$ cd buildout/plone-3.0.5/

$ python2.4 bootstrap.py

$ bin/buildout -v

$ bin/solr-instance &     <-- launches solr (python bin\solr-instance under Windows

$ bin/instance fg         <-- launches Zope

Then, on Zope, install SolrIntegration in the quick_installer. The next document you will publish will be indexed on Solr side, and searchable with the search box.

The portal_catalog remains though, so it is indexed twice ;) you can empty it to check
Solr is acting right.

Plone integration

The last part we need to work on is to make the SearchableText index 100% Solr based. Whit advices us to create a storage for TextIndexNG so that's where we are heading on (should be done tomorrow hopefully)

We would also like to do some benchmarks to compare the speed and ZODB size. We will
probably use Jmeter for this.

I would like to thank Alan, Leonardo, Sidnei for their work on this area, and for releasing it as open source: I really believe that it will become a great indexing solution for Plone in the next months. I was really waiting for this momentum in indexing in the Plone community.