Jason Deacon Team : Web Development Tags : Web Development

Performance-oriented search system for websites

Jason Deacon Team : Web Development Tags : Web Development

When allowing users to search your website you need to consider the system that will perform the actual search in the background. Many clients will choose to have Google search integrated which is perfectly adequate if you have a large public facing content website, but as soon as you want users to be able to search content which is protected through registration, or want users to search using very specific criteria (value X > 30, value Y contains "blue") then you need to look elsewhere.

Enter Lucene.NET. Lucene.NET is a .NET port of the Java-based search system, Lucene. Lucene.NET is used in tens of thousands of applications in production environments world-wide and is very robust, performant and flexible. In fact if you have used Umbraco, then you've indirectly used Lucene.NET, as Umbraco uses Examine, which internally uses Lucene.NET. The integration is typical of using a search system and very straight forward, you index it with content which is made up of one or many fields of data and then you query that index.

Typically you would have your web application update the content through Lucene.NET so it can add/update the data in its internal index (which is stored on disk, rather than in a SQL database). When you tell Lucene.NET to update its index, you create a 'Document' and add fields of data to that document, telling Lucene.NET about what it should do with the data regarding indexing and storage. This is important because when you perform a query against Lucene.NET and it returns results, you are only going to get the fields of data that you specified Lucene.NET should store.

For example, if you were indexing a News item which had the following internal fields:

Id Title Date Created Short Description Full Content Author Name Last Modified By User Created By User Previous Version

You would typically tell Lucene.NET:

Id Store only
Title Store only
Date Created Store only
Short Description Store only
Full Content Store only
Author Name Store only
Last Modified By User Don't add to the document
Created By User Don't add to the document
Previous Version Don't add to the document
IndexContent: (Title + Short Description + Full Content + Author Name) Index Only

Excluding fields that are not relevant to the content to be shown to the user, or used to identify the content (it's Id) keeps the Lucene.NET indexes small and fast. The approach of having a combined field with the other data in it is just for simplicity for querying. Instead of telling Lucene.NET to first index, and then query multiple fields given the users keywords, you can just specify a single field and know exactly the content that will be searched. It's more of a time saver than a required approach when integrating Lucene.NET.

Also, by placing 'store only' fields in the Lucene documents it means that when you perform a query and get back a set of results, you should have all the information you need to display that piece of data in a search result listing, and maybe even in it's entirety which is a massive performance increase over having to perform a search query first, then retreive all matching data from your data store based on the results you get back.

Using this approach we have been able to index over one hundred thousand documents (a process which takes mere seconds when done correctly) and perform queries against them in less than 10ms.

Also, when working with Lucene (especially when developing the search queries) it helps to be able to visualise what Lucene is actually doing with your query. For that, you can use a tool called Luke which you can get from here (http://www.getopt.org/luke/). You will be able to see exactly what data you are getting back when you perform a specific query, and also be able to play around with query syntax to ensure that your code is generating the correct query when users perform searches.

Of course there's no such thing as a silver bullet, and with Lucene.NET there are a few things to keep in mind. First, the first application which uses a common Lucene.NET index location on disk holds a lock on those files, meaning that multiple applications can not use the same index at the same time. This is only really a problem if you plan on having your CMS index content that the users create and if it's a separate application (Umbraco is one instance for front end and CMS, so there are no locking issues there).