I recently designed an enterprise search solution for a +50000 employees organization in the public sector. I have layed out the challenges that we aimed to handle in a previous posting (http://tomaselfving.blogspot.com/2007/09/challenges-of-enterprise-search.html). One additional design objective was to minimize the number of synchronous calls to the base systems.
The solution consists of a 5 layered architecture:
- Web application layer – any number of web sites and client applications using the search web services of layer 2.
- Web service layer publishing web services with customized search functionality.
- Data warehouse layer – The index database that the web services performs its search queries in.
- Integration layer with components managing the import and the processing of base system data.
- Base systems connected to the solution.

Web application layer
The web application’s consists of a single search field on the intranet portal home page and a set of underlying pages displaying the search result sets and other user interactions from the result set pages. It uses three web services exposing the enterprise search functionality found in the web services layer. The result set can easily be merged and sorted together with search results from searches on web sites and document stores.
Web service layer
The web services are designed to expose custom search functionality based on the structure of the data in the base systems. The first WS is the search service returning a result set with the search result, sorted by relevance and ready to display. As the base systems that we have integrated is document management systems, we need a way to retrieve the physical documents. The second and the third WS is to fetch the document from the two base systems. It takes the document id as in-parameter and returns the corresponding document from the base system.
Intermediate database layer and the integrations
The idea is to gather as much information from the base systems in an intermediate database as possible, in order make it available for high-performance search queries. No round-trips to the base systems are required. This way, the base systems are protected against the hard-to-estimate performance load from web applications using the web services. The only disadvantage is that data in the intermediate database will not be as fresh as the data in the base systems. In this case, this is acceptable. Synchronizations may be done as often as you like, we use an interval of 1 hour. The length of the interval depends on:
- How long time the synchronization process takes. The interval shouldn’t, obviously, be shorter as the previous synchronization process needs to finish before the next one starts!
- How much changes it is in the base data. The more changes, the shorter interval is recommended.
The information is stored in the index database and free-text indexing is immediately performed on the incoming data. The index database may also store the entire source document content, if You want to make it searchable. This is a figure showing the connection between the web services, the index database and the base systems.

The communication between the base systems works according to the principles of loosely coupled systems. Dependencies regarding system up-time are minimized. With this solution the responsibility to produce and deliver the incoming XML-file including…
- filtering of information that the base system doesn’t want to publish and make available for search
- and optimization of the data for improved searchability
One risk is that the FTP-file transfer fails for some reason, but from a search user perspective the solution still works. The latest data will not be in the index database, but the search function will work. The next time the FTP-transfer works, the index database will be up-to-date again.
One requirement on the base systems and their transfer programs is that they set time-stamps on changes in the base systems. Using that timestamp, only the changed and new data since the last run can be extracted in a batchprogram.
What is this magic Lucene index database then? Lucene is a free/open source information retrieval library, originally implemented in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene has been ported to programming languages including Delphi, Perl, C#, C++, Python, Ruby and PHP. Our solution used the C# version known as Lucene.NET.
While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene itself is just an indexing and search library and does not contain crawling and HTML parsing functionality. The Apache project Nutch is based on Lucene and provides this functionality; the Apache project Solr is a fully-featured search server based on Lucene.
At the core of Lucene's logical architecture is a notion of a document containing fields of text. This flexibility allows Lucene's API to be agnostic of file format. Text from PDFs, HTML, Microsoft Word documents, as well as many others can all be indexed so long as their textual information can be extracted.
This is just one way to implement this functionality. It proved to work very well for us, the quality of the search result sets are good, and the performance in term of search response times is fantastic.
Feel free to comment on this or let me know if you have made other experiences in the field of Enterprise search!
© Copyright 2007, Tomas Elfving

