Thursday, September 20, 2007

An Enterprise Search solution using Lucene

I recently designed an enterprise search solution for a +50000 employees organization in the public sector. I have layed out the challenges that we aimed to handle in a previous posting (http://tomaselfving.blogspot.com/2007/09/challenges-of-enterprise-search.html). One additional design objective was to minimize the number of synchronous calls to the base systems.

The solution consists of a 5 layered architecture:

  1. Web application layer – any number of web sites and client applications using the search web services of layer 2.
  2. Web service layer publishing web services with customized search functionality.
  3. Data warehouse layer – The index database that the web services performs its search queries in.
  4. Integration layer with components managing the import and the processing of base system data.
  5. Base systems connected to the solution.


Web application layer
The web application’s consists of a single search field on the intranet portal home page and a set of underlying pages displaying the search result sets and other user interactions from the result set pages. It uses three web services exposing the enterprise search functionality found in the web services layer. The result set can easily be merged and sorted together with search results from searches on web sites and document stores.

Web service layer
The web services are designed to expose custom search functionality based on the structure of the data in the base systems. The first WS is the search service returning a result set with the search result, sorted by relevance and ready to display. As the base systems that we have integrated is document management systems, we need a way to retrieve the physical documents. The second and the third WS is to fetch the document from the two base systems. It takes the document id as in-parameter and returns the corresponding document from the base system.

Intermediate database layer and the integrations
The idea is to gather as much information from the base systems in an intermediate database as possible, in order make it available for high-performance search queries. No round-trips to the base systems are required. This way, the base systems are protected against the hard-to-estimate performance load from web applications using the web services. The only disadvantage is that data in the intermediate database will not be as fresh as the data in the base systems. In this case, this is acceptable. Synchronizations may be done as often as you like, we use an interval of 1 hour. The length of the interval depends on:
  1. How long time the synchronization process takes. The interval shouldn’t, obviously, be shorter as the previous synchronization process needs to finish before the next one starts!
  2. How much changes it is in the base data. The more changes, the shorter interval is recommended.
We populate the index using a Microsoft Biztalk orchestration reading in-files from a folder. The two participating bas systems sends their respective XML-files in their own XML format. They contains both metadata and document information for new and changed documents to this folder using FTP. This mechanism ensures that documents are only sent once. We use different XML mapping schemas for each system to translate words with different meaning and perform simple logic processing like formatting for instance.
The information is stored in the index database and free-text indexing is immediately performed on the incoming data. The index database may also store the entire source document content, if You want to make it searchable. This is a figure showing the connection between the web services, the index database and the base systems.


The communication between the base systems works according to the principles of loosely coupled systems. Dependencies regarding system up-time are minimized. With this solution the responsibility to produce and deliver the incoming XML-file including…
  • filtering of information that the base system doesn’t want to publish and make available for search
  • and optimization of the data for improved searchability
…lies in the respective base system.

One risk is that the FTP-file transfer fails for some reason, but from a search user perspective the solution still works. The latest data will not be in the index database, but the search function will work. The next time the FTP-transfer works, the index database will be up-to-date again.
One requirement on the base systems and their transfer programs is that they set time-stamps on changes in the base systems. Using that timestamp, only the changed and new data since the last run can be extracted in a batchprogram.

What is this magic Lucene index database then? Lucene is a free/open source information retrieval library, originally implemented in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene has been ported to programming languages including Delphi, Perl, C#, C++, Python, Ruby and PHP. Our solution used the C# version known as Lucene.NET.
While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene itself is just an indexing and search library and does not contain crawling and HTML parsing functionality. The Apache project Nutch is based on Lucene and provides this functionality; the Apache project Solr is a fully-featured search server based on Lucene.
At the core of Lucene's logical architecture is a notion of a document containing fields of text. This flexibility allows Lucene's API to be agnostic of file format. Text from PDFs, HTML, Microsoft Word documents, as well as many others can all be indexed so long as their textual information can be extracted.

This is just one way to implement this functionality. It proved to work very well for us, the quality of the search result sets are good, and the performance in term of search response times is fantastic.

Feel free to comment on this or let me know if you have made other experiences in the field of Enterprise search!


© Copyright 2007, Tomas Elfving

Saturday, September 8, 2007

Challenges of an Enterprise Search implementation

Enterprise Search is, to quote Wikipedia, "the practice of identifying and enabling specific content across the enterprise to be indexed, searched, and displayed to authorized users". The goal is to give the users the "single search"-field while still search all kinds of content from all kind of data sources. The challenge is that "content" comes in many formats and from different kinds of data sources. It may for instance be:
- other web internal sites
- your own extranets & internet sites
- file shares
- customer records held in a CRM system
- business information in an internal database
- letters and reports in Document Management Systems.
- people/contact information in a telephony system

Another challenge is that the information often have various levels of security classification so that only authorized user should get hits on a search. Sounds faily simple, right. But that means firstly that the user making the search needs to identify himself, and secondly that all the systems needs to be able to identify that user correctly. Not an easy task when different base systems have their own user database with their own user identification solution. Across an enterprise, numerous user databases may be in use.

Information in base systems is not structured in a "search-friendly" format. A relational database is pretty useless as it is when it comes to extracting relevance-sorted search results from a free text search. You probably need to do some work to make these system available as a good data source in an enterprise search solution.

How do You create a search result when hits comes from many different data sources? In what order should the search hits be sorted? You need a way to understand the relevance weight of each search hit from each data source when assembly the total, final search result to be presented to the user.

Lastly, performance issues needs to be adressed. The users of an enterprise search expects response times of no more than a few seconds, they are spoiled with the Google performance of "0,047 seconds" ;-). The base systems participating in the search solutions are rarely prepared for this scenario.

In an upcoming post I'll describe an enterprise search solution atchitecture that I recently implemented for a large organisation adressing the challenges described above.

© Copyright 2007, Tomas Elfving

Monday, September 3, 2007

Scoping content with Audience targeting in MOSS 2007

I'm getting comments & questions about different "group problems" on a previous post on Audience targeting ("Audience targeting and User profiles in MOSS 2007"), and it seem to be a good idea to clarify the different options available with audiences. These are some additional thoughts on scoping content in MOSS 2007.
  1. SharePoint Groups - SharePoint Groups are a valid Target Audience mechanism. This is especially useful when the site administrators may not have access to Active Directory. With Sharepoint Groups, the site administrator have full control to modify his audience. SharePoint Groups have the added benefit of allowing self-enrollment if the site administrator wants to setup a site that might have different levels of information and allow the users themselves to subscribe to what components they'd like.
  2. Active Directory Domain Groups or Roles- Active Directory domain groups are also a valid Target Audience. The nice thing with this option is that many organizations already have groups setup for internal use that are suitable for targeting specific areas of an organization. The SharePoint site administrator has however less or no control over the membership in the group. This might, on the other hand, be just what you want!
  3. Audience Rules - These are very powerful and potentially maybe least understood. They can be used to do a number of convenient things. They can be setup with multiple rules and then setup to require a match to all rules or any rule. This allows a SharePoint Shared Services administrator to define and scope very flexible audiences that will update automatically as user information is changed and synchronized in to SharePoint. The rules might be as simple as belonging to a security group or being a part of a specific area of the Actove Directory organizational hierarchy. Group, list and organizational rules have operators of "Reports Under" and "Member Of". User profile property rules have operators of "Contains" and "Not Contains". You could for instance make a basic rule to match all users with the word Sales rep in their TitleSharePoint under their Skills and Manager in their TitleNew Employee Orientation Team distribution list.
© Copyright 2007, Tomas Elfving