Wednesday, November 01, 2006

Full Text Search

In this article I have tried to evaluate some of the options for integrating full-text search features in java applications.

MySQL’s built-in Full Text Search engine

From my initial search what I could find was that MySQL’s built-in Text Search Engine surprisingly does effective full-text searching if the dataset is small. Also it has the least cost to implement since the search criteria can be specified as a part of query itself. But as the size of dataset grows its efficiency becomes dependent on the system resources like CPU, RAM etc.

Open Source Full Text Search engines

Most of the external full text search engines work by keeping a separate index of the table data which will be updated at frequent intervals (maybe with some amount of caching) so that time spend on the database server is less for searching for information. This approach will certainly lessen the load on the database server.

A complete list of popular full text search engines is available at WikiMedia site

1. Sphinx

Sphinx is a full-text search engine, distributed under GPL version 2. Generally, it's a standalone search engine, meant to provide fast, size-efficient and relevant full-text search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data source drivers support fetching data either via direct connection to MySQL, PostgreSQL, or from a pipe in a custom XML format.

2. Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java with features like Scalable High-Performance Indexing, Powerful Accurate and Efficient Search Algorithms etc.

As a full-text search engine, Lucene needs little introduction. Lucene, an open source project hosted by Apache, aims to produce high-performance full-text indexing and search software. The Java Lucene product itself is a high-performance, high capacity, full-text search tool used by many popular Websites such as the Wikipedia online encyclopedia and TheServerSide.com, as well as in many, many Java applications. It is a fast, reliable tool that has proved its value in countless demanding production environments.

Although Lucene is well known for its full-text indexing, many developers are less aware that it can also provide powerful complementary searching, filtering, and sorting functionalities. Indeed, many searches involve combining full-text searches with filters on different fields or criteria. For example, you may want to search a database of books or articles using a full-text search, but with the possibility to limit the results to certain types of books. Traditionally, this type of criteria-based searching is in the realm of the relational database. However, Lucene offers numerous powerful features that let you efficiently combine full-text searches with criteria-based searches and sorts.

Bench Marks


The results of benchmarking the most popular full text search engines (MySQL’s built-in Text Search engine, Sphinix Text Search engine plug-in for MySQL and Lucene) is published in the PlanetMySQL site.

Conclusion

Lucene is “the most” popular full text search solution available now to conduct efficient full text searches on database compared to MySQL’s built-in Text Search engine and Sphinix plug-in for MySQL. Lucene is just a java API so it provides seamless integration with other Java programs as compared to Sphinix written in pearl. It is a set of tools that allows us to create an index and then search it. So we need to manually handle the index creation/updation and searches using the API. But the good news is that Spring & Hibernate supports integration of Lucene through various support classes. So go ahead and use it in your java projects. Have a great time searching inside your applications with Lucene!!!

Reference:

Java World Article on Integrating Lucene

1 comments:

Ajit Kumar L said...

Roshan,
Done well. Good indeed is your sharing of experiences.