Sunday, May 31, 2015

Introduction to Elasticsearch

Going through my previous blogs it will be clearly evident that I am very fond of writing about the topic of "Full Text Search". This time also, I would like to discuss on a topic related to Searching and Indexing. That is, about a document based NoSQL database named Elasticsearch which uses an underlying full-text search engine to store data as indexes and query it efficiently. Recently, I had an opportunity to work on this product and was fascinated by its concept and features.

I don't wish to get into the theory of NoSQL databases and Polyglot persistence here. For that you can refer to the book NoSQL Distilled by Martin Fowler, which gives a good introduction to this subject. Here I would like to give a brief introduction about ElasticSearch and explain how ElasticSearch is related to the concept of full text search.

A little bit of history (as per Wikipedia):

Shay Banon created Compass Framework in 2004. (You can read about Compass from my previous article here ). While thinking about the third version of Compass he realized that it would be necessary to rewrite big parts of Compass to "create a scalable search solution". So he created "a solution built from the ground up to be distributed" and used a common interface, JSON over HTTP, suitable for programming languages other than Java as well. Shay Banon released the first version of Elasticsearch in February 2010.

By the way, what is Elasticsearch?:

Elasticsearch is a distributed RESTful search engine built for the cloud. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License. You can Download Elasticsearch from the official distribution site.

How is Elasticsearch related to Full Text Search?:

ElasticSearch is powered by Lucene, a powerful open-source full-text search library, under the hood. ElasticSearch uses Apache Lucene to create and manage an inverted index and instead of searching the text directly, it searches an index instead. Hence ElasticSearch is able to achieve fast search responses. This is like retrieving pages in a book related to a keyword by scanning the index at the back of a book, as opposed to searching every word of every page of the book.

My Experiences with Elasticsearch:

Recently, I was involved in a migration of Data-warehouse from relational star based schema to Elasticsearch NoSQL data store. This rewrite simplified the schema definition for analytics significantly and gave amazing performance benefits for storing (indexing) and querying (searching) data. ElasticSearch also gave us the benefits of horizontal scaling (with the concept of shrads) which is excellent to handle big data.

But one challenge with NoSQL database is to deal with the non-conventional style of queries and resultset (usually a JSON based notation). It might seem non-intuitive to Developers, who are much familiar with the intuitive and standard notation of Structured Query Language (SQL) for relational databases. This is clearly a disadvantage for projects using NoSQL datastores, as a lot of time is spend by Developers in writing code for translating SQL statements to its NoSQL counterpart statements.

Due to the complexity and learning curve associated with the non-conventional style of querying, firing even simple queries against NoSQL databases becomes cumbersome and time consuming. When I was working on the migration project, I was searching for an open source tool that could give a SQL interface for the underlying NoSQL data store, so that I could verify data easily outside the application. But I couldn't find one, so I had to use the web-based Marvel plugin for Elasticsearch and had to spent lot of time learning the syntax for creating expressions to query data in the format Elasticsearch understands.

So after finishing the migration project, I started working an open source project for developing a web based query tool for querying Elasticsearch using standard SQL syntax. This project is currently publicly hosted in GitHub under name sql-query-browser-for-elasticsearch. This project is still under development (with daily check-ins), but at it current state, it can be used as a thin web client for Elasticsearch Server running on any machine to execute simple queries against a hosted Elasticsearch server. The project can be downloaded using the Git client and run locally.

Emerging visualization trends with Elasticsearch:

Elastic Search, Logstash and Kibana – the ELK Stack – is emerging as the best technology stack to collect, manage and visualize big data. These technology stack provides good visualization tools and relieves developers from the need to write custom queries to fetch data.

Conclusion:

The world of NoSQL is still in its infancy and need to go a long way to get to a position where relational databases are positioned now. Maybe in near future, we can expect a standardization in the language for getting information from and updating NoSQL databases, similar to what SQL provides now for relational databases. Till then, we need a community involvement in producing more open-source, user friendly query tools to interface with these feature rich databases.

References:

1. Wikipedia entry for Elasticsearch
2. Github project for Elasticsearch