Elasticsearch for Java Developers: Introduction

Andrey RedkoFebruary 13th, 2017Last Updated: December 11th, 2023

1 401 16 minutes read

This article is part of our Academy Course titled Elasticsearch Tutorial for Java Developers.

In this course, we provide a series of tutorials so that you can develop your own Elasticsearch based applications. We cover a wide range of topics, from installation and operations, to Java API Integration and reporting. With our straightforward tutorials, you will be able to get your own projects up and running in minimum time. Check it out here!

1. Introduction

Effective, fast and accurate search functionality is an integral part of vast majority of the modern applications and software platforms. Either you are running a small e-commerce web site and need to offer your customers a search over product catalogs, or you are a service provider and need to expose an API to let the developers filter over users and companies, or you are building any kind of messaging application where finding a conversation in the history is a must-have feature from day one … What is really important is that, delivering as relevant results as fast as possible could be yet another competitive advantage of the product or platform you are developing.

1. Introduction

2. Elasticsearch Basics

2.1. Documents
2.2. Indices
2.3. Index Settings
2.4. Mappings
2.5. Advanced Mappings
2.6. Indexing
2.7. Internalization (i18n)

3. Running Elasticsearch

3.1. Standalone Instance
3.2. Clustering
3.3. Embedding Into Application
3.4. Running As Container

4. Where Elasticsearch Fits

5. Conclusion

6. What’s next

Indeed, the search could have many faces, purposes, goals and different scale. It could be as simple as looking by exact word match or as complex as trying to understand the intent and the contextual meaning of the words one’s is looking for (semantic search engines). In terms of scale, it could be as trivial as querying a single database table, or as complex as crunching over billions and billions of web pages in order to deliver the desired results. It is very interesting and flourishing area of research, with many algorithms and papers published over the years.

In case you are a Java / JVM developer, you may have heard about Apache Lucene project, a high-performance, full-featured indexing and search library. It is the first and the best in class choice to unleash the power of full-text search and embed it into your applications. Although it is a terrific library by all means, many developers have found Apache Lucene too low-level and not easy to use. That is one of the reasons why two other great projects, Elasticsearch and Apache Solr, have been born.

In this tutorial, we are going to talk about Elasticsearch, making an emphasis on development side of things rather than operational. We are going to learn the basics of Elasticsearch, get familiarized with the terminology and discuss different ways to run it and communicate with it from within Java / JVM applications or command line. At the very end of the tutorial we are going to talk about Elastic Stack to showcase the ecosystem around Elasticsearch and its amazing capabilities.

If you are a junior or seasoned Java / JVM developer and interested in learning about Elasticsearch, this tutorial is definitely for you.

2. Elasticsearch Basics

To get started, it would be great to answer the question: so, what is Elasticsearch, how it can help me and why should I use it?

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements. – https://www.elastic.co/

Elasticsearch is built on top of Apache Lucene but favors communication over RESTful APIs and advanced in-depth analytics features. The RESTful part makes Elasticsearch particularly easy to learn and use. As of the moment of this writing, the latest stable release branch of the Elasticsearch was 5.2, with the latest released version being 5.2.0. We should definitely give Elasticsearch guys the credit for keeping the pace of delivering new releases so often, 5.0.x / 5.1.x branches are just a few months old ….

In perspective of Elasticsearch, being RESTful APIs has another advantage: every single piece of data sent to or received from Elasticsearch is itself a human-readable JSON document (although this is not the only protocol Elasticsearch supports as we are going to see later on).

To keep the discussion relevant and practical, we are going to pretend that we are developing the application to manage the catalog of books. The data model will include categories, authors, publisher, book details (like publishing date, ISBN, rating) and brief description.

Let us see how we could leverage Elasticsearch to make our book catalog easily searchable but before that we need to get familiarized a bit with the terminology. Although in the next couple of sections we are going to go over most of the concepts behind Elasticsearch, please do not hesitate to consult Elasticsearch official documentation any time.

2.1. Documents

To put it simply, in context of Elasticsearch document is just an arbitrary piece of data (usually, structured). It could be absolutely anything which makes sense to your applications (like users, logs, blog posts, articles, products, …) but this is a basic unit of information which Elasticsearch could manipulate.

2.2. Indices

Elasticsearch stores documents inside indices and as such, an index is simply a collection of the documents. To be fair, persisting absolutely different kind of the documents in the same index would be somewhat convenient but quite difficult to work with so every index may have one or more types. The types group documents logically by defining a set of common properties (or fields) every document of such type should have. Types serve as a metadata about documents and are very useful for exploring the structure of the data and constructing meaningful queries and aggregations.

2.3. Index Settings

Each index in Elasticsearch could have specific settings associated with it at the time of its creation. The most important ones are number of shards and replication factor. Let us talk about that for a moment.

Elasticsearch has been built from the ground up to operate over massive amount of indexed data which will very likely exceed the memory and/or storage capabilities of a single physical (or virtual) machine instance. As such, Elasticsearch uses sharding as a mechanism to split the index into several smaller pieces, called shards, and distribute them among many nodes. Please notice, once set the number of shards could not be altered (although this is not entirely true anymore, the index could be shrunk into fewer shards).

Indeed, sharding solves a real problem but it is vulnerable to data loss issues due to individual node failures. To address this problem, Elasticsearch supports high availability by leveraging replication. In this case, depending on replication factor, Elasticsearch maintains one or more copies of each shard and makes sure that each shard’s replica is placed on different node.

2.4. Mappings

The process of defining the type of the documents and assigning it to a particular index is called index mapping, mapping type or just a mapping. Coming up with a proper type mapping is, probably, one of the most important design exercises you would have to make in order to get most out of Elasticsearch. Let us take some time and talk about mappings in details.

Each mapping consists of optional meta-fields (they usually start from the underscore ‘_’ character like _index, _id, _parent) and regular document fields (or properties). Each field (or property) has a data type, which in Elasticsearch could fall into one of those categories:

Simple data types
- text – indexes full-text values
- keyword – indexes structured values
- date – indexes date/time values
- long – indexes signed 64-bit integer values
- integer – indexes signed 32-bit integer values
- short – indexes signed 16-bit integer values
- byte – indexes signed 8-bit integer values
- double – indexes double-precision 64-bit IEEE 754 floating point values
- float – indexes single-precision 32-bit IEEE 754 floating point values
- half_float – indexes half-precision 16-bit IEEE 754 floating point values
- scaled_float – indexes floating point values that is backed by a long and a fixed scaling factor
- boolean – indexes boolean values (for example, true/false, on/off, yes/no, 1/0)
- ip – indexes either IPv4 or IPv6 address values
- binary – indexes any binary value encoded as a Base64 string
Composite data types
- object – indexes inner objects which, in turn, may contain inner objects themselves
- nested – a specialized version of the object data type that allows to index arrays of objects independently of each other
Specialized data type
- geo_point – indexes latitude-longitude pairs
- geo_shape – indexes an arbitrary geo shapes (such as rectangles and polygons)
- completion – dedicated data type to back auto-complete/search-as-you-type functionality
- token_count – dedicated data type to count the number of tokens in a string
- percolator – specialized data type to store the query which is going to be used by percolate query to match the documents
Range data types:
- integer_range – indexes a range of signed 32-bit integers
- float_range – indexes a range of single-precision 32-bit IEEE 754 floating point values
- long_range – indexes a range of signed 64-bit integers
- double_range – indexes a range of double-precision 64-bit IEEE 754 floating point values
- date_range – indexes a range of date values represented as unsigned 64-bit integer milliseconds elapsed since system epoch

Cannot stress it enough, choosing the proper data type for the fields (properties) of your documents is a key for fast, effective search which delivers really relevant results. There is one catch though: the fields in each mapping type are not entirely independent of each other. The fields with the same name and within the same index but in different mapping types must have the same mapping definition. The reason is that internally those fields are mapped to the same field.

Getting back to our application data model, let us try to define the simplest mapping type for books collections, utilizing our just acquired knowledge about data types.

For most of the book properties the mapping data types are pretty straightforward but what about authors and categories? Those properties essentially contain the collection of values for which Elasticsearch has no direct data type yet, … or has it?

2.5. Advanced Mappings

Interestingly, indeed Elasticsearch has no dedicated array or collection type but by default, any field may contain zero or more values (of its data type).

In case of complex data structures, Elasticsearch supports mapping using object and nested data types as well as establishing parent/child relationships between documents within the same index. There are pros and cons of each approach but in order to learn how to use those techniques let us store categories as nested property of the books mapping type, while authors are going to be represented as a dedicated mapping which refers to books as parent.

Mapping Book Catalog: second (and the last) attempt

These are our close to final mapping types for the catalog index. As we already know, JSON is a first class citizen in Elasticsearch, so let us get a feeling of how the typical index mapping looks like in the format Elasticsearch actually understands.

{
  "mappings": {
    "books": {
      "_source" : {
        "enabled": true
      },
      "properties": {
        "title": { "type": "text" },
        "categories" : {
          "type": "nested",
          "properties" : {
            "name": { "type": "text" }
          }
        },
        "publisher": { "type": "keyword" },
        "description": { "type": "text" },
        "published_date": { "type": "date" },
        "isbn": { "type": "keyword" },
        "rating": { "type": "byte" }
       }
   },
   "authors": {
     "properties": {
       "first_name": { "type": "keyword" },
       "last_name": { "type": "keyword" }
     },
     "_parent": {
        "type": "books"
      }
    }
  }
}

You may be surprised but explicit definition of the fields and mapping types could be omitted. Elasticsearch supports dynamic mapping thereby new mapping types and new field names will be added automatically when document is indexed (in this case Elasticsearch makes a decision what the field data types should be).

Another important detail to mention is that each mapping type can have custom metadata associated with it by using special _meta property. It is exceptionally useful technique which will be used by us later on in the tutorial.

2.6. Indexing

Once Elasticsearch has all your indices and their mapping types defined (or inferred using dynamic mapping), it is ready to analyze and index the documents. It is quite complex but interesting process which involves at least analyzers, tokenizers, token filters and character filters.

Elasticsearch supports quite a rich number of mapping parameters which let you tailor the indexing, analysis and search phases precisely to your needs. For example, every single field (or property) could be configured to use own index-time and search-time analyzers, support synonyms, apply stemming, filter out stop words and much, much more. By carefully crafting these parameters you may end up with superior search capabilities, however the opposite also holds true, having them loose, and a lot of irrelevant and noisy results may be returned every time.

If you don’t need all that, you are good to go with the defaults as we have done in the previous section, omitting the parameters altogether. However, it is rarely the case. To give a realistic example, most of the time our applications have to support multiple languages (and locales). Luckily, Elasticsearch shines here as well.

Before we move on to the next topic, there is an important constraint you have to be aware of. Once the mapping types are configured, in majority of cases they cannot be updated as it automatically assumes that all the documents in the corresponding collections are not up to date anymore and should be re-indexed.

2.7. Internalization (i18n)

The process of indexing and analyzing the documents is very sensitive to the native language of the document. By default, Elasticsearch uses standard analyzer if none is specified in the mapping types. It works well for most of the languages but Elasticsearch supplies the dedicated analyzers for Arabic, Armenian, Basque, Brazilian, Bulgarian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, Thai and a few more.

There are couple of ways to approach the indexing of the same document in multiple languages, depending on your data model and business case. For example, if document instances physically exist (translated) in multiple languages, than it probably makes sense to have one index per language.

In case when documents are partially translated, Elasticsearch has another interesting option hidden in the sleeves called multi-fields. Multi-fields allow indexing the same document field (property) in different ways to be used for different purposes (like, for example, supporting multiple languages). Getting back to our books mapping type, we may have defined the title property as a multi-field one, for example:

"title": {
  "type": "text",
  "fields": {
    "en": { "type": "text", "analyzer": "english" },
    "fr": { "type": "text", "analyzer": "french" },
    "de": { "type": "text", "analyzer": "german" },
    ...
  }
}

Those are not the only options available but they illustrate well enough the flexibility and maturity of the Elasticsearch in fulfilling quite sophisticated demands.

3. Running Elasticsearch

Elasticsearch embraces simplicity in many ways and one of those is exceptionally easy way to get started on mostly any platform in just two steps: download and run. In the next couple of sections we are going to talk about quite a few different ways to get your Elasticsearch up and running.

3.1. Standalone Instance

Running Elasticsearch as a standalone application (or instance) is the fastest and simplest route to take. Just download the package of your choice and run the shell script on Linux/Unix/Mac operating systems:

bin/elasticsearch

Or from the batch file on Windows operating system:

bin\elasticsearch.bat

And that is it, pretty straightforward, isn’t it? However, before we go ahead and talk about more advanced options, it would be useful to get a taste what it actually means to run an instance of Download Now

Elasticsearch for Java Developers: Introduction

1. Introduction

Table Of Contents

2. Elasticsearch Basics

2.1. Documents

2.2. Indices

2.3. Index Settings

2.4. Mappings

2.5. Advanced Mappings

2.6. Indexing

2.7. Internalization (i18n)

3. Running Elasticsearch

3.1. Standalone Instance

Thank you!

1. Introduction

Table Of Contents

2. Elasticsearch Basics

2.1. Documents

2.2. Indices

2.3. Index Settings

2.4. Mappings

2.5. Advanced Mappings

2.6. Indexing

2.7. Internalization (i18n)

3. Running Elasticsearch

3.1. Standalone Instance

Thank you!

3.2. Clustering

3.3. Embedding Into Application

3.4. Running As Container

Thank you!

Related Articles