Elasticsearch for Java Developers: Elasticsearch from the command line

Andrey RedkoFebruary 27th, 2017Last Updated: December 11th, 2023

2 317 21 minutes read

This article is part of our Academy Course titled Elasticsearch Tutorial for Java Developers.

In this course, we provide a series of tutorials so that you can develop your own Elasticsearch based applications. We cover a wide range of topics, from installation and operations, to Java API Integration and reporting. With our straightforward tutorials, you will be able to get your own projects up and running in minimum time. Check it out here!

1. Introduction

From the previous part of the tutorial we have got a pretty good understanding of what Elasticsearch is, its basic concepts and the power of search capabilities it could bring to our applications. In this section we are jumping right into the battle and going to apply our knowledge in practice. Along this section curl and/or http would be the only tools we are going to use to make friends with Elasticsearch.

1. Introduction
2. Is My Cluster Healthy?
3. All About Indices
4. Documents, More Documents, …
5. What if My Mapping Types Are Suboptimal
6. The Search Time
7. Mutations by Query
8. Know Your Queries Better
9. From Search to Insights
10. Watch Your Cluster Breathing
11. Conclusions
12. What’s next

To sum up, we have already finalized our book catalog index and mapping types so we are going to pick it up from there. In order to keep things as close to reality as possible, we are going to use Elasticsearch cluster with three nodes (all run as Docker containers), while catalog index is going to be configured with replication factor of two.

As we are going to see, working with Elasticsearch cluster has quite a few subtleties comparing to standalone instance and it is better to be prepared to deal with them. Hopefully, you still remember from the previous part of the tutorial how to start Elasticsearch as this is going to be the only prerequisite: having the cluster up and running. With that, let us get started!

2. Is My Cluster Healthy?

The first thing you would need to know about your Elasticsearch cluster before doing anything with it is its health. There are a couple of ways to gather this information but arguably the easiest and most convenient one is by using Cluster APIs, particularly cluster health endpoint.

$ http http://localhost:9200/_cluster/health

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "active_primary_shards": 0,
    "active_shards": 0,
    "active_shards_percent_as_number": 100.0,
    "cluster_name": "es-catalog",
    "delayed_unassigned_shards": 0,
    "initializing_shards": 0,
    "number_of_data_nodes": 3,
    "number_of_in_flight_fetch": 0,
    "number_of_nodes": 3,
    "number_of_pending_tasks": 0,
    "relocating_shards": 0,
    "status": "green",
    "task_max_waiting_in_queue_millis": 0,
    "timed_out": false,
    "unassigned_shards": 0
}

Among those details we are looking for status indicator which should be set to green, meaning that that all shards are allocated and cluster is in a good operational shape.

3. All About Indices

Our Elasticsearch cluster is all green and ready to rock. The next logical step would be to create a catalog index, with the mapping types and settings we have outlined before. But before doing that, let us check if there are any indices already created this time using Indices APIs.

$ http http://localhost:9200/_stats

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_all": {
        "primaries": {},
        "total": {}
    },
    "_shards": {
        "failed": 0,
        "successful": 0,
        "total": 0
    },
    "indices": {}
}

As expected, our cluster has nothing in it yet and we are good to go with creating the index for our book catalog. As we know, Elasticsearch speaks JSON but manipulating ald be said about usag more or less complex JSON document from the command line is somewhat cumbersome. Let us better store the catalog settings and mappings in the catalog-index.json document.

{ 
  "settings": {
    "index" : {
      "number_of_shards" : 5, 
      "number_of_replicas" : 2 
    }
  },
  "mappings": {
    "books": {
      "_source" : {
        "enabled": true
      },
      "properties": {
        "title": { "type": "text" },
        "categories" : {
          "type": "nested",
          "properties" : {
            "name": { "type": "text" }
          }
        },
        "publisher": { "type": "keyword" },
        "description": { "type": "text" },
        "published_date": { "type": "date" },
        "isbn": { "type": "keyword" },
        "rating": { "type": "byte" }
       }
   },
   "authors": {
     "properties": {
       "first_name": { "type": "keyword" },
       "last_name": { "type": "keyword" }
     },
     "_parent": {
        "type": "books"
      }
    }
  }
}

And use this document as an input to create an index API.

$ http PUT http://localhost:9200/catalog < catalog-index.json

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}

A few words should be said about the usage of acknowledged response property across most of the Elasticsearch APIs, especially the ones which apply mutations. In general, this value simply indicates whether the operation completed before the timeout (“true”) or may take an effect sometime soon (“false”). We are going to see more examples of its usage in a different context later on.

That is it and we have brought our catalog index live. To ensure the truthiness of this fact, we could ask Elasticsearch to return catalog index settings.

$ http http://localhost:9200/catalog/_settings

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "catalog": {
        "settings": {
            "index": {
                "creation_date": "1487428863824",
                "number_of_replicas": "2",
                "number_of_shards": "5",
                "provided_name": "catalog",
                "uuid": "-b63dCesROC5UawbHz8IYw",
                "version": {
                    "created": "5020099"
                }
            }
        }
    }
}

Awesome, exactly what we have ordered. You might wonder how Elasticsearch would react if we would have tried to update the index settings by increasing the number of shards (as we know, not all index settings could be updated once index has been created).

$ echo '{"index":{"number_of_shards":6}}' | http PUT http://localhost:9200/catalog/_settings

HTTP/1.1 400 Bad Request
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "error": {
        "reason": "can't change the number of shards for an index",
        "root_cause": [
            ...
        ],
        "type": "illegal_argument_exception"
    },
    "status": 400
}

The error response comes as no surprise (please notice that the response details have been reduced for illustrative purposes only). Along with settings, it is very easy to get the mapping types for a particular index, for example:

$ http http://192.168.99.100:9200/catalog/_mapping

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "catalog": {
        "mappings": {
            "authors": {
                ...
            },
            "books": {
                ...
            }
        }
    }
}

By and large, the index mappings for existing fields cannot be updated; however there are some exceptions of the rule. One of the greatest features of the indices APIs is the ability to perform the analysis process against a particular index mapping type and field without actually sending any documents.

$ http http://localhost:9200/catalog/_analyze field=books.title text="Elasticsearch: The Definitive Guide. A Distributed Real-Time Search and Analytics Engine"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 13,
            "position": 0,
            "start_offset": 0,
            "token": "elasticsearch",
            "type": ""
        },
        {
            "end_offset": 18,
            "position": 1,
            "start_offset": 15,
            "token": "the",
            "type": ""
        },
        
        ...

        {
            "end_offset": 88,
            "position": 11,
            "start_offset": 82,
            "token": "engine",
            "type": ""
        }
    ]
}

It is exceptionally useful in case you would like to validate your mapping types’ parameters before throwing a bunch of data into Elasticsearch for indexing.

And last but not least, there is one important detail about index states. Any particular index could be in opened (fully operational) or closed (blocked for read/write operations, archived would be a good analogy) states. As for everything else, Elasticsearch provided an APIs for that.

$ http POST http://localhost:9200/catalog/_open

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true
}

4. Documents, More Documents, …

The empty index without documents is not very useful so let us switch gears from indices APIs to another great one, document APIs. We are going to start exploring it using the simplest single document operations, relying on the following book.json document:

{
  "title": "Elasticsearch: The Definitive Guide. A Distributed Real-Time Search and Analytics Engine",
  "categories": [
      { "name": "analytics" },
      { "name": "search" },
      { "name": "database store" }
  ],
  "publisher": "O'Reilly",
  "description": "Whether you need full-text search or real-time analytics of structured data—or both—the Elasticsearch distributed search engine is an ideal way to put your data to work. This practical guide not only shows you how to search, analyze, and explore data with Elasticsearch, but also helps you deal with the complexities of human language, geolocation, and relationships.", 
  "published_date": "2015-02-07",
  "isbn": "978-1449358549",
  "rating": 4
}

Before sending this JSON to Elasticsearch, it would be great to talk a little bit about documents identification. Each document in Elasticsearch has a unique identifier, stored in a special _id field. You may provide one while uploading the document to Elasticsearch (like we do in the example below using isbn as it is a great example of natural identifier), or it will be generated and assigned by Elasticsearch.

$ http PUT http://localhost:9200/catalog/books/978-1449358549 < book.json

HTTP/1.1 201 Created
Location: /catalog/books/978-1449358549
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_id": "978-1449358549",
    "_index": "catalog",
    "_shards": {
        "failed": 0,
        "successful": 3,
        "total": 3
    },
    "_type": "books",
    "_version": 1,
    "created": true,
    "result": "created"
}

Our first document just made its way into a catalog index, under books type. But we also have authors type, which is in a parent / child relationship with books. Let us complement the book with its authors from authors.json document.

[
  {
    "first_name": "Clinton",
    "last_name": "Gormley",
    "_parent": "978-1449358549"
  },
  {
    "first_name": "Zachary",
    "last_name": "Tong",
    "_parent": "978-1449358549"
  }
]

The book has more than one author so we still can use the single document API by indexing each author document one by one. However, let us not do that but switch over to bulk document API instead and transform our authors.json document a bit to be compatible with bulk document API format.

{ "index" : { "_index" : "catalog", "_type" : "authors", "_id": "1", "_parent": "978-1449358549" } }
{ "first_name": "Clinton", "last_name": "Gormley" }
{ "index" : { "_index" : "catalog", "_type" : "authors", "_id": "2", "_parent": "978-1449358549" } }
{ "first_name": "Zachary", "last_name": "Tong" }

Done deal, let us save this document as authors-bulk.json and feed it directly into bulk document API endpoint.

$ http POST http://localhost:9200/_bulk < authors-bulk.json

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "errors": false,
    "items": [
        {
            "index": {
                "_id": "1",
                "_index": "catalog",
                "_shards": {
                    "failed": 0,
                    "successful": 3,
                    "total": 3
                },
                "_type": "authors",
                "_version": 5,
                "created": false,
                "result": "updated",
                "status": 200
            }
        },
        {
            "index": {
                "_id": "2",
                "_index": "catalog",
                "_shards": {
                    "failed": 0,
                    "successful": 3,
                    "total": 3
                },
                "_type": "authors",
                "_version": 2,
                "created": true,
                "result": "created",
                "status": 201
            }
        }
    ],
    "took": 105
}

And we have book and author documents as the first citizens of the catalog index! It is time to fetch those documents back.

$ http http://localhost:9200/catalog/books/978-1449358549

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_id": "978-1449358549",
    "_index": "catalog",
    "_source": {
        "categories": [
            { "name": "analytics" },
            { "name": "search"},
            { "name": "database store" }
        ],
        "description": "...",
        "isbn": "978-1449358549",
        "published_date": "2015-02-07",
        "publisher": "O'Reilly",
        "rating": 4,
        "title": "Elasticsearch: The Definitive Guide. A Distributed Real-Time Search and Analytics Engine"
    },
    "_type": "books",
    "_version": 1,
    "found": true
}

Easy! However to fetch the documents from authors collection, which are children of their respective documents from books collection, we have to supply the parent identifier along with the document own one, for example:

$ http http://localhost:9200/catalog/authors/1?parent=978-1449358549

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_id": "1",
    "_index": "catalog",
    "_parent": "978-1449358549",
    "_routing": "978-1449358549",
    "_source": {
        "first_name": "Clinton",
        "last_name": "Gormley"
    },
    "_type": "authors",
    "_version": 1,
    "found": true
}

This is one of the specifics working with parent / child relations in Elasticsearch. As it has been already mentioned, you may model such relationships in a simpler way but our goal is to learn how to deal with it if you choose to go this route in your applications.

The delete and update APIs are pretty straightforward so we just leaf them through, please notice that the same rules regarding identifying the child documents apply. You may be surprised, but deleting a parent document does not automatically delete its children, so keep that in mind. We are going to see how to workaround that a bit later.

To finish up, let us take a look at the term vectors API which returns all the details and statistics about terms in the fields of the document, for example (only the small part of the response has been pasted):

$ http http://localhost:9200/catalog/books/978-1449358549/_termvectors?fields=description

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_id": "978-1449358549",
    "_index": "catalog",
    "_type": "books",
    "_version": 1,
    "found": true,
    "term_vectors": {
        "description": {
            "field_statistics": {
                "doc_count": 1,
                "sum_doc_freq": 46,
                "sum_ttf": 60
            },
            "terms": {
                "analyze": {
                    "term_freq": 1,
                    "tokens": [ ... ]
                },
                "and": {
                    "term_freq": 2,
                    "tokens": [ ... ]

                },
                "complexities": {
                    "term_freq": 1,
                    "tokens": [ ... ]

                },
                "data": {
                    "term_freq": 3,
                    "tokens": [ ... ]

                },
                ...
            }
        }
    },
    "took": 5
}

You may not find yourself using the term vectors API often however it is a terrific tool to troubleshoot why certain documents may not pop up in the search results.

5. What if My Mapping Types Are Suboptimal

Very often over time you may discover that your mapping types may not be optimal and could be made better. However, Elasticsearch supports only limited modifications over existing mapping types. Luckily, Elasticsearch is providing a dedicated reindexing API, for example:

$ echo '{"source": {"index": "catalog"}, "dest": {"index": "catalog-v2"}}' | http POST http://localhost:9200/_reindex

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "batches": 0,
    "created": 200,
    "deleted": 0,
    "failures": [],
    "noops": 0,
    "requests_per_second": -1.0,
    "retries": {
        "bulk": 0,
        "search": 0
    },
    "throttled_millis": 0,
    "throttled_until_millis": 0,
    "timed_out": false,
    "took": 265,
    "total": 200,
    "updated": 0,
    "version_conflicts": 0
}

The trick here is to create a new index with updated mapping types, catalog-v2, and than just ask Download Now

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Don

8 years ago

Nice article! Thanks. I noticed that the commands are starting with http rather than curl.

Andriy Redko

Reply to Don

Hi Don,

Thank you for the comment, that’s right, some commands use http (https://httpie.org/), which is a really great tool, quite similar to curl. The reason, aside from pure educational, is that http does many convenient things like prettying the JSON and coloring in the output, just out of the box. However, everything you could do with http in this article, you could do with curl as well. Thank you.

Best Regards,
Andriy Redko

Elasticsearch for Java Developers: Elasticsearch from the command line

1. Introduction

Table Of Contents

2. Is My Cluster Healthy?

3. All About Indices

4. Documents, More Documents, …

5. What if My Mapping Types Are Suboptimal

Thank you!

1. Introduction

Table Of Contents

2. Is My Cluster Healthy?

3. All About Indices

4. Documents, More Documents, …

5. What if My Mapping Types Are Suboptimal

Thank you!

6. The Search Time

7. Mutations by Query

8. Know Your Queries Better

9. From Search to Insights

10. Watch Your Cluster Breathing

Thank you!

Related Articles