Introduction to
By Melvyn Peignon
What will I cover?
- Company and products presentation
- Elasticsearch architecture
- Presentation of Kibana
- Presentation of the search API
- Analyzer
- TF/IDF and relevance
- Elasticsearch use case
- Conclusion
Elastic
Founded in 2012
- Is behind:
- Kibana
- Elasticsearch
- Logstash
- Beats
What is elasticsearch?
- Full text search engine
- Based on Lucene
- Highly available
- Distributed
- Scalable
- RESTful
- Open Source
Shay
Bannon
Trending between search-engine (ES is blue)
How do they make money?
CRUD
CREATE
READ
UPDATE
DELETE
Some concepts to know
- Near real time (NRT)
- Cluster
- Node
- Index
- Document
- Shards and Replicas
Documents, Types, indexes
- An index is a collection of documents that share similar
properties.
- A document is the basic piece of information that can be
indexed.
- A type is a logical partition of the data in your index
Cluster, Nodes, Shards and Replicas
Cluster
Node 1
S1 S2
S3 S4
Cluster, Nodes, Shards and Replicas
Cluster
Node 1 Node 2
S3 S4S1 S2
Cluster, Nodes, Shards and Replicas
Cluster
Node 1 Node 2
S3 S4S1 S2
Cluster, Nodes, Shards and Replicas
Cluster
Node 1 Node 2 Node 3 Node 4
S1 S2 S3 S4R2 R1 R4 R3
Cluster, Nodes, Shards and Replicas
Cluster
Node 1 Node 2 Node 3 Node 4
S1 S2 S3 S4R2 R1 R4 R3
Cluster, Nodes, Shards and Replicas
Cluster
Node 1 Node 2 Node 3 Node 4
S1 S2 S3 S4R2 R1 R4 R3
Ping
PongPing
Cluster, Nodes, Shards and Replicas
Cluster
Node 1 Node 2 Node 3 Node 4
S1 S2 S3 S4R2 R1 R4 R3
Cluster, Nodes, Shards and Replicas
Cluster
Node 1 Node 2 Node 3 Node 4
S1 S2 S3 S4R2 R1 R4 R3
Responsibilities of the master
- Cluster health
- All the creation of index
- Repartition of the Shards
- Repartition of the Replicas
Cluster recommendation
- Your servers in the same data center
- Your machines on different Rack
- Keeping at least 3 eligible master node (Quorum of 2 is 2)
What’s Kibana?
- Another elastic product
- A tool allowing you to communicate in a more “human”
way to your elasticsearch
- A product that allow you to do dashboard and data
visualization
Let’s go for a demonstration
Demonstration done on Kibana
Query can be found on Github:
The analyzer
{“a”: [id_0], “walk”: [id_0], “in”: [id_0], “the”: [id_0], “wood”: [id_0]}
Standard Analyzer
The analyzer
{“a”: [id_0, id_1], “walk”: [id_0], “in”: [id_0], “the”: [id_0],
“wood”: [id_0], “probability”:[id_1], “complete”:[id_1],
“guide”:[id_1]}
Standard Analyzer
The analyzer
{“a”: [id_0, id_1], “walk”: [id_0], “in”: [id_0],
“the”: [id_0], “wood”: [id_0], “probability”:[id_1],
“complete”:[id_1], “guide”:[id_1]}
[id_0, id_1]
The analyzer
{“a”: [id_0, id_1], “walk”: [id_0], “in”: [id_0],
“the”: [id_0], “wood”: [id_0],
“probability”:[id_1], “complete”:[id_1],
“guide”:[id_1]}
[]
The english analyzer
English Analyzer
{“walk”: [id_0], “wood”: [id_0]}
The english analyzer
{ “walk”: [id_0], “wood”: [id_0]}
[]
What is relevance?
Two theories to know:
- Boolean model
- Space vector model
Boolean model
O0 = “Eric is ... always feeding”
O1 = “Jherez is ... with the friends”
….
O6 = “Manage Idea… to Melvyn)”
QT= {“lab”, “manager”} QO = “OR”
T = {t1:”lab”, t2:”manager”, t3:”Idea”, …, “t4”:
feeding}
D = {D0, D1, …, D6}
D0 = {Eric, is, …, feeding}
D1 = {Jherez, is, …, friends}
D6 = {Manage, idea, …,
Melvyn}
S1 = {D0, D1, D6}
S2 = {D0, D6}
SF = S1 ∪ S2 = S1
Space vector model
S1 = {D0, D1, D6}
T0 = D0 ∩ QT (“lab”, “manager”) ⇒ V0 = (L0, M0)
T1 = D1 ∩ QT (“lab”) ⇒ V1 = (L1, 0)
T6 = D6 ∩ QT (“lab”, “manager”) ⇒ V6 = (L6, M6)
Weight of a token in a document
- Term frequency
TF = √Frequency
- Inverse Document Frequency
IDF = 1 + log(1/ (docFrequency + 1))
- Field length
FL = 1 / √TokenInField
Weight = TF x IDF x FL
Relevance
Vq = [1, 1.47]
V0 = [0.81, 0.85]
V1 = [0.37, 0]
V6 = [0.8, 1.2]
Relevance(Vq, Vx) = cos(Vq, Vx) =
(Vq . Vx) / (॥Vq॥.॥Vx॥)
Let’s Kaggle with elasticsearch
https://www.kaggle.com/c/whats-cooking
Results of our “Classifier”
Explanation of the methodology:
http://melvyn.pythonanywhere.com/posts/1/
Last advices?
- Mapping (I highly recommend having a mapping. You cannot update the type
defined in a field in the mapping)
- Elasticsearch as a database (I prefer having both, easier for reindexation,
having a back up, do my search and analytics on ES and use my database for
identification, etc ...)
- Elasticsearch as a NOSQL database (I wouldn’t do it on a serious project, but
nice to have if you wanna do a quick implementation for a POC)
Hope you enjoyed the presentation!
Thank you for your attention!
Questions?

Introduction to elasticsearch

  • 1.
  • 2.
    What will Icover? - Company and products presentation - Elasticsearch architecture - Presentation of Kibana - Presentation of the search API - Analyzer - TF/IDF and relevance - Elasticsearch use case - Conclusion
  • 3.
    Elastic Founded in 2012 -Is behind: - Kibana - Elasticsearch - Logstash - Beats
  • 4.
    What is elasticsearch? -Full text search engine - Based on Lucene - Highly available - Distributed - Scalable - RESTful - Open Source Shay Bannon
  • 5.
  • 6.
    How do theymake money?
  • 7.
  • 8.
    Some concepts toknow - Near real time (NRT) - Cluster - Node - Index - Document - Shards and Replicas
  • 9.
    Documents, Types, indexes -An index is a collection of documents that share similar properties. - A document is the basic piece of information that can be indexed. - A type is a logical partition of the data in your index
  • 10.
    Cluster, Nodes, Shardsand Replicas Cluster Node 1 S1 S2 S3 S4
  • 11.
    Cluster, Nodes, Shardsand Replicas Cluster Node 1 Node 2 S3 S4S1 S2
  • 12.
    Cluster, Nodes, Shardsand Replicas Cluster Node 1 Node 2 S3 S4S1 S2
  • 13.
    Cluster, Nodes, Shardsand Replicas Cluster Node 1 Node 2 Node 3 Node 4 S1 S2 S3 S4R2 R1 R4 R3
  • 14.
    Cluster, Nodes, Shardsand Replicas Cluster Node 1 Node 2 Node 3 Node 4 S1 S2 S3 S4R2 R1 R4 R3
  • 15.
    Cluster, Nodes, Shardsand Replicas Cluster Node 1 Node 2 Node 3 Node 4 S1 S2 S3 S4R2 R1 R4 R3 Ping PongPing
  • 16.
    Cluster, Nodes, Shardsand Replicas Cluster Node 1 Node 2 Node 3 Node 4 S1 S2 S3 S4R2 R1 R4 R3
  • 17.
    Cluster, Nodes, Shardsand Replicas Cluster Node 1 Node 2 Node 3 Node 4 S1 S2 S3 S4R2 R1 R4 R3
  • 18.
    Responsibilities of themaster - Cluster health - All the creation of index - Repartition of the Shards - Repartition of the Replicas
  • 19.
    Cluster recommendation - Yourservers in the same data center - Your machines on different Rack - Keeping at least 3 eligible master node (Quorum of 2 is 2)
  • 20.
    What’s Kibana? - Anotherelastic product - A tool allowing you to communicate in a more “human” way to your elasticsearch - A product that allow you to do dashboard and data visualization
  • 22.
    Let’s go fora demonstration
  • 23.
    Demonstration done onKibana Query can be found on Github:
  • 24.
    The analyzer {“a”: [id_0],“walk”: [id_0], “in”: [id_0], “the”: [id_0], “wood”: [id_0]} Standard Analyzer
  • 25.
    The analyzer {“a”: [id_0,id_1], “walk”: [id_0], “in”: [id_0], “the”: [id_0], “wood”: [id_0], “probability”:[id_1], “complete”:[id_1], “guide”:[id_1]} Standard Analyzer
  • 26.
    The analyzer {“a”: [id_0,id_1], “walk”: [id_0], “in”: [id_0], “the”: [id_0], “wood”: [id_0], “probability”:[id_1], “complete”:[id_1], “guide”:[id_1]} [id_0, id_1]
  • 27.
    The analyzer {“a”: [id_0,id_1], “walk”: [id_0], “in”: [id_0], “the”: [id_0], “wood”: [id_0], “probability”:[id_1], “complete”:[id_1], “guide”:[id_1]} []
  • 28.
    The english analyzer EnglishAnalyzer {“walk”: [id_0], “wood”: [id_0]}
  • 29.
    The english analyzer {“walk”: [id_0], “wood”: [id_0]} []
  • 30.
    What is relevance? Twotheories to know: - Boolean model - Space vector model
  • 31.
    Boolean model O0 =“Eric is ... always feeding” O1 = “Jherez is ... with the friends” …. O6 = “Manage Idea… to Melvyn)” QT= {“lab”, “manager”} QO = “OR” T = {t1:”lab”, t2:”manager”, t3:”Idea”, …, “t4”: feeding} D = {D0, D1, …, D6} D0 = {Eric, is, …, feeding} D1 = {Jherez, is, …, friends} D6 = {Manage, idea, …, Melvyn} S1 = {D0, D1, D6} S2 = {D0, D6} SF = S1 ∪ S2 = S1
  • 32.
    Space vector model S1= {D0, D1, D6} T0 = D0 ∩ QT (“lab”, “manager”) ⇒ V0 = (L0, M0) T1 = D1 ∩ QT (“lab”) ⇒ V1 = (L1, 0) T6 = D6 ∩ QT (“lab”, “manager”) ⇒ V6 = (L6, M6)
  • 33.
    Weight of atoken in a document - Term frequency TF = √Frequency - Inverse Document Frequency IDF = 1 + log(1/ (docFrequency + 1)) - Field length FL = 1 / √TokenInField Weight = TF x IDF x FL
  • 34.
    Relevance Vq = [1,1.47] V0 = [0.81, 0.85] V1 = [0.37, 0] V6 = [0.8, 1.2] Relevance(Vq, Vx) = cos(Vq, Vx) = (Vq . Vx) / (॥Vq॥.॥Vx॥)
  • 35.
    Let’s Kaggle withelasticsearch https://www.kaggle.com/c/whats-cooking
  • 36.
    Results of our“Classifier” Explanation of the methodology: http://melvyn.pythonanywhere.com/posts/1/
  • 37.
    Last advices? - Mapping(I highly recommend having a mapping. You cannot update the type defined in a field in the mapping) - Elasticsearch as a database (I prefer having both, easier for reindexation, having a back up, do my search and analytics on ES and use my database for identification, etc ...) - Elasticsearch as a NOSQL database (I wouldn’t do it on a serious project, but nice to have if you wanna do a quick implementation for a POC)
  • 38.
    Hope you enjoyedthe presentation! Thank you for your attention! Questions?