Applying AI for Log Analysis
July 2017
Confidential and Proprietary July 2017
Hi!
Ronny Lehmann
CTO & Founder – Loom Systems
Formerly 8200, BioCatch
Machine-Learning | High-performance Cloud-Computing
@ronnyle_mann
Confidential and Proprietary July 2017
Founded in April 2015
30 people (5 in San Francisco)
Bootstrap for 2 first years, recently funded
Hiring very much
Confidential and Proprietary July 2017
Today’s Big-Data Bottleneck:
You are.
2000’s Big-Data Bottlenecks:
✓ Storing
✓ Querying
✓ Real-time processing
Confidential and Proprietary July 2017
Good dev(ops) are hard-to-find
Employee tenure very low (<3yrs. Source: PayScale)
Operations is Tribal Knowledge
Machines are very loyal, never ask for a
raise and have excellent memory. Can
(some) of this be done with machines?
Confidential and Proprietary July 2017
➜“I’ve been hearing this for 20 years”
Total Recall, a movie based on a book from 1966, featuring
a self-driving car as science fiction.
If Artificial-Intelligence has matured enough to drive your
car, it can probably also help with your IT.
Skeptic?!
Confidential and Proprietary July 2017
• Real-time trend detection
• Pattern Recognition
• Large Dimensionality
• Complex State
• Strict Methodology
HUMANS
Good at top-down tasks
BOTS
Superior at bottom-up tasks
• Deep reasoning
• Contextual thinking
• Tired
• Bored
• Lazy
• Frustrated
• Married
Confidential and Proprietary July 2017
That’s what we do @ Loom Systems
AIOps - Algorithmic IT operations
Use Big Data and Machine Learning Technologies to Achieve a Data-Centric Approach to
Availability and Performance Monitoring.
Extend the Data-Centric Approach to Other ITOM (IT Operations Monitoring) Disciplines, and Seek
to Exploit the Linkages It Allows Between ITOM, SIEM and Business Analytics
Confidential and Proprietary July 2017
Action
•Remedy
•Recommendation
•Insight
•Knowledge
Root-Cause
Analysis
•Aggregation
•Correlation
•Causality
Data
Modelling
•Visualizations
•Define KPIs
•Reporting
•Rules & Thresholds
Data
Preparation
•Collection
•Normalization
•Sanitizing
•Preprocessing
Cracking the science behind data-science
Confidential and Proprietary July 2017
Loom Ops – real-time AIOps
Processing
Semi-structured ->
Structured Data
MLP & Pattern
Recognition
Measure-All
Analysis
Behavior
Tracking
Anomaly Detection
& Trend Prediction
Correlation
Engine
Alerting
Incident
Enrichment
Insights Engine Routing
Confidential and Proprietary July 2017
Three layers of context
Generic Context
Something being mentioned more than normal, or is appearing after long absence
Something stopped/started happening
Common Business Context
Semantical words (timeout, Trojan, failure)
Common Software
Proprietary Business Context
Names of business products, servers, applications..
Confidential and Proprietary July 2017
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port 48278 ssh2
Processing
Generic context – rate of this pattern in the logs
Common Business Context –
➜ Contextual words (Warn, Failed)
➜ Common Entities (User, IP, ssh)
Proprietary Business Context –
➜ Server Name
Real-time Sturcturing, Clustering
Token & Entity Extraction and Classification
HistogrammegatronServer
MetersshdApplication
MeterronnyUser
Meter192.168.118.1source_IP
Random48278source_port
Failed password for user [user] from [source_IP] port [source_port] ssh2
Confidential and Proprietary July 2017
Automatic Structuring
Confidential and Proprietary July 2017
Loom Ops – real-time AIOps
Processing
Semi-structured ->
Structured Data
MLP & Pattern
Recognition
Measure-All
Analysis
Behavior
Tracking
Anomaly Detection
& Trend Prediction
Correlation
Engine
Alerting
Incident
Enrichment
Insights Engine Routing
Confidential and Proprietary July 2017
- This is not (only) anomaly-detection (!)
Algorithms
3σ
Baseline
ARIMA
Feature extraction
Detection & Alerting
History
Scoring
Self Feedback
User Direct and
Indirect Feedback
Detection
When tracking up to 1M signals -> must
automatically determine what kind of
detections are interesting for every signal
(examples: website response time, ad-
click rate)
Confidential and Proprietary July 2017
Root-Cause Analysis
When something breaks, anomalies are everywhere. How do you know what to fix?
Confidential and Proprietary July 2017
Root-Cause Analysis
When something breaks, everything starts complaining. How do you know what to fix?
Confidential and Proprietary July 2017
Automated Root-Cause Analysis. Aggregating the detections, correlating
and determining causality between them.
How?:
➜ Time-based causality
➜ Relationship-based analysis
➜ Graphs-based analysis
Root-Cause Analysis
Confidential and Proprietary July 2017
Examples
Confidential and Proprietary July 2017
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:57 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.16 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user dror from 192.168.118.4 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user john from 192.168.118.14 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user dan from 192.168.118.121 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user gab from 192.168.118.51 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user anna from 192.168.118.66 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user dan from 192.168.118.123 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user jim from 192.168.118.133 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user nate from 192.168.118.201 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user stan from 192.168.118.194 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user paul from 192.168.118.144 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user avi from 192.168.118.81 port…
Sep 27 14:25:57 megatron sshd[7498]: WARN - Failed password for user stas from 192.168.118.54 port…
ronny is mentioned more than normal in the context of ssh failures
The context of ssh failures is mentioned more than normal
Root-Cause Analysis – Relationship Based
Confidential and Proprietary July 2017
Root-Cause Analysis- Graph Based
Confidential and Proprietary July 2017
Root-Cause Analysis- Graph Based
Confidential and Proprietary July 2017
Correlated Incidents
Confidential and Proprietary July 2017
Processing
Semi-structured ->
Structured Data
MLP & Pattern
Recognition
Measure-All
Analysis
Behavior
Tracking
Anomaly Detection
& Trend Prediction
Correlation
Engine
Alerting
Incident
Enrichment
Insights Engine Routing
Real-Time AIOps
Confidential and Proprietary July 2017
Countering Alert Flooding / Alert Fatigue
➜ Overall rate of incidents
➜ Quality of an incident
An incident report:
➜ Root-Cause Analysis
➜ History of similar incidents
➜ Insights & Recommendations
Incident Enrichments
Confidential and Proprietary July 2017
Incident Enrichments
Thank you!
(still hiring very much)

Applying ML for Log Analysis

  • 1.
    Applying AI forLog Analysis July 2017
  • 2.
    Confidential and ProprietaryJuly 2017 Hi! Ronny Lehmann CTO & Founder – Loom Systems Formerly 8200, BioCatch Machine-Learning | High-performance Cloud-Computing @ronnyle_mann
  • 3.
    Confidential and ProprietaryJuly 2017 Founded in April 2015 30 people (5 in San Francisco) Bootstrap for 2 first years, recently funded Hiring very much
  • 4.
    Confidential and ProprietaryJuly 2017 Today’s Big-Data Bottleneck: You are. 2000’s Big-Data Bottlenecks: ✓ Storing ✓ Querying ✓ Real-time processing
  • 5.
    Confidential and ProprietaryJuly 2017 Good dev(ops) are hard-to-find Employee tenure very low (<3yrs. Source: PayScale) Operations is Tribal Knowledge Machines are very loyal, never ask for a raise and have excellent memory. Can (some) of this be done with machines?
  • 6.
    Confidential and ProprietaryJuly 2017 ➜“I’ve been hearing this for 20 years” Total Recall, a movie based on a book from 1966, featuring a self-driving car as science fiction. If Artificial-Intelligence has matured enough to drive your car, it can probably also help with your IT. Skeptic?!
  • 7.
    Confidential and ProprietaryJuly 2017 • Real-time trend detection • Pattern Recognition • Large Dimensionality • Complex State • Strict Methodology HUMANS Good at top-down tasks BOTS Superior at bottom-up tasks • Deep reasoning • Contextual thinking • Tired • Bored • Lazy • Frustrated • Married
  • 8.
    Confidential and ProprietaryJuly 2017 That’s what we do @ Loom Systems AIOps - Algorithmic IT operations Use Big Data and Machine Learning Technologies to Achieve a Data-Centric Approach to Availability and Performance Monitoring. Extend the Data-Centric Approach to Other ITOM (IT Operations Monitoring) Disciplines, and Seek to Exploit the Linkages It Allows Between ITOM, SIEM and Business Analytics
  • 9.
    Confidential and ProprietaryJuly 2017 Action •Remedy •Recommendation •Insight •Knowledge Root-Cause Analysis •Aggregation •Correlation •Causality Data Modelling •Visualizations •Define KPIs •Reporting •Rules & Thresholds Data Preparation •Collection •Normalization •Sanitizing •Preprocessing Cracking the science behind data-science
  • 10.
    Confidential and ProprietaryJuly 2017 Loom Ops – real-time AIOps Processing Semi-structured -> Structured Data MLP & Pattern Recognition Measure-All Analysis Behavior Tracking Anomaly Detection & Trend Prediction Correlation Engine Alerting Incident Enrichment Insights Engine Routing
  • 11.
    Confidential and ProprietaryJuly 2017 Three layers of context Generic Context Something being mentioned more than normal, or is appearing after long absence Something stopped/started happening Common Business Context Semantical words (timeout, Trojan, failure) Common Software Proprietary Business Context Names of business products, servers, applications..
  • 12.
    Confidential and ProprietaryJuly 2017 Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port 48278 ssh2 Processing Generic context – rate of this pattern in the logs Common Business Context – ➜ Contextual words (Warn, Failed) ➜ Common Entities (User, IP, ssh) Proprietary Business Context – ➜ Server Name Real-time Sturcturing, Clustering Token & Entity Extraction and Classification HistogrammegatronServer MetersshdApplication MeterronnyUser Meter192.168.118.1source_IP Random48278source_port Failed password for user [user] from [source_IP] port [source_port] ssh2
  • 13.
    Confidential and ProprietaryJuly 2017 Automatic Structuring
  • 14.
    Confidential and ProprietaryJuly 2017 Loom Ops – real-time AIOps Processing Semi-structured -> Structured Data MLP & Pattern Recognition Measure-All Analysis Behavior Tracking Anomaly Detection & Trend Prediction Correlation Engine Alerting Incident Enrichment Insights Engine Routing
  • 15.
    Confidential and ProprietaryJuly 2017 - This is not (only) anomaly-detection (!) Algorithms 3σ Baseline ARIMA Feature extraction Detection & Alerting History Scoring Self Feedback User Direct and Indirect Feedback Detection When tracking up to 1M signals -> must automatically determine what kind of detections are interesting for every signal (examples: website response time, ad- click rate)
  • 16.
    Confidential and ProprietaryJuly 2017 Root-Cause Analysis When something breaks, anomalies are everywhere. How do you know what to fix?
  • 17.
    Confidential and ProprietaryJuly 2017 Root-Cause Analysis When something breaks, everything starts complaining. How do you know what to fix?
  • 18.
    Confidential and ProprietaryJuly 2017 Automated Root-Cause Analysis. Aggregating the detections, correlating and determining causality between them. How?: ➜ Time-based causality ➜ Relationship-based analysis ➜ Graphs-based analysis Root-Cause Analysis
  • 19.
    Confidential and ProprietaryJuly 2017 Examples
  • 20.
    Confidential and ProprietaryJuly 2017 Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:57 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port… Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.16 port… Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user dror from 192.168.118.4 port… Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user john from 192.168.118.14 port… Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user dan from 192.168.118.121 port… Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user gab from 192.168.118.51 port… Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user anna from 192.168.118.66 port… Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user dan from 192.168.118.123 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user jim from 192.168.118.133 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user nate from 192.168.118.201 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user stan from 192.168.118.194 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user paul from 192.168.118.144 port… Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user avi from 192.168.118.81 port… Sep 27 14:25:57 megatron sshd[7498]: WARN - Failed password for user stas from 192.168.118.54 port… ronny is mentioned more than normal in the context of ssh failures The context of ssh failures is mentioned more than normal Root-Cause Analysis – Relationship Based
  • 21.
    Confidential and ProprietaryJuly 2017 Root-Cause Analysis- Graph Based
  • 22.
    Confidential and ProprietaryJuly 2017 Root-Cause Analysis- Graph Based
  • 23.
    Confidential and ProprietaryJuly 2017 Correlated Incidents
  • 24.
    Confidential and ProprietaryJuly 2017 Processing Semi-structured -> Structured Data MLP & Pattern Recognition Measure-All Analysis Behavior Tracking Anomaly Detection & Trend Prediction Correlation Engine Alerting Incident Enrichment Insights Engine Routing Real-Time AIOps
  • 25.
    Confidential and ProprietaryJuly 2017 Countering Alert Flooding / Alert Fatigue ➜ Overall rate of incidents ➜ Quality of an incident An incident report: ➜ Root-Cause Analysis ➜ History of similar incidents ➜ Insights & Recommendations Incident Enrichments
  • 26.
    Confidential and ProprietaryJuly 2017 Incident Enrichments
  • 27.

Editor's Notes

  • #5 They’re called data-scientists but these are analysts, SRE’s, DevOps and others
  • #7 ASK: Who here believes that self-driving cars will be successful? This book was released exactly 50 years ago. Indeed, science fiction sometimes takes too long to become reality Seriously – let’s get skepticism out of the way – I’m going to be talking about a working concept. It’s not the car, or the street, or the stoplight. It’s the AI. AI is mature, it’s ready
  • #8 Humans are better at top-down, or open-ended questions, such as “where should I open my next branch” Machines are superior in rigorous and exhausting tasks, such as “keep track on our sales in every state, sliced by affiliates, browsers; let me know if something happens”? Can we split responsibility?
  • #11 Analysis is comprised of processing, analyzing, understanding, then acting. Loom Ops does the processing, analyzing, and – to some-extent – the understanding. We must have automated processing if we want to: Track much more Ingest many sources …
  • #12 Loom covers the generic and common contexts, and will be able to inter-connect them with proprietary contexts
  • #13 The single log line will automatically be processed and translated to 8 different metrics! This is without going into sequence analysis Can you see how hard it is to extract value from machine data?
  • #16 We suppress “always-broken” alerts. The Machine-Learning based prioritization and filtering is self adjusting so that the incidents rate fits the size of the team
  • #17 Detection is very hard and usually ends with a vague lead – such as user complaints, high CPU You then go to the logs (single source of truth) but there’s all this noise. You find many unusual things in different log streams. This is RCA – the process of understanding that the kids are fighting, not because Silvia pushed John and he pushed back, but because they’re hungry
  • #18 Can you see how hard it is to extract value from machine data?
  • #21 The ops guy gets an alert – high-cpu on Authentication server. He starts searching the logs for errors, and after some serious amount of work, he narrows it down to this log line. Can you tell the difference in the meaning of the two scenarios?
  • #23 When things go wrong, it’s hard to tell the chain of causality
  • #26 We have less alerts because we suppress “always-broken” alerts, and with the help of ML-based prioritization and filtering. This reaches a much better result when compared to a human-built rule engine Can you see how hard it is to extract value from machine data? Then, it’s the quality of the incident, translating to MTTR. Fuzzy matching is crucial because no one uses ticketing systems. You need to get it in “push”. And you need to be able to provide simple, fast feedback
  • #27 BTW, Anomaly detection makes it possible for us to suppress “always-broken” alerts. We also used Machine-Learning based prioritization and filtering – we adjust the incidents rate to the size of the team. Fuzzy matching is crucial because no one uses ticketing systems. You need to get it in “push”. And you need to be able to provide simple, fast feedback