PROBABILISTIC DATAPROBABILISTIC DATA
STRUCTURESSTRUCTURES
INTROINTRO
What is a Probabilistic Data Structure
TRADEOFFSTRADEOFFS
Trade accuracy for speed
TRADEOFFSTRADEOFFS
Trade accuracy for speed
meaning: they do not give 100% accurate
results
TRADEOFFSTRADEOFFS
Trade accuracy for speed
meaning: they do not give 100% accurate
results
Have less space requirements, also called sublinear
TRADEOFFSTRADEOFFS
Trade accuracy for speed
meaning: they do not give 100% accurate
results
Have less space requirements, also called sublinear
meaning: to count N distinct items, the
required space is less than N
TRADEOFFSTRADEOFFS
Trade accuracy for speed
meaning: they do not give 100% accurate
results
Have less space requirements, also called sublinear
meaning: to count N distinct items, the
required space is less than N
Bonus: are associative
GOOD ENOUGHGOOD ENOUGH
GOOD ENOUGHGOOD ENOUGH
A query may return a wrong answer
But the answer is good enough (ex:
count=1355, real count = 1299)
GOOD ENOUGHGOOD ENOUGH
A query may return a wrong answer
But the answer is good enough (ex:
count=1355, real count = 1299)
Usually for BigData(tm) whatever that is
BLOOM FILTERSBLOOM FILTERS
WHATS IT FOR?WHATS IT FOR?
Membership tests
Does this SET contains a particular ELEMENT?
An element is either MAYBE on the set, or IS NOT
FALSE POSITIVES ARE POSSIBLEFALSE POSITIVES ARE POSSIBLE
An element is either MAYBE on the set, or IS NOT
FALSE POSITIVES ARE POSSIBLEFALSE POSITIVES ARE POSSIBLE
FALSE NEGATIVES NEVER HAPPENFALSE NEGATIVES NEVER HAPPEN
An element is either MAYBE on the set, or IS NOT
HOW DOES IT WORK?HOW DOES IT WORK?
IT'S A BIT SETIT'S A BIT SET
ADDINGADDING
Note that a very long string still occupies the same
couple of bits.
QUERYINGQUERYING
PARAMETERSPARAMETERS
PARAMETERSPARAMETERS
Bit eld size (m)
PARAMETERSPARAMETERS
Bit eld size (m)
Number of hash functions (k)
insertion and membership are O(k)
RULESRULES
RULESRULES
One byte per item in the input set gives about a 2%
false positive rate
1024 elements to a 1KB Bloom Filter, about a
2% false positives
RULESRULES
One byte per item in the input set gives about a 2%
false positive rate
1024 elements to a 1KB Bloom Filter, about a
2% false positives
The optimal number of hash functions is about 0.7
times the number of bits per item
3 at a 10% false positive rate
13 at a 0.01% false positive rate
RULESRULES
One byte per item in the input set gives about a 2%
false positive rate
1024 elements to a 1KB Bloom Filter, about a
2% false positives
The optimal number of hash functions is about 0.7
times the number of bits per item
3 at a 10% false positive rate
13 at a 0.01% false positive rate
The number of hashes dominates performance
CASSANDRACASSANDRA
DIATRIBE: HASHINGDIATRIBE: HASHING
JUST USE MURMUR3JUST USE MURMUR3
Fraction of keys hashed without collision (64 bits)
V{n,i} = Number of items of namespace n hashed to
the i-th bin
Variance vs Mean for random distribution
BLACK=50% FLIP-PROBABILITY,BRIGHT GREEN=OUTPUT BIT IS “STUCK”- DOESN’T EVERBLACK=50% FLIP-PROBABILITY,BRIGHT GREEN=OUTPUT BIT IS “STUCK”- DOESN’T EVER
VARYVARY
COUNT MIN SKETCHCOUNT MIN SKETCH
WHATS IT FOR?WHATS IT FOR?
Top-K frequencies/Heavy hitters
WHATS IT FOR?WHATS IT FOR?
Top-K frequencies/Heavy hitters
How many times have you seen X?
Leaderboards
Stats
Rate limiting, packet stats, etc
HOW DOES IT WORK?HOW DOES IT WORK?
IT'S A 2D ARRAYIT'S A 2D ARRAY
ADDINGADDING
QUERYINGQUERYING
Take the minimum
PARAMETERSPARAMETERS
Number of hash functions
Size of matrix
TDIGESTTDIGEST
WHATS IT FOR?WHATS IT FOR?
Quantiles
WHATS IT FOR?WHATS IT FOR?
Quantiles
What's the 90% percentile for GET /my/service?
and 99%?
WHATS IT FOR?WHATS IT FOR?
Quantiles
What's the 90% percentile for GET /my/service?
and 99%?
anomaly detection: trigger at some percentile
threshold
WHATS IT FOR?WHATS IT FOR?
Quantiles
What's the 90% percentile for GET /my/service?
and 99%?
anomaly detection: trigger at some percentile
threshold
quantiles per metric per user/location/etc
WHATS IT FOR?WHATS IT FOR?
Quantiles
What's the 90% percentile for GET /my/service?
and 99%?
anomaly detection: trigger at some percentile
threshold
quantiles per metric per user/location/etc
Normally you need the full data set for a given
quantile
You cannot calculate a quantile of quantiles -
makes it hard to do streaming
HOW DOES IT WORK?HOW DOES IT WORK?
SPARSEREPRESENTATION OFTHECUMULATIVEDISTRIBUTION FUNCTIONSPARSEREPRESENTATION OFTHECUMULATIVEDISTRIBUTION FUNCTION
HOW DOES IT WORK?HOW DOES IT WORK?
SPARSEREPRESENTATION OFTHECUMULATIVEDISTRIBUTION FUNCTIONSPARSEREPRESENTATION OFTHECUMULATIVEDISTRIBUTION FUNCTION
After ingesting data, the data structure has learned
the "interesting" points of the CDF, called
centroids
HOW DOES IT WORK?HOW DOES IT WORK?
SOMEDATASOMEDATA
HOW DOES IT WORK?HOW DOES IT WORK?
EMPIRICAL CDFEMPIRICAL CDF
HOW DOES IT WORK?HOW DOES IT WORK?
"INTERESTING"POINTS"INTERESTING"POINTS
COMBININGCOMBINING
Create a new t-Digest and treat the internal
centroids of the two left-hand side digests as
incoming data
The resulting t-Digest is a only slightly larger, but
more accurate
tDigest1 + tDigest2 = tDigest3
------------------- --------
incoming data => new tDigest
QUERYINGQUERYING
8mb of pareto-distributed data into a t-Digest
QUERYINGQUERYING
8mb of pareto-distributed data into a t-Digest
Resulting size was 5kb
any percentile or quantile desired
accuracy was on the order of 0.002%.
PARAMETERSPARAMETERS
Compression
tradeoff of size vs accuracy
depends on the implementation, some expose
more params than others
doesn't always mean the same thing
HYPERLOGLOGHYPERLOGLOG
WHATS IT FOR?WHATS IT FOR?
Cardinality Estimation
WHATS IT FOR?WHATS IT FOR?
Cardinality Estimation
How many distinct ITEMS are there today? and
yesterday? and the two days?
ex: unique visitors
WHATS IT FOR?WHATS IT FOR?
Cardinality Estimation
How many distinct ITEMS are there today? and
yesterday? and the two days?
ex: unique visitors
group-by/count without keeping all the data
HOW DOES IT WORK?HOW DOES IT WORK?
IT'S COMPLICATEDIT'S COMPLICATED
...The observation that the cardinality of
a multiset of uniformly-distributed
random numbers can be estimated by
calculating the maximum number of
leading zeros in the binary representation
of each number in the set.
If the maximum number of leading zeros
observed is n, an estimate for the number
of distinct elements in the set is 2^n.
REAL SLOWNOWREAL SLOWNOW
In a random stream of integers
REAL SLOWNOWREAL SLOWNOW
In a random stream of integers
~50% of the numbers (in binary) starts with
"1"
REAL SLOWNOWREAL SLOWNOW
In a random stream of integers
~50% of the numbers (in binary) starts with
"1"
25% starts with "01"
REAL SLOWNOWREAL SLOWNOW
In a random stream of integers
~50% of the numbers (in binary) starts with
"1"
25% starts with "01"
12,5% starts with "001"
REAL SLOWNOWREAL SLOWNOW
In a random stream of integers
~50% of the numbers (in binary) starts with
"1"
25% starts with "01"
12,5% starts with "001"
If you observe a random stream and see a "001",
there is a higher chance that this stream has a
cardinality of 8.
BUCKETINGBUCKETING
number: 13,200,393
hash: 2,005,620,294
bits: [100010110101011001000110]
100010110101011001 000110
---------------|------
value index
BUCKETINGBUCKETING
number: 13,200,393
hash: 2,005,620,294
bits: [100010110101011001000110]
value: number of zeros +1 (rtl)
100010110101011001 000110
---------------|------
value index
BUCKETINGBUCKETING
number: 13,200,393
hash: 2,005,620,294
bits: [100010110101011001000110]
value: number of zeros +1 (rtl)
index: The lowest b bits used to determine the
index of the register whose value is to be updated.
(m=2b)
100010110101011001 000110
---------------|------
value index
BUCKETINGBUCKETING
number: 13,200,393
hash: 2,005,620,294
bits: [100010110101011001000110]
value: number of zeros +1 (rtl)
index: The lowest b bits used to determine the
index of the register whose value is to be updated.
(m=2b)
each bucket will serve as an "estimator"
100010110101011001 000110
---------------|------
value index
ESTIMATINGESTIMATING
LogLog:
In order to compute the number of distinct
values in the stream you would just take the
average of all of the m buckets
distinct vals = constant * m * 2^(avg R)
ESTIMATINGESTIMATING
LogLog:
In order to compute the number of distinct
values in the stream you would just take the
average of all of the m buckets
HyperLogLog uses
large range correction (??)
Harmonic Mean which tends to behave better
for extreme values
distinct vals = constant * m * 2^(avg R)
HARMONIC WUT?HARMONIC WUT?
EXAMPLEEXAMPLE
http://content.research.neustar.biz/blog/hll.html
UNIONS/INTERSECTIONSUNIONS/INTERSECTIONS
Are lossless (for same HLL size)
Some guys tried to combine different HLL
with different sizes
How many distinct visitors we had in
Monday AND Tuesday?
PARAMETERSPARAMETERS
number of buckets/registers
theoretical HLL error bounds (1.04 / sqrt(m))
THIS IS HUGETHIS IS HUGE
Who's using?
Node, Java, C, etc etc
Postgres
Redis
Twitter Algebird, Scalding
Druid (MPP)
Basically anyone who needs to count
distinct/group-by
CONCLUSIONCONCLUSION
CONCLUSIONCONCLUSION
evaluate the scenario to see if approximation is
useful
CONCLUSIONCONCLUSION
evaluate the scenario to see if approximation is
useful
test the npm packages, some are just crap
CONCLUSIONCONCLUSION
evaluate the scenario to see if approximation is
useful
test the npm packages, some are just crap
I would write them in C and use f
ENDEND

An introduction to probabilistic data structures