An introduction to probabilistic data structures

PROBABILISTIC DATAPROBABILISTIC DATA
STRUCTURESSTRUCTURES

INTROINTRO
What is a Probabilistic Data Structure

TRADEOFFSTRADEOFFS
Trade accuracy for speed

TRADEOFFSTRADEOFFS
meaning: they do not give 100% accurate
results

TRADEOFFSTRADEOFFS
results
Have less space requirements, also called sublinear

TRADEOFFSTRADEOFFS
results
meaning: to count N distinct items, the
required space is less than N

TRADEOFFSTRADEOFFS
results
meaning: to count N distinct items, the
required space is less than N
Bonus: are associative

GOOD ENOUGHGOOD ENOUGH
A query may return a wrong answer
But the answer is good enough (ex:
count=1355, real count = 1299)

GOOD ENOUGHGOOD ENOUGH
A query may return a wrong answer
But the answer is good enough (ex:
count=1355, real count = 1299)
Usually for BigData(tm) whatever that is

WHATS IT FOR?WHATS IT FOR?
Membership tests
Does this SET contains a particular ELEMENT?

An element is either MAYBE on the set, or IS NOT

FALSE POSITIVES ARE POSSIBLEFALSE POSITIVES ARE POSSIBLE

FALSE POSITIVES ARE POSSIBLEFALSE POSITIVES ARE POSSIBLE
FALSE NEGATIVES NEVER HAPPENFALSE NEGATIVES NEVER HAPPEN

HOW DOES IT WORK?HOW DOES IT WORK?
IT'S A BIT SETIT'S A BIT SET

ADDINGADDING
Note that a very long string still occupies the same
couple of bits.

PARAMETERSPARAMETERS
Bit eld size (m)

Bit eld size (m)
Number of hash functions (k)
insertion and membership are O(k)

RULESRULES
One byte per item in the input set gives about a 2%
false positive rate
1024 elements to a 1KB Bloom Filter, about a
2% false positives

RULESRULES
false positive rate
2% false positives
The optimal number of hash functions is about 0.7
times the number of bits per item
3 at a 10% false positive rate
13 at a 0.01% false positive rate

RULESRULES
false positive rate
2% false positives
The optimal number of hash functions is about 0.7
times the number of bits per item
3 at a 10% false positive rate
13 at a 0.01% false positive rate
The number of hashes dominates performance

DIATRIBE: HASHINGDIATRIBE: HASHING
JUST USE MURMUR3JUST USE MURMUR3

Fraction of keys hashed without collision (64 bits)

V{n,i} = Number of items of namespace n hashed to
the i-th bin
Variance vs Mean for random distribution

BLACK=50% FLIP-PROBABILITY,BRIGHT GREEN=OUTPUT BIT IS “STUCK”- DOESN’T EVERBLACK=50% FLIP-PROBABILITY,BRIGHT GREEN=OUTPUT BIT IS “STUCK”- DOESN’T EVER
VARYVARY

COUNT MIN SKETCHCOUNT MIN SKETCH

Top-K frequencies/Heavy hitters

Top-K frequencies/Heavy hitters
How many times have you seen X?
Leaderboards
Stats
Rate limiting, packet stats, etc

IT'S A 2D ARRAYIT'S A 2D ARRAY

QUERYINGQUERYING
Take the minimum

Number of hash functions
Size of matrix

Quantiles

Quantiles
What's the 90% percentile for GET /my/service?
and 99%?

Quantiles
and 99%?
anomaly detection: trigger at some percentile
threshold

Quantiles
and 99%?
threshold
quantiles per metric per user/location/etc

Quantiles
and 99%?
threshold
quantiles per metric per user/location/etc
Normally you need the full data set for a given
quantile
You cannot calculate a quantile of quantiles -
makes it hard to do streaming

SPARSEREPRESENTATION OFTHECUMULATIVEDISTRIBUTION FUNCTIONSPARSEREPRESENTATION OFTHECUMULATIVEDISTRIBUTION FUNCTION

SPARSEREPRESENTATION OFTHECUMULATIVEDISTRIBUTION FUNCTIONSPARSEREPRESENTATION OFTHECUMULATIVEDISTRIBUTION FUNCTION
After ingesting data, the data structure has learned
the "interesting" points of the CDF, called
centroids

SOMEDATASOMEDATA

EMPIRICAL CDFEMPIRICAL CDF

"INTERESTING"POINTS"INTERESTING"POINTS

COMBININGCOMBINING
Create a new t-Digest and treat the internal
centroids of the two left-hand side digests as
incoming data
The resulting t-Digest is a only slightly larger, but
more accurate
tDigest1 + tDigest2 = tDigest3
------------------- --------
incoming data => new tDigest

QUERYINGQUERYING
8mb of pareto-distributed data into a t-Digest

QUERYINGQUERYING
8mb of pareto-distributed data into a t-Digest
Resulting size was 5kb
any percentile or quantile desired
accuracy was on the order of 0.002%.

Compression
tradeoff of size vs accuracy
depends on the implementation, some expose
more params than others
doesn't always mean the same thing

Cardinality Estimation

How many distinct ITEMS are there today? and
yesterday? and the two days?
ex: unique visitors

How many distinct ITEMS are there today? and
yesterday? and the two days?
ex: unique visitors
group-by/count without keeping all the data

IT'S COMPLICATEDIT'S COMPLICATED
...The observation that the cardinality of
a multiset of uniformly-distributed
random numbers can be estimated by
calculating the maximum number of
leading zeros in the binary representation
of each number in the set.
If the maximum number of leading zeros
observed is n, an estimate for the number
of distinct elements in the set is 2^n.

REAL SLOWNOWREAL SLOWNOW
In a random stream of integers

~50% of the numbers (in binary) starts with
"1"

"1"
25% starts with "01"

"1"
12,5% starts with "001"

"1"
12,5% starts with "001"
If you observe a random stream and see a "001",
there is a higher chance that this stream has a
cardinality of 8.

BUCKETINGBUCKETING
number: 13,200,393
hash: 2,005,620,294
bits: [100010110101011001000110]
100010110101011001 000110
---------------|------
value index

BUCKETINGBUCKETING
number: 13,200,393
hash: 2,005,620,294
bits: [100010110101011001000110]
value: number of zeros +1 (rtl)
100010110101011001 000110
---------------|------
value index

BUCKETINGBUCKETING
number: 13,200,393
hash: 2,005,620,294
bits: [100010110101011001000110]
index: The lowest b bits used to determine the
index of the register whose value is to be updated.
(m=2b)
100010110101011001 000110
---------------|------
value index

BUCKETINGBUCKETING
number: 13,200,393
hash: 2,005,620,294
bits: [100010110101011001000110]
index: The lowest b bits used to determine the
index of the register whose value is to be updated.
(m=2b)
each bucket will serve as an "estimator"
100010110101011001 000110
---------------|------
value index

ESTIMATINGESTIMATING
LogLog:
In order to compute the number of distinct
values in the stream you would just take the
average of all of the m buckets
distinct vals = constant * m * 2^(avg R)

ESTIMATINGESTIMATING
LogLog:
In order to compute the number of distinct
values in the stream you would just take the
average of all of the m buckets
HyperLogLog uses
large range correction (??)
Harmonic Mean which tends to behave better
for extreme values
distinct vals = constant * m * 2^(avg R)

EXAMPLEEXAMPLE
http://content.research.neustar.biz/blog/hll.html

UNIONS/INTERSECTIONSUNIONS/INTERSECTIONS
Are lossless (for same HLL size)
Some guys tried to combine different HLL
with different sizes
How many distinct visitors we had in
Monday AND Tuesday?

number of buckets/registers
theoretical HLL error bounds (1.04 / sqrt(m))

THIS IS HUGETHIS IS HUGE
Who's using?
Node, Java, C, etc etc
Postgres
Redis
Twitter Algebird, Scalding
Druid (MPP)
Basically anyone who needs to count
distinct/group-by

CONCLUSIONCONCLUSION
evaluate the scenario to see if approximation is
useful

useful
test the npm packages, some are just crap

useful
test the npm packages, some are just crap
I would write them in C and use f

An introduction to probabilistic data structures

More Related Content

Similar to An introduction to probabilistic data structures

Recently uploaded

An introduction to probabilistic data structures