ParsingJSONReallyQuickly:LessonsLearned
DanielLemire
blog:https://lemire.me
twitter:@lemire
GitHub:https://github.com/lemire/
professor(ComputerScience)atUniversitéduQuébec(TÉLUQ)
Montreal
2
Howfastcanyoureadalargefile?
Areyoulimitedbyyourdiskor
AreyoulimitedbyyourCPU?
3
AniMacdisk:2.2GB/s,FasterSSDs(e.g.,5GB/s)
areavailable
4
Readingtextlines(CPUonly)
~0.6GB/son3.4GHzSkylakeinJava
void parseLine(String s) {
volume += s.length();
}
void readString(StringReader data) {
BufferedReader bf = new BufferedReader(data);
bf.lines().forEach(s -> parseLine(s));
}
Sourceavailable.
ImprovedbyJDK-8229022
5
Readingtextlines(CPUonly)
~1.5GB/son3.4GHzSkylake
inC++(GNUGCC8.3)
size_t sum_line_lengths(char * data, size_t length) {
std::stringstream is;
is.rdbuf()->pubsetbuf(data, length);
std::string line;
size_t sumofalllinelengths{0};
while(getline(is, line)) {
sumofalllinelengths += line.size();
}
return sumofalllinelengths;
}
Sourceavailable.
6
source 7
JSON
SpecifiedbyDouglasCrockford
RFC7159byTimBrayin2013
Ubiquitousformattoexchangedata
{"Image": {"Width": 800,"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/81989943",
"Height": 125,"Width": 100}
}
8
"Ourbackendspendshalfitstimeserializinganddeserializingjson"
9
JSONparsing
Readallofthecontent
CheckthatitisvalidJSON
CheckUnicodeencoding
Parsenumbers
BuildDOM(document-object-model)
Harderthanparsinglines?
10
JacksonJSONspeed(Java)
twitter.json:0.35GB/son3.4GHzSkylake
Sourcecodeavailable.
speed
Jackson(Java) 0.35GB/s
readLinesC++ 1.5GB/s
disk 2.2GB/s
11
RapidJSONspeed(C++)
twitter.json:0.650GB/son3.4GHzSkylake
speed
RapidJSON(C++) 0.65GB/s
Jackson(Java) 0.35GB/s
readLinesC++ 1.5GB/s
disk 2.2GB/s
12
simdjsonspeed(C++)
twitter.json:2.4GB/son3.4GHzSkylake
speed
simdjson(C++) 2.4GB/s
RapidJSON(C++) 0.65GB/s
Jackson(Java) 0.35GB/s
readLinesC++ 1.5GB/s
disk 2.2GB/s
13
2.4GB/sona3.4GHz(+turbo)processoris
~1.5cyclesperinputbyte
14
Trick#1:avoidhard-to-predictbranches
15
Writerandomnumbersonanarray.
while (howmany != 0) {
out[index] = random();
index += 1;
howmany--;
}
e.g.,~3cyclesperiteration
16
Writeonlyoddrandomnumbers:
while (howmany != 0) {
val = random();
if( val is odd) { // <=== new
out[index] = val;
index += 1;
}
howmany--;
}
17
From3cyclesto15cyclespervalue!
18
Gobranchless!while (howmany != 0) {
val = random();
out[index] = val;
index += (val bitand 1);
howmany--;
}
backtounder4cycles!
Detailsandcodeavailable
19
WhatifIkeeprunningthesamebenchmark?
(samepseudo-randomintegersfromrun-to-run)
20
Trick#2:Usewide"words"
Don'tprocessbytebybyte
21
Whenpossible,useSIMDAvailableonmostcommodityprocessors(ARM,x64)
Originallyadded(Pentium)formultimedia(sound)
Addwider(128-bit,256-bit,512-bit)registers
Addsnewfuninstructions:do32tablelookupsatonce.
22
ISA where max.registerwidth
ARMNEON(AArch64) mobilephones,tablets 128-bit
SSE2...SSE4.2 legacyx64(Intel,AMD) 128-bit
AVX,AVX2 mainstreamx64(Intel,AMD) 256-bit
AVX-512 latestx64(Intel) 512-bit
23
"Intrinsic"functions(C,C++,Rust,...)mappingtospecificinstructionsonspecific
instructionssets
Higherlevelfunctions(Swift,C++,...):JavaVectorAPI
Autovectorization("compilermagic")(Java,C,C++,...)
Optimizedfunctions(someinJava)
Assembly(e.g.,incrypto)
24
Trick#3:avoidmemory/objectallocation
25
Insimdjson,theDOM(document-object-model)isstoredononecontiguoustape.
26
Trick#4:measuretheperformance!
benchmark-drivendevelopment
27
ContinuousIntegrationPerformancetests
performanceregressionisabugthatshouldbespottedearly
28
Processorfrequenciesarenotconstant
Especiallyonlaptops
CPUcyclesdifferentfromtime
TimecanbenoisierthanCPUcycles
29
Specificexamples
30
Example1.UTF-8StringsareASCII(1bytepercodepoint)
Otherwisemultiplebytes(2,3or4)
Only1.1MvalidUTF-8codepoints
31
ValidatingUTF-8withif/else/while
if (byte1 < 0x80) {
return true; // ASCII
}
if (byte1 < 0xE0) {
if (byte1 < 0xC2 || byte2 > 0xBF) {
return false;
}
} else if (byte1 < 0xF0) {
// Three-byte form.
if (byte2 > 0xBF
|| (byte1 == 0xE0 && byte2 < 0xA0)
|| (byte1 == 0xED && 0xA0 <= byte2)
blablabla
) blablabla
} else {
// Four-byte form.
.... blabla
}
32
UsingSIMD
Load32-byteregisters
Use~20instructions
Nobranch,nobranchmisprediction
33
Example:Verifythatallbytevaluesarenolargerthan244
Saturatedsubtraction: x - 244 isnon-zeroifanonlyif x > 244 .
_mm256_subs_epu8(current_bytes, 244 );
Oneinstruction,checks32bytesatonce!
34
processingrandomUTF-8cycles/byte
branching 11
simdjson 0.5
20xfaster!
Sourcecodeavailable.
35
Example2.Classifyingcharacters
comma(0x2c) ,
colon(0x3a) :
brackets(0x5b,0x5d,0x7b,0x7d): [, ], {, }
white-space(0x09,0x0a,0x0d,0x20)
others
Classify16,32or64charactersatonce!
36
Dividevaluesintotwo'nibbles'
0x2cis2(highnibble)andc(lownibble)
Thereare16possiblelownibbles.
Thereare16possiblehighnibbles.
37
ARMNEONandx64processorshaveinstructionsto
lookup16-bytetablesinavectorizedmanner(16
valuesatatime):pshufb,tbl
38
Startwithanarrayof4-bitvalues
[1,1,0,2,0,5,10,15,7,8,13,9,0,13,5,1]
Createalookuptable
[200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215]
0 200,1 201,2 202
Result:
[201,201,200,202,200,205,210,215,207,208,213,209,200,213,205,201]
39
Findtwotables H1 and H2 suchasthebitwiseANDofthelookclassifythecharacters.
H1(low(c)) & H2(high(c))
comma(0x2c):1
colon(0x3a):2
brackets(0x5b,0x5d,0x7b,0x7d):4
mostwhite-space(0x09,0x0a,0x0d):8
whitespace(0x20):16
others:0
40
const uint8x16_t low_nibble_mask =
(uint8x16_t){16, 0, 0, 0, 0, 0, 0, 0, 0, 8, 12, 1, 2, 9, 0, 0};
const uint8x16_t high_nibble_mask =
(uint8x16_t){8, 0, 18, 4, 0, 1, 0, 1, 0, 0, 0, 3, 2, 1, 0, 0};
const uint8x16_t low_nib_and_mask = vmovq_n_u8(0xf);
Fiveinstructions:
uint8x16_t nib_lo = vandq_u8(chunk, low_nib_and_mask);
uint8x16_t nib_hi = vshrq_n_u8(chunk, 4);
uint8x16_t shuf_lo = vqtbl1q_u8(low_nibble_mask, nib_lo);
uint8x16_t shuf_hi = vqtbl1q_u8(high_nibble_mask, nib_hi);
return vandq_u8(shuf_lo, shuf_hi);
41
Example3.Detectingescapedcharacters
" "
 
" "
42
Canyoutellwherethestringsstartandend?
{ ""Nam[{": [ 116,"" ...
Withoutbranching?
43
Escapecharactersfollowanoddsequenceof
backslashes!
44
Identifybackslashes:
{ ""Nam[{": [ 116,""
___111________________1111_ :B
Oddandevenpositions
1_1_1_1_1_1_1_1_1_1_1_1_1_1 :E(constant)
_1_1_1_1_1_1_1_1_1_1_1_1_1_ :O(constant)
45
Doabunchofarithmeticandlogicaloperations...
(((B + (B &~(B << 1)& E))& ~B)& ~E) | (((B + ((B &~(B << 1))& O))& ~B)& E)
Result:
{ ""Nam[{": [ 116,"" ...
______1____________________
Nobranch!
46
Removetheescapedquotes,and
theremainingquotestellyouwherethestringsare!
47
{ ""Nam[{": [ 116,""
__1___1_____1________1____1 :allquotes
______1____________________ :escapedquotes
__1_________1________1____1 :string-delimiterquotes
48
Findthespanofthestring
mask = quote xor (quote << 1);
mask = mask xor (mask << 2);
mask = mask xor (mask << 4);
mask = mask xor (mask << 8);
mask = mask xor (mask << 16);
...
__1_________1________1____1 (quotes)
becomes
__1111111111_________11111_ (stringregion)
49
EntirestructureoftheJSONdocumentcanbe
identified(asabitset)withoutanybranch!
50
Example4.Decodebitindexes
Giventhebitset 1000100010001 ,wewantthelocationofthe1s(e.g.,0,4,812)
51
while (word != 0) {
result[i] = trailingzeroes(word);
word = word & (word - 1);
i++;
}
Ifnumberof1sper64-bitishardtopredict:lotsofmispredictions!!!
52
Insteadofpredictingthenumberof1sper64-bit,predictwhetheritisin
{1,2,3,4}
{5,6,7,8}
{9,10,11,12}
Easier!
53
Reducethenumberofmispredictionbydoingmoreworkperiteration:
while (word != 0) {
result[i] = trailingzeroes(word);
word = word & (word - 1);
result[i+1] = trailingzeroes(word);
word = word & (word - 1);
result[i+2] = trailingzeroes(word);
word = word & (word - 1);
result[i+3] = trailingzeroes(word);
word = word & (word - 1);
i+=4;
}
Discardbogusindexesbycountingthenumberof1sintheworddirectly(e.g.,
bitCount )
54
Example5.Numberparsingisexpensive
strtod :
90MB/s
38cyclesperbyte
10branchmissesperfloating-pointnumber
55
Checkwhetherwehave8consecutivedigits
bool is_made_of_eight_digits_fast(const char *chars) {
uint64_t val;
memcpy(&val, chars, 8);
return (((val & 0xF0F0F0F0F0F0F0F0) |
(((val + 0x0606060606060606) & 0xF0F0F0F0F0F0F0F0) >> 4))
== 0x3333333333333333);
}
56
Thenconstructthecorrespondinginteger
Usingonlythreemultiplications(insteadof7):
uint32_t parse_eight_digits_unrolled(const char *chars) {
uint64_t val;
memcpy(&val, chars, sizeof(uint64_t));
val = (val & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8;
val = (val & 0x00FF00FF00FF00FF) * 6553601 >> 16;
return (val & 0x0000FFFF0000FFFF) * 42949672960001 >> 32;
}
CandoevenbetterwithSIMD
57
RuntimedispatchOnfirstcall,pointerchecksCPU,andreassignsitself.Nolanguagesupport.
58
int json_parse_dispatch(...) {
Architecture best_implementation = find_best_supported_implementation();
// Selecting the best implementation
switch (best_implementation) {
case Architecture::HASWELL:
json_parse_ptr = &json_parse_implementation<Architecture::HASWELL>;
break;
case Architecture::WESTMERE:
json_parse_ptr= &json_parse_implementation<Architecture::WESTMERE>;
break;
default:
return UNEXPECTED_ERROR;
}
return json_parse_ptr(....);
}
59
Wheretogetit?
GitHub:https://github.com/lemire/simdjson/
ModernC++,single-header(easyintegration)
ARM(e.g.,iPhone),x64(goingback10years)
Apache2.0(nohiddenpatents)
UsedbyMicrosoftFishStoreandYandexClickHouse
wrappersinPython,PHP,C#,Rust,JavaScript(node),Ruby
portstoRust,GoandC#
60
Reference
GeoffLangdale,DanielLemire,ParsingGigabytesofJSONperSecond,VLDB
Journal,https://arxiv.org/abs/1902.08318
61
Credit
GeoffLangdale(algorithmicarchitectandwizard)
Contributors:
ThomasNavennec,KaiWolf,TylerKennedy,FrankWessels,GeorgeFotopoulos,Heinz
N.Gies,EmilGedda,WojciechMuła,GeorgiosFloros,DongXie,NanXiao,Egor
Bogatov,JinxiWang,LuizFernandoPeres,WouterBolsterlee,AnishKarandikar,Reini
Urban.TomDyson,IhorDotsenko,AlexeyMilovidov,ChangLiu,SunnyGleason,John
Keiser,ZachBjornson,VitalyBaranov,JuhoLauri,MichaelEisel,IoDazaDillon,Paul
Dreik,JérémiePiotteandothers
62
63

Parsing JSON Really Quickly: Lessons Learned