Skip to content

Commit 317fc6b

Browse files
authored
accurate number parsing (simdjson#558)
1 parent 1aaad22 commit 317fc6b

File tree

7 files changed

+1382
-254
lines changed

7 files changed

+1382
-254
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@ _We do not aim to provide a general-purpose JSON library._ A library like RapidJ
402402
- The input string is unmodified. (Parsers like sajson and RapidJSON use the input string as a buffer.)
403403
- We parse integers and floating-point numbers as separate types which allows us to support large signed 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long` and large unsigned integers up to the value 18446744073709551615. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed or unsigned 64-bit value, we reject the JSON document.
404404
- We support the full range of 64-bit floating-point numbers (binary64). The values range from ` std::numeric_limits<double>::lowest()` to `std::numeric_limits<double>::max()`, so from -1.7976e308 all the way to 1.7975e308. Extreme values (less or equal to -1e308, greater or equal to 1e308) are rejected: we refuse to parse the input document.
405-
- We test for accurate float parsing with a bound on the [unit of least precision (ULP)](https://en.wikipedia.org/wiki/Unit_in_the_last_place) of one. Practically speaking, this implies 15 digits of accuracy or better.
405+
- We test for accurate float parsing with a perfect (ULP 0) accuracy. Many parsers offer only approximate floating parsing. RapidJSON also offers the option of accurate float parsing (`kParseFullPrecisionFlag`) but it comes at a significant performance penalty compared to the default settings.
406406
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation. The sajson parser does incomplete UTF-8 validation, accepting code point
407407
sequences like 0xb1 0x87.)
408408
- We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)

benchmark/parseandstatcompetition.cpp

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,32 @@ rapid_compute_stats(const simdjson::padded_string &p) {
238238
return answer;
239239
}
240240

241+
__attribute__((noinline)) stat_t
242+
rapid_accurate_compute_stats(const simdjson::padded_string &p) {
243+
stat_t answer;
244+
char *buffer = (char *)malloc(p.size() + 1);
245+
if(buffer == nullptr) {
246+
return answer;
247+
}
248+
memcpy(buffer, p.data(), p.size());
249+
buffer[p.size()] = '\0';
250+
rapidjson::Document d;
251+
d.ParseInsitu<kParseValidateEncodingFlag|kParseFullPrecisionFlag>(buffer);
252+
answer.valid = !d.HasParseError();
253+
if (!answer.valid) {
254+
free(buffer);
255+
return answer;
256+
}
257+
answer.number_count = 0;
258+
answer.object_count = 0;
259+
answer.array_count = 0;
260+
answer.null_count = 0;
261+
answer.true_count = 0;
262+
answer.false_count = 0;
263+
rapid_traverse(answer, d);
264+
free(buffer);
265+
return answer;
266+
}
241267
int main(int argc, char *argv[]) {
242268
bool verbose = false;
243269
bool just_data = false;
@@ -294,6 +320,11 @@ int main(int argc, char *argv[]) {
294320
printf("rapid: ");
295321
print_stat(s2);
296322
}
323+
stat_t s2a = rapid_accurate_compute_stats(p);
324+
if (verbose) {
325+
printf("rapid full: ");
326+
print_stat(s2);
327+
}
297328
stat_t s3 = sasjon_compute_stats(p);
298329
if (verbose) {
299330
printf("sasjon: ");

benchmark/parsingcompetition.cpp

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,11 +130,24 @@ bool bench(const char *filename, bool verbose, bool just_data, int repeat_multip
130130
.HasParseError(),
131131
false, memcpy(buffer, p.data(), p.size()), repeat, volume,
132132
!just_data);
133+
#ifndef ALLPARSER
134+
if (!just_data)
135+
#endif
136+
BEST_TIME("RapidJSON (accurate number parsing) ",
137+
d.Parse<kParseValidateEncodingFlag|kParseFullPrecisionFlag>((const char *)buffer)
138+
.HasParseError(),
139+
false, memcpy(buffer, p.data(), p.size()), repeat, volume,
140+
!just_data);
133141
BEST_TIME("RapidJSON (insitu)",
134142
d.ParseInsitu<kParseValidateEncodingFlag>(buffer).HasParseError(),
135143
false,
136144
memcpy(buffer, p.data(), p.size()) && (buffer[p.size()] = '\0'),
137145
repeat, volume, !just_data);
146+
BEST_TIME("RapidJSON (insitu, accurate number parsing)",
147+
d.ParseInsitu<kParseValidateEncodingFlag|kParseFullPrecisionFlag>(buffer).HasParseError(),
148+
false,
149+
memcpy(buffer, p.data(), p.size()) && (buffer[p.size()] = '\0'),
150+
repeat, volume, !just_data);
138151
#ifndef ALLPARSER
139152
if (!just_data)
140153
#endif

0 commit comments

Comments
 (0)