Support lenient parsing mode to ignore trailing garbage (like RapidJSON's kParseStopWhenDoneFlag or Spark's PERMISSIVE mode) #2502
Replies: 1 comment 1 reply
-
|
@zhanglistar I have converted your issue to a discussion under 'ideas'. The simdjson library is a community-based project. We do support non-standard JSON under some conditions with deliberately undocumented flags. I have personally expressed my own stance:
I would encourage you to prepare a pull request. However, please note that we will not allow by default broken JSON to pass through. (See the rest of my message to understand why.)
That's an interesting example. The user is trying to parse as a JSON document the following... {"id": -2, "ipAddress": "100.100.100.100", "howYouHearAboutUs": null, "isInterestedInOffer": true, "incomeRange": 60000, "isEmailConfirmed": false}
{"id": -1, "firstName": "John", "lastName": "Smith", "email": "[john.smith@gmail.com](mailto:john.smith@gmail.com)", "city": "Smalltown", "incomeRange": 1, "birthDate": "1999-12-10T05:00:00Z", "password": "*********", "agreeToTermsOfUse": true, "howYouHearAboutUs": "Radio", "isInterestedInOffer": false}
{"id": -3, "visitUrl": "https://www.website.com/?purpose=X", "ipAddress": "100.200.300.400", "howYouHearAboutUs": null, "isInterestedInOffer": true, "incomeRange": 100000, "isEmailConfirmed": true, "isIdentityConfirmed": false, "agreeToTermsOfUse": true, "validationResults": null}And they get an error. Should this input parse as a JSON document, silently ignoring that it is not ONE but rather a stream of JSON documents ? Of course, the correct behaviour is to get an error. It would be totally counterproductive to just silently ignore the issue. And the first answer on this StackOverflow question explains how the user misunderstands what he is doing, and how to fix the issue involves thinking about what one is trying to do. We do support streams of JSON documents in simdjson. So we do have a direct solution to the problem encountered by this issue, directly in simdjson. And it does not involve silently ignoring the error. We would want our users to do the following... auto json = R"({ "foo": 1 } { "foo": 2 } { "foo": 3 } )"_padded;
ondemand::parser parser;
ondemand::document_stream docs = parser.iterate_many(json);
for (auto doc : docs) {
//...
} |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently, simdjson fails to parse JSON strings with trailing garbage characters, returning an error. For example, the string
{"a":"b"}abdefails entirely due to the abde suffix, preventing extraction of the valid{"a":"b"}part.This is common in "dirty" data scenarios like logs, user inputs, or big data pipelines (e.g., Apache Spark), where JSON may be mixed with invalid characters. Many libraries support a lenient mode to ignore such garbage for robustness.Expected BehaviorAdd an optional parsing flag (e.g., parse_flags::stop_when_done or parse_flags::lenient_trailing) that allows the parser to stop immediately after the valid JSON root object/array, ignoring subsequent characters. On success, return SIMDJSON_SUCCESS, and optionally report the garbage position (e.g., via a byte offset in padded_json).This would mirror:RapidJSON's kParseStopWhenDoneFlag: Stops after parsing the root element, ignoring trailing content.Apache Spark's PERMISSIVE mode: Leniently parses malformed records, extracting valid parts into normal fields and placing invalid parts in a _corrupt_record-like field (optional similar mechanism).
Desired Behavior (Succeeds, Ignores Garbage):cpp
// Assuming new flag: parse_flags::stop_when_done
auto doc = parser.parse(json, simdjson::parse_flags::stop_when_done);
if (doc.error()) {
std::cerr << "Error: " << doc.error() << std::endl;
return 1;
}
// Now doc["a"] == "b"; optionally: json.trailing_offset() == 8 (position of 'a' in "abde")
std::cout << doc["a"] << std::endl; // Outputs: b
return 0;
BenefitsCompatibility: Eases migration from libraries like RapidJSON without pre-processing inputs.
Performance: Minimal overhead with simdjson's SIMD optimizations—just tweak the parsing loop's end condition.
Use Cases: Log aggregation (ELK Stack), ETL tools, real-time streams (Kafka + Spark), avoiding pipeline interruptions from imperfect JSON.
Related ReferencesRapidJSON docs: https://rapidjson.org/md_doc_parsing.html (kParseStopWhenDoneFlag)
Spark JSON modes: https://spark.apache.org/docs/latest/sql-data-sources-json.html (PERMISSIVE mode)
Similar issues in other libs: StackOverflow on trailing garbage (e.g., https://stackoverflow.com/questions/38858345/parse-error-trailing-garbage-while-trying-to-parse-json-column-in-data-frame)
Beta Was this translation helpful? Give feedback.
All reactions