Support lenient parsing mode to ignore trailing garbage (like RapidJSON's kParseStopWhenDoneFlag or Spark's PERMISSIVE mode) #2502

zhanglistar · 2025-10-10T07:15:47Z

zhanglistar
Oct 10, 2025

Currently, simdjson fails to parse JSON strings with trailing garbage characters, returning an error. For example, the string {"a":"b"}abde fails entirely due to the abde suffix, preventing extraction of the valid {"a":"b"} part.This is common in "dirty" data scenarios like logs, user inputs, or big data pipelines (e.g., Apache Spark), where JSON may be mixed with invalid characters. Many libraries support a lenient mode to ignore such garbage for robustness.Expected BehaviorAdd an optional parsing flag (e.g., parse_flags::stop_when_done or parse_flags::lenient_trailing) that allows the parser to stop immediately after the valid JSON root object/array, ignoring subsequent characters. On success, return SIMDJSON_SUCCESS, and optionally report the garbage position (e.g., via a byte offset in padded_json).This would mirror:RapidJSON's kParseStopWhenDoneFlag: Stops after parsing the root element, ignoring trailing content.
Apache Spark's PERMISSIVE mode: Leniently parses malformed records, extracting valid parts into normal fields and placing invalid parts in a _corrupt_record-like field (optional similar mechanism).

Desired Behavior (Succeeds, Ignores Garbage):cpp

// Assuming new flag: parse_flags::stop_when_done
auto doc = parser.parse(json, simdjson::parse_flags::stop_when_done);
if (doc.error()) {
std::cerr << "Error: " << doc.error() << std::endl;
return 1;
}
// Now doc["a"] == "b"; optionally: json.trailing_offset() == 8 (position of 'a' in "abde")
std::cout << doc["a"] << std::endl; // Outputs: b
return 0;

BenefitsCompatibility: Eases migration from libraries like RapidJSON without pre-processing inputs.
Performance: Minimal overhead with simdjson's SIMD optimizations—just tweak the parsing loop's end condition.
Use Cases: Log aggregation (ELK Stack), ETL tools, real-time streams (Kafka + Spark), avoiding pipeline interruptions from imperfect JSON.

Related ReferencesRapidJSON docs: https://rapidjson.org/md_doc_parsing.html (kParseStopWhenDoneFlag)
Spark JSON modes: https://spark.apache.org/docs/latest/sql-data-sources-json.html (PERMISSIVE mode)
Similar issues in other libs: StackOverflow on trailing garbage (e.g., https://stackoverflow.com/questions/38858345/parse-error-trailing-garbage-while-trying-to-parse-json-column-in-data-frame)

lemire · 2025-10-10T14:18:07Z

lemire
Oct 10, 2025
Maintainer

@zhanglistar I have converted your issue to a discussion under 'ideas'.

The simdjson library is a community-based project. We do support non-standard JSON under some conditions with deliberately undocumented flags.

I have personally expressed my own stance:

Daniel Lemire, "Just say no to broken JSON," in Daniel Lemire's blog, July 4, 2025, https://lemire.me/blog/2025/07/04/just-say-no-to-broken-json/.

Minimal overhead with simdjson's SIMD optimizations—just tweak the parsing loop's end condition.

I would encourage you to prepare a pull request. However, please note that we will not allow by default broken JSON to pass through. (See the rest of my message to understand why.)

Similar issues in other libs: StackOverflow on trailing garbage (e.g., https://stackoverflow.com/questions/38858345/parse-error-trailing-garbage-while-trying-to-parse-json-column-in-data-frame)

That's an interesting example. The user is trying to parse as a JSON document the following...

{"id": -2, "ipAddress": "100.100.100.100", "howYouHearAboutUs": null, "isInterestedInOffer": true, "incomeRange": 60000, "isEmailConfirmed": false}
{"id": -1, "firstName": "John", "lastName": "Smith", "email": "[john.smith@gmail.com](mailto:john.smith@gmail.com)", "city": "Smalltown", "incomeRange": 1, "birthDate": "1999-12-10T05:00:00Z", "password": "*********", "agreeToTermsOfUse": true, "howYouHearAboutUs": "Radio", "isInterestedInOffer": false}
{"id": -3, "visitUrl": "https://www.website.com/?purpose=X", "ipAddress": "100.200.300.400", "howYouHearAboutUs": null, "isInterestedInOffer": true, "incomeRange": 100000, "isEmailConfirmed": true, "isIdentityConfirmed": false, "agreeToTermsOfUse": true, "validationResults": null}

And they get an error.

Should this input parse as a JSON document, silently ignoring that it is not ONE but rather a stream of JSON documents ?

Of course, the correct behaviour is to get an error. It would be totally counterproductive to just silently ignore the issue. And the first answer on this StackOverflow question explains how the user misunderstands what he is doing, and how to fix the issue involves thinking about what one is trying to do.

We do support streams of JSON documents in simdjson. So we do have a direct solution to the problem encountered by this issue, directly in simdjson. And it does not involve silently ignoring the error.

We would want our users to do the following...

auto json = R"({ "foo": 1 } { "foo": 2 } { "foo": 3 } )"_padded;
ondemand::parser parser;
ondemand::document_stream docs = parser.iterate_many(json);
for (auto doc : docs) {
 //...
}

1 reply

zhanglistar Oct 11, 2025
Author

Thanks for your kindly reply @lemire . In real world, users can submit malformed json data to big data system, that's common case. So there are some json libraries already support lenient parsing mode to torrent, like RapidJSON and jsoncpp. As widely used json parsing library, I think that SIMDJSON should also support this feature. I may open a PR to do this later if I have time. Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support lenient parsing mode to ignore trailing garbage (like RapidJSON's kParseStopWhenDoneFlag or Spark's PERMISSIVE mode) #2502

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Support lenient parsing mode to ignore trailing garbage (like RapidJSON's kParseStopWhenDoneFlag or Spark's PERMISSIVE mode) #2502

Uh oh!

zhanglistar Oct 10, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

lemire Oct 10, 2025 Maintainer

Uh oh!

Uh oh!

zhanglistar Oct 11, 2025 Author

zhanglistar
Oct 10, 2025

Replies: 1 comment 1 reply

lemire
Oct 10, 2025
Maintainer

zhanglistar Oct 11, 2025
Author