-
Notifications
You must be signed in to change notification settings - Fork 261
feat: add CJK friendly emphasis extension #1059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: add CJK friendly emphasis extension #1059
Conversation
pulldown-cmark/src/firstpass.rs
Outdated
| #[inline] | ||
| fn previous_two_chars(s: &str, ix: usize) -> (Option<char>, Option<char>) { | ||
| let mut iter = s[..ix].chars(); | ||
| let mut prev_prev = None; | ||
| let mut prev = None; | ||
| while let Some(ch) = iter.next() { | ||
| prev_prev = prev; | ||
| prev = Some(ch); | ||
| } | ||
| (prev, prev_prev) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This iterates through the full string, which makes emphasis parsing O(n^2), as caught by CI.
The previous implementation uses .chars().last(), which takes advantage of DoubleEndedIterator. Also, I wouldn't put #[inline] on an internal function unless a benchmark indicates it helps.
| #[inline] | |
| fn previous_two_chars(s: &str, ix: usize) -> (Option<char>, Option<char>) { | |
| let mut iter = s[..ix].chars(); | |
| let mut prev_prev = None; | |
| let mut prev = None; | |
| while let Some(ch) = iter.next() { | |
| prev_prev = prev; | |
| prev = Some(ch); | |
| } | |
| (prev, prev_prev) | |
| } | |
| fn previous_two_chars(s: &str, ix: usize) -> (Option<char>, Option<char>) { | |
| let mut iter = s[..ix].chars().rev(); | |
| let prev = iter.next(); | |
| let prev_prev = iter.next(); | |
| (prev, prev_prev) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related, maybe this should not take ix, so that it's the caller's responsibility to slice the string.
CommonMark has a problem that the following emphasis marks
**are not recognized as emphasis marks in CJK.This pull request introduces support for CJK-friendly emphasis handling in the Markdown parser, aligning with the CommonMark CJK-friendly amendments specification.
It adds a new option to enable CJK-friendly emphasis parsing, updates the delimiter run logic to properly handle CJK characters and punctuation, and includes comprehensive tests to verify the new behavior. By default, the feature is disabled to maintain backward compatibility.
In addition to the specification, I also refer to the Tips for Implementers and Concrete ranges of each terms in tats-u/markdown-cjk-friendly for implementation.