Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

Commit dd5dc12

Browse files
authored
Merge pull request #77 from geohci/refactor-expand-node-differ
* Support detailed diffs for HTMLEntities * Unrecognized HTML tags are just categorized as `Other Tag` * Improve text formatting handling: * Treat text formatting like templates -- i.e. changes to type and/or contents trigger a change (as opposed to just type) * HTML tags are now treated case-insensitively in line with official parser so e.g., `<Small>` is the same as `<small>` -> Text Formatting * Formatting is context-dependent so preserve original tag to avoid parsing errors later on in processing * Begin incorporating mwconstants external library in place of edit-types-maintained constants Closes #75 and addresses parts of #74 .
2 parents 4e88f4e + 87b6534 commit dd5dc12

File tree

11 files changed

+55
-8391
lines changed

11 files changed

+55
-8391
lines changed

.github/workflows/test.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ jobs:
2929
pip install flake8 pytest
3030
pip install anytree
3131
pip install mwparserfromhell
32+
pip install mwconstants
3233
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
3334
- name: Lint with flake8
3435
run: |

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ Wikitext/language is verrrrrrry complicated and so there are certain things we c
7777
* Sentences: full stop punctuation is used for [many things](https://en.wikipedia.org/wiki/Full_stop#Usage). Abbreviations are particularly challenging and will falsely split up sentences. On the other hand, Thai has no sentence punctuation so each paragraph is (incorrectly) considered the equivalent of a single sentence.
7878
* Words: we have done our best to extract words for whitespace-delimited languages but some languages use special spacing characters that may falsely split up words -- e.g., Bengali. We have done our best to detect account for these languages but may have missed some.
7979
* Media: images/audio/video can be included in articles via bracketed links, templates, and galleries. Each have their own syntax, and, in particular templates separate the image name from its formatting options. For galleries/bracket-links, we associate the formatting/caption options with the media and changes to them will trigger as media changes. For templates, we cannot do this.
80+
* Text Formatting: parsing text formatting is quite complicated and context-dependent. We parse the wikitext section-by-section so text formatting split up between sections might parse unexpectedly.
8081

8182
For links, we assume that if the prefix is not for media or a category, the link is a wikilink to namespace 0. This is generally reasonable for current versions of Wikipedia articles
8283
but would overload the `Wikilink` class with e.g., user page links on talk pages or interwiki links for older versions of articles.

mwedittypes/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
__summary__ = "mwedittypes is a package that supports edit diffs and action detection for Wikipedia"
55
__url__ = "https://github.com/geohci/edit-types"
66

7-
__version__ = "2.0.0"
7+
__version__ = "2.0.1"
88

99
__license__ = "MIT License"
1010

0 commit comments

Comments
 (0)