Skip to content

Commit f80ab69

Browse files
authored
Merge pull request GeneralNewsExtractor#88 from kingname/develop
useless_attr 现在必须完全匹配才会移除节点
2 parents a1ccab2 + 26f4934 commit f80ab69

File tree

5 files changed

+1852
-7
lines changed

5 files changed

+1852
-7
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# General News Extractor Changelog
22

3+
## 0.2.3 (2020-09-15)
4+
5+
### Bug fix
6+
7+
1. `USELESS_ATTR`对应的节点,只有 class 完全匹配才需要删除。之前包含就删除的匹配方式会导致 ifeng 的正文被删除。
8+
39
## 0.2.2 (2020-08-02)
410

511
### Features

gne/defaults.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@
5757
# if one tag in the follow list does not contain any child node nor content, it could be removed
5858
TAGS_CAN_BE_REMOVE_IF_EMPTY = ['section', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'span']
5959

60-
USELESS_ATTR = ['share',
60+
USELESS_ATTR = {
61+
'share',
6162
'contribution',
6263
'copyright',
6364
'copy-right',
@@ -69,7 +70,7 @@
6970
'social',
7071
'submeta',
7172
'report-infor'
72-
]
73+
}
7374

7475

7576
HIGH_WEIGHT_ARRT_KEYWORD = ['content',

gne/utils.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,9 @@ def normalize_node(element: HtmlElement):
3333

3434
class_name = node.get('class')
3535
if class_name:
36-
for attribute in USELESS_ATTR:
37-
if attribute in class_name:
38-
remove_node(node)
39-
break
36+
if class_name in USELESS_ATTR:
37+
remove_node(node)
38+
break
4039

4140

4241
def html2element(html):
@@ -47,6 +46,7 @@ def html2element(html):
4746

4847
def pre_parse(element):
4948
normalize_node(element)
49+
5050
return element
5151

5252

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
name='gne',
88
packages=find_packages(exclude=[]),
99
install_requires=['lxml', 'numpy', 'pyyaml'],
10-
version='0.2.2',
10+
version='0.2.3',
1111
description='General extractor of news pages.',
1212
long_description=readme,
1313
long_description_content_type='text/markdown',

tests/ifeng/3.html

Lines changed: 1838 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)