Skip to content

Conversation

@ninniuz
Copy link
Contributor

@ninniuz ninniuz commented Nov 28, 2013

The HomepageExtractor was not able to correctly detect references to official webpages
when conveyed through templates placed in the External links section.
Only matches to Template:Official were considered and the URL extraction was failing
when the template property contained an ExternalLinkNode.

With this change:

  • A map is provided to store info about official website templates, for different langs
  • Template redirects are checked when looking for official websites
  • URL is extracted also when the template property is an ExternalLinkNode

I was able to extract 550k homepages from enwiki-20130604 compared to about 505k items
without this patch (about 10% more).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanProperty may return "" and will be turned to "http://", maybe we can save some computation if we skip the empty strings from now

The HomepageExtractor was not able to correctly detect references to official webpages
when conveyed through templates placed in the External links section.
Only matches to Template:Official were considered and the URL extraction was failing
when the template property contained an ExternalLinkNode.

With this change:
- A map is provided to store info about official website templates, for different langs
- Template redirects are checked when looking for official websites
- URL is extracted also when the template property is an ExternalLinkNode

I was able to extract 550k homepages from enwiki-20130604 compared to about 505k items
without this patch (about 10% more).
jimkont added a commit that referenced this pull request Dec 17, 2013
@jimkont jimkont merged commit 0089903 into dbpedia:master Dec 17, 2013
@ninniuz ninniuz deleted the homepage_extractor_fixes branch December 17, 2013 14:29
@jimkont jimkont modified the milestone: pastReleases Mar 19, 2015
jimkont added a commit to jimkont/extraction-framework that referenced this pull request Mar 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants