forked from dbpedia/extraction-framework
-
Notifications
You must be signed in to change notification settings - Fork 0
final Live merge #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The HomepageExtractor was not able to correctly detect references to official webpages when conveyed through templates placed in the External links section. Only matches to Template:Official were considered and the URL extraction was failing when the template property contained an ExternalLinkNode. With this change: - A map is provided to store info about official website templates, for different langs - Template redirects are checked when looking for official websites - URL is extracted also when the template property is an ExternalLinkNode I was able to extract 550k homepages from enwiki-20130604 compared to about 505k items without this patch (about 10% more).
Fix HomepageExtractor issues
When extracting quads from the extraction server, we should deduplicate quads so that we get a similar result to that produced by the extraction framework
Issue dbpedia#144: Deduplicate quads in the extraction server
Various enhancements for Live
…perty - Decide whether to generate statistics by reading a JVM system property. By doing that it is possible to create stats without having to edit the code and recompile the framework. - Add a Scala launcher in pom to start stats creation by setting the appropriate JVM system property
Issue dbpedia#149: Read stats generation property from JVM system property
Set default values in case any of the HomepageExtractor required config does not exist for a specific language (e.g. there is no official website template in itwiki, but there is an official keyword used in external links section). Not using English as default because it looks like a strong assumption which is likely to be either wrong or disruptive.
HomepageExtractor config: return default empty values for undefined settings in a language
…ersand Issue dbpedia#151: URL encode wikititle in sample page extraction
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With this request the refactoring is almost complete. What you should do before (or after) you make a pull request to the main repo:
withFilter' method does not yet exist on net.liftweb.json.JsonAST.JValue, usingfilter' method instead, trymvn clean install` to see them