Skip to content

Conversation

@jimkont
Copy link

@jimkont jimkont commented Jan 20, 2014

With this request the refactoring is almost complete. What you should do before (or after) you make a pull request to the main repo:

  • there are two warnings left for JsonLift: 'withFilter' method does not yet exist on net.liftweb.json.JsonAST.JValue, using filter' method instead, try mvn clean install` to see them
  • JsonNode should hold a JsonObject, not a list of Nodes (we discussed this and agreed it was a temporary hack)
  • move all the extra code from JsonParser (that generates AST) into the individual extractors
    • Every extractor should define at the beginning variables with the paths that are of interest (e.g. json \ "claims" \ ...) and later reuse these variables. This will make it easy to adapt the code in case the Wikidata format changes
    • Create org.dbpedia.extraction.util.JSONUtils where you can store all the Json related custom function you create

jimkont and others added 28 commits December 6, 2013 09:52
The HomepageExtractor was not able to correctly detect references to official webpages
when conveyed through templates placed in the External links section.
Only matches to Template:Official were considered and the URL extraction was failing
when the template property contained an ExternalLinkNode.

With this change:
- A map is provided to store info about official website templates, for different langs
- Template redirects are checked when looking for official websites
- URL is extracted also when the template property is an ExternalLinkNode

I was able to extract 550k homepages from enwiki-20130604 compared to about 505k items
without this patch (about 10% more).
When extracting quads from the extraction server, we should
deduplicate quads so that we get a similar result to that
produced by the extraction framework
Issue dbpedia#144: Deduplicate quads in the extraction server
…perty

- Decide whether to generate statistics by reading a JVM system property.
By doing that it is possible to create stats without having to edit the code
and recompile the framework.
- Add a Scala launcher in pom to start stats creation by setting the
appropriate JVM system property
Issue dbpedia#149: Read stats generation property from JVM system property
Set default values in case any of the HomepageExtractor required config does not exist
for a specific language (e.g. there is no official website template in itwiki, but
there is an official keyword used in external links section).

Not using English as default because it looks like a strong assumption which is likely
to be either wrong or disruptive.
HomepageExtractor config: return default empty values for undefined settings in a language
…ersand

Issue dbpedia#151: URL encode wikititle in sample page extraction
hadyelsahar added a commit that referenced this pull request Jan 23, 2014
@hadyelsahar hadyelsahar merged commit 563287b into hadyelsahar:wikidata Jan 23, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants