Skip to content
Christopher Sahnwaldt edited this page May 3, 2013 · 6 revisions

DBpedia infrastructure

A quick overview of the DBpedia servers, processes and modules. Not complete.

There are several components involved: the mappings wiki (http://mappings.dbpedia.org/), the DBpedia server (for lack of a better name, http://mappings.dbpedia.org/server/) , the dump extraction (executed by someone somewhere), the live extraction (running continuously), the continuous build system, and possibly others.

The mappings wiki is just a MediaWiki instance that holds configuration data: the ontology and the mappings. Not much code here. We extended MediaWiki a bit: when a user updates an ontology or mapping page, the new page is sent to the DBpedia server, which tries to load it and reports any errors back to the wiki, which in turn displays them to the user.

The DBpedia server is a little HTTP server written in Scala. It's basically a HTTP wrapper around the extraction framework: it provides HTTP APIs that allow a client to execute certain parts of the framework. It also handles the statistics and some other things.

When we say "extraction framework", we usually mean the code in the DBpedia code module. The DBpedia server and the other modules - like the dump and live extraction - use this code. The core module contains code to download the ontology and the mappings from the wiki and turn them into object structures that guide the extraction. It also contains many other extractors that are mostly independent of the ontology and mappings, and a lot of utility code, for example for serializing RDF triples.

The dump extraction is another module that allows users to configure and run an extraction from a Wikipedia dump file. That's what we use to make a DBpedia release.

The live extraction is another module with Scala and Java code, but it's also continuously running process that extracts data from the latest Wikipedia pages. Its extraction results are available at http://live.dbpedia.org.

The continous build system is hosted by the good people at Travis CI. We just provide a configuration script for it. Whenever there is a change in the DBpedia code on GitHub, Travis downloads the latest version, builds and tests it. (Well, it should run tests, but we don't really have any.)

And then of course there's Wikipedia, which we use in a variety of ways: The core module downloads settings from the different Wikipedia language editions so we can correctly parse Wikipedia pages, the dump extraction downloads Wikipedia dumps from http://dumps.wikimedia.org/ and extracts RDF data, the live extraction receives live updates about the latest changes on Wikipedia, and finally the DBpedia server pulls some Wikipedia pages when it runs a sample extraction.

Clone this wiki locally