-
Notifications
You must be signed in to change notification settings - Fork 291
Sisterproject extractor #743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
A question for my reviewers : |
|
That said, I'm not sure what the relation you were trying to describe actually is. Perhaps you can describe it in English? That will help your readers guide you to an existing predicate that describes that relation, or if necessary, suggest how to handle the lack of such. |
|
Thank you @TallTed for your feedback,
|
|
An other solution could be also to use : skos:related |
|
Kudos, SonarCloud Quality Gate passed!
|
| * Extracted information is represented using properties in the http://xx.dbpedia.org/property/ | ||
| * namespace (where xx is the language code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Extracted information is represented using properties in the http://xx.dbpedia.org/property/ | |
| * namespace (where xx is the language code). | |
| * Extracted information is represented using properties in the | |
| * <http://xx.dbpedia.org/property/> namespace (where xx is the language code). |
| * namespace (where xx is the language code). | ||
| * The names of the these properties directly reflect the name of the Wikipedia infobox property. | ||
| * Property names are not cleaned or merged. | ||
| * Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset. | |
| * Property types are not part of a subsumption hierarchy, and there is no consistent ontology for the infobox dataset. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughAdds a new public dataset “sisterprojects_page_links”, exposes it via DBpediaDatasets as SisterProjectLink, and introduces a SisterProjectExtractor that emits owl:sameAs quads for sister project links extracted from infobox templates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant WikiPage as PageNode
participant Extractor as SisterProjectExtractor
participant Parsers as Link/Object/String Parsers
participant Datasets as DBpediaDatasets
participant Output as Quad Sink
WikiPage->>Extractor: extract(node, subjectUri)
Extractor->>Extractor: Validate namespace/redirects/ignore lists
Extractor->>Extractor: Iterate infobox templates & properties
Extractor->>Extractor: Normalize key → property URI (camelCase, cleanup)
Extractor->>Parsers: Try link/object parsing for value
alt Link parsed
Parsers-->>Extractor: Target IRI
else Fallback
Parsers-->>Extractor: Language-tagged literal
end
Extractor->>Extractor: Map property to sister project URI (abbr/project maps)
Extractor->>Datasets: Select SisterProjectLink
Extractor->>Output: Emit Quad(subject, owl:sameAs, target, dataset)
note over Extractor,Output: On errors (e.g., invalid URI), skip quad
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (3)
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala (3)
162-165: Use proper logging instead ofprintln.The exception handler on line 164 uses
printlnto output errors, which won't appear in standard application logs and makes debugging difficult in production.Replace with proper logging:
catch { - case ex : IllegalArgumentException => println(ex) + case ex : IllegalArgumentException => + // Log the error with context + context.logger.warn(s"Failed to create quad for property ${property.key} on page ${node.title}: ${ex.getMessage}") }Note: Verify that
contexthas a logger available, or use an appropriate logging mechanism for this codebase.
175-184: Refactor to avoid explicitreturnstatements.The use of
returnon line 179 is not idiomatic Scala. The method can be written more functionally without explicit returns.private def extractValue(node: PropertyNode): List[ParseResult[String]] = { - - extractLinks(node) match { - case links if links.nonEmpty => { - return links - } - case _ => - } - StringParser.parse(node).map(value => ParseResult(value.value, None, Some(rdfLangStrDt))).toList + val links = extractLinks(node) + if (links.nonEmpty) links + else StringParser.parse(node).map(value => ParseResult(value.value, None, Some(rdfLangStrDt))).toList }
80-116: Consider externalizing configuration.The regex patterns (lines 80-85), abbreviation mappings (lines 94-104), and project URL mappings (lines 105-116) are hardcoded. This makes it difficult to add support for new languages or sister projects without code changes.
Consider moving these configurations to:
- A separate configuration file (JSON/YAML)
- The existing
InfoboxExtractorConfigor a newSisterProjectExtractorConfig- The mappings wiki for easier community maintenance
This would allow language-specific configurations to be added without recompiling the extractor.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
core/src/main/resources/datasetdefinitions.json(1 hunks)core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala(1 hunks)core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala(1 hunks)
🔇 Additional comments (2)
core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala (1)
233-233: LGTM!The new dataset alias follows the established pattern for similar link datasets (e.g.,
CommonsLinkon line 232) and correctly references the dataset defined indatasetdefinitions.json.core/src/main/resources/datasetdefinitions.json (1)
465-470: LGTM!The new dataset entry follows the established structure for link datasets and includes all necessary metadata (name, traits, description, and default graph configuration).
| /** | ||
| * This extractor extracts all properties from all infoboxes. | ||
| * Extracted information is represented using properties in the http://xx.dbpedia.org/property/ | ||
| * namespace (where xx is the language code). | ||
| * The names of the these properties directly reflect the name of the Wikipedia infobox property. | ||
| * Property names are not cleaned or merged. | ||
| * Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset. | ||
| * The infobox extractor performs only a minimal amount of property value clean-up, e.g., by converting a value like “June 2009” to the XML Schema format “2009–06”. | ||
| * You should therefore use the infobox dataset only if your application requires complete coverage of all Wikipeda properties and you are prepared to accept relatively noisy data. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update class documentation to reflect the extractor's actual purpose.
The documentation appears to be copied from InfoboxExtractor and describes infobox property extraction rather than sister project link extraction. It should be updated to explain that this extractor retrieves links to sister projects (Commons, Wiktionary, Wikiquote, etc.) from infobox templates.
🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 20 to 29, the class Scaladoc currently describes infobox property
extraction (copied from InfoboxExtractor) but this extractor actually extracts
links to sister projects (Commons, Wiktionary, Wikiquote, etc.) from infobox
templates; update the class documentation to accurately describe that the
extractor locates and converts sister-project interwiki/infobox links into
DBpedia properties (naming the target sister projects, expected input
templates/fields, output property namespace, and any value-normalization
performed), replacing the erroneous infobox-focused text with a concise
explanation of purpose and behavior.
| //////////////////////////////////////////////////////////////////////////////////////////////////////////////////// | ||
| // Parsers | ||
| //////////////////////////////////////////////////////////////////////////////////////////////////////////////////// | ||
| private val sameAsProperty = context.ontology.properties("owl:sameAs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace owl:sameAs with an appropriate predicate.
As discussed in the PR comments, owl:sameAs is semantically incorrect for sister project links because it denotes strict co-reference (subject and object identify the same entity). A Wiktionary page is not the same entity as a Wikipedia article.
Consider using skos:related or creating dedicated properties for each sister project type (e.g., dbo:wiktionaryPage, dbo:commonsPage), as suggested by datalogism in the PR discussion.
🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around line 76, the code currently uses
context.ontology.properties("owl:sameAs") which is semantically incorrect for
sister project links; replace this with an appropriate predicate such as
"skos:related" or, preferably, add and use dedicated properties per project
(e.g., "dbo:wiktionaryPage", "dbo:commonsPage") in the ontology and then
reference them via context.ontology.properties("<predicate>"); update the
variable name to reflect the new predicate (e.g., relatedProperty or
wiktionaryProperty), change any downstream uses to emit the new predicate, and
add/update tests/fixtures and ontology configuration to include the chosen
predicates.
| private val regexMap = Map( | ||
| "fr"->List("autres projet.*".r), | ||
| "en"->List("sister project.*".r), | ||
| "de"->List("schwesterprojekte.*".r) | ||
|
|
||
| ) | ||
|
|
||
|
|
||
| private val currentRegexList = regexMap(wikiCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against unsupported language codes.
Line 88 accesses regexMap(wikiCode) directly, which will throw NoSuchElementException if the language code is not in the map. Since only three languages are currently supported (fr, en, de), this extractor will fail for all other languages.
Apply this diff to fail gracefully:
- private val currentRegexList = regexMap(wikiCode)
+ private val currentRegexList = regexMap.getOrElse(wikiCode, {
+ throw new IllegalArgumentException(s"SisterProjectExtractor does not support language code: $wikiCode. Supported languages: ${regexMap.keys.mkString(", ")}")
+ })Alternatively, return an empty list to skip processing for unsupported languages:
- private val currentRegexList = regexMap(wikiCode)
+ private val currentRegexList = regexMap.getOrElse(wikiCode, List.empty)🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 80 to 88, the code directly indexes regexMap with wikiCode which
throws NoSuchElementException for unsupported languages; update it to guard
lookup (use Map.get or contains) and either default to an empty List when
wikiCode is not present or handle the missing case by logging/warning and
skipping processing, ensuring currentRegexList is assigned via a safe getOrElse
so the extractor fails gracefully for unsupported languages.
| //USELESS ? | ||
| private val linkParser = new LinkParser(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove outdated comment.
The comment "USELESS ?" is incorrect since linkParser is used in the extractLinks method at line 197.
🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 91 to 92, remove the outdated "USELESS ?" comment that incorrectly
claims linkParser is useless—linkParser is used by extractLinks at line 197;
delete that comment (or replace it with a short accurate comment describing
linkParser’s purpose) so the code comment no longer misleads.
| private val mapProjects = Map( | ||
| "commons" -> "http://commons.dbpedia.org/resource/", | ||
| "wiktionary" -> "https://wiktionary.org/wiki/", | ||
| "wikinews" -> "https://wikinews.org/wiki/", | ||
| "wikivoyage" -> "https://wikivoyage.org/wiki/", | ||
| "wikiquote" -> "https://wikiquote.org/wiki/", | ||
| "wikisource" -> "https://wikisource.org/wiki/", | ||
| "wikibooks" -> "https://wikibooks.org/wiki/", | ||
| "wikiversity" -> "https://wikibooks.org/wiki/", | ||
| "species" -> "https://species.wikimedia.org/wiki/" | ||
|
|
||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect URL for Wikiversity.
Line 113 maps "wikiversity" to the Wikibooks URL (https://wikibooks.org/wiki/) instead of the correct Wikiversity URL.
Apply this diff:
"wikisource" -> "https://wikisource.org/wiki/",
"wikibooks" -> "https://wikibooks.org/wiki/",
- "wikiversity" -> "https://wikibooks.org/wiki/",
+ "wikiversity" -> "https://wikiversity.org/wiki/",
"species" -> "https://species.wikimedia.org/wiki/"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| private val mapProjects = Map( | |
| "commons" -> "http://commons.dbpedia.org/resource/", | |
| "wiktionary" -> "https://wiktionary.org/wiki/", | |
| "wikinews" -> "https://wikinews.org/wiki/", | |
| "wikivoyage" -> "https://wikivoyage.org/wiki/", | |
| "wikiquote" -> "https://wikiquote.org/wiki/", | |
| "wikisource" -> "https://wikisource.org/wiki/", | |
| "wikibooks" -> "https://wikibooks.org/wiki/", | |
| "wikiversity" -> "https://wikibooks.org/wiki/", | |
| "species" -> "https://species.wikimedia.org/wiki/" | |
| ) | |
| private val mapProjects = Map( | |
| "commons" -> "http://commons.dbpedia.org/resource/", | |
| "wiktionary" -> "https://wiktionary.org/wiki/", | |
| "wikinews" -> "https://wikinews.org/wiki/", | |
| "wikivoyage" -> "https://wikivoyage.org/wiki/", | |
| "wikiquote" -> "https://wikiquote.org/wiki/", | |
| "wikisource" -> "https://wikisource.org/wiki/", | |
| "wikibooks" -> "https://wikibooks.org/wiki/", | |
| "wikiversity" -> "https://wikiversity.org/wiki/", | |
| "species" -> "https://species.wikimedia.org/wiki/" | |
| ) |
🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 105 to 116, the mapProjects entry for "wikiversity" incorrectly
points to the Wikibooks URL; update the value for the "wikiversity" key to the
correct Wikiversity base URL "https://wikiversity.org/wiki/". Ensure only that
map entry is changed and the rest of the Map remains intact.
| if (mapProjects.contains(property.key)) { | ||
| val value = mapProjects(property.key).replace(property.key, language.wikiCode + "." + property.key) + WikiUtil.wikiEncode(pr.value) | ||
| quads += new Quad(language, DBpediaDatasets.SisterProjectLink, subjectUri, sameAsProperty, value, splitNode.sourceIri, null) | ||
| } | ||
| if (mapAbrev.contains(property.key)) { | ||
| val keyProj=mapAbrev(property.key) | ||
| val value = mapProjects(keyProj).replace(keyProj, language.wikiCode + "." + keyProj) + WikiUtil.wikiEncode(pr.value) | ||
| quads += new Quad(language, DBpediaDatasets.SisterProjectLink, subjectUri, sameAsProperty, value, splitNode.sourceIri, null) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify and fix URL construction logic.
The URL construction logic appears problematic:
-
Line 153:
mapProjects(property.key).replace(property.key, language.wikiCode + "." + property.key)- For
"commons"property with URL"http://commons.dbpedia.org/resource/", replacing"commons"with"en.commons"would produce"http://en.commons.dbpedia.org/resource/", which may not be the intended URL structure. - For
"wiktionary"with URL"https://wiktionary.org/wiki/", the same replace would produce"https://en.wiktionary.org/wiki/", which seems correct but is fragile.
- For
-
The logic assumes the property key appears exactly once in the URL string at the right position, which is fragile and error-prone.
Consider building the URLs explicitly rather than using string replacement:
if (mapProjects.contains(property.key)) {
val baseProject = property.key
val value = baseProject match {
case "commons" => s"http://commons.dbpedia.org/resource/${WikiUtil.wikiEncode(pr.value)}"
case _ => s"https://${language.wikiCode}.${baseProject}.org/wiki/${WikiUtil.wikiEncode(pr.value)}"
}
quads += new Quad(language, DBpediaDatasets.SisterProjectLink, subjectUri, sameAsProperty, value, splitNode.sourceIri, null)
}🤖 Prompt for AI Agents
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 152 to 160: the current URL construction uses string.replace on
mapProjects(...) which is fragile and can produce incorrect hostnames (e.g.
replacing "commons" produces "en.commons.dbpedia.org"); instead construct the
target URL explicitly based on the project key and language: treat commons (and
any other special-case projects) with their correct fixed base (e.g.
"http://commons.dbpedia.org/resource/") and for standard wiki projects build
"https://{language.wikiCode}.{project}.org/wiki/{WikiUtil.wikiEncode(pr.value)}";
apply the same explicit-construction logic to the mapAbrev branch (use the
mapped keyProj then build the URL), remove the string.replace usage, and ensure
WikiUtil.wikiEncode(pr.value) is appended to the constructed base.
| * This extractor extracts all properties from all infoboxes. | ||
| * Extracted information is represented using properties in the http://xx.dbpedia.org/property/ | ||
| * namespace (where xx is the language code). | ||
| * The names of the these properties directly reflect the name of the Wikipedia infobox property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * The names of the these properties directly reflect the name of the Wikipedia infobox property. | |
| * The names of the these properties directly reflect the names of the Wikipedia infobox properties. |
| private val sameAsProperty = context.ontology.properties("owl:sameAs") | ||
|
|
||
|
|
||
| /// NEED TO BE EXTENDED TO OTHERS LANGUAGES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// NEED TO BE EXTENDED TO OTHERS LANGUAGES | |
| /// NEED TO BE EXTENDED TO OTHER LANGUAGES |








Following the @jlareck template
#719
i finished to develop a multilingual (that need to be configured for new lang)
sister project extractor.
Summary by CodeRabbit