Skip to content

Conversation

@datalogism
Copy link
Member

@datalogism datalogism commented Nov 23, 2022

Following the @jlareck template
#719
i finished to develop a multilingual (that need to be configured for new lang)
sister project extractor.

Summary by CodeRabbit

  • New Features
    • Introduced a new “Sister Projects Links” dataset, published as Linked Data and available alongside existing page link datasets.
    • Automatically extracts sister project links from infoboxes (with language-aware filtering and normalization) and publishes owl:sameAs connections to sister project resources.
    • Expands cross-project linkage coverage for each language-locale DBpedia instance, improving navigability and data integration across related Wikimedia projects.

@datalogism
Copy link
Member Author

A question for my reviewers :
This extractor creates triples with owl:sameAs relation. I made maybe a mistake by using this relation because a wiktionary page is not really equivalent to a wikipedia article... What do you think about it ? Must I create a new relation in the DBpedia ontology for each sister project ?

@TallTed
Copy link
Contributor

TallTed commented Nov 23, 2022

owl:sameAs is definitely the wrong predicate, as it describes co-reference — i.e., owl:sameAs says that the subject and object URIs identify the same entity — far beyond the relation you appear to have been trying to describe.

That said, I'm not sure what the relation you were trying to describe actually is. Perhaps you can describe it in English? That will help your readers guide you to an existing predicate that describes that relation, or if necessary, suggest how to handle the lack of such.

@datalogism
Copy link
Member Author

Thank you @TallTed for your feedback,
you totally right a sameAs relationship is not the best way for describing it.
The aim of this extractor is to retrieve the Sister projects links related to a given Wikipedia article, for example the Commons, the Wiktionary, Wikiquotes...
A good practice could be to take example on :

@datalogism
Copy link
Member Author

An other solution could be also to use : skos:related
@jlareck @Vehnem @kurzum, what is your opinion about that ?

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 6 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Comment on lines +22 to +23
* Extracted information is represented using properties in the http://xx.dbpedia.org/property/
* namespace (where xx is the language code).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Extracted information is represented using properties in the http://xx.dbpedia.org/property/
* namespace (where xx is the language code).
* Extracted information is represented using properties in the
* <http://xx.dbpedia.org/property/> namespace (where xx is the language code).

* namespace (where xx is the language code).
* The names of the these properties directly reflect the name of the Wikipedia infobox property.
* Property names are not cleaned or merged.
* Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset.
* Property types are not part of a subsumption hierarchy, and there is no consistent ontology for the infobox dataset.

@JJ-Author
Copy link
Contributor

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Oct 9, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Oct 9, 2025

Walkthrough

Adds a new public dataset “sisterprojects_page_links”, exposes it via DBpediaDatasets as SisterProjectLink, and introduces a SisterProjectExtractor that emits owl:sameAs quads for sister project links extracted from infobox templates.

Changes

Cohort / File(s) Summary
Dataset declaration (links)
core/src/main/resources/datasetdefinitions.json
Adds sisterprojects_page_links with name, traits, description, and defaultgraph under the links section.
Dataset alias export
core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala
Adds val SisterProjectLink = datasets("sisterprojects_page_links") for public access to the new dataset.
Extractor implementation
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
New extractor: parses infobox templates, filters properties, maps to sister project URIs, parses values (links/strings), and emits owl:sameAs quads to DBpediaDatasets.SisterProjectLink.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant WikiPage as PageNode
  participant Extractor as SisterProjectExtractor
  participant Parsers as Link/Object/String Parsers
  participant Datasets as DBpediaDatasets
  participant Output as Quad Sink

  WikiPage->>Extractor: extract(node, subjectUri)
  Extractor->>Extractor: Validate namespace/redirects/ignore lists
  Extractor->>Extractor: Iterate infobox templates & properties
  Extractor->>Extractor: Normalize key → property URI (camelCase, cleanup)
  Extractor->>Parsers: Try link/object parsing for value
  alt Link parsed
    Parsers-->>Extractor: Target IRI
  else Fallback
    Parsers-->>Extractor: Language-tagged literal
  end
  Extractor->>Extractor: Map property to sister project URI (abbr/project maps)
  Extractor->>Datasets: Select SisterProjectLink
  Extractor->>Output: Emit Quad(subject, owl:sameAs, target, dataset)
  note over Extractor,Output: On errors (e.g., invalid URI), skip quad
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my ears at sister ties,
Across the wikis’ starry skies—
A hop, a link, an owl’s word,
sameAs echoes, neatly heard.
New trails mapped with gentle clicks,
DBpedia’s burrow grows—so slick!
Carrots cached in page-link sticks.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title directly names the main feature introduced, the sister project extractor, succinctly conveying the purpose of the changeset. It avoids vague or generic language and remains concise. This phrasing is clear enough for teammates to understand the primary addition at a glance.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala (3)

162-165: Use proper logging instead of println.

The exception handler on line 164 uses println to output errors, which won't appear in standard application logs and makes debugging difficult in production.

Replace with proper logging:

             catch
             {
-              case ex : IllegalArgumentException => println(ex)
+              case ex : IllegalArgumentException => 
+                // Log the error with context
+                context.logger.warn(s"Failed to create quad for property ${property.key} on page ${node.title}: ${ex.getMessage}")
             }

Note: Verify that context has a logger available, or use an appropriate logging mechanism for this codebase.


175-184: Refactor to avoid explicit return statements.

The use of return on line 179 is not idiomatic Scala. The method can be written more functionally without explicit returns.

   private def extractValue(node: PropertyNode): List[ParseResult[String]] = {
-
-    extractLinks(node) match {
-      case links if links.nonEmpty => {
-        return links
-      }
-      case _ =>
-    }
-    StringParser.parse(node).map(value => ParseResult(value.value, None, Some(rdfLangStrDt))).toList
+    val links = extractLinks(node)
+    if (links.nonEmpty) links
+    else StringParser.parse(node).map(value => ParseResult(value.value, None, Some(rdfLangStrDt))).toList
   }

80-116: Consider externalizing configuration.

The regex patterns (lines 80-85), abbreviation mappings (lines 94-104), and project URL mappings (lines 105-116) are hardcoded. This makes it difficult to add support for new languages or sister projects without code changes.

Consider moving these configurations to:

  1. A separate configuration file (JSON/YAML)
  2. The existing InfoboxExtractorConfig or a new SisterProjectExtractorConfig
  3. The mappings wiki for easier community maintenance

This would allow language-specific configurations to be added without recompiling the extractor.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb0e463 and 73a0992.

📒 Files selected for processing (3)
  • core/src/main/resources/datasetdefinitions.json (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala (1 hunks)
🔇 Additional comments (2)
core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala (1)

233-233: LGTM!

The new dataset alias follows the established pattern for similar link datasets (e.g., CommonsLink on line 232) and correctly references the dataset defined in datasetdefinitions.json.

core/src/main/resources/datasetdefinitions.json (1)

465-470: LGTM!

The new dataset entry follows the established structure for link datasets and includes all necessary metadata (name, traits, description, and default graph configuration).

Comment on lines +20 to +29
/**
* This extractor extracts all properties from all infoboxes.
* Extracted information is represented using properties in the http://xx.dbpedia.org/property/
* namespace (where xx is the language code).
* The names of the these properties directly reflect the name of the Wikipedia infobox property.
* Property names are not cleaned or merged.
* Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset.
* The infobox extractor performs only a minimal amount of property value clean-up, e.g., by converting a value like “June 2009” to the XML Schema format “2009–06”.
* You should therefore use the infobox dataset only if your application requires complete coverage of all Wikipeda properties and you are prepared to accept relatively noisy data.
*/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update class documentation to reflect the extractor's actual purpose.

The documentation appears to be copied from InfoboxExtractor and describes infobox property extraction rather than sister project link extraction. It should be updated to explain that this extractor retrieves links to sister projects (Commons, Wiktionary, Wikiquote, etc.) from infobox templates.

🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 20 to 29, the class Scaladoc currently describes infobox property
extraction (copied from InfoboxExtractor) but this extractor actually extracts
links to sister projects (Commons, Wiktionary, Wikiquote, etc.) from infobox
templates; update the class documentation to accurately describe that the
extractor locates and converts sister-project interwiki/infobox links into
DBpedia properties (naming the target sister projects, expected input
templates/fields, output property namespace, and any value-normalization
performed), replacing the erroneous infobox-focused text with a concise
explanation of purpose and behavior.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// Parsers
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
private val sameAsProperty = context.ontology.properties("owl:sameAs")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Replace owl:sameAs with an appropriate predicate.

As discussed in the PR comments, owl:sameAs is semantically incorrect for sister project links because it denotes strict co-reference (subject and object identify the same entity). A Wiktionary page is not the same entity as a Wikipedia article.

Consider using skos:related or creating dedicated properties for each sister project type (e.g., dbo:wiktionaryPage, dbo:commonsPage), as suggested by datalogism in the PR discussion.

🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around line 76, the code currently uses
context.ontology.properties("owl:sameAs") which is semantically incorrect for
sister project links; replace this with an appropriate predicate such as
"skos:related" or, preferably, add and use dedicated properties per project
(e.g., "dbo:wiktionaryPage", "dbo:commonsPage") in the ontology and then
reference them via context.ontology.properties("<predicate>"); update the
variable name to reflect the new predicate (e.g., relatedProperty or
wiktionaryProperty), change any downstream uses to emit the new predicate, and
add/update tests/fixtures and ontology configuration to include the chosen
predicates.

Comment on lines +80 to +88
private val regexMap = Map(
"fr"->List("autres projet.*".r),
"en"->List("sister project.*".r),
"de"->List("schwesterprojekte.*".r)

)


private val currentRegexList = regexMap(wikiCode)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against unsupported language codes.

Line 88 accesses regexMap(wikiCode) directly, which will throw NoSuchElementException if the language code is not in the map. Since only three languages are currently supported (fr, en, de), this extractor will fail for all other languages.

Apply this diff to fail gracefully:

-  private val currentRegexList = regexMap(wikiCode)
+  private val currentRegexList = regexMap.getOrElse(wikiCode, {
+    throw new IllegalArgumentException(s"SisterProjectExtractor does not support language code: $wikiCode. Supported languages: ${regexMap.keys.mkString(", ")}")
+  })

Alternatively, return an empty list to skip processing for unsupported languages:

-  private val currentRegexList = regexMap(wikiCode)
+  private val currentRegexList = regexMap.getOrElse(wikiCode, List.empty)
🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 80 to 88, the code directly indexes regexMap with wikiCode which
throws NoSuchElementException for unsupported languages; update it to guard
lookup (use Map.get or contains) and either default to an empty List when
wikiCode is not present or handle the missing case by logging/warning and
skipping processing, ensuring currentRegexList is assigned via a safe getOrElse
so the extractor fails gracefully for unsupported languages.

Comment on lines +91 to +92
//USELESS ?
private val linkParser = new LinkParser(true)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove outdated comment.

The comment "USELESS ?" is incorrect since linkParser is used in the extractLinks method at line 197.

🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 91 to 92, remove the outdated "USELESS ?" comment that incorrectly
claims linkParser is useless—linkParser is used by extractLinks at line 197;
delete that comment (or replace it with a short accurate comment describing
linkParser’s purpose) so the code comment no longer misleads.

Comment on lines +105 to +116
private val mapProjects = Map(
"commons" -> "http://commons.dbpedia.org/resource/",
"wiktionary" -> "https://wiktionary.org/wiki/",
"wikinews" -> "https://wikinews.org/wiki/",
"wikivoyage" -> "https://wikivoyage.org/wiki/",
"wikiquote" -> "https://wikiquote.org/wiki/",
"wikisource" -> "https://wikisource.org/wiki/",
"wikibooks" -> "https://wikibooks.org/wiki/",
"wikiversity" -> "https://wikibooks.org/wiki/",
"species" -> "https://species.wikimedia.org/wiki/"

)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix incorrect URL for Wikiversity.

Line 113 maps "wikiversity" to the Wikibooks URL (https://wikibooks.org/wiki/) instead of the correct Wikiversity URL.

Apply this diff:

     "wikisource" -> "https://wikisource.org/wiki/",
     "wikibooks" -> "https://wikibooks.org/wiki/",
-    "wikiversity" -> "https://wikibooks.org/wiki/",
+    "wikiversity" -> "https://wikiversity.org/wiki/",
     "species" -> "https://species.wikimedia.org/wiki/"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private val mapProjects = Map(
"commons" -> "http://commons.dbpedia.org/resource/",
"wiktionary" -> "https://wiktionary.org/wiki/",
"wikinews" -> "https://wikinews.org/wiki/",
"wikivoyage" -> "https://wikivoyage.org/wiki/",
"wikiquote" -> "https://wikiquote.org/wiki/",
"wikisource" -> "https://wikisource.org/wiki/",
"wikibooks" -> "https://wikibooks.org/wiki/",
"wikiversity" -> "https://wikibooks.org/wiki/",
"species" -> "https://species.wikimedia.org/wiki/"
)
private val mapProjects = Map(
"commons" -> "http://commons.dbpedia.org/resource/",
"wiktionary" -> "https://wiktionary.org/wiki/",
"wikinews" -> "https://wikinews.org/wiki/",
"wikivoyage" -> "https://wikivoyage.org/wiki/",
"wikiquote" -> "https://wikiquote.org/wiki/",
"wikisource" -> "https://wikisource.org/wiki/",
"wikibooks" -> "https://wikibooks.org/wiki/",
"wikiversity" -> "https://wikiversity.org/wiki/",
"species" -> "https://species.wikimedia.org/wiki/"
)
🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 105 to 116, the mapProjects entry for "wikiversity" incorrectly
points to the Wikibooks URL; update the value for the "wikiversity" key to the
correct Wikiversity base URL "https://wikiversity.org/wiki/". Ensure only that
map entry is changed and the rest of the Map remains intact.

Comment on lines +152 to +160
if (mapProjects.contains(property.key)) {
val value = mapProjects(property.key).replace(property.key, language.wikiCode + "." + property.key) + WikiUtil.wikiEncode(pr.value)
quads += new Quad(language, DBpediaDatasets.SisterProjectLink, subjectUri, sameAsProperty, value, splitNode.sourceIri, null)
}
if (mapAbrev.contains(property.key)) {
val keyProj=mapAbrev(property.key)
val value = mapProjects(keyProj).replace(keyProj, language.wikiCode + "." + keyProj) + WikiUtil.wikiEncode(pr.value)
quads += new Quad(language, DBpediaDatasets.SisterProjectLink, subjectUri, sameAsProperty, value, splitNode.sourceIri, null)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify and fix URL construction logic.

The URL construction logic appears problematic:

  1. Line 153: mapProjects(property.key).replace(property.key, language.wikiCode + "." + property.key)

    • For "commons" property with URL "http://commons.dbpedia.org/resource/", replacing "commons" with "en.commons" would produce "http://en.commons.dbpedia.org/resource/", which may not be the intended URL structure.
    • For "wiktionary" with URL "https://wiktionary.org/wiki/", the same replace would produce "https://en.wiktionary.org/wiki/", which seems correct but is fragile.
  2. The logic assumes the property key appears exactly once in the URL string at the right position, which is fragile and error-prone.

Consider building the URLs explicitly rather than using string replacement:

if (mapProjects.contains(property.key)) {
  val baseProject = property.key
  val value = baseProject match {
    case "commons" => s"http://commons.dbpedia.org/resource/${WikiUtil.wikiEncode(pr.value)}"
    case _ => s"https://${language.wikiCode}.${baseProject}.org/wiki/${WikiUtil.wikiEncode(pr.value)}"
  }
  quads += new Quad(language, DBpediaDatasets.SisterProjectLink, subjectUri, sameAsProperty, value, splitNode.sourceIri, null)
}
🤖 Prompt for AI Agents
core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala
around lines 152 to 160: the current URL construction uses string.replace on
mapProjects(...) which is fragile and can produce incorrect hostnames (e.g.
replacing "commons" produces "en.commons.dbpedia.org"); instead construct the
target URL explicitly based on the project key and language: treat commons (and
any other special-case projects) with their correct fixed base (e.g.
"http://commons.dbpedia.org/resource/") and for standard wiki projects build
"https://{language.wikiCode}.{project}.org/wiki/{WikiUtil.wikiEncode(pr.value)}";
apply the same explicit-construction logic to the mapAbrev branch (use the
mapped keyProj then build the URL), remove the string.replace usage, and ensure
WikiUtil.wikiEncode(pr.value) is appended to the constructed base.

* This extractor extracts all properties from all infoboxes.
* Extracted information is represented using properties in the http://xx.dbpedia.org/property/
* namespace (where xx is the language code).
* The names of the these properties directly reflect the name of the Wikipedia infobox property.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The names of the these properties directly reflect the name of the Wikipedia infobox property.
* The names of the these properties directly reflect the names of the Wikipedia infobox properties.

private val sameAsProperty = context.ontology.properties("owl:sameAs")


/// NEED TO BE EXTENDED TO OTHERS LANGUAGES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// NEED TO BE EXTENDED TO OTHERS LANGUAGES
/// NEED TO BE EXTENDED TO OTHER LANGUAGES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants