Add Data Quality Monitoring for Extractors #808

PritamP20 · 2025-12-12T18:08:26Z

Adding a monitoring system to track extraction errors and data quality issues. Currently integrated with HomepageExtractor as a proof of concept.

Problem

Many extractors silently discard invalid data without any logging:

// HomepageExtractor.scala line 125 (before)
case _ : IRISyntaxException => Seq()  //   TODO: log

This makes it impossible to debug extraction issues or measure data quality.

Solution

Created DataQualityMonitor - a simple utility to log and track extraction errors.

Usage:

val monitor = DataQualityMonitor.forExtractor("HomepageExtractor")

// Log errors with context
monitor.logInvalidData(pageTitle, "Malformed IRI", exception = Some(ex), data = Some(url))

// Get metrics
monitor.getMetrics()  // Map("HomepageExtractor:IRISyntaxException" -> 42)

Changes

New files:

core/.../util/DataQualityMonitor.scala - Core monitoring logic
core/.../util/DataQualityMonitorTest.scala - Tests
server/.../resources/DataQuality.scala - Web dashboard at /quality/
documentation/data-quality-monitoring.md - Docs

Modified:

core/.../mappings/HomepageExtractor.scala - Added monitoring calls

Web Dashboard

Access at http://localhost:9999/server/quality/ when running the server.

Shows:

Error counts by extractor and type
Detailed error logs
CSV export for analysis

Testing

cd core && mvn test -Dtest=DataQualityMonitorTest -DskipTests=false

cd server && ../run server
# Open http://localhost:9999/server/quality/

Summary by CodeRabbit

New Features
- Added a Data Quality Monitoring system for extraction metrics and a web dashboard with metrics, per-extractor views, error lists, and CSV export.
Enhancements
- Extraction flow now reports outcomes (success/skip/invalid) for better visibility.
Documentation
- New guide describing the monitoring utility, usage, endpoints, and integration examples.
Tests
- New test suites validating monitoring, dashboard endpoints, CSV export, limits, and concurrency.
Chores
- Project configuration updated to include Maven integration and resource filtering.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-12T18:08:50Z

📝 Walkthrough

Walkthrough

Adds a centralized Data Quality Monitoring system, integrates it into HomepageExtractor, exposes metrics/errors via a new /quality REST resource and dashboard, adds tests and documentation, and updates several Eclipse .project files to include Maven builders/natures.

Changes

Cohort / File(s)	Summary
Version control `\.gitignore`	No functional change: identical ignore rule(s) were removed and re-added; final semantics unchanged.
Eclipse project descriptors `core/.project`, `dump/.project`, `scripts/.project`, `server/.project`	Replaced minimal Eclipse Scala project descriptors with expanded Maven-aware descriptors: added Maven builder/nature (`org.eclipse.m2e.core.maven2Builder`, `org.eclipse.m2e.core.maven2Nature`), retained Scala/Java natures, added filteredResources (regex excluding `node_modules`, `.git`, `__CREATED_BY_JAVA_LANGUAGE_SERVER__`), and adjusted buildSpec/projects.
Data Quality monitoring core `core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala`	New global and per-extractor DataQualityMonitor plus `ExtractionError`: thread-safe counters, capped error-detail storage, APIs to log invalid data/skips/successes, retrieve metrics, export CSV, and reset.
Extractor integration `core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala`	Integrated per-extractor monitor: created monitor instance; logSkipped for non-main namespaces; logSuccess for cleaned URIs; logInvalidData on URI cleaning failures and exceptions with contextual messages.
Tests (core) `core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala`	New ScalaTest suite validating monitor creation, logging (with/without exceptions and data), metrics aggregation, error-detail limits, CSV export, concurrency safety, and record fields.
Server REST resource & nav `server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala`, `server/src/main/scala/org/dbpedia/extraction/server/resources/Root.scala`	New `DataQuality` REST resource exposing HTML dashboard, `/quality/metrics` JSON, `/quality/errors` HTML with type/limit params, and `/quality/export` CSV; `Root.scala` adds "Data Quality" navigation link.
Tests (server) `server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala`	New test suite covering dashboard rendering, metrics JSON, error grouping, HTML/CSV outputs, escaping/limits, concurrency, and multi-extractor scenarios.
Documentation `documentation/data-quality-monitoring.md`	New doc describing monitor purpose, API usage, integration example (HomepageExtractor), error categorization, thread-safety and memory bounds, testing utilities, and web dashboard endpoints.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Browser
    participant Server as REST Server\n(DataQuality resource)
    participant Monitor as DataQualityMonitor (global)
    participant Extractor as HomepageExtractor
    participant Pipeline as Extraction Pipeline

    Pipeline->>Extractor: process page
    alt page skipped (non-main ns)
        Extractor->>Monitor: logSkipped(pageTitle, reason)
        Monitor->>Monitor: record skip (no error count)
    else success (clean URI)
        Extractor->>Monitor: logSuccess(pageTitle, triplesCount)
        Monitor->>Monitor: increment success metric
    else invalid data / exception
        Extractor->>Monitor: logInvalidData(pageTitle, reason, exception?, data?)
        Monitor->>Monitor: increment error count & store capped detail
    end

    User->>Browser: Navigate to /quality/
    Browser->>Server: GET /quality/
    Server->>Monitor: getGlobalMetrics()
    Monitor-->>Server: metrics & totals
    Server->>Server: render dashboard HTML
    Server-->>Browser: HTML dashboard

    Browser->>Server: GET /quality/errors?type=...&limit=...
    Server->>Monitor: getErrorDetails(type, limit)
    Monitor-->>Server: List[ExtractionError]
    Server-->>Browser: HTML error list / CSV (if export)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas needing careful review:
- core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala — thread-safety, atomic counters, capped detail storage, CSV export formatting.
- core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala — placement and semantics of monitor calls, exception handling messages.
- server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala — query parameter parsing, HTML/JSON/CSV formatting and escaping, sorting/grouping logic.
- core/src/test/.../DataQualityMonitorTest.scala and server/src/test/.../DataQualityTest.scala — concurrency test stability, reset/teardown behavior, and assumptions about global state.
- .project files — consistency of Maven builder/nature additions across modules and correctness of the resource filter regex.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add Data Quality Monitoring for Extractors' accurately summarizes the main change: introducing a data quality monitoring system with extractor integration.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (5)

core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (1)

127-136: Consider consistent page identifiers for monitoring.

The logSuccess and logInvalidData calls use subjectUri (e.g., http://dbpedia.org/resource/Example) while logSkipped on line 55 uses page.title.encoded. This inconsistency may complicate metrics aggregation and dashboard readability.

Consider passing the page title consistently, or document that different identifiers are intentional for different use cases.
documentation/data-quality-monitoring.md (1)
64-66: Add a language identifier to the fenced code block.

The code block should specify a language for proper rendering and linting compliance. Since this is a URL, use text or plaintext.
-```
+```text
 http://localhost:9999/server/quality/
</blockquote></details>
<details>
<summary>core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (2)</summary><blockquote>

`76-82`: **Extract repeated test literals to constants.**

Per SonarCloud analysis, `"Invalid IRI"` and `"TestExtractor:InvalidData"` are duplicated multiple times. Consider extracting these to constants for maintainability.

```diff
+  private val InvalidIriReason = "Invalid IRI"
+  private val TestExtractorInvalidData = "TestExtractor:InvalidData"
+
   it should "retrieve error details" in {
     val monitor = DataQualityMonitor.forExtractor("TestExtractor")
 
-    monitor.logInvalidData("Einstein", "Invalid IRI", data = Some("malformed://url"))
-    monitor.logInvalidData("Tesla", "Invalid IRI", data = Some("another://bad"))
+    monitor.logInvalidData("Einstein", InvalidIriReason, data = Some("malformed://url"))
+    monitor.logInvalidData("Tesla", InvalidIriReason, data = Some("another://bad"))
 
-    val details = DataQualityMonitor.getErrorDetails("TestExtractor:InvalidData", limit = 10)
+    val details = DataQualityMonitor.getErrorDetails(TestExtractorInvalidData, limit = 10)
97-112: Consider reducing iteration count or marking as slow test.

Logging 11,000 errors in a loop may slow down test execution. Consider reducing to a smaller count (e.g., 100 over the limit) or annotating with a slow test marker if the framework supports it.
   it should "limit stored error details to prevent memory issues" in {
     val monitor = DataQualityMonitor.forExtractor("TestExtractor")
 
-    // Try to log more than the limit (10000)
-    for (i <- 1 to 11000) {
+    // Try to log more than the limit (10000) - use 10100 to test boundary
+    for (i <- 1 to 10100) {
       monitor.logInvalidData(s"Page$i", "Error")
     }
 
     // Should still work without memory issues
     val metrics = monitor.getMetrics()
-    monitor.getTotalErrors() should be(11000)
+    monitor.getTotalErrors() should be(10100)
 
     // But details should be limited
     val details = DataQualityMonitor.getErrorDetails("TestExtractor:InvalidData", limit = 20000)
     details.size should be <= 10000
   }
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1)
58-65: CSV export doesn't properly escape all special characters.

The current implementation only replaces commas with semicolons but doesn't handle quotes, newlines, or other characters that could break CSV parsing. Consider proper CSV escaping.
   def exportToCsv(errorType: String, limit: Int = 1000): String = {
     val errors = getErrorDetails(errorType, limit)
     val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n"
-    val rows = errors.map(e =>
-      s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}"
-    ).mkString("\n")
+    val rows = errors.map(e =>
+      s"${escapeCsv(e.extractorName)},${escapeCsv(e.pageTitle)},${escapeCsv(e.message)},${e.timestamp}"
+    ).mkString("\n")
     header + rows
   }
+
+  private def escapeCsv(value: String): String = {
+    if (value.contains(",") || value.contains("\"") || value.contains("\n")) {
+      "\"" + value.replace("\"", "\"\"") + "\""
+    } else {
+      value
+    }
+  }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14cca4b and 4ceb888.

⛔ Files ignored due to path filters (1)

.DS_Store is excluded by !**/.DS_Store

📒 Files selected for processing (11)

.gitignore (1 hunks)
core/.project (1 hunks)
core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (4 hunks)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1 hunks)
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1 hunks)
documentation/data-quality-monitoring.md (1 hunks)
dump/.project (1 hunks)
scripts/.project (1 hunks)
server/.project (1 hunks)
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala (1 hunks)
server/src/main/scala/org/dbpedia/extraction/server/resources/Root.scala (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (1)

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (6)

DataQualityMonitor (23-74)

DataQualityMonitor (79-188)

forExtractor (34-36)

logSkipped (112-114)

logSuccess (119-121)

logInvalidData (91-107)

core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (12)

DataQualityMonitor (23-74)

DataQualityMonitor (79-188)

reset (70-73)

forExtractor (34-36)

logInvalidData (91-107)

getMetrics (126-131)

getTotalErrors (136-138)

getGlobalMetrics (41-43)

getErrorDetails (48-53)

exportToCsv (58-65)

logSkipped (112-114)

logSuccess (119-121)

🪛 GitHub Check: SonarCloud Code Analysis

core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala

[failure] 76-76: Define a constant instead of duplicating this literal "Invalid IRI" 3 times.

See more on https://sonarcloud.io/project/issues?id=dbpedia_extraction-framework&issues=AZsTwP6Eypzs4FBPiKIV&open=AZsTwP6Eypzs4FBPiKIV&pullRequest=808

[failure] 79-79: Define a constant instead of duplicating this literal "TestExtractor:InvalidData" 3 times.

See more on https://sonarcloud.io/project/issues?id=dbpedia_extraction-framework&issues=AZsTwP6Eypzs4FBPiKIU&open=AZsTwP6Eypzs4FBPiKIU&pullRequest=808

🪛 markdownlint-cli2 (0.18.1)

documentation/data-quality-monitoring.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: build
GitHub Check: build
GitHub Check: deploy
GitHub Check: build

🔇 Additional comments (14)

.gitignore (1)

10-10: LGTM!

Adding .settings to gitignore is appropriate alongside the Eclipse .project file updates in this PR.

dump/.project (1)

1-36: LGTM!

The Maven-enabled Eclipse project descriptor follows the same pattern as the other module .project files in this PR.

scripts/.project (1)

1-36: LGTM!

Consistent Maven-enabled Eclipse configuration matching other modules.

server/src/main/scala/org/dbpedia/extraction/server/resources/Root.scala (1)

28-28: LGTM!

The navigation link integrates cleanly with existing links and follows the established pattern.

core/.project (1)

1-35: LGTM!

Consistent Maven-enabled Eclipse configuration for the core module.

core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (3)

29-30: LGTM!

Clean initialization using the factory method. The monitor is appropriately scoped as a private instance field.

54-57: LGTM!

Good addition of skip logging with descriptive reason including the namespace.

138-155: LGTM!

Good categorization of exceptions with separate handling for IRISyntaxException versus other errors. Including both the exception and URL in the log provides useful debugging context.

documentation/data-quality-monitoring.md (1)

1-129: Good documentation coverage.

The documentation clearly explains the purpose, usage patterns, API methods, and integration examples. This will help other developers adopt the monitoring system in their extractors.

server/.project (1)

1-36: LGTM!

The Eclipse project configuration correctly adds Maven builder support alongside Scala, and the resource filters appropriately exclude common non-source directories.

core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)

134-153: Good concurrency test coverage.

The test properly exercises thread safety with 10 threads logging 100 errors each and verifies the total count. This validates the thread-safe design using TrieMap and AtomicLong.

server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala (1)

32-111: Well-structured HTML dashboard.

The dashboard implementation is clean with proper grouping by extractor, sorting, and informative empty-state messaging. The use of Bootstrap panels provides a good UX.

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (2)

23-74: Good design for the companion object API.

The factory method pattern with forExtractor, global metrics access, and CSV export provide a clean public API. Thread-safe data structures (TrieMap, AtomicLong) are appropriate choices.

79-138: Well-structured per-extractor monitor class.

The class provides a clean interface for logging invalid data, skipped pages, and successes. The categorization logic is sensible, and metrics retrieval is properly scoped to the extractor.

coderabbitai · 2025-12-12T18:12:21Z

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala

+  def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {
+    errorDetails.get(errorType) match {
+      case Some(errors) => errors.take(limit).toList
+      case None => List.empty
+    }
+  }


⚠️ Potential issue | 🟠 Major

Potential thread-safety issue when reading error details.

errors.take(limit) operates on a mutable ListBuffer that may be concurrently modified by recordError. Consider synchronizing the read or returning a snapshot.

def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = { errorDetails.get(errorType) match { - case Some(errors) => errors.take(limit).toList + case Some(errors) => errors.synchronized { errors.take(limit).toList } case None => List.empty } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {

errorDetails.get(errorType) match {

case Some(errors) => errors.take(limit).toList

case None => List.empty

}

}

def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {

errorDetails.get(errorType) match {

case Some(errors) => errors.synchronized { errors.take(limit).toList }

case None => List.empty

}

}

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala around lines 48 to 53, the method getErrorDetails reads from a mutable ListBuffer without synchronization which can race with concurrent recordError updates; change the implementation to obtain a thread-safe snapshot by synchronizing on the ListBuffer (or the map entry) while copying its contents and then return the copy (e.g., synchronize around errors and call errors.take(limit).toList or errors.toList.take(limit)), or replace the mutable ListBuffer with a concurrent/immutable collection and return its immutable snapshot.

coderabbitai · 2025-12-12T18:12:22Z

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala

+    DataQualityMonitor.errorDetails
+      .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)
+      .synchronized {
+        val buffer = DataQualityMonitor.errorDetails(errorType)
+        if (buffer.size < 10000) { // Limit storage
+          buffer += errorDetail
+        }
+      }
+  }


🛠️ Refactor suggestion | 🟠 Major

Simplify synchronization and fix redundant buffer lookup.

The buffer is fetched via getOrElseUpdate, then re-fetched inside the synchronized block. Use the buffer returned by getOrElseUpdate directly.

- DataQualityMonitor.errorDetails - .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty) - .synchronized { - val buffer = DataQualityMonitor.errorDetails(errorType) - if (buffer.size < 10000) { // Limit storage - buffer += errorDetail - } - } + val buffer = DataQualityMonitor.errorDetails + .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty) + buffer.synchronized { + if (buffer.size < 10000) { // Limit storage + buffer += errorDetail + } + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

DataQualityMonitor.errorDetails

.getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)

.synchronized {

val buffer = DataQualityMonitor.errorDetails(errorType)

if (buffer.size < 10000) { // Limit storage

buffer += errorDetail

}

}

}

val buffer = DataQualityMonitor.errorDetails

.getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)

buffer.synchronized {

if (buffer.size < 10000) { // Limit storage

buffer += errorDetail

}

}

}

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala around lines 164 to 172, the code calls getOrElseUpdate to obtain a buffer but then re-reads the map inside the synchronized block; replace that by storing the result of getOrElseUpdate in a local val (e.g. val buffer = DataQualityMonitor.errorDetails.getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)) and synchronize on that buffer (buffer.synchronized { if (buffer.size < 10000) buffer += errorDetail }) so you use the retrieved buffer directly and avoid the redundant map lookup.

coderabbitai · 2025-12-12T18:12:22Z

server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala

+  def getMetricsJson: String = {
+    val metrics = DataQualityMonitor.getGlobalMetrics()
+
+    val json = new StringBuilder()
+    json.append("{\n")
+    json.append(s"""  "totalErrors": ${metrics.values.sum},\n""")
+    json.append(s"""  "errorTypes": ${metrics.size},\n""")
+    json.append("""  "metrics": {""").append("\n")
+
+    val entries = metrics.toSeq
+    entries.zipWithIndex.foreach { case ((errorType, count), idx) =>
+      json.append(s"""    "$errorType": $count""")
+      if (idx < entries.size - 1) json.append(",")
+      json.append("\n")
+    }
+
+    json.append("  }\n")
+    json.append("}")
+    json.toString()
+  }


⚠️ Potential issue | 🟠 Major

Manual JSON construction lacks proper escaping.

If error type keys contain special characters (quotes, backslashes), the generated JSON will be malformed. Consider using a JSON library or at minimum escape the errorType string.

val entries = metrics.toSeq entries.zipWithIndex.foreach { case ((errorType, count), idx) => - json.append(s""" "$errorType": $count""") + val escapedType = errorType.replace("\\", "\\\\").replace("\"", "\\\"") + json.append(s""" "$escapedType": $count""") if (idx < entries.size - 1) json.append(",") json.append("\n") }

🤖 Prompt for AI Agents

In server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala around lines 119-138 the code manually concatenates JSON and does not escape errorType keys, which can break JSON if keys contain quotes or backslashes; replace the hand-rolled string building with a proper JSON serialization approach (e.g., construct a Scala Map/JsObject or POJO representing totalErrors, errorTypes and metrics and serialize it with your project's JSON library such as play-json, Jackson, or json4s), or at minimum run each errorType through a JSON string-escaping routine before inserting into the output; ensure you return the serialized JSON string from the serializer rather than concatenating pieces.

coderabbitai · 2025-12-12T18:12:22Z

server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala

+  def getErrors(
+    @QueryParam("type") errorType: String,
+    @QueryParam("limit") limitParam: String
+  ): Elem = {
+    val limit = Option(limitParam).map(_.toInt).getOrElse(50)
+    val errors = DataQualityMonitor.getErrorDetails(errorType, limit)


🛠️ Refactor suggestion | 🟠 Major

Add null check for errorType parameter.

If the type query parameter is missing, errorType will be null, which could cause issues when passed to getErrorDetails. Return a user-friendly error or default behavior.

def getErrors( @QueryParam("type") errorType: String, @QueryParam("limit") limitParam: String ): Elem = { + if (errorType == null || errorType.isEmpty) { + return <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> + {ServerHeader.getHeader("Error", true)} + <body> + <div class="container"> + <div class="alert alert-danger"> + <strong>Missing parameter:</strong> Please specify an error type using ?type=ErrorType + </div> + <a href="https://github.com/quality/" class="btn btn-default">← Back to Overview</a> + </div> + </body> + </html> + } val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala around lines 146 to 151, the QueryParam "type" (errorType) can be null when omitted; add a null/empty check and handle it before calling DataQualityMonitor.getErrorDetails: either supply a sensible default (e.g. treat missing type as "all" or empty string) or return a user-friendly error XML response with a 400 status indicating the missing parameter; ensure you do not pass a raw null into getErrorDetails and construct the returned Elem accordingly.

coderabbitai · 2025-12-12T18:12:22Z

server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala

+    val limit = Option(limitParam).map(_.toInt).getOrElse(50)
+    val errors = DataQualityMonitor.getErrorDetails(errorType, limit)


⚠️ Potential issue | 🟠 Major

Handle invalid limit parameter to prevent NumberFormatException.

If limitParam contains a non-numeric value, toInt will throw NumberFormatException, returning a 500 error to the client. Add error handling or use toIntOption.

- val limit = Option(limitParam).map(_.toInt).getOrElse(50) + val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val limit = Option(limitParam).map(_.toInt).getOrElse(50)

val errors = DataQualityMonitor.getErrorDetails(errorType, limit)

val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50)

val errors = DataQualityMonitor.getErrorDetails(errorType, limit)

🤖 Prompt for AI Agents

In server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala around lines 150 to 151, the current use of Option(limitParam).map(_.toInt).getOrElse(50) will throw NumberFormatException for non-numeric input; change it to safely parse the parameter (e.g., use Option(limitParam).flatMap(s => s.toIntOption).getOrElse(50) or catch NumberFormatException) and, if parsing fails, either fall back to the default 50 or return a 400 Bad Request with a clear message; update the surrounding code to use the parsed Int variable when calling DataQualityMonitor.getErrorDetails(errorType, limit).

coderabbitai · 2025-12-12T18:12:22Z

server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala

+    val limit = Option(limitParam).map(_.toInt).getOrElse(1000)
+    DataQualityMonitor.exportToCsv(errorType, limit)


⚠️ Potential issue | 🟠 Major

Same NumberFormatException risk in exportCsv.

Apply the same fix here for consistent error handling.

- val limit = Option(limitParam).map(_.toInt).getOrElse(1000) + val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(1000)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val limit = Option(limitParam).map(_.toInt).getOrElse(1000)

DataQualityMonitor.exportToCsv(errorType, limit)

val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(1000)

DataQualityMonitor.exportToCsv(errorType, limit)

🤖 Prompt for AI Agents

In server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala around lines 216 to 217, the current conversion Option(limitParam).map(_.toInt).getOrElse(1000) can throw NumberFormatException for bad input; change it to a safe parse that falls back to 1000 on invalid input (e.g. use scala.util.Try(limitParam.toInt).toOption.getOrElse(1000) or limitParam.toIntOption.getOrElse(1000) depending on Scala version) and optionally log or ignore the invalid value, then call DataQualityMonitor.exportToCsv(errorType, limit) with the safely parsed limit.

PritamP20 · 2025-12-13T02:58:47Z

@JJ-Author can you please review this

sonarqubecloud · 2025-12-17T15:02:46Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1)
25-30: Thread-safety issue persists in getErrorDetails.

The mutable ListBuffer may be concurrently modified by recordError while this method reads from it. This was flagged in a previous review and remains unaddressed.
   def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {
     errorDetails.get(errorType) match {
-      case Some(errors) => errors.take(limit).toList
+      case Some(errors) => errors.synchronized { errors.take(limit).toList }
       case None => List.empty
     }
   }

🧹 Nitpick comments (1)

server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (1)
44-59: Consider using JSON parsing for more robust assertions.

String-based assertions like should include("\"totalErrors\": 5") are fragile and sensitive to formatting changes. A JSON parser would make tests more resilient.

Example with a JSON library:
import org.json4s._
import org.json4s.jackson.JsonMethods._

val json = parse(dataQuality.getMetricsJson)
(json \ "totalErrors").extract[Int] should be(5)

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b5188f and 35de582.

📒 Files selected for processing (3)

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1 hunks)
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1 hunks)
server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (1)

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (5)

DataQualityMonitor (11-45)

DataQualityMonitor (47-123)

reset (41-44)

forExtractor (17-19)

logInvalidData (51-63)

core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (12)

DataQualityMonitor (11-45)

DataQualityMonitor (47-123)

reset (41-44)

forExtractor (17-19)

logInvalidData (51-63)

getMetrics (73-78)

getTotalErrors (80-80)

getGlobalMetrics (21-23)

getErrorDetails (25-30)

exportToCsv (32-39)

logSkipped (65-67)

logSuccess (69-71)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: build
GitHub Check: build
GitHub Check: build
GitHub Check: deploy

🔇 Additional comments (13)

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (5)

1-6: LGTM: Appropriate concurrency primitives selected.

Using TrieMap and AtomicLong from java.util.concurrent provides thread-safe foundations for the concurrent access patterns this monitor will encounter.

47-78: LGTM: Per-extractor monitor implementation is well-structured.

The separation between logging methods (logInvalidData, logSkipped, logSuccess) provides clear semantics for different extraction outcomes. The getMetrics() filtering by prefix cleanly isolates per-extractor data.

102-106: Synchronization approach differs from past suggestion but is valid.

The current implementation synchronizes on the entire errorDetails TrieMap rather than on individual buffers as previously suggested. While this is coarse-grained and could cause contention under high concurrency with different error types, it is thread-safe and acceptable for a monitoring utility.

109-122: LGTM: Helper methods are clean and functional.

Error categorization logic is sensible, prioritizing exception type when available and falling back to keyword-based categorization. The truncate helper safely handles string length.

125-131: LGTM: Clean data model for error details.

The ExtractionError case class captures the essential error metadata with appropriate use of Option for the nullable exception type.

core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (4)

1-9: LGTM: Test setup is appropriate.

Using BeforeAndAfter with DataQualityMonitor.reset() ensures test isolation, preventing state leakage between tests.

17-39: LGTM: Error recording and categorization tests are well-structured.

Tests validate both exception-based and keyword-based categorization paths with appropriate assertions on metrics.

107-123: LGTM: Concurrency test validates thread-safe counting.

The test correctly spawns multiple threads, waits for completion with join(), and verifies the expected total count. This validates the AtomicLong increment behavior under contention.

125-144: LGTM: Thorough field validation for error details.

The test comprehensively verifies all ExtractionError fields are correctly populated, including the optional exceptionType.

server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (4)

1-12: LGTM: Server test setup is appropriate.

Direct instantiation of DataQuality for testing allows isolation from the full server stack while still exercising the resource logic.

14-42: LGTM: HTML rendering tests cover key scenarios.

Tests validate empty state, populated state with multiple extractors, and proper grouping—covering the main rendering paths.

93-99: Test validates current escaping behavior but not full CSV safety.

This test correctly expects commas to be replaced with semicolons as currently implemented. If the CSV escaping in DataQualityMonitor.exportToCsv is updated to use RFC 4180 quoting, this test will need to be updated accordingly.

131-149: LGTM: Concurrent access test validates thread safety under mixed read/write load.

Interleaving getMetricsJson calls during concurrent logging effectively tests the monitor's behavior under realistic concurrent access patterns.

coderabbitai · 2025-12-17T15:05:54Z

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala

+  def exportToCsv(errorType: String, limit: Int = 1000): String = {
+    val errors = getErrorDetails(errorType, limit)
+    val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n"
+    val rows = errors.map(e =>
+      s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}"
+    ).mkString("\n")
+    header + rows
+  }


⚠️ Potential issue | 🟡 Minor

CSV export is vulnerable to malformed output.

Only message is escaped (commas replaced with semicolons), but extractorName and pageTitle could also contain commas, newlines, or quotes that would corrupt the CSV structure.

Consider proper RFC 4180 CSV escaping:

def exportToCsv(errorType: String, limit: Int = 1000): String = { val errors = getErrorDetails(errorType, limit) val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n" + def escapeCsvField(field: String): String = { + if (field.contains(",") || field.contains("\"") || field.contains("\n")) + "\"" + field.replace("\"", "\"\"") + "\"" + else field + } val rows = errors.map(e => - s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}" + s"${escapeCsvField(e.extractorName)},${escapeCsvField(e.pageTitle)},${escapeCsvField(e.message)},${e.timestamp}" ).mkString("\n") header + rows }

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala around lines 32 to 39, the CSV export currently only sanitizes the message field and so extractorName and pageTitle (and message newlines/quotes) can break CSV structure; change the implementation to produce RFC4180-compliant CSV by either: 1) using a CSV library (e.g., OpenCSV or Apache Commons CSV) to build rows safely, or 2) implementing RFC4180 escaping for each field — wrap every field in double quotes, double any internal double quotes, and convert newlines to \r\n (or preserve them inside quoted fields) before joining with commas; ensure the header remains and apply the escaping to extractorName, pageTitle, and message (and timestamp if necessary).

Add Data Quality Monitoring for extrators

4ceb888

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

PritamP20 added 2 commits December 17, 2025 18:39

fix

3b5188f

fix and imp

35de582

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

		val limit = Option(limitParam).map(_.toInt).getOrElse(50)
		val errors = DataQualityMonitor.getErrorDetails(errorType, limit)

		val limit = Option(limitParam).map(_.toInt).getOrElse(1000)
		DataQualityMonitor.exportToCsv(errorType, limit)

Add Data Quality Monitoring for Extractors #808

Are you sure you want to change the base?

Add Data Quality Monitoring for Extractors #808

Uh oh!

Conversation

PritamP20 commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Web Dashboard

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

PritamP20 commented Dec 13, 2025

Uh oh!

sonarqubecloud bot commented Dec 17, 2025

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PritamP20 commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading