Skip to content

Conversation

@PritamP20
Copy link

@PritamP20 PritamP20 commented Dec 12, 2025

Adding a monitoring system to track extraction errors and data quality issues. Currently integrated with HomepageExtractor as a proof of concept.

Problem

Many extractors silently discard invalid data without any logging:

// HomepageExtractor.scala line 125 (before)
case _ : IRISyntaxException => Seq()  //   TODO: log

This makes it impossible to debug extraction issues or measure data quality.

Solution

Created DataQualityMonitor - a simple utility to log and track extraction errors.

Usage:

val monitor = DataQualityMonitor.forExtractor("HomepageExtractor")

// Log errors with context
monitor.logInvalidData(pageTitle, "Malformed IRI", exception = Some(ex), data = Some(url))

// Get metrics
monitor.getMetrics()  // Map("HomepageExtractor:IRISyntaxException" -> 42)

Changes

New files:

  • core/.../util/DataQualityMonitor.scala - Core monitoring logic
  • core/.../util/DataQualityMonitorTest.scala - Tests
  • server/.../resources/DataQuality.scala - Web dashboard at /quality/
  • documentation/data-quality-monitoring.md - Docs

Modified:

  • core/.../mappings/HomepageExtractor.scala - Added monitoring calls

Web Dashboard

Access at http://localhost:9999/server/quality/ when running the server.

Shows:

  • Error counts by extractor and type
  • Detailed error logs
  • CSV export for analysis

Testing

cd core && mvn test -Dtest=DataQualityMonitorTest -DskipTests=false

cd server && ../run server
# Open http://localhost:9999/server/quality/
Screenshot 2025-12-12 at 11 19 07 PM Screenshot 2025-12-12 at 1 26 54 PM

Summary by CodeRabbit

  • New Features

    • Added a Data Quality Monitoring system for extraction metrics and a web dashboard with metrics, per-extractor views, error lists, and CSV export.
  • Enhancements

    • Extraction flow now reports outcomes (success/skip/invalid) for better visibility.
  • Documentation

    • New guide describing the monitoring utility, usage, endpoints, and integration examples.
  • Tests

    • New test suites validating monitoring, dashboard endpoints, CSV export, limits, and concurrency.
  • Chores

    • Project configuration updated to include Maven integration and resource filtering.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 12, 2025

📝 Walkthrough

Walkthrough

Adds a centralized Data Quality Monitoring system, integrates it into HomepageExtractor, exposes metrics/errors via a new /quality REST resource and dashboard, adds tests and documentation, and updates several Eclipse .project files to include Maven builders/natures.

Changes

Cohort / File(s) Summary
Version control
\.gitignore
No functional change: identical ignore rule(s) were removed and re-added; final semantics unchanged.
Eclipse project descriptors
core/.project, dump/.project, scripts/.project, server/.project
Replaced minimal Eclipse Scala project descriptors with expanded Maven-aware descriptors: added Maven builder/nature (org.eclipse.m2e.core.maven2Builder, org.eclipse.m2e.core.maven2Nature), retained Scala/Java natures, added filteredResources (regex excluding node_modules, .git, __CREATED_BY_JAVA_LANGUAGE_SERVER__), and adjusted buildSpec/projects.
Data Quality monitoring core
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala
New global and per-extractor DataQualityMonitor plus ExtractionError: thread-safe counters, capped error-detail storage, APIs to log invalid data/skips/successes, retrieve metrics, export CSV, and reset.
Extractor integration
core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala
Integrated per-extractor monitor: created monitor instance; logSkipped for non-main namespaces; logSuccess for cleaned URIs; logInvalidData on URI cleaning failures and exceptions with contextual messages.
Tests (core)
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala
New ScalaTest suite validating monitor creation, logging (with/without exceptions and data), metrics aggregation, error-detail limits, CSV export, concurrency safety, and record fields.
Server REST resource & nav
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala, server/src/main/scala/org/dbpedia/extraction/server/resources/Root.scala
New DataQuality REST resource exposing HTML dashboard, /quality/metrics JSON, /quality/errors HTML with type/limit params, and /quality/export CSV; Root.scala adds "Data Quality" navigation link.
Tests (server)
server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala
New test suite covering dashboard rendering, metrics JSON, error grouping, HTML/CSV outputs, escaping/limits, concurrency, and multi-extractor scenarios.
Documentation
documentation/data-quality-monitoring.md
New doc describing monitor purpose, API usage, integration example (HomepageExtractor), error categorization, thread-safety and memory bounds, testing utilities, and web dashboard endpoints.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Browser
    participant Server as REST Server\n(DataQuality resource)
    participant Monitor as DataQualityMonitor (global)
    participant Extractor as HomepageExtractor
    participant Pipeline as Extraction Pipeline

    Pipeline->>Extractor: process page
    alt page skipped (non-main ns)
        Extractor->>Monitor: logSkipped(pageTitle, reason)
        Monitor->>Monitor: record skip (no error count)
    else success (clean URI)
        Extractor->>Monitor: logSuccess(pageTitle, triplesCount)
        Monitor->>Monitor: increment success metric
    else invalid data / exception
        Extractor->>Monitor: logInvalidData(pageTitle, reason, exception?, data?)
        Monitor->>Monitor: increment error count & store capped detail
    end

    User->>Browser: Navigate to /quality/
    Browser->>Server: GET /quality/
    Server->>Monitor: getGlobalMetrics()
    Monitor-->>Server: metrics & totals
    Server->>Server: render dashboard HTML
    Server-->>Browser: HTML dashboard

    Browser->>Server: GET /quality/errors?type=...&limit=...
    Server->>Monitor: getErrorDetails(type, limit)
    Monitor-->>Server: List[ExtractionError]
    Server-->>Browser: HTML error list / CSV (if export)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Areas needing careful review:
    • core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala — thread-safety, atomic counters, capped detail storage, CSV export formatting.
    • core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala — placement and semantics of monitor calls, exception handling messages.
    • server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala — query parameter parsing, HTML/JSON/CSV formatting and escaping, sorting/grouping logic.
    • core/src/test/.../DataQualityMonitorTest.scala and server/src/test/.../DataQualityTest.scala — concurrency test stability, reset/teardown behavior, and assumptions about global state.
    • .project files — consistency of Maven builder/nature additions across modules and correctness of the resource filter regex.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add Data Quality Monitoring for Extractors' accurately summarizes the main change: introducing a data quality monitoring system with extractor integration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (1)

127-136: Consider consistent page identifiers for monitoring.

The logSuccess and logInvalidData calls use subjectUri (e.g., http://dbpedia.org/resource/Example) while logSkipped on line 55 uses page.title.encoded. This inconsistency may complicate metrics aggregation and dashboard readability.

Consider passing the page title consistently, or document that different identifiers are intentional for different use cases.

documentation/data-quality-monitoring.md (1)

64-66: Add a language identifier to the fenced code block.

The code block should specify a language for proper rendering and linting compliance. Since this is a URL, use text or plaintext.

-```
+```text
 http://localhost:9999/server/quality/

</blockquote></details>
<details>
<summary>core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (2)</summary><blockquote>

`76-82`: **Extract repeated test literals to constants.**

Per SonarCloud analysis, `"Invalid IRI"` and `"TestExtractor:InvalidData"` are duplicated multiple times. Consider extracting these to constants for maintainability.

```diff
+  private val InvalidIriReason = "Invalid IRI"
+  private val TestExtractorInvalidData = "TestExtractor:InvalidData"
+
   it should "retrieve error details" in {
     val monitor = DataQualityMonitor.forExtractor("TestExtractor")
 
-    monitor.logInvalidData("Einstein", "Invalid IRI", data = Some("malformed://url"))
-    monitor.logInvalidData("Tesla", "Invalid IRI", data = Some("another://bad"))
+    monitor.logInvalidData("Einstein", InvalidIriReason, data = Some("malformed://url"))
+    monitor.logInvalidData("Tesla", InvalidIriReason, data = Some("another://bad"))
 
-    val details = DataQualityMonitor.getErrorDetails("TestExtractor:InvalidData", limit = 10)
+    val details = DataQualityMonitor.getErrorDetails(TestExtractorInvalidData, limit = 10)

97-112: Consider reducing iteration count or marking as slow test.

Logging 11,000 errors in a loop may slow down test execution. Consider reducing to a smaller count (e.g., 100 over the limit) or annotating with a slow test marker if the framework supports it.

   it should "limit stored error details to prevent memory issues" in {
     val monitor = DataQualityMonitor.forExtractor("TestExtractor")
 
-    // Try to log more than the limit (10000)
-    for (i <- 1 to 11000) {
+    // Try to log more than the limit (10000) - use 10100 to test boundary
+    for (i <- 1 to 10100) {
       monitor.logInvalidData(s"Page$i", "Error")
     }
 
     // Should still work without memory issues
     val metrics = monitor.getMetrics()
-    monitor.getTotalErrors() should be(11000)
+    monitor.getTotalErrors() should be(10100)
 
     // But details should be limited
     val details = DataQualityMonitor.getErrorDetails("TestExtractor:InvalidData", limit = 20000)
     details.size should be <= 10000
   }
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1)

58-65: CSV export doesn't properly escape all special characters.

The current implementation only replaces commas with semicolons but doesn't handle quotes, newlines, or other characters that could break CSV parsing. Consider proper CSV escaping.

   def exportToCsv(errorType: String, limit: Int = 1000): String = {
     val errors = getErrorDetails(errorType, limit)
     val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n"
-    val rows = errors.map(e =>
-      s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}"
-    ).mkString("\n")
+    val rows = errors.map(e =>
+      s"${escapeCsv(e.extractorName)},${escapeCsv(e.pageTitle)},${escapeCsv(e.message)},${e.timestamp}"
+    ).mkString("\n")
     header + rows
   }
+
+  private def escapeCsv(value: String): String = {
+    if (value.contains(",") || value.contains("\"") || value.contains("\n")) {
+      "\"" + value.replace("\"", "\"\"") + "\""
+    } else {
+      value
+    }
+  }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14cca4b and 4ceb888.

⛔ Files ignored due to path filters (1)
  • .DS_Store is excluded by !**/.DS_Store
📒 Files selected for processing (11)
  • .gitignore (1 hunks)
  • core/.project (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (4 hunks)
  • core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1 hunks)
  • core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1 hunks)
  • documentation/data-quality-monitoring.md (1 hunks)
  • dump/.project (1 hunks)
  • scripts/.project (1 hunks)
  • server/.project (1 hunks)
  • server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala (1 hunks)
  • server/src/main/scala/org/dbpedia/extraction/server/resources/Root.scala (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (6)
  • DataQualityMonitor (23-74)
  • DataQualityMonitor (79-188)
  • forExtractor (34-36)
  • logSkipped (112-114)
  • logSuccess (119-121)
  • logInvalidData (91-107)
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (12)
  • DataQualityMonitor (23-74)
  • DataQualityMonitor (79-188)
  • reset (70-73)
  • forExtractor (34-36)
  • logInvalidData (91-107)
  • getMetrics (126-131)
  • getTotalErrors (136-138)
  • getGlobalMetrics (41-43)
  • getErrorDetails (48-53)
  • exportToCsv (58-65)
  • logSkipped (112-114)
  • logSuccess (119-121)
🪛 GitHub Check: SonarCloud Code Analysis
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala

[failure] 76-76: Define a constant instead of duplicating this literal "Invalid IRI" 3 times.

See more on https://sonarcloud.io/project/issues?id=dbpedia_extraction-framework&issues=AZsTwP6Eypzs4FBPiKIV&open=AZsTwP6Eypzs4FBPiKIV&pullRequest=808


[failure] 79-79: Define a constant instead of duplicating this literal "TestExtractor:InvalidData" 3 times.

See more on https://sonarcloud.io/project/issues?id=dbpedia_extraction-framework&issues=AZsTwP6Eypzs4FBPiKIU&open=AZsTwP6Eypzs4FBPiKIU&pullRequest=808

🪛 markdownlint-cli2 (0.18.1)
documentation/data-quality-monitoring.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: deploy
  • GitHub Check: build
🔇 Additional comments (14)
.gitignore (1)

10-10: LGTM!

Adding .settings to gitignore is appropriate alongside the Eclipse .project file updates in this PR.

dump/.project (1)

1-36: LGTM!

The Maven-enabled Eclipse project descriptor follows the same pattern as the other module .project files in this PR.

scripts/.project (1)

1-36: LGTM!

Consistent Maven-enabled Eclipse configuration matching other modules.

server/src/main/scala/org/dbpedia/extraction/server/resources/Root.scala (1)

28-28: LGTM!

The navigation link integrates cleanly with existing links and follows the established pattern.

core/.project (1)

1-35: LGTM!

Consistent Maven-enabled Eclipse configuration for the core module.

core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (3)

29-30: LGTM!

Clean initialization using the factory method. The monitor is appropriately scoped as a private instance field.


54-57: LGTM!

Good addition of skip logging with descriptive reason including the namespace.


138-155: LGTM!

Good categorization of exceptions with separate handling for IRISyntaxException versus other errors. Including both the exception and URL in the log provides useful debugging context.

documentation/data-quality-monitoring.md (1)

1-129: Good documentation coverage.

The documentation clearly explains the purpose, usage patterns, API methods, and integration examples. This will help other developers adopt the monitoring system in their extractors.

server/.project (1)

1-36: LGTM!

The Eclipse project configuration correctly adds Maven builder support alongside Scala, and the resource filters appropriately exclude common non-source directories.

core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)

134-153: Good concurrency test coverage.

The test properly exercises thread safety with 10 threads logging 100 errors each and verifies the total count. This validates the thread-safe design using TrieMap and AtomicLong.

server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala (1)

32-111: Well-structured HTML dashboard.

The dashboard implementation is clean with proper grouping by extractor, sorting, and informative empty-state messaging. The use of Bootstrap panels provides a good UX.

core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (2)

23-74: Good design for the companion object API.

The factory method pattern with forExtractor, global metrics access, and CSV export provide a clean public API. Thread-safe data structures (TrieMap, AtomicLong) are appropriate choices.


79-138: Well-structured per-extractor monitor class.

The class provides a clean interface for logging invalid data, skipped pages, and successes. The categorization logic is sensible, and metrics retrieval is properly scoped to the extractor.

Comment on lines +48 to +53
def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {
errorDetails.get(errorType) match {
case Some(errors) => errors.take(limit).toList
case None => List.empty
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential thread-safety issue when reading error details.

errors.take(limit) operates on a mutable ListBuffer that may be concurrently modified by recordError. Consider synchronizing the read or returning a snapshot.

   def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {
     errorDetails.get(errorType) match {
-      case Some(errors) => errors.take(limit).toList
+      case Some(errors) => errors.synchronized { errors.take(limit).toList }
       case None => List.empty
     }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {
errorDetails.get(errorType) match {
case Some(errors) => errors.take(limit).toList
case None => List.empty
}
}
def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {
errorDetails.get(errorType) match {
case Some(errors) => errors.synchronized { errors.take(limit).toList }
case None => List.empty
}
}
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala
around lines 48 to 53, the method getErrorDetails reads from a mutable
ListBuffer without synchronization which can race with concurrent recordError
updates; change the implementation to obtain a thread-safe snapshot by
synchronizing on the ListBuffer (or the map entry) while copying its contents
and then return the copy (e.g., synchronize around errors and call
errors.take(limit).toList or errors.toList.take(limit)), or replace the mutable
ListBuffer with a concurrent/immutable collection and return its immutable
snapshot.

Comment on lines 164 to 172
DataQualityMonitor.errorDetails
.getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)
.synchronized {
val buffer = DataQualityMonitor.errorDetails(errorType)
if (buffer.size < 10000) { // Limit storage
buffer += errorDetail
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Simplify synchronization and fix redundant buffer lookup.

The buffer is fetched via getOrElseUpdate, then re-fetched inside the synchronized block. Use the buffer returned by getOrElseUpdate directly.

-    DataQualityMonitor.errorDetails
-      .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)
-      .synchronized {
-        val buffer = DataQualityMonitor.errorDetails(errorType)
-        if (buffer.size < 10000) { // Limit storage
-          buffer += errorDetail
-        }
-      }
+    val buffer = DataQualityMonitor.errorDetails
+      .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)
+    buffer.synchronized {
+      if (buffer.size < 10000) { // Limit storage
+        buffer += errorDetail
+      }
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
DataQualityMonitor.errorDetails
.getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)
.synchronized {
val buffer = DataQualityMonitor.errorDetails(errorType)
if (buffer.size < 10000) { // Limit storage
buffer += errorDetail
}
}
}
val buffer = DataQualityMonitor.errorDetails
.getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)
buffer.synchronized {
if (buffer.size < 10000) { // Limit storage
buffer += errorDetail
}
}
}
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala
around lines 164 to 172, the code calls getOrElseUpdate to obtain a buffer but
then re-reads the map inside the synchronized block; replace that by storing the
result of getOrElseUpdate in a local val (e.g. val buffer =
DataQualityMonitor.errorDetails.getOrElseUpdate(errorType,
collection.mutable.ListBuffer.empty)) and synchronize on that buffer
(buffer.synchronized { if (buffer.size < 10000) buffer += errorDetail }) so you
use the retrieved buffer directly and avoid the redundant map lookup.

Comment on lines +119 to +138
def getMetricsJson: String = {
val metrics = DataQualityMonitor.getGlobalMetrics()

val json = new StringBuilder()
json.append("{\n")
json.append(s""" "totalErrors": ${metrics.values.sum},\n""")
json.append(s""" "errorTypes": ${metrics.size},\n""")
json.append(""" "metrics": {""").append("\n")

val entries = metrics.toSeq
entries.zipWithIndex.foreach { case ((errorType, count), idx) =>
json.append(s""" "$errorType": $count""")
if (idx < entries.size - 1) json.append(",")
json.append("\n")
}

json.append(" }\n")
json.append("}")
json.toString()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Manual JSON construction lacks proper escaping.

If error type keys contain special characters (quotes, backslashes), the generated JSON will be malformed. Consider using a JSON library or at minimum escape the errorType string.

     val entries = metrics.toSeq
     entries.zipWithIndex.foreach { case ((errorType, count), idx) =>
-      json.append(s"""    "$errorType": $count""")
+      val escapedType = errorType.replace("\\", "\\\\").replace("\"", "\\\"")
+      json.append(s"""    "$escapedType": $count""")
       if (idx < entries.size - 1) json.append(",")
       json.append("\n")
     }
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala
around lines 119-138 the code manually concatenates JSON and does not escape
errorType keys, which can break JSON if keys contain quotes or backslashes;
replace the hand-rolled string building with a proper JSON serialization
approach (e.g., construct a Scala Map/JsObject or POJO representing totalErrors,
errorTypes and metrics and serialize it with your project's JSON library such as
play-json, Jackson, or json4s), or at minimum run each errorType through a JSON
string-escaping routine before inserting into the output; ensure you return the
serialized JSON string from the serializer rather than concatenating pieces.

Comment on lines +146 to +151
def getErrors(
@QueryParam("type") errorType: String,
@QueryParam("limit") limitParam: String
): Elem = {
val limit = Option(limitParam).map(_.toInt).getOrElse(50)
val errors = DataQualityMonitor.getErrorDetails(errorType, limit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add null check for errorType parameter.

If the type query parameter is missing, errorType will be null, which could cause issues when passed to getErrorDetails. Return a user-friendly error or default behavior.

   def getErrors(
     @QueryParam("type") errorType: String,
     @QueryParam("limit") limitParam: String
   ): Elem = {
+    if (errorType == null || errorType.isEmpty) {
+      return <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
+        {ServerHeader.getHeader("Error", true)}
+        <body>
+          <div class="container">
+            <div class="alert alert-danger">
+              <strong>Missing parameter:</strong> Please specify an error type using ?type=ErrorType
+            </div>
+            <a href="https://github.com/quality/" class="btn btn-default">&larr; Back to Overview</a>
+          </div>
+        </body>
+      </html>
+    }
     val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala
around lines 146 to 151, the QueryParam "type" (errorType) can be null when
omitted; add a null/empty check and handle it before calling
DataQualityMonitor.getErrorDetails: either supply a sensible default (e.g. treat
missing type as "all" or empty string) or return a user-friendly error XML
response with a 400 status indicating the missing parameter; ensure you do not
pass a raw null into getErrorDetails and construct the returned Elem
accordingly.

Comment on lines +150 to +151
val limit = Option(limitParam).map(_.toInt).getOrElse(50)
val errors = DataQualityMonitor.getErrorDetails(errorType, limit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle invalid limit parameter to prevent NumberFormatException.

If limitParam contains a non-numeric value, toInt will throw NumberFormatException, returning a 500 error to the client. Add error handling or use toIntOption.

-    val limit = Option(limitParam).map(_.toInt).getOrElse(50)
+    val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val limit = Option(limitParam).map(_.toInt).getOrElse(50)
val errors = DataQualityMonitor.getErrorDetails(errorType, limit)
val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50)
val errors = DataQualityMonitor.getErrorDetails(errorType, limit)
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala
around lines 150 to 151, the current use of
Option(limitParam).map(_.toInt).getOrElse(50) will throw NumberFormatException
for non-numeric input; change it to safely parse the parameter (e.g., use
Option(limitParam).flatMap(s => s.toIntOption).getOrElse(50) or catch
NumberFormatException) and, if parsing fails, either fall back to the default 50
or return a 400 Bad Request with a clear message; update the surrounding code to
use the parsed Int variable when calling
DataQualityMonitor.getErrorDetails(errorType, limit).

Comment on lines +216 to +217
val limit = Option(limitParam).map(_.toInt).getOrElse(1000)
DataQualityMonitor.exportToCsv(errorType, limit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same NumberFormatException risk in exportCsv.

Apply the same fix here for consistent error handling.

-    val limit = Option(limitParam).map(_.toInt).getOrElse(1000)
+    val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(1000)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val limit = Option(limitParam).map(_.toInt).getOrElse(1000)
DataQualityMonitor.exportToCsv(errorType, limit)
val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(1000)
DataQualityMonitor.exportToCsv(errorType, limit)
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala
around lines 216 to 217, the current conversion
Option(limitParam).map(_.toInt).getOrElse(1000) can throw NumberFormatException
for bad input; change it to a safe parse that falls back to 1000 on invalid
input (e.g. use scala.util.Try(limitParam.toInt).toOption.getOrElse(1000) or
limitParam.toIntOption.getOrElse(1000) depending on Scala version) and
optionally log or ignore the invalid value, then call
DataQualityMonitor.exportToCsv(errorType, limit) with the safely parsed limit.

@PritamP20
Copy link
Author

@JJ-Author can you please review this

@sonarqubecloud
Copy link

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1)

25-30: Thread-safety issue persists in getErrorDetails.

The mutable ListBuffer may be concurrently modified by recordError while this method reads from it. This was flagged in a previous review and remains unaddressed.

   def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {
     errorDetails.get(errorType) match {
-      case Some(errors) => errors.take(limit).toList
+      case Some(errors) => errors.synchronized { errors.take(limit).toList }
       case None => List.empty
     }
   }
🧹 Nitpick comments (1)
server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (1)

44-59: Consider using JSON parsing for more robust assertions.

String-based assertions like should include("\"totalErrors\": 5") are fragile and sensitive to formatting changes. A JSON parser would make tests more resilient.

Example with a JSON library:

import org.json4s._
import org.json4s.jackson.JsonMethods._

val json = parse(dataQuality.getMetricsJson)
(json \ "totalErrors").extract[Int] should be(5)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b5188f and 35de582.

📒 Files selected for processing (3)
  • core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1 hunks)
  • core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1 hunks)
  • server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (5)
  • DataQualityMonitor (11-45)
  • DataQualityMonitor (47-123)
  • reset (41-44)
  • forExtractor (17-19)
  • logInvalidData (51-63)
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (12)
  • DataQualityMonitor (11-45)
  • DataQualityMonitor (47-123)
  • reset (41-44)
  • forExtractor (17-19)
  • logInvalidData (51-63)
  • getMetrics (73-78)
  • getTotalErrors (80-80)
  • getGlobalMetrics (21-23)
  • getErrorDetails (25-30)
  • exportToCsv (32-39)
  • logSkipped (65-67)
  • logSuccess (69-71)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: deploy
🔇 Additional comments (13)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (5)

1-6: LGTM: Appropriate concurrency primitives selected.

Using TrieMap and AtomicLong from java.util.concurrent provides thread-safe foundations for the concurrent access patterns this monitor will encounter.


47-78: LGTM: Per-extractor monitor implementation is well-structured.

The separation between logging methods (logInvalidData, logSkipped, logSuccess) provides clear semantics for different extraction outcomes. The getMetrics() filtering by prefix cleanly isolates per-extractor data.


102-106: Synchronization approach differs from past suggestion but is valid.

The current implementation synchronizes on the entire errorDetails TrieMap rather than on individual buffers as previously suggested. While this is coarse-grained and could cause contention under high concurrency with different error types, it is thread-safe and acceptable for a monitoring utility.


109-122: LGTM: Helper methods are clean and functional.

Error categorization logic is sensible, prioritizing exception type when available and falling back to keyword-based categorization. The truncate helper safely handles string length.


125-131: LGTM: Clean data model for error details.

The ExtractionError case class captures the essential error metadata with appropriate use of Option for the nullable exception type.

core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (4)

1-9: LGTM: Test setup is appropriate.

Using BeforeAndAfter with DataQualityMonitor.reset() ensures test isolation, preventing state leakage between tests.


17-39: LGTM: Error recording and categorization tests are well-structured.

Tests validate both exception-based and keyword-based categorization paths with appropriate assertions on metrics.


107-123: LGTM: Concurrency test validates thread-safe counting.

The test correctly spawns multiple threads, waits for completion with join(), and verifies the expected total count. This validates the AtomicLong increment behavior under contention.


125-144: LGTM: Thorough field validation for error details.

The test comprehensively verifies all ExtractionError fields are correctly populated, including the optional exceptionType.

server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (4)

1-12: LGTM: Server test setup is appropriate.

Direct instantiation of DataQuality for testing allows isolation from the full server stack while still exercising the resource logic.


14-42: LGTM: HTML rendering tests cover key scenarios.

Tests validate empty state, populated state with multiple extractors, and proper grouping—covering the main rendering paths.


93-99: Test validates current escaping behavior but not full CSV safety.

This test correctly expects commas to be replaced with semicolons as currently implemented. If the CSV escaping in DataQualityMonitor.exportToCsv is updated to use RFC 4180 quoting, this test will need to be updated accordingly.


131-149: LGTM: Concurrent access test validates thread safety under mixed read/write load.

Interleaving getMetricsJson calls during concurrent logging effectively tests the monitor's behavior under realistic concurrent access patterns.

Comment on lines +32 to +39
def exportToCsv(errorType: String, limit: Int = 1000): String = {
val errors = getErrorDetails(errorType, limit)
val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n"
val rows = errors.map(e =>
s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}"
).mkString("\n")
header + rows
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

CSV export is vulnerable to malformed output.

Only message is escaped (commas replaced with semicolons), but extractorName and pageTitle could also contain commas, newlines, or quotes that would corrupt the CSV structure.

Consider proper RFC 4180 CSV escaping:

   def exportToCsv(errorType: String, limit: Int = 1000): String = {
     val errors = getErrorDetails(errorType, limit)
     val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n"
+    def escapeCsvField(field: String): String = {
+      if (field.contains(",") || field.contains("\"") || field.contains("\n"))
+        "\"" + field.replace("\"", "\"\"") + "\""
+      else field
+    }
     val rows = errors.map(e =>
-      s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}"
+      s"${escapeCsvField(e.extractorName)},${escapeCsvField(e.pageTitle)},${escapeCsvField(e.message)},${e.timestamp}"
     ).mkString("\n")
     header + rows
   }
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala
around lines 32 to 39, the CSV export currently only sanitizes the message field
and so extractorName and pageTitle (and message newlines/quotes) can break CSV
structure; change the implementation to produce RFC4180-compliant CSV by either:
1) using a CSV library (e.g., OpenCSV or Apache Commons CSV) to build rows
safely, or 2) implementing RFC4180 escaping for each field — wrap every field in
double quotes, double any internal double quotes, and convert newlines to \r\n
(or preserve them inside quoted fields) before joining with commas; ensure the
header remains and apply the escaping to extractorName, pageTitle, and message
(and timestamp if necessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant