-
Notifications
You must be signed in to change notification settings - Fork 291
Add Data Quality Monitoring for Extractors #808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds a centralized Data Quality Monitoring system, integrates it into HomepageExtractor, exposes metrics/errors via a new /quality REST resource and dashboard, adds tests and documentation, and updates several Eclipse .project files to include Maven builders/natures. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Browser
participant Server as REST Server\n(DataQuality resource)
participant Monitor as DataQualityMonitor (global)
participant Extractor as HomepageExtractor
participant Pipeline as Extraction Pipeline
Pipeline->>Extractor: process page
alt page skipped (non-main ns)
Extractor->>Monitor: logSkipped(pageTitle, reason)
Monitor->>Monitor: record skip (no error count)
else success (clean URI)
Extractor->>Monitor: logSuccess(pageTitle, triplesCount)
Monitor->>Monitor: increment success metric
else invalid data / exception
Extractor->>Monitor: logInvalidData(pageTitle, reason, exception?, data?)
Monitor->>Monitor: increment error count & store capped detail
end
User->>Browser: Navigate to /quality/
Browser->>Server: GET /quality/
Server->>Monitor: getGlobalMetrics()
Monitor-->>Server: metrics & totals
Server->>Server: render dashboard HTML
Server-->>Browser: HTML dashboard
Browser->>Server: GET /quality/errors?type=...&limit=...
Server->>Monitor: getErrorDetails(type, limit)
Monitor-->>Server: List[ExtractionError]
Server-->>Browser: HTML error list / CSV (if export)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (5)
core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (1)
127-136: Consider consistent page identifiers for monitoring.The
logSuccessandlogInvalidDatacalls usesubjectUri(e.g.,http://dbpedia.org/resource/Example) whilelogSkippedon line 55 usespage.title.encoded. This inconsistency may complicate metrics aggregation and dashboard readability.Consider passing the page title consistently, or document that different identifiers are intentional for different use cases.
documentation/data-quality-monitoring.md (1)
64-66: Add a language identifier to the fenced code block.The code block should specify a language for proper rendering and linting compliance. Since this is a URL, use
textorplaintext.-``` +```text http://localhost:9999/server/quality/</blockquote></details> <details> <summary>core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (2)</summary><blockquote> `76-82`: **Extract repeated test literals to constants.** Per SonarCloud analysis, `"Invalid IRI"` and `"TestExtractor:InvalidData"` are duplicated multiple times. Consider extracting these to constants for maintainability. ```diff + private val InvalidIriReason = "Invalid IRI" + private val TestExtractorInvalidData = "TestExtractor:InvalidData" + it should "retrieve error details" in { val monitor = DataQualityMonitor.forExtractor("TestExtractor") - monitor.logInvalidData("Einstein", "Invalid IRI", data = Some("malformed://url")) - monitor.logInvalidData("Tesla", "Invalid IRI", data = Some("another://bad")) + monitor.logInvalidData("Einstein", InvalidIriReason, data = Some("malformed://url")) + monitor.logInvalidData("Tesla", InvalidIriReason, data = Some("another://bad")) - val details = DataQualityMonitor.getErrorDetails("TestExtractor:InvalidData", limit = 10) + val details = DataQualityMonitor.getErrorDetails(TestExtractorInvalidData, limit = 10)
97-112: Consider reducing iteration count or marking as slow test.Logging 11,000 errors in a loop may slow down test execution. Consider reducing to a smaller count (e.g., 100 over the limit) or annotating with a slow test marker if the framework supports it.
it should "limit stored error details to prevent memory issues" in { val monitor = DataQualityMonitor.forExtractor("TestExtractor") - // Try to log more than the limit (10000) - for (i <- 1 to 11000) { + // Try to log more than the limit (10000) - use 10100 to test boundary + for (i <- 1 to 10100) { monitor.logInvalidData(s"Page$i", "Error") } // Should still work without memory issues val metrics = monitor.getMetrics() - monitor.getTotalErrors() should be(11000) + monitor.getTotalErrors() should be(10100) // But details should be limited val details = DataQualityMonitor.getErrorDetails("TestExtractor:InvalidData", limit = 20000) details.size should be <= 10000 }core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1)
58-65: CSV export doesn't properly escape all special characters.The current implementation only replaces commas with semicolons but doesn't handle quotes, newlines, or other characters that could break CSV parsing. Consider proper CSV escaping.
def exportToCsv(errorType: String, limit: Int = 1000): String = { val errors = getErrorDetails(errorType, limit) val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n" - val rows = errors.map(e => - s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}" - ).mkString("\n") + val rows = errors.map(e => + s"${escapeCsv(e.extractorName)},${escapeCsv(e.pageTitle)},${escapeCsv(e.message)},${e.timestamp}" + ).mkString("\n") header + rows } + + private def escapeCsv(value: String): String = { + if (value.contains(",") || value.contains("\"") || value.contains("\n")) { + "\"" + value.replace("\"", "\"\"") + "\"" + } else { + value + } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (11)
.gitignore(1 hunks)core/.project(1 hunks)core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala(4 hunks)core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala(1 hunks)core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala(1 hunks)documentation/data-quality-monitoring.md(1 hunks)dump/.project(1 hunks)scripts/.project(1 hunks)server/.project(1 hunks)server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala(1 hunks)server/src/main/scala/org/dbpedia/extraction/server/resources/Root.scala(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (6)
DataQualityMonitor(23-74)DataQualityMonitor(79-188)forExtractor(34-36)logSkipped(112-114)logSuccess(119-121)logInvalidData(91-107)
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (12)
DataQualityMonitor(23-74)DataQualityMonitor(79-188)reset(70-73)forExtractor(34-36)logInvalidData(91-107)getMetrics(126-131)getTotalErrors(136-138)getGlobalMetrics(41-43)getErrorDetails(48-53)exportToCsv(58-65)logSkipped(112-114)logSuccess(119-121)
🪛 GitHub Check: SonarCloud Code Analysis
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala
[failure] 76-76: Define a constant instead of duplicating this literal "Invalid IRI" 3 times.
[failure] 79-79: Define a constant instead of duplicating this literal "TestExtractor:InvalidData" 3 times.
🪛 markdownlint-cli2 (0.18.1)
documentation/data-quality-monitoring.md
64-64: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: deploy
- GitHub Check: build
🔇 Additional comments (14)
.gitignore (1)
10-10: LGTM!Adding
.settingsto gitignore is appropriate alongside the Eclipse.projectfile updates in this PR.dump/.project (1)
1-36: LGTM!The Maven-enabled Eclipse project descriptor follows the same pattern as the other module
.projectfiles in this PR.scripts/.project (1)
1-36: LGTM!Consistent Maven-enabled Eclipse configuration matching other modules.
server/src/main/scala/org/dbpedia/extraction/server/resources/Root.scala (1)
28-28: LGTM!The navigation link integrates cleanly with existing links and follows the established pattern.
core/.project (1)
1-35: LGTM!Consistent Maven-enabled Eclipse configuration for the core module.
core/src/main/scala/org/dbpedia/extraction/mappings/HomepageExtractor.scala (3)
29-30: LGTM!Clean initialization using the factory method. The monitor is appropriately scoped as a private instance field.
54-57: LGTM!Good addition of skip logging with descriptive reason including the namespace.
138-155: LGTM!Good categorization of exceptions with separate handling for
IRISyntaxExceptionversus other errors. Including both the exception and URL in the log provides useful debugging context.documentation/data-quality-monitoring.md (1)
1-129: Good documentation coverage.The documentation clearly explains the purpose, usage patterns, API methods, and integration examples. This will help other developers adopt the monitoring system in their extractors.
server/.project (1)
1-36: LGTM!The Eclipse project configuration correctly adds Maven builder support alongside Scala, and the resource filters appropriately exclude common non-source directories.
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)
134-153: Good concurrency test coverage.The test properly exercises thread safety with 10 threads logging 100 errors each and verifies the total count. This validates the thread-safe design using
TrieMapandAtomicLong.server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala (1)
32-111: Well-structured HTML dashboard.The dashboard implementation is clean with proper grouping by extractor, sorting, and informative empty-state messaging. The use of Bootstrap panels provides a good UX.
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (2)
23-74: Good design for the companion object API.The factory method pattern with
forExtractor, global metrics access, and CSV export provide a clean public API. Thread-safe data structures (TrieMap,AtomicLong) are appropriate choices.
79-138: Well-structured per-extractor monitor class.The class provides a clean interface for logging invalid data, skipped pages, and successes. The categorization logic is sensible, and metrics retrieval is properly scoped to the extractor.
| def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = { | ||
| errorDetails.get(errorType) match { | ||
| case Some(errors) => errors.take(limit).toList | ||
| case None => List.empty | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential thread-safety issue when reading error details.
errors.take(limit) operates on a mutable ListBuffer that may be concurrently modified by recordError. Consider synchronizing the read or returning a snapshot.
def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = {
errorDetails.get(errorType) match {
- case Some(errors) => errors.take(limit).toList
+ case Some(errors) => errors.synchronized { errors.take(limit).toList }
case None => List.empty
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = { | |
| errorDetails.get(errorType) match { | |
| case Some(errors) => errors.take(limit).toList | |
| case None => List.empty | |
| } | |
| } | |
| def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = { | |
| errorDetails.get(errorType) match { | |
| case Some(errors) => errors.synchronized { errors.take(limit).toList } | |
| case None => List.empty | |
| } | |
| } |
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala
around lines 48 to 53, the method getErrorDetails reads from a mutable
ListBuffer without synchronization which can race with concurrent recordError
updates; change the implementation to obtain a thread-safe snapshot by
synchronizing on the ListBuffer (or the map entry) while copying its contents
and then return the copy (e.g., synchronize around errors and call
errors.take(limit).toList or errors.toList.take(limit)), or replace the mutable
ListBuffer with a concurrent/immutable collection and return its immutable
snapshot.
| DataQualityMonitor.errorDetails | ||
| .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty) | ||
| .synchronized { | ||
| val buffer = DataQualityMonitor.errorDetails(errorType) | ||
| if (buffer.size < 10000) { // Limit storage | ||
| buffer += errorDetail | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Simplify synchronization and fix redundant buffer lookup.
The buffer is fetched via getOrElseUpdate, then re-fetched inside the synchronized block. Use the buffer returned by getOrElseUpdate directly.
- DataQualityMonitor.errorDetails
- .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)
- .synchronized {
- val buffer = DataQualityMonitor.errorDetails(errorType)
- if (buffer.size < 10000) { // Limit storage
- buffer += errorDetail
- }
- }
+ val buffer = DataQualityMonitor.errorDetails
+ .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty)
+ buffer.synchronized {
+ if (buffer.size < 10000) { // Limit storage
+ buffer += errorDetail
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DataQualityMonitor.errorDetails | |
| .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty) | |
| .synchronized { | |
| val buffer = DataQualityMonitor.errorDetails(errorType) | |
| if (buffer.size < 10000) { // Limit storage | |
| buffer += errorDetail | |
| } | |
| } | |
| } | |
| val buffer = DataQualityMonitor.errorDetails | |
| .getOrElseUpdate(errorType, collection.mutable.ListBuffer.empty) | |
| buffer.synchronized { | |
| if (buffer.size < 10000) { // Limit storage | |
| buffer += errorDetail | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala
around lines 164 to 172, the code calls getOrElseUpdate to obtain a buffer but
then re-reads the map inside the synchronized block; replace that by storing the
result of getOrElseUpdate in a local val (e.g. val buffer =
DataQualityMonitor.errorDetails.getOrElseUpdate(errorType,
collection.mutable.ListBuffer.empty)) and synchronize on that buffer
(buffer.synchronized { if (buffer.size < 10000) buffer += errorDetail }) so you
use the retrieved buffer directly and avoid the redundant map lookup.
| def getMetricsJson: String = { | ||
| val metrics = DataQualityMonitor.getGlobalMetrics() | ||
|
|
||
| val json = new StringBuilder() | ||
| json.append("{\n") | ||
| json.append(s""" "totalErrors": ${metrics.values.sum},\n""") | ||
| json.append(s""" "errorTypes": ${metrics.size},\n""") | ||
| json.append(""" "metrics": {""").append("\n") | ||
|
|
||
| val entries = metrics.toSeq | ||
| entries.zipWithIndex.foreach { case ((errorType, count), idx) => | ||
| json.append(s""" "$errorType": $count""") | ||
| if (idx < entries.size - 1) json.append(",") | ||
| json.append("\n") | ||
| } | ||
|
|
||
| json.append(" }\n") | ||
| json.append("}") | ||
| json.toString() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual JSON construction lacks proper escaping.
If error type keys contain special characters (quotes, backslashes), the generated JSON will be malformed. Consider using a JSON library or at minimum escape the errorType string.
val entries = metrics.toSeq
entries.zipWithIndex.foreach { case ((errorType, count), idx) =>
- json.append(s""" "$errorType": $count""")
+ val escapedType = errorType.replace("\\", "\\\\").replace("\"", "\\\"")
+ json.append(s""" "$escapedType": $count""")
if (idx < entries.size - 1) json.append(",")
json.append("\n")
}🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala
around lines 119-138 the code manually concatenates JSON and does not escape
errorType keys, which can break JSON if keys contain quotes or backslashes;
replace the hand-rolled string building with a proper JSON serialization
approach (e.g., construct a Scala Map/JsObject or POJO representing totalErrors,
errorTypes and metrics and serialize it with your project's JSON library such as
play-json, Jackson, or json4s), or at minimum run each errorType through a JSON
string-escaping routine before inserting into the output; ensure you return the
serialized JSON string from the serializer rather than concatenating pieces.
| def getErrors( | ||
| @QueryParam("type") errorType: String, | ||
| @QueryParam("limit") limitParam: String | ||
| ): Elem = { | ||
| val limit = Option(limitParam).map(_.toInt).getOrElse(50) | ||
| val errors = DataQualityMonitor.getErrorDetails(errorType, limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add null check for errorType parameter.
If the type query parameter is missing, errorType will be null, which could cause issues when passed to getErrorDetails. Return a user-friendly error or default behavior.
def getErrors(
@QueryParam("type") errorType: String,
@QueryParam("limit") limitParam: String
): Elem = {
+ if (errorType == null || errorType.isEmpty) {
+ return <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
+ {ServerHeader.getHeader("Error", true)}
+ <body>
+ <div class="container">
+ <div class="alert alert-danger">
+ <strong>Missing parameter:</strong> Please specify an error type using ?type=ErrorType
+ </div>
+ <a href="https://github.com/quality/" class="btn btn-default">← Back to Overview</a>
+ </div>
+ </body>
+ </html>
+ }
val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala
around lines 146 to 151, the QueryParam "type" (errorType) can be null when
omitted; add a null/empty check and handle it before calling
DataQualityMonitor.getErrorDetails: either supply a sensible default (e.g. treat
missing type as "all" or empty string) or return a user-friendly error XML
response with a 400 status indicating the missing parameter; ensure you do not
pass a raw null into getErrorDetails and construct the returned Elem
accordingly.
| val limit = Option(limitParam).map(_.toInt).getOrElse(50) | ||
| val errors = DataQualityMonitor.getErrorDetails(errorType, limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle invalid limit parameter to prevent NumberFormatException.
If limitParam contains a non-numeric value, toInt will throw NumberFormatException, returning a 500 error to the client. Add error handling or use toIntOption.
- val limit = Option(limitParam).map(_.toInt).getOrElse(50)
+ val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| val limit = Option(limitParam).map(_.toInt).getOrElse(50) | |
| val errors = DataQualityMonitor.getErrorDetails(errorType, limit) | |
| val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(50) | |
| val errors = DataQualityMonitor.getErrorDetails(errorType, limit) |
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala
around lines 150 to 151, the current use of
Option(limitParam).map(_.toInt).getOrElse(50) will throw NumberFormatException
for non-numeric input; change it to safely parse the parameter (e.g., use
Option(limitParam).flatMap(s => s.toIntOption).getOrElse(50) or catch
NumberFormatException) and, if parsing fails, either fall back to the default 50
or return a 400 Bad Request with a clear message; update the surrounding code to
use the parsed Int variable when calling
DataQualityMonitor.getErrorDetails(errorType, limit).
| val limit = Option(limitParam).map(_.toInt).getOrElse(1000) | ||
| DataQualityMonitor.exportToCsv(errorType, limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same NumberFormatException risk in exportCsv.
Apply the same fix here for consistent error handling.
- val limit = Option(limitParam).map(_.toInt).getOrElse(1000)
+ val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(1000)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| val limit = Option(limitParam).map(_.toInt).getOrElse(1000) | |
| DataQualityMonitor.exportToCsv(errorType, limit) | |
| val limit = Option(limitParam).flatMap(s => scala.util.Try(s.toInt).toOption).getOrElse(1000) | |
| DataQualityMonitor.exportToCsv(errorType, limit) |
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/DataQuality.scala
around lines 216 to 217, the current conversion
Option(limitParam).map(_.toInt).getOrElse(1000) can throw NumberFormatException
for bad input; change it to a safe parse that falls back to 1000 on invalid
input (e.g. use scala.util.Try(limitParam.toInt).toOption.getOrElse(1000) or
limitParam.toIntOption.getOrElse(1000) depending on Scala version) and
optionally log or ignore the invalid value, then call
DataQualityMonitor.exportToCsv(errorType, limit) with the safely parsed limit.
|
@JJ-Author can you please review this |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (1)
25-30: Thread-safety issue persists ingetErrorDetails.The mutable
ListBuffermay be concurrently modified byrecordErrorwhile this method reads from it. This was flagged in a previous review and remains unaddressed.def getErrorDetails(errorType: String, limit: Int = 100): List[ExtractionError] = { errorDetails.get(errorType) match { - case Some(errors) => errors.take(limit).toList + case Some(errors) => errors.synchronized { errors.take(limit).toList } case None => List.empty } }
🧹 Nitpick comments (1)
server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (1)
44-59: Consider using JSON parsing for more robust assertions.String-based assertions like
should include("\"totalErrors\": 5")are fragile and sensitive to formatting changes. A JSON parser would make tests more resilient.Example with a JSON library:
import org.json4s._ import org.json4s.jackson.JsonMethods._ val json = parse(dataQuality.getMetricsJson) (json \ "totalErrors").extract[Int] should be(5)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala(1 hunks)core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala(1 hunks)server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (5)
DataQualityMonitor(11-45)DataQualityMonitor(47-123)reset(41-44)forExtractor(17-19)logInvalidData(51-63)
core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (1)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (12)
DataQualityMonitor(11-45)DataQualityMonitor(47-123)reset(41-44)forExtractor(17-19)logInvalidData(51-63)getMetrics(73-78)getTotalErrors(80-80)getGlobalMetrics(21-23)getErrorDetails(25-30)exportToCsv(32-39)logSkipped(65-67)logSuccess(69-71)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
- GitHub Check: deploy
🔇 Additional comments (13)
core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala (5)
1-6: LGTM: Appropriate concurrency primitives selected.Using
TrieMapandAtomicLongfromjava.util.concurrentprovides thread-safe foundations for the concurrent access patterns this monitor will encounter.
47-78: LGTM: Per-extractor monitor implementation is well-structured.The separation between logging methods (
logInvalidData,logSkipped,logSuccess) provides clear semantics for different extraction outcomes. ThegetMetrics()filtering by prefix cleanly isolates per-extractor data.
102-106: Synchronization approach differs from past suggestion but is valid.The current implementation synchronizes on the entire
errorDetailsTrieMap rather than on individual buffers as previously suggested. While this is coarse-grained and could cause contention under high concurrency with different error types, it is thread-safe and acceptable for a monitoring utility.
109-122: LGTM: Helper methods are clean and functional.Error categorization logic is sensible, prioritizing exception type when available and falling back to keyword-based categorization. The
truncatehelper safely handles string length.
125-131: LGTM: Clean data model for error details.The
ExtractionErrorcase class captures the essential error metadata with appropriate use ofOptionfor the nullable exception type.core/src/test/scala/org/dbpedia/extraction/util/DataQualityMonitorTest.scala (4)
1-9: LGTM: Test setup is appropriate.Using
BeforeAndAfterwithDataQualityMonitor.reset()ensures test isolation, preventing state leakage between tests.
17-39: LGTM: Error recording and categorization tests are well-structured.Tests validate both exception-based and keyword-based categorization paths with appropriate assertions on metrics.
107-123: LGTM: Concurrency test validates thread-safe counting.The test correctly spawns multiple threads, waits for completion with
join(), and verifies the expected total count. This validates theAtomicLongincrement behavior under contention.
125-144: LGTM: Thorough field validation for error details.The test comprehensively verifies all
ExtractionErrorfields are correctly populated, including the optionalexceptionType.server/src/test/scala/org/dbpedia/extraction/server/resources/DataQualityTest.scala (4)
1-12: LGTM: Server test setup is appropriate.Direct instantiation of
DataQualityfor testing allows isolation from the full server stack while still exercising the resource logic.
14-42: LGTM: HTML rendering tests cover key scenarios.Tests validate empty state, populated state with multiple extractors, and proper grouping—covering the main rendering paths.
93-99: Test validates current escaping behavior but not full CSV safety.This test correctly expects commas to be replaced with semicolons as currently implemented. If the CSV escaping in
DataQualityMonitor.exportToCsvis updated to use RFC 4180 quoting, this test will need to be updated accordingly.
131-149: LGTM: Concurrent access test validates thread safety under mixed read/write load.Interleaving
getMetricsJsoncalls during concurrent logging effectively tests the monitor's behavior under realistic concurrent access patterns.
| def exportToCsv(errorType: String, limit: Int = 1000): String = { | ||
| val errors = getErrorDetails(errorType, limit) | ||
| val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n" | ||
| val rows = errors.map(e => | ||
| s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}" | ||
| ).mkString("\n") | ||
| header + rows | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV export is vulnerable to malformed output.
Only message is escaped (commas replaced with semicolons), but extractorName and pageTitle could also contain commas, newlines, or quotes that would corrupt the CSV structure.
Consider proper RFC 4180 CSV escaping:
def exportToCsv(errorType: String, limit: Int = 1000): String = {
val errors = getErrorDetails(errorType, limit)
val header = "Extractor,PageTitle,ErrorMessage,Timestamp\n"
+ def escapeCsvField(field: String): String = {
+ if (field.contains(",") || field.contains("\"") || field.contains("\n"))
+ "\"" + field.replace("\"", "\"\"") + "\""
+ else field
+ }
val rows = errors.map(e =>
- s"${e.extractorName},${e.pageTitle},${e.message.replaceAll(",", ";")},${e.timestamp}"
+ s"${escapeCsvField(e.extractorName)},${escapeCsvField(e.pageTitle)},${escapeCsvField(e.message)},${e.timestamp}"
).mkString("\n")
header + rows
}🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/DataQualityMonitor.scala
around lines 32 to 39, the CSV export currently only sanitizes the message field
and so extractorName and pageTitle (and message newlines/quotes) can break CSV
structure; change the implementation to produce RFC4180-compliant CSV by either:
1) using a CSV library (e.g., OpenCSV or Apache Commons CSV) to build rows
safely, or 2) implementing RFC4180 escaping for each field — wrap every field in
double quotes, double any internal double quotes, and convert newlines to \r\n
(or preserve them inside quoted fields) before joining with commas; ensure the
header remains and apply the escaping to extractorName, pageTitle, and message
(and timestamp if necessary).



Adding a monitoring system to track extraction errors and data quality issues. Currently integrated with HomepageExtractor as a proof of concept.
Problem
Many extractors silently discard invalid data without any logging:
This makes it impossible to debug extraction issues or measure data quality.
Solution
Created
DataQualityMonitor- a simple utility to log and track extraction errors.Usage:
Changes
New files:
core/.../util/DataQualityMonitor.scala- Core monitoring logiccore/.../util/DataQualityMonitorTest.scala- Testsserver/.../resources/DataQuality.scala- Web dashboard at/quality/documentation/data-quality-monitoring.md- DocsModified:
core/.../mappings/HomepageExtractor.scala- Added monitoring callsWeb Dashboard
Access at
http://localhost:9999/server/quality/when running the server.Shows:
Testing
Summary by CodeRabbit
New Features
Enhancements
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.