Skip to content

feat(low-code cdk): Allow for multiple schema_loaders to be defined for a stream#536

Merged
tolik0 merged 2 commits intomainfrom
brian/multiple_schema_loaders
May 8, 2025
Merged

feat(low-code cdk): Allow for multiple schema_loaders to be defined for a stream#536
tolik0 merged 2 commits intomainfrom
brian/multiple_schema_loaders

Conversation

@brianjlai
Copy link
Contributor

@brianjlai brianjlai commented May 8, 2025

What

While migrating airbytehq/airbyte#58105, @tolik0 identified that it was very tedious and limiting to introduce static schema fields into a dynamic schema loader because it required everything to be defined in a transformation.

This PR allows for the definition of multiple schema_loaders on a single stream that are then merged together so that we can use the more intuitive and readable InlineSchemaLoader in combination with DynamicSchemaLoader

How

Modifies the low code manifest interface to accept either a list or a single loader to retain backwards compatibility

Under the hood we have a CompositeSchemaLoader which combines the properties of each schema loader together with earlier defined schema loaders taking priority. This is more intuitive so users don't need to learn or understand another component compared to just a regular list of schema loaders.

Summary by CodeRabbit

  • New Features
    • Added support for specifying multiple schema loaders in declarative streams, enabling combined schema definitions with precedence for earlier loaders.
  • Bug Fixes
    • Ensured properties from earlier schema loaders are preserved during schema merging and not overwritten by later loaders.
  • Tests
    • Added unit tests validating the merging behavior of multiple schema loaders and the correct creation of composite schema loaders in declarative streams.

@github-actions github-actions bot added the enhancement New feature or request label May 8, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 8, 2025

📝 Walkthrough

Walkthrough

This change introduces support for specifying multiple schema loaders in declarative streams by allowing the schema_loader property to accept a list of loaders. It adds a new CompositeSchemaLoader class to merge schemas, updates model annotations and the factory logic, and adds unit tests for the new functionality.

Changes

File(s) Change Summary
airbyte_cdk/sources/declarative/declarative_component_schema.yaml Expanded the schema_loader property description and schema to allow a single loader or an array of loaders, clarifying precedence and merging behavior.
airbyte_cdk/sources/declarative/models/declarative_component_schema.py Updated the type annotation for schema_loader in DeclarativeStream to support a single loader or a list of loaders.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Modified the factory to handle lists of schema loaders by wrapping them in a new CompositeSchemaLoader.
airbyte_cdk/sources/declarative/schema/composite_schema_loader.py Added the new CompositeSchemaLoader class, which merges schemas from multiple loaders, preserving property precedence.
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py Added a test to verify that multiple schema loaders are correctly composed into a CompositeSchemaLoader.
unit_tests/sources/declarative/schema/test_composite_schema_loader.py Introduced new tests for CompositeSchemaLoader, checking correct schema merging and property precedence.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DeclarativeStream
    participant ModelToComponentFactory
    participant CompositeSchemaLoader
    participant SchemaLoader(s)

    User->>DeclarativeStream: Define stream with schema_loader (single or list)
    DeclarativeStream->>ModelToComponentFactory: Pass model with schema_loader(s)
    ModelToComponentFactory->>CompositeSchemaLoader: If list, wrap loaders
    CompositeSchemaLoader->>SchemaLoader(s): Call get_json_schema() on each
    CompositeSchemaLoader->>ModelToComponentFactory: Merge schemas, return combined
    ModelToComponentFactory->>DeclarativeStream: Return stream with composite loader
Loading

Suggested reviewers

  • darynaishchenko
  • maxi297

Would you like to add more integration tests for edge cases, such as deeply nested property conflicts, to further ensure the robustness of the schema merging logic? Wdyt?

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between deccb76 and 6b2eb13.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Analyze (python)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1928-1934: ⚠️ Potential issue

Fix type check errors in the implementation

There are mypy errors in this section:

  1. Line 1928: Type error when passing model.schema_loader to _create_component_from_model
  2. Line 1934: Incompatible types in assignment (DefaultSchemaLoader vs CompositeSchemaLoader)

The first error is because mypy expects a BaseModel but gets a list. The second might be related to how the else branch assigns a DefaultSchemaLoader to a variable that previously held a CompositeSchemaLoader.

To fix the type issues, consider using a type annotation with Union for the schema_loader variable:

+        schema_loader: Union[CompositeSchemaLoader, DefaultSchemaLoader, JsonFileSchemaLoader, InlineSchemaLoader, DynamicSchemaLoader]
         if model.schema_loader and isinstance(model.schema_loader, list):
             nested_schema_loaders = [
                 self._create_component_from_model(model=schema_loader, config=config)
                 for schema_loader in model.schema_loader
             ]
             schema_loader = CompositeSchemaLoader(
                 schema_loaders=nested_schema_loaders, parameters=model.parameters or {}
             )
         elif model.schema_loader:
             schema_loader = self._create_component_from_model(
                 model=model.schema_loader, config=config
             )
         else:
             options = model.parameters or {}
             if "name" not in options:
                 options["name"] = model.name
             schema_loader = DefaultSchemaLoader(config=config, parameters=options)
🧰 Tools
🪛 GitHub Actions: Linters

[error] 1928-1928: mypy error: Argument "model" to "_create_component_from_model" of "ModelToComponentFactory" has incompatible type "InlineSchemaLoader | DynamicSchemaLoader | JsonFileSchemaLoader | CustomSchemaLoader | list[InlineSchemaLoader | DynamicSchemaLoader | JsonFileSchemaLoader | CustomSchemaLoader]"; expected "BaseModel" [arg-type]


[error] 1934-1934: mypy error: Incompatible types in assignment (expression has type "DefaultSchemaLoader", variable has type "CompositeSchemaLoader") [assignment]

🧹 Nitpick comments (10)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

1424-1427: Polish description grammar?
The new description is clear, but maybe we could tighten up the phrasing—for example, “Schema loaders defined first taking precedence…” could read better as “Loaders listed first take precedence in case of conflict.” Would you be up for a quick tweak here? wdyt?


1428-1439: Consider tightening the array schema constraints?
The addition of - type: array with items.anyOf nicely enables multiple loaders. A couple of ideas to guard against edge cases:

  • Add minItems: 1 to ensure the array isn’t empty.
  • Add uniqueItems: true to prevent duplicate loader definitions.
  • Switch anyOf to oneOf if you want each value to match exactly one branch (single loader vs array) rather than any matching subset.

Any of these sound useful to you? wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

1918-1926: Good implementation of multiple schema loader support, but consider passing model parameters

The implementation correctly handles both list and single schema loader cases, maintaining backward compatibility. However, you're creating the CompositeSchemaLoader with an empty parameters dictionary. Should you pass model.parameters or {} instead to be consistent with other component creations? wdyt?

-            schema_loader = CompositeSchemaLoader(
-                schema_loaders=nested_schema_loaders, parameters={}
-            )
+            schema_loader = CompositeSchemaLoader(
+                schema_loaders=nested_schema_loaders, parameters=model.parameters or {}
+            )

1918-1926: Consider adding a brief comment explaining the new functionality

It would be helpful to add a comment explaining the purpose of allowing multiple schema loaders and how they're combined. This makes the code more maintainable.

         if model.schema_loader and isinstance(model.schema_loader, list):
+            # Handle multiple schema loaders by creating a CompositeSchemaLoader
+            # which merges the properties from each schema loader, with earlier loaders taking priority
             nested_schema_loaders = [
                 self._create_component_from_model(model=schema_loader, config=config)
                 for schema_loader in model.schema_loader
             ]
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)

4567-4633: Good test case for multiple schema loaders functionality.

This test verifies that the factory correctly handles declarative streams defined with multiple schema loaders specified as a list. The test properly:

  1. Defines a stream with two different inline schemas
  2. Processes the manifest through parsing and transformation
  3. Verifies the resulting schema loader is a CompositeSchemaLoader containing two InlineSchemaLoader instances

Have you considered adding a test case that mixes different schema loader types (like JsonFileSchemaLoader with InlineSchemaLoader) to verify that works too? Also, would it be worth adding an assertion to verify the actual schema content after merging? Just thinking about edge cases, wdyt?

airbyte_cdk/sources/declarative/schema/composite_schema_loader.py (2)

11-32: Clean implementation of the CompositeSchemaLoader!

The implementation is straightforward and the docstring clearly explains the precedence behavior. The merging of schemas is handled efficiently.

One thing to consider: the current implementation only merges top-level properties, but doesn't perform deep merging of nested properties. In most cases this is probably fine, but would a deep merge be beneficial for some use cases? For example:

def get_json_schema(self) -> Mapping[str, Any]:
    combined_schema: Dict[str, Any] = {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": ["null", "object"],
        "properties": {},
    }
    for schema_loader in self.schema_loaders:
        schema_properties = schema_loader.get_json_schema()["properties"]
-        combined_schema["properties"] = {**schema_properties, **combined_schema["properties"]}
+        # Deep merge properties
+        self._deep_merge_properties(schema_properties, combined_schema["properties"])
    return combined_schema

+ def _deep_merge_properties(self, source: Dict[str, Any], target: Dict[str, Any]) -> None:
+     """Deep merge source properties into target, with target taking precedence."""
+     for key, value in source.items():
+         if key not in target:
+             target[key] = value
+         elif isinstance(value, dict) and isinstance(target[key], dict):
+             if "properties" in value and "properties" in target[key]:
+                 if "properties" not in target[key]:
+                     target[key]["properties"] = {}
+                 self._deep_merge_properties(value["properties"], target[key]["properties"])

Would that be useful? Or perhaps overkill for the current use cases?


29-30: Schema merging logic works but could be clearer!

The current implementation has properties from earlier loaders taking precedence, which is mentioned in the docstring. However, the dictionary unpacking order might be confusing at first glance.

Consider making the precedence more explicit with a comment or by rearranging the expression:

- combined_schema["properties"] = {**schema_properties, **combined_schema["properties"]}
+ # Ensure properties from earlier loaders take precedence
+ combined_schema["properties"] = {**schema_properties, **combined_schema["properties"]}

Or alternatively:

- combined_schema["properties"] = {**schema_properties, **combined_schema["properties"]}
+ # First apply new properties, then keep existing ones (which take precedence)
+ new_properties = dict(schema_properties)  # Make a copy
+ new_properties.update(combined_schema["properties"])  # Existing properties override new ones
+ combined_schema["properties"] = new_properties

This makes the intention clearer even to readers who aren't familiar with dictionary unpacking behavior. What do you think?

unit_tests/sources/declarative/schema/test_composite_schema_loader.py (3)

55-151: Great test for the dynamic/static schema combination!

This test thoroughly verifies that the CompositeSchemaLoader correctly combines properties from both static and dynamic schema loaders.

One suggestion - it might be helpful to add assertions that specifically verify the schema structure beyond just comparing the entire expected schema. For example:

# Verify specific aspects of the schema
assert "dynamic_property_one" in schema["properties"]
assert "id" in schema["properties"]
assert schema["properties"]["nestedDisplayOptions"]["properties"]["theme"]["type"] == ["null", "string"]

This would make it clearer what aspects of the schema are most important to validate and provide more specific failure messages if something goes wrong. What do you think?


153-211: Good test for property precedence behavior!

This test properly verifies that properties from earlier schema loaders take precedence over later ones.

Consider adding a comment explaining why certain properties appear the way they do in the expected schema. For example, a comment noting that "updatedAt" from the first loader takes precedence over the version in the second loader, and "id" from the first loader takes precedence over the version in the third loader. This would make the test's intent clearer to future readers.

# Test specifically verifies:
# - "updatedAt" from first loader (format: date-time) overrides second loader (format: date)
# - "id" from first loader (type: string) overrides third loader (type: integer) 
# - "organization" from second loader is included
expected_schema = {
    ...
}

Would that help clarify the test's purpose?


1-211: Consider adding tests for edge cases!

The tests cover the main functionality well, but there are a few edge cases that might be worth testing.

Consider adding tests for:

  1. Empty schema_loaders list
  2. Schema loader that returns a schema without properties
  3. Schema loader that returns None or raises an exception

These would help ensure the CompositeSchemaLoader is robust against unexpected inputs. Would these additional tests be valuable?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8906ed2 and deccb76.

📒 Files selected for processing (6)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2 hunks)
  • airbyte_cdk/sources/declarative/schema/composite_schema_loader.py (1 hunks)
  • unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (2 hunks)
  • unit_tests/sources/declarative/schema/test_composite_schema_loader.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py (1)
  • DynamicSchemaLoader (119-285)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

[error] 1928-1928: mypy error: Argument "model" to "_create_component_from_model" of "ModelToComponentFactory" has incompatible type "InlineSchemaLoader | DynamicSchemaLoader | JsonFileSchemaLoader | CustomSchemaLoader | list[InlineSchemaLoader | DynamicSchemaLoader | JsonFileSchemaLoader | CustomSchemaLoader]"; expected "BaseModel" [arg-type]


[error] 1934-1934: mypy error: Incompatible types in assignment (expression has type "DefaultSchemaLoader", variable has type "CompositeSchemaLoader") [assignment]

⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

507-507: Clean import addition for CompositeSchemaLoader

The import is correctly placed in the schema section, maintaining the organizational structure of the imports.

unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)

155-156: Well-structured import changes.

Adding the necessary imports for InlineSchemaLoader and CompositeSchemaLoader to support testing the new functionality.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

2150-2164: Nice extension of the schema_loader field!

This change allows the flexibility to use multiple schema loaders while maintaining backward compatibility. The Union type correctly handles both single schema loaders and lists of loaders.

Just a thought - does the implementation handle any possible conflicts or overlaps between schemas from different loaders? Looking at the CompositeSchemaLoader implementation, it seems earlier loaders take precedence, which is likely the expected behavior, but might be worth documenting this behavior in the field description, wdyt?

@tolik0 tolik0 merged commit bcfcf04 into main May 8, 2025
30 of 32 checks passed
@tolik0 tolik0 deleted the brian/multiple_schema_loaders branch May 8, 2025 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants