Skip to content

feat(source-declarative-manifest): add support for custom Python components from dynamic text input#174

Merged
aaronsteers merged 43 commits intomainfrom
aj/feat/accept-components-text-input
Jan 22, 2025
Merged

feat(source-declarative-manifest): add support for custom Python components from dynamic text input#174
aaronsteers merged 43 commits intomainfrom
aj/feat/accept-components-text-input

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Dec 14, 2024

Summary

Adds ability to parse components.py text passed in the config. The feature is disabled-by-default in all of our current runtimes, but it unblocks the next stage of development and testing of the Platform. See below security section for more info.

New feature

The source-declarative-manifest connector will now accept these config args:

  • __injected_manifest: (already existing config) this is how we receive the manifest yaml.
  • __injected_components_py: (new) the Python components.py text passed from the Builder
  • __injected_components_py_checksums: (new) one or more checksum to ensure code is not tampered with in flight:
    • md5
    • sha256

Notes on security

  • The connector will not parse or accept any custom code unless the AIRBYTE_ALLOW_CUSTOM_CODE env var is set to true.
  • This feature will only be turned on when we know we are in a sandboxed/untrusted execution context (or when the user is ourselves).
  • For now, at least one checksum is required for tamper-detection. Note that this check doesn't attempt to check if the code is "safe" or "approved", only that it hasn't been modified in flight.

Summary by CodeRabbit

  • New Features

    • Enhanced validation for configuration and checksum handling.
    • Implemented a new pagination strategy for API interactions.
    • Added a new configuration for integrating with The Guardian API.
    • New start_date configuration option added for relevant time periods.
    • Introduced a custom exception for testing error handling scenarios.
    • Added support for dynamic module loading and registration based on configuration.
    • Added a new task for locking dependencies without updates.
  • Bug Fixes

    • Improved error handling for custom code execution.
    • Enhanced module loading and component registration mechanisms.
  • Documentation

    • Added README for Guardian API test resources.
    • Updated docstrings for improved clarity.
  • Chores

    • Updated type hinting in various modules.
    • Updated .gitignore to exclude sensitive files.

@aaronsteers aaronsteers changed the title [draft] skeleton: components module from dynamic text input feat: add support for custom Python components from dynamic text input Jan 13, 2025
@aaronsteers aaronsteers changed the title feat: add support for custom Python components from dynamic text input feat(source-declarative-manifest): add support for custom Python components from dynamic text input Jan 13, 2025
@github-actions github-actions bot added the enhancement New feature or request label Jan 13, 2025
@aaronsteers aaronsteers marked this pull request as ready for review January 13, 2025 23:33
@coderabbitai

This comment was marked as off-topic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py (1)

21-24: Consider using snake_case for variable names to follow Python conventions

Variable names currPage and totalPages are using camelCase, which is less common in Python. Could we rename them to current_page and total_pages to match PEP 8 naming conventions? WDYT?

unit_tests/source_declarative_manifest/conftest.py (1)

13-20: Should we handle invalid hash_type inputs gracefully?

Currently, if an invalid hash_type is provided to hash_text, it raises a KeyError. Would it be better to handle this case and raise a more informative exception or provide a default behavior? WDYT?

airbyte_cdk/cli/source_declarative_manifest/_run.py (1)

174-178: LGTM! Consider enhancing the error message with the actual value?

The type validation is a great addition! Would you consider adding the actual value to the error message to make debugging easier? Something like:

-                f"but got type: {type(config['__injected_declarative_manifest'])}"
+                f"but got type: {type(config['__injected_declarative_manifest'])} with value: {config['__injected_declarative_manifest']!r}"

wdyt?

unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (2)

42-60: Consider adding edge cases to the test?

The basic functionality test looks good! Would you consider adding some edge cases like:

  • Empty string input
  • Invalid Python syntax
  • Module with syntax errors
    wdyt?

89-134: Comprehensive integration test, but might need error cases

The happy path testing is thorough! Would you consider adding test cases for:

  1. Missing or invalid checksums
  2. Invalid manifest structure
  3. Error handling during source creation

Also, the start_date truncation comment could be more descriptive about why 2 days is chosen:

-    # Truncate the start_date to speed up tests
+    # Truncate the start_date to 2 days ago to reduce API calls while still testing recent data
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml (2)

63-156: Consider reducing duplication in stream definitions?

I notice that base_stream and content_stream have identical configurations for incremental_sync, retriever, and their sub-components. Would it make sense to use YAML anchors and aliases to reduce this duplication? For example:

  base_stream: &base_stream_config
    incremental_sync:
      # ... existing config ...
    retriever:
      # ... existing config ...

  content_stream:
    <<: *base_stream_config
    schema_loader:
      # ... specific schema config ...

This would make the configuration more maintainable and less prone to drift. wdyt?


327-327: Consider tightening the date pattern regex?

The current pattern ^([1-9][0-9]{3})\-(0?[1-9]|1[012])\-(0?[1-9]|[12][0-9]|3[01])$ allows some invalid dates (like 2023-02-31). Would it be helpful to use a more restrictive pattern that validates actual calendar dates? wdyt?

Also applies to: 374-374

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 216cd43 and c837745.

📒 Files selected for processing (11)
  • airbyte_cdk/cli/source_declarative_manifest/_run.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
  • airbyte_cdk/test/utils/manifest_only_fixtures.py (3 hunks)
  • pyproject.toml (1 hunks)
  • unit_tests/source_declarative_manifest/conftest.py (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/.gitignore (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/README.md (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/valid_config.yaml (1 hunks)
  • unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/.gitignore
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/valid_config.yaml
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/README.md
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (9)
unit_tests/source_declarative_manifest/conftest.py (1)

23-23: Adding return type annotation looks good!

The addition of the return type annotation -> str improves code readability and type checking.

airbyte_cdk/test/utils/manifest_only_fixtures.py (1)

54-68: ⚠️ Potential issue

Security consideration: Using exec() on user input

The implementation looks clean, but executing arbitrary Python code could be dangerous. Should we add a warning in the docstring about the security implications? Also, consider adding input validation or sandboxing if this will be used with untrusted input.

Let's check if this is used with user input:

unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1)

28-35: LGTM! Simple and clear test fixture.

The sample components are well-structured and demonstrate both function and class implementations.

pyproject.toml (1)

129-129: LGTM! Useful addition for dependency management.

The new lock task provides a clean way to update the lockfile without updating dependencies.

unit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml (5)

312-312: Consider restricting additional properties?

additionalProperties: true allows any unknown properties in the configuration. Given this is an API connector, would it be safer to set this to false to catch potential typos in property names? wdyt?


297-300: LGTM! Simple and effective check configuration.

The check configuration properly validates the essential "content" stream.


304-304: LGTM! Well-documented with proper security considerations.

The documentation URLs and descriptions are clear, and the API key is properly marked as airbyte_secret: true.

Also applies to: 317-319


1-1: Consider using a more specific version format?

The version "4.3.2" seems to be using semantic versioning, but it might be helpful to align it with the actual Guardian API version if one exists. wdyt?

Let me check if there's a specific version format used by The Guardian API:


45-45: Verify the availability of the custom pagination class?

The CustomPageIncrement class is referenced in multiple places. Let's verify its existence and implementation:

Also applies to: 91-92, 139-139, 231-231

✅ Verification successful

The CustomPageIncrement class is properly implemented and available! 👍

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the custom pagination class exists and examine its implementation
rg -l "class CustomPageIncrement" 
ast-grep --pattern 'class CustomPageIncrement { $$$ }'

Length of output: 179


Script:

#!/bin/bash
# Examine the implementation of CustomPageIncrement class
rg -A 10 "class CustomPageIncrement" unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py

Length of output: 577

coderabbitai bot added a commit that referenced this pull request Jan 13, 2025
Docstrings generation was requested by @aaronsteers.

* #174 (comment)

The following files were modified:

* `airbyte_cdk/cli/source_declarative_manifest/_run.py`
* `airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`
* `airbyte_cdk/test/utils/manifest_only_fixtures.py`
* `unit_tests/source_declarative_manifest/conftest.py`
* `unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py`
* `unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py`
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2025

Note

We have generated docstrings for this pull request, at #218

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

990-993: Consider adding error handling for missing components module.

The code assumes the components module will be available. Should we add more descriptive error messages for common failure scenarios? WDYT?

     custom_component_class = self._get_class_from_fully_qualified_class_name(
         full_qualified_class_name=model.class_name,
-        components_module=self._get_components_module_object(config=config),
+        components_module=self._get_components_module_object(config=config),
+    ) if hasattr(model, 'class_name') else None
+    
+    if not custom_component_class:
+        raise ValueError(f"Could not create custom component. Missing class_name in {model}")

1084-1107: Consider documenting security implications of executing code from configuration.

Based on the retrieved learnings, custom Python components are intentionally unrestricted to support unpredictable use cases. Should we add a docstring warning about the security implications? WDYT?

     def _get_components_module_object(
         config: Config,
     ) -> types.ModuleType:
-        """Get a components module object based on the provided config.
+        """Get a components module object based on the provided config and execute the code.
 
         If custom python components is provided, this will be loaded. Otherwise, we will
         attempt to load from the `components` module already imported.
+        
+        Warning:
+            This method executes Python code from configuration without restrictions.
+            Ensure that the configuration is from a trusted source as it has full access
+            to Python's capabilities including file operations and network access.
         """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c837745 and c54a73d.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
🧰 Additional context used
📓 Learnings (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1063-1067
Timestamp: 2025-01-13T23:49:48.658Z
Learning: When implementing code execution from configuration, use a combination of security measures:
1. AST validation to detect dangerous operations
2. RestrictedPython or custom restricted execution environment
3. Limited builtins to prevent access to dangerous functions
4. Import hooks and global restrictions to prevent file/network access
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1063-1067
Timestamp: 2025-01-13T23:38:28.436Z
Learning: When executing Python code from configuration, implement security measures like AST validation, sandboxing, or restricted execution environments to prevent security vulnerabilities.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1093-1102
Timestamp: 2025-01-14T00:20:32.310Z
Learning: In the `airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` file, the strict module name checks in `_get_class_from_fully_qualified_class_name` (requiring `module_name` to be "components" and `module_name_full` to be "source_declarative_manifest.components") are intentionally designed to provide early, clear feedback when class declarations won't be found later in execution. These restrictions may be loosened in the future if the requirements for class definition locations change.
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

10-11: LGTM! The new imports support the custom components functionality.

The addition of sys and types modules is appropriate for dynamic module creation and management.


1046-1083: The module name validation aligns with the documented requirements.

Based on the retrieved learnings, the strict module name checks are intentionally designed to provide early, clear feedback when class declarations won't be found later in execution.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
unit_tests/sources/declarative/test_manifest_declarative_source.py (2)

19-19: Would you consider expanding the comment to be more specific about why it's needed?

The current comment indicates it's needed for dynamic imports, but it might be helpful to explain which specific dynamic imports rely on this module, wdyt?

-import unit_tests.sources.declarative.external_component  # Needed for dynamic imports to work
+import unit_tests.sources.declarative.external_component  # Required for custom component dynamic imports in test_valid_manifest

268-271: LGTM! Consider extracting these assertions into a helper method?

The assertions are well-structured and verify the module loading hierarchy. To improve maintainability and reusability, would you consider extracting them into a helper method, wdyt?

+def assert_modules_loaded():
+    """Verify that all required modules are properly loaded."""
+    assert "unit_tests" in sys.modules
+    assert "unit_tests.sources" in sys.modules
+    assert "unit_tests.sources.declarative" in sys.modules
+    assert "unit_tests.sources.declarative.external_component" in sys.modules
+
 def test_valid_manifest():
     manifest = {
         # ... manifest configuration ...
     }
+    assert_modules_loaded()
-    assert "unit_tests" in sys.modules
-    assert "unit_tests.sources" in sys.modules
-    assert "unit_tests.sources.declarative" in sys.modules
-    assert "unit_tests.sources.declarative.external_component" in sys.modules
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c35577 and e6b28b6.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/test_manifest_declarative_source.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
unit_tests/sources/declarative/test_manifest_declarative_source.py (1)

Line range hint 272-1207: No changes in this section.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (3)

140-148: Consider using context manager for temporary file cleanup.

The temporary files created with delete=False aren't automatically cleaned up. Should we ensure cleanup using a try-finally block? Wdyt?

-    with NamedTemporaryFile(delete=False, suffix=".json") as temp_config_file:
+    temp_config_file = NamedTemporaryFile(delete=False, suffix=".json")
+    try:
         json_str = json.dumps(py_components_config_dict)
         Path(temp_config_file.name).write_text(json_str)
         temp_config_file.flush()
         source = create_declarative_source(
             ["check", "--config", temp_config_file.name],
         )
+    finally:
+        os.unlink(temp_config_file.name)

Also applies to: 174-182, 267-274


298-301: Specify exception type in except block.

The bare except could catch and mask important exceptions like KeyboardInterrupt. Should we specify the expected exception type? Wdyt?

-            with pytest.raises(Exception):
+            with pytest.raises((ValueError, RuntimeError)):  # specify expected exceptions
                 for msg in msg_iterator:
                     assert msg
             return

237-241: Consider creating a mock source for testing.

The TODO comment suggests creating a test source that doesn't require credentials. This would make the tests more reliable and faster. Would you like me to help create a mock source implementation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6b28b6 and f29f616.

📒 Files selected for processing (1)
  • unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (2)

1-45: LGTM! Well-organized imports and clear test data.

The imports are comprehensive and properly typed, and the sample component code provides a good test case.


102-120: Consider adding error handling for YAML parsing.

The YAML parsing could fail with malformed files. Should we add try-except blocks to provide more informative error messages? Wdyt?

-    manifest_dict = yaml.safe_load(manifest_yml_path.read_text())
+    try:
+        manifest_dict = yaml.safe_load(manifest_yml_path.read_text())
+    except yaml.YAMLError as e:
+        raise ValueError(f"Failed to parse manifest file: {e}")

@aaronsteers aaronsteers merged commit 3a9ab87 into main Jan 22, 2025
19 of 20 checks passed
@aaronsteers aaronsteers deleted the aj/feat/accept-components-text-input branch January 22, 2025 02:35
@aaronsteers aaronsteers linked an issue Feb 4, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(SDM): Ability to accept Python code with config.json

3 participants