feat(source-declarative-manifest): add support for custom Python components from dynamic text input#174
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (7)
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py (1)
21-24: Consider using snake_case for variable names to follow Python conventionsVariable names
currPageandtotalPagesare using camelCase, which is less common in Python. Could we rename them tocurrent_pageandtotal_pagesto match PEP 8 naming conventions? WDYT?unit_tests/source_declarative_manifest/conftest.py (1)
13-20: Should we handle invalidhash_typeinputs gracefully?Currently, if an invalid
hash_typeis provided tohash_text, it raises aKeyError. Would it be better to handle this case and raise a more informative exception or provide a default behavior? WDYT?airbyte_cdk/cli/source_declarative_manifest/_run.py (1)
174-178: LGTM! Consider enhancing the error message with the actual value?The type validation is a great addition! Would you consider adding the actual value to the error message to make debugging easier? Something like:
- f"but got type: {type(config['__injected_declarative_manifest'])}" + f"but got type: {type(config['__injected_declarative_manifest'])} with value: {config['__injected_declarative_manifest']!r}"wdyt?
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (2)
42-60: Consider adding edge cases to the test?The basic functionality test looks good! Would you consider adding some edge cases like:
- Empty string input
- Invalid Python syntax
- Module with syntax errors
wdyt?
89-134: Comprehensive integration test, but might need error casesThe happy path testing is thorough! Would you consider adding test cases for:
- Missing or invalid checksums
- Invalid manifest structure
- Error handling during source creation
Also, the start_date truncation comment could be more descriptive about why 2 days is chosen:
- # Truncate the start_date to speed up tests + # Truncate the start_date to 2 days ago to reduce API calls while still testing recent dataunit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml (2)
63-156: Consider reducing duplication in stream definitions?I notice that
base_streamandcontent_streamhave identical configurations forincremental_sync,retriever, and their sub-components. Would it make sense to use YAML anchors and aliases to reduce this duplication? For example:base_stream: &base_stream_config incremental_sync: # ... existing config ... retriever: # ... existing config ... content_stream: <<: *base_stream_config schema_loader: # ... specific schema config ...This would make the configuration more maintainable and less prone to drift. wdyt?
327-327: Consider tightening the date pattern regex?The current pattern
^([1-9][0-9]{3})\-(0?[1-9]|1[012])\-(0?[1-9]|[12][0-9]|3[01])$allows some invalid dates (like 2023-02-31). Would it be helpful to use a more restrictive pattern that validates actual calendar dates? wdyt?Also applies to: 374-374
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
airbyte_cdk/cli/source_declarative_manifest/_run.py(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(3 hunks)airbyte_cdk/test/utils/manifest_only_fixtures.py(3 hunks)pyproject.toml(1 hunks)unit_tests/source_declarative_manifest/conftest.py(1 hunks)unit_tests/source_declarative_manifest/resources/source_the_guardian_api/.gitignore(1 hunks)unit_tests/source_declarative_manifest/resources/source_the_guardian_api/README.md(1 hunks)unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py(1 hunks)unit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml(1 hunks)unit_tests/source_declarative_manifest/resources/source_the_guardian_api/valid_config.yaml(1 hunks)unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- unit_tests/source_declarative_manifest/resources/source_the_guardian_api/.gitignore
- unit_tests/source_declarative_manifest/resources/source_the_guardian_api/valid_config.yaml
- unit_tests/source_declarative_manifest/resources/source_the_guardian_api/README.md
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (9)
unit_tests/source_declarative_manifest/conftest.py (1)
23-23: Adding return type annotation looks good!The addition of the return type annotation
-> strimproves code readability and type checking.airbyte_cdk/test/utils/manifest_only_fixtures.py (1)
54-68:⚠️ Potential issueSecurity consideration: Using exec() on user input
The implementation looks clean, but executing arbitrary Python code could be dangerous. Should we add a warning in the docstring about the security implications? Also, consider adding input validation or sandboxing if this will be used with untrusted input.
Let's check if this is used with user input:
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1)
28-35: LGTM! Simple and clear test fixture.The sample components are well-structured and demonstrate both function and class implementations.
pyproject.toml (1)
129-129: LGTM! Useful addition for dependency management.The new lock task provides a clean way to update the lockfile without updating dependencies.
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml (5)
312-312: Consider restricting additional properties?
additionalProperties: trueallows any unknown properties in the configuration. Given this is an API connector, would it be safer to set this tofalseto catch potential typos in property names? wdyt?
297-300: LGTM! Simple and effective check configuration.The check configuration properly validates the essential "content" stream.
304-304: LGTM! Well-documented with proper security considerations.The documentation URLs and descriptions are clear, and the API key is properly marked as
airbyte_secret: true.Also applies to: 317-319
1-1: Consider using a more specific version format?The version "4.3.2" seems to be using semantic versioning, but it might be helpful to align it with the actual Guardian API version if one exists. wdyt?
Let me check if there's a specific version format used by The Guardian API:
45-45: Verify the availability of the custom pagination class?The
CustomPageIncrementclass is referenced in multiple places. Let's verify its existence and implementation:Also applies to: 91-92, 139-139, 231-231
✅ Verification successful
The CustomPageIncrement class is properly implemented and available! 👍
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if the custom pagination class exists and examine its implementation rg -l "class CustomPageIncrement" ast-grep --pattern 'class CustomPageIncrement { $$$ }'Length of output: 179
Script:
#!/bin/bash # Examine the implementation of CustomPageIncrement class rg -A 10 "class CustomPageIncrement" unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.pyLength of output: 577
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py
Outdated
Show resolved
Hide resolved
Docstrings generation was requested by @aaronsteers. * #174 (comment) The following files were modified: * `airbyte_cdk/cli/source_declarative_manifest/_run.py` * `airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` * `airbyte_cdk/test/utils/manifest_only_fixtures.py` * `unit_tests/source_declarative_manifest/conftest.py` * `unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py` * `unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py`
|
Note We have generated docstrings for this pull request, at #218 |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
990-993: Consider adding error handling for missing components module.The code assumes the components module will be available. Should we add more descriptive error messages for common failure scenarios? WDYT?
custom_component_class = self._get_class_from_fully_qualified_class_name( full_qualified_class_name=model.class_name, - components_module=self._get_components_module_object(config=config), + components_module=self._get_components_module_object(config=config), + ) if hasattr(model, 'class_name') else None + + if not custom_component_class: + raise ValueError(f"Could not create custom component. Missing class_name in {model}")
1084-1107: Consider documenting security implications of executing code from configuration.Based on the retrieved learnings, custom Python components are intentionally unrestricted to support unpredictable use cases. Should we add a docstring warning about the security implications? WDYT?
def _get_components_module_object( config: Config, ) -> types.ModuleType: - """Get a components module object based on the provided config. + """Get a components module object based on the provided config and execute the code. If custom python components is provided, this will be loaded. Otherwise, we will attempt to load from the `components` module already imported. + + Warning: + This method executes Python code from configuration without restrictions. + Ensure that the configuration is from a trusted source as it has full access + to Python's capabilities including file operations and network access. """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(3 hunks)
🧰 Additional context used
📓 Learnings (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1063-1067
Timestamp: 2025-01-13T23:49:48.658Z
Learning: When implementing code execution from configuration, use a combination of security measures:
1. AST validation to detect dangerous operations
2. RestrictedPython or custom restricted execution environment
3. Limited builtins to prevent access to dangerous functions
4. Import hooks and global restrictions to prevent file/network access
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1063-1067
Timestamp: 2025-01-13T23:38:28.436Z
Learning: When executing Python code from configuration, implement security measures like AST validation, sandboxing, or restricted execution environments to prevent security vulnerabilities.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1093-1102
Timestamp: 2025-01-14T00:20:32.310Z
Learning: In the `airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` file, the strict module name checks in `_get_class_from_fully_qualified_class_name` (requiring `module_name` to be "components" and `module_name_full` to be "source_declarative_manifest.components") are intentionally designed to provide early, clear feedback when class declarations won't be found later in execution. These restrictions may be loosened in the future if the requirements for class definition locations change.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
10-11: LGTM! The new imports support the custom components functionality.The addition of
sysandtypesmodules is appropriate for dynamic module creation and management.
1046-1083: The module name validation aligns with the documented requirements.Based on the retrieved learnings, the strict module name checks are intentionally designed to provide early, clear feedback when class declarations won't be found later in execution.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
unit_tests/sources/declarative/test_manifest_declarative_source.py (2)
19-19: Would you consider expanding the comment to be more specific about why it's needed?The current comment indicates it's needed for dynamic imports, but it might be helpful to explain which specific dynamic imports rely on this module, wdyt?
-import unit_tests.sources.declarative.external_component # Needed for dynamic imports to work +import unit_tests.sources.declarative.external_component # Required for custom component dynamic imports in test_valid_manifest
268-271: LGTM! Consider extracting these assertions into a helper method?The assertions are well-structured and verify the module loading hierarchy. To improve maintainability and reusability, would you consider extracting them into a helper method, wdyt?
+def assert_modules_loaded(): + """Verify that all required modules are properly loaded.""" + assert "unit_tests" in sys.modules + assert "unit_tests.sources" in sys.modules + assert "unit_tests.sources.declarative" in sys.modules + assert "unit_tests.sources.declarative.external_component" in sys.modules + def test_valid_manifest(): manifest = { # ... manifest configuration ... } + assert_modules_loaded() - assert "unit_tests" in sys.modules - assert "unit_tests.sources" in sys.modules - assert "unit_tests.sources.declarative" in sys.modules - assert "unit_tests.sources.declarative.external_component" in sys.modules
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/declarative/test_manifest_declarative_source.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
unit_tests/sources/declarative/test_manifest_declarative_source.py (1)
Line range hint
272-1207: No changes in this section.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (3)
140-148: Consider using context manager for temporary file cleanup.The temporary files created with
delete=Falsearen't automatically cleaned up. Should we ensure cleanup using a try-finally block? Wdyt?- with NamedTemporaryFile(delete=False, suffix=".json") as temp_config_file: + temp_config_file = NamedTemporaryFile(delete=False, suffix=".json") + try: json_str = json.dumps(py_components_config_dict) Path(temp_config_file.name).write_text(json_str) temp_config_file.flush() source = create_declarative_source( ["check", "--config", temp_config_file.name], ) + finally: + os.unlink(temp_config_file.name)Also applies to: 174-182, 267-274
298-301: Specify exception type inexceptblock.The bare
exceptcould catch and mask important exceptions likeKeyboardInterrupt. Should we specify the expected exception type? Wdyt?- with pytest.raises(Exception): + with pytest.raises((ValueError, RuntimeError)): # specify expected exceptions for msg in msg_iterator: assert msg return
237-241: Consider creating a mock source for testing.The TODO comment suggests creating a test source that doesn't require credentials. This would make the tests more reliable and faster. Would you like me to help create a mock source implementation?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (2)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (2)
1-45: LGTM! Well-organized imports and clear test data.The imports are comprehensive and properly typed, and the sample component code provides a good test case.
102-120: Consider adding error handling for YAML parsing.The YAML parsing could fail with malformed files. Should we add try-except blocks to provide more informative error messages? Wdyt?
- manifest_dict = yaml.safe_load(manifest_yml_path.read_text()) + try: + manifest_dict = yaml.safe_load(manifest_yml_path.read_text()) + except yaml.YAMLError as e: + raise ValueError(f"Failed to parse manifest file: {e}")
Summary
Adds ability to parse
components.pytext passed in the config. The feature is disabled-by-default in all of our current runtimes, but it unblocks the next stage of development and testing of the Platform. See below security section for more info.New feature
The
source-declarative-manifestconnector will now accept these config args:__injected_manifest: (already existing config) this is how we receive the manifest yaml.__injected_components_py: (new) the Pythoncomponents.pytext passed from the Builder__injected_components_py_checksums: (new) one or more checksum to ensure code is not tampered with in flight:md5sha256Notes on security
AIRBYTE_ALLOW_CUSTOM_CODEenv var is set totrue.Summary by CodeRabbit
New Features
start_dateconfiguration option added for relevant time periods.Bug Fixes
Documentation
Chores
.gitignoreto exclude sensitive files.