Skip to content

Add paramstyle transformer#276

Merged
susodapop merged 17 commits intonative-query-params-stagingfrom
add-paramstyle-transformer
Nov 10, 2023
Merged

Add paramstyle transformer#276
susodapop merged 17 commits intonative-query-params-stagingfrom
add-paramstyle-transformer

Conversation

@susodapop
Copy link
Contributor

Background

The goal of this PR is to make it so that queries that include parameter markers in the older pyformat paramstyle as defined in PEP-249 can still run with connector >= 3.0.0 when use_inline_params=False. Without this change, any existing parameterised queries would not execute at all because they use a parameter marker syntax that DBR doesn't understand.

Thefore it's desirable that query like this:

SELECT * FROM table WHERE field = %(param)s

can be executed with the native parameter approach. We do this by automatically converting pyformat parameter markers into named format markers. So the above query would be executed as:

SELECT * FROM table WHERE field = :param

Description

There are a few significant changes in this PR:

  1. It implements a method called transform_paramstyle which performs a regex looking for named pyformat parameter markers in a query string and replaces them with named markers. If there are no pyformat markers, the query text is returned as-is. There are unit tests for this behaviour.
  2. It makes it so that transform_paramstyle is called on the query text whenever parameters are passed to cursor.execute() and use_inline_params=False.
  3. It updates all of our parameterised query tests to cope with this new dimension.

Most of the changes are to the e2e tests. Those changes are in short:

Define a 3-dimension test matrix for the roundtrip parameterisation tests

The components of the matrix are:

  • The type of Python value being parameterised
  • The approach to parameterisation being used
  • The paramstyle to be used in the query.

When running with the inline approach, only pyformat paramstyle is used.

Refactor roundtrip tests to use pytest.mark.parametrize

This makes the test-code easier to read. And also protects against unintended differences between assertions for different types.

This means that these tests:

  • test_dbsqlparam_inferred_none
  • test_dbsqlparam_inferred_bool
  • test_dbsqlparam_inferred_integer
  • test_dbsqlparam_inferred_double
  • test_dbsqlparam_inferred_date
  • test_dbsqlparam_inferred_timestamp
  • test_dbsqlparam_inferred_string
  • test_dbsqlparam_inferred_decimal

Are replaced with a single, parameterised test:

  • test_dbsql_param_with_inferrence

Ditto for the primitive and custom decimal type tests.

Up next

We'll switch the default parameter approach from inline to native and document all of this in README.parameters.md.

Jesse Whitehouse added 11 commits November 9, 2023 12:45
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

squash with two commits back

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
This commit updates the test fixtures so that we automatically run the
NATIVE tests with both paramstyles.

For parameter features that are new in version 3.0.0 (DbsqlParameter)
we don't run with both parameter styles because there is no need for
syntax backwards-compatibility when using the new DbsqlParameter class.

i.e. Nobody should ever use DbsqlParameter with a ParamStyle.PYFORMAT
query

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
We no longer need a separate 2 tests for doubles, as they are handled
in the new _eq() function.

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
@susodapop susodapop marked this pull request as ready for review November 9, 2023 23:18
…ormer

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
@benc-db
Copy link
Collaborator

benc-db commented Nov 9, 2023

Question: why %(var)s? Will it always be s?


_output_operation = operation

PYFORMAT_PARAMSTYLE_REGEX = r"%\((\w+)\)s"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this too restrictive? Can you link to pep or whatever that describes this? In particular I'm wondering if %( thing )s is allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the specification: https://peps.python.org/pep-0249/#paramstyle

It's ambiguous as to what format codes are expected to work. But in reality, since we're doing string interpolation under the covers, it doesn't matter so much what PEP-249 requires. Because if customers have queries that depend on this behaviour their queries will break if we don't do something to cover the use case.

@susodapop
Copy link
Contributor Author

Oh good question. I picked that final s because it's consistent with the only other e2e test we have that tests this behaviour. But since inject_parameters is just a convenience wrapper around python's string interpolation, it's very possible we have users with invocations like:

query = "SELECT * FROM users WHERE user_id = %(user_id)d"
parameters = {"user_id": 1234}
cursor.execute(query, parameters)

I can add an e2e test for this and revise the regex to capture-and-replace any valid format specifier. I wonder just how complex could be the format specifiers customers are using...depending on the edge-cases it could make sense to use a different approach for this.

What I'm thinking is that rather than regexing for the parameter marker, we could simply render it inline and then do str.replace() where we replace the rendered value with the new parameter marker. Thoughts?


_output_operation = operation

PYFORMAT_PARAMSTYLE_REGEX = r"%\((\w+)\)s"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an existing regex for inline replacement today, can we reuse that logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. We're just doing string interpolation so Python handles it for us, in C.

Have a look at inject_parameters to see exactly what it does:

def inject_parameters(operation: str, parameters: Dict[str, str]):
return operation % parameters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call to inject_parameters which replace parameters with the native format of parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what transform_paramstyle() already does...

@susodapop
Copy link
Contributor Author

This is the complete format Mini-Language specification.. For simplicity, I like the approach of simply rendering the pyformat marker with its parameter value and then replacing that with our new marker.

@benc-db
Copy link
Collaborator

benc-db commented Nov 9, 2023

Oh good question. I picked that final s because it's consistent with the only other e2e test we have that tests this behaviour. But since inject_parameters is just a convenience wrapper around python's string interpolation, it's very possible we have users with invocations like:

query = "SELECT * FROM users WHERE user_id = %(user_id)d"
parameters = {"user_id": 1234}
cursor.execute(query, parameters)

I can add an e2e test for this and revise the regex to capture-and-replace any valid format specifier. I wonder just how complex could be the format specifiers customers are using...depending on the edge-cases it could make sense to use a different approach for this.

What I'm thinking is that rather than regexing for the parameter marker, we could simply render it inline and then do str.replace() where we replace the rendered value with the new parameter marker. Thoughts?

So, the only examples I could find all use %()s. Even for things like dates. Now technically, because you were just doing python string interpolation before, people might have used crazier things than that, but I don't think we have any obligation to support things outside of the norm. If they are doing crazy things, the rendered value could be unexpected (consider if they passed you date formatting for example). I think the simple regex approach is fine.

@jadewang-db
Copy link

after this change if user use :param style query, it will work right?

@susodapop
Copy link
Contributor Author

@jadewang-db

after this change if user use :param style query, it will work right?

Yes, but only when the native parameter approach is used. It will fail at DBR when use_inline_params=True because DBR will see a parameter marker and expect that a TSparkParameter was passed in the request. But pysql will not have sent a TSparkParameter because the inline approach doesn't do that.

@susodapop
Copy link
Contributor Author

@benc-db

So, the only examples I could find all use %()s. Even for things like dates. Now technically, because you were just doing python string interpolation before, people might have used crazier things than that, but I don't think we have any obligation to support things outside of the norm. If they are doing crazy things, the rendered value could be unexpected (consider if they passed you date formatting for example). I think the simple regex approach is fine.

Works for me!

@benc-db
Copy link
Collaborator

benc-db commented Nov 10, 2023

This is the complete format Mini-Language specification.. For simplicity, I like the approach of simply rendering the pyformat marker with its parameter value and then replacing that with our new marker.

I'm skeptical that this is possible without having exactly as complicated a regex. Consider that a date format could lead to the word August being rendered. In order to know that August is what you're looking for, you would need to extract their formatting code. But if you have a regex that can extract their formatting code, you may as well just use that for finding and replacing without rendering the pyformat first. Willing to be proven wrong :P.

@susodapop
Copy link
Contributor Author

I'm skeptical that this is possible without having exactly as complicated a regex. Consider that a date format could lead to the word August being rendered. In order to know that August is what you're looking for, you would need to extract their formatting code. But if you have a regex that can extract their formatting code, you may as well just use that for finding and replacing without rendering the pyformat first. Willing to be proven wrong :P.

Well you can't ask me to prove you wrong and not expect me to try 😛

def transform_paramstyle(operation: str, parameters: Dict[str, Any]) -> str:

    COMPLETE_PYFORMAT_REGEX = r'%\((\w+)\)([diouxXeEfFgGcrsa%])'
    INNER_NAME_REGEX = r'\((.*?)\)'
    NAMED_PARAMSTYLE_FMT = ":{}"

    pat = re.compile(COMPLETE_PYFORMAT_REGEX)
    inner_pat = re.compile(INNER_NAME_REGEX)

    pyformat_markers = pat.findall(operation)

    for marker in pyformat_markers:
        _name = re.findall(inner_pat, marker)[0]
        rendered_value = inject_parameters(_name, parameters.get(_name))
        operation = operation.replace(rendered_value, NAMED_PARAMSTYLE_FMT.format(_name))

    return operation

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Copy link
Contributor

@nithinkdb nithinkdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"Consider using native parameters." in caplog.text
), "Log message should not be supressed"

@both_approaches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing all of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just replaced with an extra dimension to the matrix, as described in the PR description.

approach_paramstyle_combinations = [
(ParameterApproach.INLINE, ParamStyle.PYFORMAT),
(ParameterApproach.NATIVE, ParamStyle.PYFORMAT),
(ParameterApproach.NATIVE, ParamStyle.NAMED),
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha

Jesse Whitehouse added 2 commits November 9, 2023 20:26
- Tests that use DbsqlParam will not run under both paramstyles
If a user is writing code with DbsqlParam then they *must* use the named
param style

- Likewise, if parameters are passed as a sequence then transform_paramstyle
will not be called.

- Adds tests for the inline ordinal parameter approach

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
@susodapop
Copy link
Contributor Author

I pushed some new e2e tests of the inline ordinal parameter approach. And updated the logic that calls transform_paramstyle. Using @jadewang-db 's approach means that we now must pass a dictionary of parameters to transform_parameters. Whereas the regex approach didn't look at the parameters whatsoever.

As a consequence, transform_paramstyle won't work if cursor.execute() is passed a list of parameters. So I've revised the logic to make sure we don't do that.

One side-effect of this is that users cannot pass a List[DbsqlParameter] and also use the pyformat paramstyle. But we wouldn't expect users to do that, because DbsqlParameter itself is brand new for the forthcoming 3.0.0 release. i.e. this doesn't break backwards compatibility.

Jesse Whitehouse added 2 commits November 9, 2023 21:55
transform_paramstyle now needs a dictionary of parameters since it doesn't
use regex anymore

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
… not

recognised by DBR.

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>
@susodapop susodapop merged commit f56defc into native-query-params-staging Nov 10, 2023
@susodapop susodapop deleted the add-paramstyle-transformer branch November 10, 2023 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants