Skip to content

Conversation

@rogiervd
Copy link
Contributor

What does this PR do?

This PR introduces type-checking in CI. It uses Pyright. It adds failing tests at first, and we can decide in this PR whether we want to continue, which will involve a number of more PRs to resolve issues.

We previously discussed this at #2749. I have since gone away and tried Pyre and Pyright, and prepared lots of little changes locally.

Advantages of type annotations in code

  • Hints while coding, so you need fewer keystrokes.

  • Error checking while coding. This image is for the bug in Make Scores.__repr__ actually return something #2897
    image

  • Clearer code, since type annotations explain interfaces at a slightly higher level than the code itself does. SpeechBrain maintainers may have a high-level overview of the code which compensates, but people with smaller brains and not that familiar with SpeechBrain (e.g. me) will find type annotations more useful.

Disadvantages of type annotations in code

  • More visual noise. From the earlier discussion, not everyone likes looking at : Union[float, int, None]. Though it turns out that from Python 3.10 (I used to think from 3.9), you can say : float | int | None. But note that docstrings currently already have type annotations; just not all correct or grammatical.

Advantages of type checking

  • The type annotations are more likely to correspond to the actual code. There are many instances in the current code where type annotations are inconsistent, and therefore misleading. This especially goes for type annotations in docstrings. E.g. in a few places arguments have type enumerable in docstrings.
  • Types of code that trip up the type checker also tend to trip up people. Again, maybe mostly small-brained people like me.

I have some experience with type checking in a serious Python project (https://github.com/apple/pfl-research), and have found it useful in making code correctly modular and clear to adopters.

Disadvantages of type checking

Why Pyright?

I started off using Pyre, but Pyright is better.

  • Good integration in VS Code. The Pylance extension uses Pyright. Try it out! (You may need the pyproject.toml from this PR.)
  • Faster. Roughly a second for a single file. Though at 10-30 seconds for the whole repo, not fast enough to run as pre-commit check, I think.
  • Pyright is better at inferring type changes, for example "if f_max is None: f_max = 1".
  • Clearer error messages.
  • Errors can be ignored with type: ignore, which generalises to other type checkers.
  • More consistent errors across set-ups / computers.
  • Easier to switch off checks for files.
  • It is possible to set which aspects are checked. This will e.g. allow us to switch on more checks, e.g. after errors are found that this check would have prevented.

Proposal for getting this in

  1. See all errors (thousands) in the CI.
  2. Me to put in multiple PRs to fix/silence errors. I've got the changes prepared locally.
  3. Me to put in extra PRs for errors that I didn't get locally.
  4. Once those PRs are all merged, merge this PR.
  5. Reap the benefits of type-checking: correctness, and help while coding.
  6. [Optionally] Run type checks for recipes/ too, and fix errors.
  7. [Optionally] Move type annotations from docstrings into function signatures. This seems a good idea to me, so that type annotations can be checked. However, the pre-commit checks currently include a check that type annotations go in docstrings.

Types of changes that need to be made

I've got a branch locally which fixes all errors I am seeing locally on a six-month-old version of SpeechBrain. But I believe I understand what sorts of PRs I'll put in, in units that should be as easy as possible to review.

  1. Around 15 small PRs where I'm fairly sure there's a bug and I can fix it. The first one is Make Scores.__repr__ actually return something #2897. I should put these in either way.
  2. Around 20 cases where type annotations are wrong, in one big PR. I should put these in either way.
  3. Around 30 cases where I believe there's a bug but I will need to check with people.
  4. Around 100 cases where type annotations are needed to help the type checker, ass well as humans, in 1-3 PRs.
  5. Around 60 cases where the type checker needs assumptions to be made explicit, in one big PR. The majority are of the form assert argument is not None. I would argue that these are useful to make explicit. Moreover, going forward I believe for code reviewers to see these sorts of explicit assumptions.
  6. Around 50 cases where the type checker believes that a variable can be unbound. This will be one big PR. See below for an example. I find this style of code confusing, but the code usually seems correct. See below for an example.
  7. Around 15 cases where overrides are not technically correct, which I imagine usually doesn't create a problem. This will be one big PR. See below for an example.
  8. Around 50 cases (around 100-150 lines) where the code seems fine, though maybe not a style that I would have used, so we need to add # type: ignore (or a file-wide ignores in a few cases, like speechbrain/inference/ASR.py, which is quite dynamic). This will be a few PRs.

I can show any of these in a PR before the decision whether introducing type checking is a good idea.

Cases where an assumption needs to be made explicit

def function(bool use_joint, joint=None):
    if use_joint:
        assert joint is not None  # Needs to be added...
        joint(data ...)  # ... otherwise error: Object of type "None" cannot be called

In my mind, this would be good to add this for readability and understandability if the caller violates the contract.

a: Optional[int]
b: Optional[int]

if a is None and b is None:
    raise ValueError("...")

if a is None:
    # Assert must be added or type checker thinks that "b" could be None.
    assert b is not None
    c = b[1]  # Error: None is not subscriptable.
def function(bool return_two) -> int | Tuple[int, int]:
    if return_two:
        return 1, 2
    else:
        return 1

a, b = function(True)  # Error: "int" is not Iterable.
# Requires asserting that the return value is a tuple:
a, b = cast(tuple, function(True))

In this case, I would personally have preferred to use two separate functions for the separate types of return values.

Case where a variable looks possibly unbound to the type checker
E.g.

if condition:
    variable = 3

...

if condition:
    variable + 2  # Error: possibly unbound.

In this case, I say variable = None at the top and then assert variable is not None within the second condition.

Case where an override is not technically correct

Just one example of an error:

speechbrain/nnet/diffusion.py:252:9 - error: Method "distort" overrides class "Diffuser" in an incompatible manner
    Parameter 3 name mismatch: base parameter is named "timesteps", override parameter is named "noise" (reportIncompatibleMethodOverride)

I have assumed that in these cases, the code has been tested and runs correctly, i.e. no one calls the method with the base class's signature.

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

This adds a GitHub Workflow a bit like standard CI (pythonapp.yml) but for type checking.
@TParcollet
Copy link
Collaborator

@pplantinga will most likely be as interested as I am into that. It will take a good number of PR (Rogier is on it) to fix all warnings but he already spotted many bugs.

@pplantinga
Copy link
Collaborator

Perhaps it is time for a deeper look at types, as they have matured in the Python ecosystem. It will take quite some work to get there but it should improve the quality of the code quite a bit.

A few comments:

Faster. Roughly a second for a single file. Though at 10-30 seconds for the whole repo, not fast enough to run as pre-commit check, I think.

Pre-commit checks are typically run only on files touched by the commit, so this could probably still work.

It is possible to set which aspects are checked. This will e.g. allow us to switch on more checks, e.g. after errors are found that this check would have prevented.

Starting with fewer checks makes good sense to me, as even a few checks will catch a lot of low-hanging fruit and fewer checks makes it more feasible to complete. I'm sure each change will require quite some effort to fix all the errors that are revealed, so waiting til it becomes clear what is necessary can help prioritize effort.

I've got a branch locally which fixes all errors I am seeing locally on a six-month-old version of SpeechBrain. But I believe I understand what sorts of PRs I'll put in, in units that should be as easy as possible to review.

This is tremendous, thanks! I'm happy to review any PRs put through this way.

@rogiervd
Copy link
Contributor Author

Perhaps it is time for a deeper look at types, as they have matured in the Python ecosystem. It will take quite some work to get there but it should improve the quality of the code quite a bit.

This is tremendous, thanks! I'm happy to review any PRs put through this way.

Thanks!

Faster. Roughly a second for a single file. Though at 10-30 seconds for the whole repo, not fast enough to run as pre-commit check, I think.

Pre-commit checks are typically run only on files touched by the commit, so this could probably still work.

I just tried it, and it took 6 seconds by my count, which might be OK. However, it does turn out that the pre-commit check runs in its own environment (normally without e.g. Torch installed). It is possible to pass in a name for the environment, but that won't generalise to everyone's local environment.
https://github.com/RobertCraigie/pyright-python?tab=readme-ov-file#pre-commit

So for the reason of environments, I think a pre-commit hook won't work. Though it's certainly possible that there is a trick that I'm missing.

@pplantinga
Copy link
Collaborator

I just tried it, and it took 6 seconds by my count, which might be OK. However, it does turn out that the pre-commit check runs in its own environment (normally without e.g. Torch installed). It is possible to pass in a name for the environment, but that won't generalise to everyone's local environment.

Ach too bad. Maybe we could include an optional pre-commit file that could be used to install it for anyone who wants to go through the effort of manually specifying the name.

@rogiervd
Copy link
Contributor Author

I just tried it, and it took 6 seconds by my count, which might be OK. However, it does turn out that the pre-commit check runs in its own environment (normally without e.g. Torch installed). It is possible to pass in a name for the environment, but that won't generalise to everyone's local environment.

Ach too bad. Maybe we could include an optional pre-commit file that could be used to install it for anyone who wants to go through the effort of manually specifying the name.

I've added a commented-out section in the pre-commit file, I hope that's what you mean.

@pplantinga
Copy link
Collaborator

I just tried it, and it took 6 seconds by my count, which might be OK. However, it does turn out that the pre-commit check runs in its own environment (normally without e.g. Torch installed). It is possible to pass in a name for the environment, but that won't generalise to everyone's local environment.

Ach too bad. Maybe we could include an optional pre-commit file that could be used to install it for anyone who wants to go through the effort of manually specifying the name.

I've added a commented-out section in the pre-commit file, I hope that's what you mean.

Looks good! Perhaps worth mentioning here that one would also probably want to stop git from tracking the file:
https://stackoverflow.com/questions/54912094/how-do-i-keep-a-local-change-outside-of-commit

@pplantinga
Copy link
Collaborator

pplantinga commented May 23, 2025

A bit of news here that may be relevant: Astral.sh (who is behind uv and ruff) has been working since January on a static type checker for Python written in rust called ty which should be incredibly fast (like their other tools). Pre-release version is available here:

https://github.com/astral-sh/ty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants