Skip to content

Conversation

@da-woods
Copy link
Contributor

@da-woods da-woods commented Aug 9, 2025

When using multi-phase init and module state, use a hash table instead of bisection into an array. This should hopefully perform better on average (especially when there are many modules available).

Special case the 0th module so this one's always available quickly.

Make it so the implementation can be tested independently on subinterpreters/module state.

When using multi-phase init and module state, use a hash
table instead of bisection into an array. This should hopefully
perform better on average (especially when there are
many modules available).

Special case the 0th module so this one's always available
quickly.

Make it so the implementation can be tested independently
on subinterpreters/module state.
@da-woods da-woods force-pushed the module-state-hash-table branch from 81c754e to df5108e Compare August 9, 2025 18:30
Copy link
Contributor

@scoder scoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any way of benchmarking the lookup difference?

I didn't have time to follow the atomics usage yet, but since you're chasing quite a number of pointers and counters, it's definitely worth a proper look.

Comment on lines 2806 to 2809
#if !CYTHON_COMPILING_IN_LIMITED_API && PY_VERSION_HEX >= 0x030E0000
int64_t v = Py_HashBuffer((void*)&id, sizeof(id));
return v >= 0 ? v : -(v+1); // make sure we return something +ve
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What speaks against a directly inlined hash function, given that we need to provide our own implementation below either way? That CityHash is probably much faster in practice than the generic hash implementation that is indirectly called by the call to Py_HashBuffer.

Comment on lines 2978 to 2984
// but shrink if less than 10% filled
while (target < count/10) {
target /= 2;
if (target < 10) {
return 10;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that we'd have a reason to shrink the table in the real world? I'd expect the number of interpreters to stay somewhat small, and this is just a table, probably dwarfed by the actual module state data. IIUC, we spend 16 bytes per module. Even if an application ends up running 1000 interpreters at a time, that'd just be a table of 16KiB, probably less than 30KiB as a hash table. And it seems unlikely that such a peak usage only appears once in a longer running application.

You know your code better than me – if it's not a big deal to allow shrinking then why not, but it can sometimes simplify the implementation to allow only growth. And it avoids excessive rehashing on larger fluctuations, e.g. when several interpreters are started and terminated repeatedly in batches (as you could expect in a long-running server setting).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in principle we want to be able to remove modules from the table - I think it should be possible to unload and reload a module. I suspect it wouldn't work right now for reference counting reasons, but I don't think this table should block it.

Being able to shrink the allocation is less important but doesn't add a lot of complication. I think I'll remove it just to avoid fluctuations, but it really only avoids the 4 lines above.

Comment on lines +82 to +84
import random

generator = random.Random()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomisation makes test failures difficult to debug. Can we use a deterministic series of operations? E.g. create a few modules, then do a couple of add-remove cycles, maybe nest some add-remove cycles, maybe nest some in increasingly growing depths, then clean up? You could use the permutation generators in itertools to create varying degrees of add-remove nesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I added print(f"Failed: state {initial_state}") on failure - I think it should make it possible to debug a failure retrospectively. For now I'll add in the small bit of code to help that.

I'm happy to make it deterministic though if that's still an issue

@da-woods
Copy link
Contributor Author

da-woods commented Aug 13, 2025

Do we have any way of benchmarking the lookup difference?

For the uncontended case when only one interpreter is reading at once (that's python3 -c "import modulestatelookup; modulestatelookup.run_many_benchmarks(None, 1, 1000000)"):

image

The new version starts off faster (largely I think due to the special-case for module 0), then is slightly slower for medium numbers of interpreters, and then ends up slightly faster again for high numbers.

For the case where 4 interpreters are reading at once (python3 -c "import modulestatelookup; modulestatelookup.run_many_benchmarks(None, 4, 1000000)") it's all much the same

image

A bit of profiling suggests that the reference counting of the number of readers is dominating in this case (which is basically unchanged) so exactly how it's looked up is largely irrelevant.


So not a hugely dramatic difference although the special-case for interpreter 0 is almost certainly worthwhile.

@da-woods
Copy link
Contributor Author

Some well-positioned blank space gives a decent speed-up to the "multiple threads" case (although it's pretty tangential to the hash-table change)

image

Copy link
Contributor

@scoder scoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, looking at the code and your benchmark results, I wonder if this is worth the complexity. If it's really not much faster then the previous implementation, but adds complexity to algorithmic synchronisation and packaging constraints, maybe the current code with an added special case for low interpreter counts the main interpreter would be enough?

Comment on lines +2772 to +2773
// Leave at enough space to keep the read_counter read_counter the
// rest of the structure in different cache lines to prevent false sharing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Leave at enough space to keep the read_counter read_counter the
// rest of the structure in different cache lines to prevent false sharing.
// Leave at enough space to keep the rest of the structure
// in different cache lines to prevent false sharing.

// fine because readers should be much faster than memory allocation.
// This is called while also being a reader, so allow one reader.
while (__pyx_atomic_load(&data->read_counter) > 1);
while (__pyx_atomic_load(&data->read_counter) > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's an atomic load, but that doesn't guarantee that after the loop the read counter is still 0. Doesn't that matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is essentially the same as the existing version. When waiting to rewrite I swap the table out and set it to NULL. Which means that nobody new starts reading it.

Comment on lines 2969 to 2972
// keep it less than half filled
while (target <= count*2) {
target *= 2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubling might become excessive at some point, especially if we don't shrink. Here's the overallocation scheme for Python lists, which might be worth adopting or adapting (they've done the thinking already, so…).

    /* This over-allocates proportional to the list size, making room
     * for additional growth.  The over-allocation is mild, but is
     * enough to give linear-time amortized behavior over a long
     * sequence of appends() in the presence of a poorly-performing
     * system realloc().
     * Add padding to make the allocated size multiple of 4.
     * The growth pattern is:  0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...
     * Note: new_allocated won't overflow because the largest possible value
     *       is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
     */
    new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;
    /* Do not overallocate if the new size is closer to overallocated size
     * than to the old size.
     */
    if (newsize - Py_SIZE(self) > (Py_ssize_t)(new_allocated - newsize))
        new_allocated = ((size_t)newsize + 3) & ~(size_t)3;

Comment on lines +2774 to +2782
// On modern C++ the required size is available. Otherwise use an arbitrary value
// of "twice what most x64 chips are in 2025".
char blank_space[
#if defined(__cplusplus) && defined(__cpp_lib_hardware_interference_size)
std::hardware_destructive_interference_size
#else
128
#endif
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the size is sometimes available in C++, specifically, when users compile their own code. Binary wheels for distribution usually target some old, generally supported CPU that may or may not match the configuration of the CPU that will eventually run the code. We could argue that users with high performance requirements would have an incentive to compile the code on their side, but many still won't do it for all of their (relevant) dependencies.

That written, the L1 cache line size seems to be consistent at least across Intel CPUs and it's an improvement according to your benchmark. It'll probably be ok in practice.

BTW, we should make sure that this code doesn't get shared across modules as the struct size might end up varying at compile time. This adds at least a difference between C++ and C modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes what C++ includes is definitely a generic compile-time estimate for the platform rather than an exact match for the user's CPU.

As far as I can tell, anything Intel or ARM is 64, but I can find a few less popular platforms that report either 32 (risc-V) or 128 (power64).

I think the padding is definitely worthwhile but I don't feel strongly about using the C++ library number or hardcoding our own number.

BTW, we should make sure that this code doesn't get shared across modules as the struct size might end up varying at compile time. This adds at least a difference between C++ and C modules.

I think the mutex already makes that true.

@da-woods
Copy link
Contributor Author

So, looking at the code and your benchmark results, I wonder if this is worth the complexity. If it's really not much faster then the previous implementation, but adds complexity to algorithmic synchronisation and packaging constraints, maybe the current code with an added special case for low interpreter counts the main interpreter would be enough?

I kind of agree. I think this has ended up doing about 5 things:

  • Adding a benchmark tool - it was useful so we might as well keep that
  • Refactoring the interface so it can be tested independently without needing to create subinterpreters - I think that's worthwhile
  • Special-casing either low interpreter counts or just the main interpreter - that seems like a very worthwhile optimization.
  • Padding between the table and the counter to reduce contention - again it seems worthwhile to me (with some debate about where the number comes from).
  • Binary search -> hash table - possibly not worthwhile performance-wise although feels like the right choice as a matter of principle.

packaging constraints

I don't think it actually changes this. The choice of mutex type already makes it a dubious to share. Module state is disabled by default so it probably isn't a good candidate to try to share .


I think I'll put this in draft for now and pick out the individual parts

@da-woods da-woods marked this pull request as draft August 14, 2025 17:08
@da-woods da-woods changed the title Replace module state lookup with hash table [Draft for now, going to pick out independent parts] Replace module state lookup with hash table Aug 14, 2025
@da-woods
Copy link
Contributor Author

My current thought on this is:

  • The contention on the atomic use counter is really bad for performance and so it needs to go - we should assume that anyone who wants subinterpreters probably wants to use them concurrently.
  • This could be done by either
    • leaking the memory when the lookup table is reallocated (so old copies of the table are never invalidated)
    • A read-copy-update type scheme where the memory is freed later at some safe time. I think that needs to be "when all interpreters have been through at least one GC cycle". I have a rough plan for how to do this but it's really complicated so may not be worth it.
  • I think both of those would work better with the hash table rather than ordered list, because the hash table can be updated in place lock-free (until it needs resizing).

I've not done anything beyond a bit of thinking about it though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants