[Draft for now, going to pick out independent parts] Replace module state lookup with hash table #7080

da-woods · 2025-08-09T12:23:03Z

When using multi-phase init and module state, use a hash table instead of bisection into an array. This should hopefully perform better on average (especially when there are many modules available).

Special case the 0th module so this one's always available quickly.

Make it so the implementation can be tested independently on subinterpreters/module state.

When using multi-phase init and module state, use a hash table instead of bisection into an array. This should hopefully perform better on average (especially when there are many modules available). Special case the 0th module so this one's always available quickly. Make it so the implementation can be tested independently on subinterpreters/module state.

scoder

Do we have any way of benchmarking the lookup difference?

I didn't have time to follow the atomics usage yet, but since you're chasing quite a number of pointers and counters, it's definitely worth a proper look.

scoder · 2025-08-13T07:58:09Z

Cython/Utility/ModuleSetupCode.c

+  #if !CYTHON_COMPILING_IN_LIMITED_API && PY_VERSION_HEX >= 0x030E0000
+    int64_t v = Py_HashBuffer((void*)&id, sizeof(id));
+    return v >= 0 ? v : -(v+1);  // make sure we return something +ve
+  #else


What speaks against a directly inlined hash function, given that we need to provide our own implementation below either way? That CityHash is probably much faster in practice than the generic hash implementation that is indirectly called by the call to Py_HashBuffer.

scoder · 2025-08-13T08:18:49Z

Cython/Utility/ModuleSetupCode.c

+    // but shrink if less than 10% filled
+    while (target < count/10) {
+        target /= 2;
+        if (target < 10) {
+            return 10;
+        }
    }


Do you think that we'd have a reason to shrink the table in the real world? I'd expect the number of interpreters to stay somewhat small, and this is just a table, probably dwarfed by the actual module state data. IIUC, we spend 16 bytes per module. Even if an application ends up running 1000 interpreters at a time, that'd just be a table of 16KiB, probably less than 30KiB as a hash table. And it seems unlikely that such a peak usage only appears once in a longer running application.

You know your code better than me – if it's not a big deal to allow shrinking then why not, but it can sometimes simplify the implementation to allow only growth. And it avoids excessive rehashing on larger fluctuations, e.g. when several interpreters are started and terminated repeatedly in batches (as you could expect in a long-running server setting).

I think in principle we want to be able to remove modules from the table - I think it should be possible to unload and reload a module. I suspect it wouldn't work right now for reference counting reasons, but I don't think this table should block it.

Being able to shrink the allocation is less important but doesn't add a lot of complication. I think I'll remove it just to avoid fluctuations, but it really only avoids the 4 lines above.

runtests.py

tests/run/modulestatelookup_impl.pyx

scoder · 2025-08-13T08:49:12Z

tests/run/modulestatelookup_impl.pyx

+    import random
+
+    generator = random.Random()


Randomisation makes test failures difficult to debug. Can we use a deterministic series of operations? E.g. create a few modules, then do a couple of add-remove cycles, maybe nest some add-remove cycles, maybe nest some in increasingly growing depths, then clean up? You could use the permutation generators in itertools to create varying degrees of add-remove nesting.

That's why I added print(f"Failed: state {initial_state}") on failure - I think it should make it possible to debug a failure retrospectively. For now I'll add in the small bit of code to help that.

I'm happy to make it deterministic though if that's still an issue

Co-authored-by: scoder <stefan_ml@behnel.de>

da-woods · 2025-08-13T20:52:57Z

Do we have any way of benchmarking the lookup difference?

For the uncontended case when only one interpreter is reading at once (that's python3 -c "import modulestatelookup; modulestatelookup.run_many_benchmarks(None, 1, 1000000)"):

The new version starts off faster (largely I think due to the special-case for module 0), then is slightly slower for medium numbers of interpreters, and then ends up slightly faster again for high numbers.

For the case where 4 interpreters are reading at once (python3 -c "import modulestatelookup; modulestatelookup.run_many_benchmarks(None, 4, 1000000)") it's all much the same

A bit of profiling suggests that the reference counting of the number of readers is dominating in this case (which is basically unchanged) so exactly how it's looked up is largely irrelevant.

So not a hugely dramatic difference although the special-case for interpreter 0 is almost certainly worthwhile.

da-woods · 2025-08-14T06:05:14Z

Some well-positioned blank space gives a decent speed-up to the "multiple threads" case (although it's pretty tangential to the hash-table change)

scoder

So, looking at the code and your benchmark results, I wonder if this is worth the complexity. If it's really not much faster then the previous implementation, but adds complexity to algorithmic synchronisation and packaging constraints, maybe the current code with an added special case for ~~low interpreter counts~~ the main interpreter would be enough?

scoder · 2025-08-14T07:12:13Z

Cython/Utility/ModuleSetupCode.c

+    // Leave at enough space to keep the read_counter read_counter the
+    // rest of the structure in different cache lines to prevent false sharing.


Suggested change

// Leave at enough space to keep the read_counter read_counter the

// rest of the structure in different cache lines to prevent false sharing.

// Leave at enough space to keep the rest of the structure

// in different cache lines to prevent false sharing.

scoder · 2025-08-14T07:17:05Z

Cython/Utility/ModuleSetupCode.c

    // fine because readers should be much faster than memory allocation.
-    // This is called while also being a reader, so allow one reader.
-    while (__pyx_atomic_load(&data->read_counter) > 1);
+    while (__pyx_atomic_load(&data->read_counter) > 0);


Ok, it's an atomic load, but that doesn't guarantee that after the loop the read counter is still 0. Doesn't that matter?

Yes this is essentially the same as the existing version. When waiting to rewrite I swap the table out and set it to NULL. Which means that nobody new starts reading it.

scoder · 2025-08-14T07:25:55Z

Cython/Utility/ModuleSetupCode.c

    // keep it less than half filled
    while (target <= count*2) {
        target *= 2;
    }


Doubling might become excessive at some point, especially if we don't shrink. Here's the overallocation scheme for Python lists, which might be worth adopting or adapting (they've done the thinking already, so…).

/* This over-allocates proportional to the list size, making room * for additional growth. The over-allocation is mild, but is * enough to give linear-time amortized behavior over a long * sequence of appends() in the presence of a poorly-performing * system realloc(). * Add padding to make the allocated size multiple of 4. * The growth pattern is: 0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ... * Note: new_allocated won't overflow because the largest possible value * is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t. */ new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3; /* Do not overallocate if the new size is closer to overallocated size * than to the old size. */ if (newsize - Py_SIZE(self) > (Py_ssize_t)(new_allocated - newsize)) new_allocated = ((size_t)newsize + 3) & ~(size_t)3;

scoder · 2025-08-14T07:52:27Z

Cython/Utility/ModuleSetupCode.c

+    // On modern C++ the required size is available. Otherwise use an arbitrary value
+    // of "twice what most x64 chips are in 2025".
+    char blank_space[
+#if defined(__cplusplus) && defined(__cpp_lib_hardware_interference_size)
+      std::hardware_destructive_interference_size
+#else
+      128
+#endif
+    ];


IIUC, the size is sometimes available in C++, specifically, when users compile their own code. Binary wheels for distribution usually target some old, generally supported CPU that may or may not match the configuration of the CPU that will eventually run the code. We could argue that users with high performance requirements would have an incentive to compile the code on their side, but many still won't do it for all of their (relevant) dependencies.

That written, the L1 cache line size seems to be consistent at least across Intel CPUs and it's an improvement according to your benchmark. It'll probably be ok in practice.

BTW, we should make sure that this code doesn't get shared across modules as the struct size might end up varying at compile time. This adds at least a difference between C++ and C modules.

Yes what C++ includes is definitely a generic compile-time estimate for the platform rather than an exact match for the user's CPU.

As far as I can tell, anything Intel or ARM is 64, but I can find a few less popular platforms that report either 32 (risc-V) or 128 (power64).

I think the padding is definitely worthwhile but I don't feel strongly about using the C++ library number or hardcoding our own number.

BTW, we should make sure that this code doesn't get shared across modules as the struct size might end up varying at compile time. This adds at least a difference between C++ and C modules.

I think the mutex already makes that true.

da-woods · 2025-08-14T17:02:40Z

So, looking at the code and your benchmark results, I wonder if this is worth the complexity. If it's really not much faster then the previous implementation, but adds complexity to algorithmic synchronisation and packaging constraints, maybe the current code with an added special case for ~~low interpreter counts~~ the main interpreter would be enough?

I kind of agree. I think this has ended up doing about 5 things:

Adding a benchmark tool - it was useful so we might as well keep that
Refactoring the interface so it can be tested independently without needing to create subinterpreters - I think that's worthwhile
Special-casing either low interpreter counts or just the main interpreter - that seems like a very worthwhile optimization.
Padding between the table and the counter to reduce contention - again it seems worthwhile to me (with some debate about where the number comes from).
Binary search -> hash table - possibly not worthwhile performance-wise although feels like the right choice as a matter of principle.

packaging constraints

I don't think it actually changes this. The choice of mutex type already makes it a dubious to share. Module state is disabled by default so it probably isn't a good candidate to try to share .

I think I'll put this in draft for now and pick out the individual parts

da-woods · 2025-10-25T11:37:43Z

My current thought on this is:

The contention on the atomic use counter is really bad for performance and so it needs to go - we should assume that anyone who wants subinterpreters probably wants to use them concurrently.
This could be done by either
- leaking the memory when the lookup table is reallocated (so old copies of the table are never invalidated)
- A read-copy-update type scheme where the memory is freed later at some safe time. I think that needs to be "when all interpreters have been through at least one GC cycle". I have a rough plan for how to do this but it's really complicated so may not be worth it.
I think both of those would work better with the hash table rather than ordered list, because the hash table can be updated in place lock-free (until it needs resizing).

I've not done anything beyond a bit of thinking about it though.

da-woods added 8 commits August 9, 2025 13:21

Fix older versions of Python

dc773c7

Fix C++11 hopefully

4a01412

Spelling

eac1fad

Hopefully fix MSVC

be4065f

Fix PyPy, maybe fix OSX

25fb2f1

Try again OSX

1f5f02b

Use Python to hash where available

df5108e

da-woods force-pushed the module-state-hash-table branch from 81c754e to df5108e Compare August 9, 2025 18:30

da-woods added 2 commits August 9, 2025 19:42

Make sure we return something positive from Py version

6e64789

Fix

a74a534

scoder reviewed Aug 13, 2025

View reviewed changes

da-woods and others added 5 commits August 13, 2025 18:28

Apply suggestions from code review

51ba245

Co-authored-by: scoder <stefan_ml@behnel.de>

Remove shrink and use of Python hash function

ddcbd3a

Improve test reproducibility

821a51a

Fixup

fbb6fef

Add benchmarks

0f408a3

da-woods added 2 commits August 13, 2025 22:05

add thread-safe guard

4524141

Leave space to revent false sharing

c25ee2c

scoder reviewed Aug 14, 2025

View reviewed changes

da-woods marked this pull request as draft August 14, 2025 17:08

da-woods changed the title ~~Replace module state lookup with hash table~~ [Draft for now, going to pick out independent parts] Replace module state lookup with hash table Aug 14, 2025

da-woods mentioned this pull request Aug 14, 2025

Make module state lookup more testable #7088

Open

		// Leave at enough space to keep the read_counter read_counter the
		// rest of the structure in different cache lines to prevent false sharing.

Uh oh!

[Draft for now, going to pick out independent parts] Replace module state lookup with hash table #7080

Are you sure you want to change the base?

[Draft for now, going to pick out independent parts] Replace module state lookup with hash table #7080

Conversation

da-woods commented Aug 9, 2025

Uh oh!

scoder left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

da-woods commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

da-woods commented Aug 14, 2025

Uh oh!

scoder left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

da-woods commented Aug 14, 2025

Uh oh!

da-woods commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

da-woods commented Aug 13, 2025 •

edited

Loading

scoder left a comment •

edited

Loading