-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[Draft for now, going to pick out independent parts] Replace module state lookup with hash table #7080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
When using multi-phase init and module state, use a hash table instead of bisection into an array. This should hopefully perform better on average (especially when there are many modules available). Special case the 0th module so this one's always available quickly. Make it so the implementation can be tested independently on subinterpreters/module state.
81c754e to
df5108e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any way of benchmarking the lookup difference?
I didn't have time to follow the atomics usage yet, but since you're chasing quite a number of pointers and counters, it's definitely worth a proper look.
Cython/Utility/ModuleSetupCode.c
Outdated
| #if !CYTHON_COMPILING_IN_LIMITED_API && PY_VERSION_HEX >= 0x030E0000 | ||
| int64_t v = Py_HashBuffer((void*)&id, sizeof(id)); | ||
| return v >= 0 ? v : -(v+1); // make sure we return something +ve | ||
| #else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What speaks against a directly inlined hash function, given that we need to provide our own implementation below either way? That CityHash is probably much faster in practice than the generic hash implementation that is indirectly called by the call to Py_HashBuffer.
Cython/Utility/ModuleSetupCode.c
Outdated
| // but shrink if less than 10% filled | ||
| while (target < count/10) { | ||
| target /= 2; | ||
| if (target < 10) { | ||
| return 10; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that we'd have a reason to shrink the table in the real world? I'd expect the number of interpreters to stay somewhat small, and this is just a table, probably dwarfed by the actual module state data. IIUC, we spend 16 bytes per module. Even if an application ends up running 1000 interpreters at a time, that'd just be a table of 16KiB, probably less than 30KiB as a hash table. And it seems unlikely that such a peak usage only appears once in a longer running application.
You know your code better than me – if it's not a big deal to allow shrinking then why not, but it can sometimes simplify the implementation to allow only growth. And it avoids excessive rehashing on larger fluctuations, e.g. when several interpreters are started and terminated repeatedly in batches (as you could expect in a long-running server setting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in principle we want to be able to remove modules from the table - I think it should be possible to unload and reload a module. I suspect it wouldn't work right now for reference counting reasons, but I don't think this table should block it.
Being able to shrink the allocation is less important but doesn't add a lot of complication. I think I'll remove it just to avoid fluctuations, but it really only avoids the 4 lines above.
| import random | ||
|
|
||
| generator = random.Random() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Randomisation makes test failures difficult to debug. Can we use a deterministic series of operations? E.g. create a few modules, then do a couple of add-remove cycles, maybe nest some add-remove cycles, maybe nest some in increasingly growing depths, then clean up? You could use the permutation generators in itertools to create varying degrees of add-remove nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I added print(f"Failed: state {initial_state}") on failure - I think it should make it possible to debug a failure retrospectively. For now I'll add in the small bit of code to help that.
I'm happy to make it deterministic though if that's still an issue
Co-authored-by: scoder <stefan_ml@behnel.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, looking at the code and your benchmark results, I wonder if this is worth the complexity. If it's really not much faster then the previous implementation, but adds complexity to algorithmic synchronisation and packaging constraints, maybe the current code with an added special case for low interpreter counts the main interpreter would be enough?
| // Leave at enough space to keep the read_counter read_counter the | ||
| // rest of the structure in different cache lines to prevent false sharing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Leave at enough space to keep the read_counter read_counter the | |
| // rest of the structure in different cache lines to prevent false sharing. | |
| // Leave at enough space to keep the rest of the structure | |
| // in different cache lines to prevent false sharing. |
| // fine because readers should be much faster than memory allocation. | ||
| // This is called while also being a reader, so allow one reader. | ||
| while (__pyx_atomic_load(&data->read_counter) > 1); | ||
| while (__pyx_atomic_load(&data->read_counter) > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it's an atomic load, but that doesn't guarantee that after the loop the read counter is still 0. Doesn't that matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is essentially the same as the existing version. When waiting to rewrite I swap the table out and set it to NULL. Which means that nobody new starts reading it.
| // keep it less than half filled | ||
| while (target <= count*2) { | ||
| target *= 2; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubling might become excessive at some point, especially if we don't shrink. Here's the overallocation scheme for Python lists, which might be worth adopting or adapting (they've done the thinking already, so…).
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* Add padding to make the allocated size multiple of 4.
* The growth pattern is: 0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...
* Note: new_allocated won't overflow because the largest possible value
* is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
*/
new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;
/* Do not overallocate if the new size is closer to overallocated size
* than to the old size.
*/
if (newsize - Py_SIZE(self) > (Py_ssize_t)(new_allocated - newsize))
new_allocated = ((size_t)newsize + 3) & ~(size_t)3;
| // On modern C++ the required size is available. Otherwise use an arbitrary value | ||
| // of "twice what most x64 chips are in 2025". | ||
| char blank_space[ | ||
| #if defined(__cplusplus) && defined(__cpp_lib_hardware_interference_size) | ||
| std::hardware_destructive_interference_size | ||
| #else | ||
| 128 | ||
| #endif | ||
| ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the size is sometimes available in C++, specifically, when users compile their own code. Binary wheels for distribution usually target some old, generally supported CPU that may or may not match the configuration of the CPU that will eventually run the code. We could argue that users with high performance requirements would have an incentive to compile the code on their side, but many still won't do it for all of their (relevant) dependencies.
That written, the L1 cache line size seems to be consistent at least across Intel CPUs and it's an improvement according to your benchmark. It'll probably be ok in practice.
BTW, we should make sure that this code doesn't get shared across modules as the struct size might end up varying at compile time. This adds at least a difference between C++ and C modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes what C++ includes is definitely a generic compile-time estimate for the platform rather than an exact match for the user's CPU.
As far as I can tell, anything Intel or ARM is 64, but I can find a few less popular platforms that report either 32 (risc-V) or 128 (power64).
I think the padding is definitely worthwhile but I don't feel strongly about using the C++ library number or hardcoding our own number.
BTW, we should make sure that this code doesn't get shared across modules as the struct size might end up varying at compile time. This adds at least a difference between C++ and C modules.
I think the mutex already makes that true.
I kind of agree. I think this has ended up doing about 5 things:
I don't think it actually changes this. The choice of mutex type already makes it a dubious to share. Module state is disabled by default so it probably isn't a good candidate to try to share . I think I'll put this in draft for now and pick out the individual parts |
|
My current thought on this is:
I've not done anything beyond a bit of thinking about it though. |



When using multi-phase init and module state, use a hash table instead of bisection into an array. This should hopefully perform better on average (especially when there are many modules available).
Special case the 0th module so this one's always available quickly.
Make it so the implementation can be tested independently on subinterpreters/module state.