Skip to content

ConcurrentLocalContextProvider leaks memory per thread #8422

@matthias-fratz-bsz

Description

@matthias-fratz-bsz

Environment Information

Provide at least:

  • JRuby version: commit 6f9df83 (pom.xml says 9.4.10.0-SNAPSHOT)
  • Operating system: Linux 6.1.0-26-amd64 break script engine #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

Test Case

ThreadMemoryLeak.java

This code creates 10k threads, each of them executing a trivial amount of JRuby code, to simulate a heavily multi-threaded server application churning through threads. It then drops a heap dump for analysis.

Note: This simulates a use case seen in a production Cantaloupe server, which for some reason seems to create and terminate ~4 threads per minute in our case (I don#t understand Jetty's automatic thread pool management there). Because it keeps running for weeks at a time, it churns through tens of thousands of threads, accumulating tens of thousands of ~100kB LocalContext objects waiting to be cleaned up in terminate(). It eventually slows to a crawl due to memory starvation, or runs out of memory entirely and crashes.

Expected Behavior

JRuby creates a LocalContext for each thread, holds it in a ThreadLocal, and cleans it up when that particular thread has terminated.
The heap dump is a few MB and its size does not scale with the number of terminated threads.

The synthetic example doesn't actually have any concurrent threads; they all are more or less sequential. Thus it should be capable of running with a very small heap, no matter how high you crank the total number of threads that will be created and destroyed (line 36).

Actual Behavior

JRuby doesn't remove the LocalContext for each thread until the ScriptingContainer itself is disposed. In a long-running server application, that would be "never". In the example, it doesn't happen before the heap dump is written, at which point the JVM is essentially terminating.

The heap dump is hundreds of MB and scales with the number of threads that have been terminated. It is completely dominated by thousands of stale local contexts held by the ScriptingContainer's ConcurrentLocalContextProvider.

biggest-object

accumulated-objects

Setting the number of threads to a high value causes massive heap consumption, and even the synthetic example will eventually run out of memory.

Probable Cause

ConcurrentLocalContextProvider creates a LocalContext for each thread, and only disposes of it in terminate(). I do not know the lifecycles of most objects involved, but evidently termiante() doesn't get called before JVM termination here. Because these objects are still reachable (via ConcurrentLocalContextProvider's contextRefs member), they cannot be garbage-collected.

Suggested Fix

Hold a Reference (a PhantomReference should do, actually) on each thread that a LocalContext is created for. When the thread is terminated and becomes unreachable, that Reference will show up in its associated ReferenceQueue. A background service / cleaner thread can then watch that ReferenceQueue and call remove() on the relevant LocalContext objects. terminate() obviously would need to get rid of that service thread, and remove() the remaining LocalContexts.

Alternatively, the service thread could periodically scan the contextRefs array and remove() any contexts whose thread has died. This should clean up the context more quickly if anything is still holding a reference to the terminated thread, keeping it form being garbage collected.

Maybe it is also possible to piggy-back scanning onto some other operation, but I'm not sure what the performance impacts of that would be. Or what to do if that operation never happens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions