Skip to content

Fixes and improvements to reference cache.#666

Merged
nickl- merged 2 commits intobeanshell:masterfrom
nickl-:reference-cache
Dec 11, 2022
Merged

Fixes and improvements to reference cache.#666
nickl- merged 2 commits intobeanshell:masterfrom
nickl-:reference-cache

Conversation

@nickl-
Copy link
Member

@nickl- nickl- commented Dec 8, 2022

Was getting annoyed with the intermittent failure of the InterpreterConcurrencyTest so fixed it. Failing tests, intermittent or otherwise, should be considered show stoppers and takes precedence over anything else, I am sure you will agree. The problem was that the value references from the WeakHashMap was finalized but the key entries never gets cleaned up, which was one motivation for developing the reference cache, see below. Not sure if this resolves #655, lacking a root cause analysis or unit test, since this issue specifically pertains to the dot notation in references. We are also using WeakHashMap in Reflection and Capabilities so the issue may be related.

Also took this time to address the issues raised on the asynchronicity of ReferencheCache. @opeongo please review this PR and confirm that you agree with the changes.

I went with option 3 from the discussion since we may or may not actually use the reference it pays not to be blocking for initialization (use async instead), yet still populate the reference cache as expected (instead of opting for lazy loading). To mitigate the concern of spawning hundreds of threads I have opted for a cached thread pool which will reuse previous threads if available but also clear them if idle and not consume unnecessary resources, which seems the best of both worlds.

It is worth mentioning that the concern raised in option 2 about not having the values loaded as weak or soft references is invalid. The reference cache uses a self monitoring reference queue process registered on the cache keys, which will remove cache entries when the keys are invalidated, something not implemented for the standard WeakHashMap. Which makes the lazy load probably the best option but it changes the original intentions of the reference cache design and I opted for option 3 as it has the least impact (talk about lazy). We should probably replace all the WeakHashMaps with ReferenceCache but that is a bigger challenge in itself, because it is not a drop in replacement and takes a completely different approach to populating the collection which enables the automatic re-initializing of expired values removed from the cache when requested.

Using the maven build process and testing completion time as a benchmark, I know this is not a scientific bench but still a ballpark, the cached thread pool is a slight improvement over the current non-asynchronous implementation but significantly faster than simply spawning new threads for each request which quickly takes a toll. I was unable to see a speed difference compared with a fixed pool, but if you have a long running test (more than a minute), the memory footprint is much smaller with the timed out threads.

There was another annoying bug I picked up with exceptions being obfuscated by the reference cache and not rolling up to the client. This also allowed for null values which the cache specifically disallows because it needs null state to test whether a reference expired or not. This has been rectified and an appropriate null pointer exception message was added to distinguish the problem from other NPEs.

Added several more tests for the reference queue, also proving cache expiry and auto reinitialize of expired entries, reaching 99% test coverage, which seems like the best we are going to get unless someone else can do better.

Changes in this commit includes:

  • Fix issue with InterpreterConcurrency Test failing periodically
  • Expose lazy loaded exceptions obscured by dereferenceValue's catch block
  • Fixed future task asynchronous process via static final cached thread pool
  • Removed debug output from ReferenceQueueMonitor's empty catch block
  • Add message to requireNonNull as specified for values
  • Fix key tests assigning null values and added tests for NPE checks
  • Fixed deprecated assertThat and ExpectedException on ReferenceCacheTest
  • Removed redundant throws Exception from test cases
  • Added tests to prove removal of finalized references
  • Added tests to prove auto reinitialize of expired references
  • Reached 99% code coverage on ReferenceCache and appears to be the limit

Fix issue with InterpreterConcurrencyTest failing periodically
Expose lazy loaded exceptions obscured by dereferenceValue's catch block
Fixed future task asynchronous process via static final cached thread pool
Removed debug output from ReferenceQueueMonitor's empty catch block
Add message to requireNonNull as specified for values
Fix key tests assigning null values and added tests for NPE checks
Fixed deprecated assertThat and ExpectedException on ReferenceCacheTest
Removed redundant throws Exception from test cases
Added tests to prove removal of finalized references
Added tests to prove auto init of expired references
Reached 99% code coverage on ReferenceCache and appears to be the limit
@codecov
Copy link

codecov bot commented Dec 8, 2022

Codecov Report

Merging #666 (1e08b15) into master (a6687e0) will increase coverage by 0.30%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #666      +/-   ##
============================================
+ Coverage     70.59%   70.89%   +0.30%     
- Complexity     2719     2756      +37     
============================================
  Files           106      106              
  Lines          9029     9057      +28     
  Branches       1772     1775       +3     
============================================
+ Hits           6374     6421      +47     
+ Misses         2232     2220      -12     
+ Partials        423      416       -7     
Flag Coverage Δ
unittests 70.89% <100.00%> (+0.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/main/java/bsh/BSHAllocationExpression.java 78.26% <100.00%> (ø)
src/main/java/bsh/BSHAmbiguousName.java 88.23% <100.00%> (+11.76%) ⬆️
src/main/java/bsh/BSHArrayDimensions.java 100.00% <100.00%> (ø)
src/main/java/bsh/BSHArrayInitializer.java 95.19% <100.00%> (+0.04%) ⬆️
src/main/java/bsh/BSHBlock.java 100.00% <100.00%> (ø)
src/main/java/bsh/BSHClassDeclaration.java 96.87% <100.00%> (+3.12%) ⬆️
src/main/java/bsh/BSHEnhancedForStatement.java 82.35% <100.00%> (+0.35%) ⬆️
src/main/java/bsh/BSHEnumConstant.java 100.00% <100.00%> (+4.76%) ⬆️
src/main/java/bsh/BSHForStatement.java 87.17% <100.00%> (+0.33%) ⬆️
src/main/java/bsh/BSHFormalComment.java 100.00% <100.00%> (ø)
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

});
cache.put(refKey, task);
task.run();
async.submit(task);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FutureReturnValueIgnored: Return value of methods returning Future must be checked. Ignoring returned Futures suppresses exceptions thrown from the code that completes the Future.


ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage
@sonatype-lift ignore Leave out the above finding from this PR
@sonatype-lift ignoreall Leave out all the existing findings from this PR
@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.


Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

Copy link
Member Author

@nickl- nickl- Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being checked when the return value is actually used, hence asynchronous, and the exceptions are rolled up:

private V dereferenceValue(Future<CacheReference<V>> futureValue) {
try {
return dereferenceValue(futureValue.get());
} catch (final Throwable e) {
throw new CompletionException(e.getCause());
}
}

There is no option to do so otherwise, as handling method throws are required, so why raise this concern on calling the run method (which does not throw any exceptions and can be called without concern). Also, if return values are never used why must it be checked?

Parts parts = PARTSCACHE.get(value);
if (null != parts)
return parts;
PARTSCACHE.remove(value);
Copy link
Member Author

@nickl- nickl- Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tested because this change actually fixes an existing test failing, check again Codecov.

@nickl-
Copy link
Member Author

nickl- commented Dec 8, 2022

Added ignore warning for lift.

@nickl-
Copy link
Member Author

nickl- commented Dec 8, 2022

Codecov is loosing the report, nothing shown for this pull request:

https://app.codecov.io/gh/beanshell/beanshell/pull/666/diff

@nickl-
Copy link
Member Author

nickl- commented Dec 8, 2022

After a re-run codecov now displays, will attempt to hit the missing items from the report

@nickl- nickl- force-pushed the reference-cache branch 6 times, most recently from 5ec1a8b to 3d5ea7f Compare December 8, 2022 15:14
@nickl-
Copy link
Member Author

nickl- commented Dec 8, 2022

Found a type in the toString of BSHAmbiguousName then had to write a test for it, thanks to CodeCov. Then had to do all the nodes just for completion.

@nickl-
Copy link
Member Author

nickl- commented Dec 8, 2022

All the PR checks are happy now as well, so @opeongo if you are happy with the ReferenceCache changes then it is RTBC.

@nickl-
Copy link
Member Author

nickl- commented Dec 11, 2022

Committing this now because builds are failing due to the InterpreterConcurrencyTest which this fixes.

@nickl- nickl- merged commit 60493f1 into beanshell:master Dec 11, 2022
@opeongo
Copy link
Contributor

opeongo commented Dec 11, 2022

Hi @nickl- I have been away for a few days due to a surgery and a resulting limited ability to do keyboarding, so sorry for taking a while to respond. I am only looking at the ReferenceCache changes that I am familiar with. At first glance this PR does not close the race in init or do anything about the value references that are queued but not cleared, so it seems like only a partial fix. I recommended option 1 but you went with 3, which adds an extra layer of concurrency which really should have been left as a potential optimization for later once the correctness fix is proven. Since we are hunting an intermittent bug (#655) and a namespace lookup problem (#659) I would have preferred this to have been simplified to make the reasoning and debugging easier. Debugging under concurrent operations is a pita, so this footgun not necessary here.

But maybe this fixes #655 so I will pull and build and test and report back.

@opeongo
Copy link
Contributor

opeongo commented Dec 11, 2022

I tried running my problematic workload with the 60493f1 version from master branch and there is a severe performance regression. After 12 hours of running the task had not proceeded very far so I was not able to get an indication if these changes have fixed #655. I ran the same workload again using bsh-2.0b5 and my task accomplished as much as the current master in only 2 minutes of clock time. 12 hours vs 2 minutes for the same forward progress, something is wrong.

I ran again with my master branch from a few weeks ago and the workload run substantially faster, although subjectively slower than bsh-2.0b5. This indicates that a recent change is responsible for the regression.

Unless you have a better idea, I think the steps forward at this stage are to bisect the recent changes to see what caused the performance regression.

Also I should implement a more objective performance metric than "it seemed faster". What is best here? Elapsed clock time, cpu time? Is there a regression test framework that we should look at?

@opeongo
Copy link
Contributor

opeongo commented Dec 11, 2022

After bisecting it seems that the performance problem shows up in 1e08b15. The only suspicious change is 1e08b15#r92725995. Maybe this has to do with the ConcurrentLinkedQueue implementation. The API docs says that it needs to do full traversals for some methods like size(), so maybe susceptible to O(N) performance degradation.

Beware that, unlike in most collections, the size method is NOT a constant-time operation. Because of the asynchronous nature of these queues, determining the current number of elements requires a traversal of the elements

@opeongo
Copy link
Contributor

opeongo commented Dec 11, 2022

Also note that ff28429 does not fix #655

@opeongo
Copy link
Contributor

opeongo commented Dec 12, 2022

See 1e08b15#r92749798 for a fix to the performance regression.

This brings the performance back again, but I wonder if it is just covering up the problem. If the ConcurrentLinkedQueue container was running slow then perhaps too many references are building up and not getting cleared. Something seems fishy.

@opeongo
Copy link
Contributor

opeongo commented Dec 12, 2022

I added a print statement to ClassManagerImpl to have a look at the number of listeners that are registered. Using my test workload the number pretty quickly shoots up to hundreds of thousands of entries in the listeners container. The value is mostly increasing but in some cases gets a bit smaller, so I guess some entries are getting reaped but many are not.

@nickl-
Copy link
Member Author

nickl- commented Dec 12, 2022

Unless you have a better idea, I think the steps forward at this stage are to bisect the recent changes to see what caused the performance regression.

Though you were going to say revert lol =)

Also I should implement a more objective performance metric than "it seemed faster". What is best here? Elapsed clock time, cpu time? Is there a regression test framework that we should look at?

We need something to give us proper bench stats with load testing et al, is JMeter the go to? Using the test execution time proves very little.

I added a print statement to ClassManagerImpl to have a look at the number of listeners that are registered. Using my test workload the number pretty quickly shoots up to hundreds of thousands of entries in the listeners container. The value is mostly increasing but in some cases gets a bit smaller, so I guess some entries are getting reaped but many are not.

Something is seriously dodgy with that code and needs a revamp. As also mentioned 1e08b15#r92812487

Thanx for the review, hope the hand is bearable.

@opeongo
Copy link
Contributor

opeongo commented Dec 14, 2022

Though you were going to say revert lol =)

It was no big deal to chase this one down because it was obvious it crept in during a very recent commit and there were very few lines to look at.

However, since you poked me about it, I will give you my reaction.

I think revert are appropriate when:

  • the problem is occurring on the trunk, and
  • it is causing a showstopper bug, and
  • the feature worked before before the commit, but does not afterwards.

Reverts are especially appropriate when the change has to do with a non-essential feature or optimization attempt that is not core to getting to a stable release candidate. If it is something that can be triaged to fix later then do so and quickly get it out of the way so that others can keep on working.

The greatest thing about using a git for the software development workflow is that it is so easy to make a revert and put the change in to a feature branch where the proponent for the change can spend time reworking and testing. The code is not thrown away, but passed back to the person who made the change (or another interested party) so that they can have another look at it and get it right. I don't understand why you wont accept working this way when it is what just about every other development shop on the planet is doing. Blocking bugs on the trunk is a big no no, and you do it all the time.

The development style that you are imposing on this project is like hostage taking. You are essentially doing development on the trunk. And you are doing it without sufficient testing on real work loads (I say this because I run real world work loads and bugs crop up quickly). When problems show up instead of reverting and taking the change in to a feature branch you insist that it should stay in the trunk and volunteers should step up and fix it.

When bugs like this are introduced it impacts everyone. No one else can work on the stuff that is important to them because someone else introduced a bug in to the trunk that messes up doing doing work in other areas since standard core features that used to work are now broken (like basic math, I'm looking at you #590, or scripted objects #659). Now we all have to switch gears and redirect effort to become expert on part of the code base that previously they had nothing to do with. Ok, so we are a team and we can help each other out this way, for sure. But it happens so often it seems like a bad habit that you have got.

@nickl-
Copy link
Member Author

nickl- commented Dec 15, 2022

However, since you poked me about it, I will give you my reaction.

Was just kidding hoping to break the ice, guess it is still a sensitive issue.

I think revert are appropriate when:

  • the problem is occurring on the trunk, and

  • it is causing a showstopper bug, and

  • the feature worked before before the commit, but does not afterwards.

I might agree with you on new changes, but we are not talking about new changes but commits that are the result of merging in forks. These should be seen as historic changes already part of the trunk and they were all enhancements that were thought necessary and we do not have the original committers present anymore to resubmit their change.

It is just a fact of matter which we have to live with and we can find the issue resolved with a little elbow grease.

You are essentially doing development on the trunk. And you are doing it without sufficient testing on real work loads (I say this because I run real world work loads and bugs crop up quickly).

We just need more unit tests and can only ensure changes do not impact the available unit tests. So to improve this we need more unit tests. All changes go back to master, and releases are made from master, so there is not much that can be done about that.

pgiffuni pushed a commit to pgiffuni/beanshell that referenced this pull request Jan 8, 2024
Fixes and improvements to reference cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Variables disappearing in long running scripts

2 participants