Skip to content

Conversation

@AlexaXs
Copy link
Contributor

@AlexaXs AlexaXs commented Dec 14, 2021

After a Threadpool rewrite for context-awareness, LFS checkouts started suffering long delays. We found the reason is that libgit2 callbacks that were being run in parallel started being run in sequence. But this is actually the safest way to deal with the callbacks, preventing deadlocks while being executed, and keeping the non-callback work thread-safe.

The changes here will just be an UNSAFE temporary workaround until LFS is properly integrated into NodeGit. The changes will be limited to the processing of callbacks from Workers that leverage threaded libgit2 functions (at the moment, only checkout functionality), and they basically allow the callbacks from the current Worker to be queued to the main JS thread without waiting for the current callback to end.

It is UNSAFE because with threaded libgit2 functions there will be a potential risk of deadlock if the callbacks need to lock an object that is already locked by the current Worker.

The Unsafe commit in this PR will be reverted once LFS is integrated into nodegit.

MikeJerred added a commit to LogicOverSnacks/glint-nodegit that referenced this pull request Dec 17, 2021
Copy link
Member

@implausible implausible left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexaXs This is not a good move. While it makes sense that we want to fix the regression - this cannot be a permanent solution as it throws lock safety out the window. Basically this allows nodegit to enter a state where the callback calling thread may perform work with a "locked" object, but the lock itself has been released by the callback runner. This is incredibly unsafe and was one of the aims that I intended to fix when updating the library in this way.

This may be a temporary workaround however, we have no idea what the consequences of this change really are other than "it seemed to work before". There definitely existed segfaults around LFS before, and it's very possible that this fix improved stability.

The appropriate change for fixing LFS is not to remove this patch, but to integrate nodegit-lfs into nodegit directly and with a managed solution for working with locks. Frankly, this fixes a performance regression to reintroduce a safety regression... We should aim instead to fix 2 performance regressions without introducing any safety regressions. Nodegit-lfs is poorly designed right now, because it does not use a streaming interface, it loads the entire blob into memory and then sends it to lfs for processing. If we integrated nodegit-lfs directly into nodegit, we could theoretically switch the implementation to:

  1. streaming interface
  2. no need for callbacks to JS simply for streaming data
  3. only using callbacks for authentication

I strongly suggest taking this approach for fixing LFS and looking at a longer term approach that will improve the value of the library. That this is an issue at all should provide enough friction with the team and customers to provide the support you need to build this out at Axosoft.

That all said, if the timeline for fixing lfs properly isn't manageable, we can discuss merging this; however, this PR and associated commits need to be updated to comment out the code and properly document that this was removed as a temporary workaround and not as a "permanent fix". We should label this as an unsafe workaround before it is merged, and we should not remove the code that works as intended.

@AlexaXs
Copy link
Contributor Author

AlexaXs commented Jan 3, 2022

@implausible, first of all than you so much for reviewing this PR, and for the tremendous work you did to rewrite the threadpool. Also thanks a lot for your comments on the approach to fix LFS properly.

Obviously I hadn't understood all the flow involved in the threadpool with thread-safety, so apologies for the unsafe changes in the PR.

I've checked previous PRs related with the threadpool and thread-safety to understand better the present code:

I'd appreciate if you could confirm what I understand about the present code:

  1. This temporaryUnlock is here to prevent a deadlock in the main JS thread in this situation: a Worker execution (asynchronous) has locked some libgit2 objects (only of type git_repository and git_index at the moment), and while locked, the main JS thread runs a synchrounous NodeGit API that could potentially also lock the same objects.

  2. This waiting is what makes NodeGit remain thread-safe while the JS callback is running and the locked objects are temporary unlocked.

If my understanding on those points are correct, it'd be great if you could review this change to the PR consisting on commenting out the temporaryUnlock in the first point above. Let me know if my thoughts are correct concerning this possible update to the PR:

  1. LockMaster locks all the objects at once, and that already prevents deadlocks between async executions.
  2. At the moment, synchronous NodeGit APIs never lock any object.

Unless there's something I'm not seeing, by commenting out the temporaryUnlock, the objects initially locked by the async Worker will remain locked until the COMPLETED event is received (when the Worker finishes with the async execution), keeping NodeGit thread-safe.

As I see it, with this change, although we open the possibility to suffer from deadlocks in the future, with the present code NodeGit remains thread-safe and there is no possibility of deadlocks.

If you think these thoughts are correct and commenting out the temporaryUnlock is a valid temporal workaround, it'd be great if you could merge this PR until LFS is fixed properly. In that case I'd update this PR this way:

  • I'll re-label the PR as Unsafe temporal workaround: LFS checkout performance regression.
  • I'll add a new commit to put back the removed code and will comment it out instead, documenting the reason for the temporal workaround.

Let me know please if everything makes sense. And happy 2022 by the way! 😄

@AlexaXs AlexaXs marked this pull request as draft January 14, 2022 15:17
@AlexaXs AlexaXs changed the title LFS checkout performance regression UNSAFE TEMPORARY WORKAROUND for LFS checkout performance regression Jan 24, 2022
@AlexaXs AlexaXs changed the title UNSAFE TEMPORARY WORKAROUND for LFS checkout performance regression UNSAFE Temporary workaround for LFS checkout performance regression Jan 24, 2022
@AlexaXs AlexaXs force-pushed the fix/LFS-checkout-performance-regression branch from 8af36d5 to b4ccac5 Compare January 24, 2022 14:52
@AlexaXs
Copy link
Contributor Author

AlexaXs commented Jan 24, 2022

@implausible, I've modified the description and the commits of this PR to see if they could work as an UNSAFE temporary workaround until LFS is integrated into NodeGit:

  • The changes will be limited to callbacks from Workers leveraging threaded libgit2 functions (at the moment only checkout functionality). The rest of the functionality won't be affected.
  • The changes will allow callbacks to be queued to the main JS thread without waiting for the current one to end, and will not apply a temporaryUnlock, since it would lock back again before the callback is executed. This way we make sure that the callback calling thread will perform work with objects locked.
  • The changes will be UNSAFE because the callbacks might need to lock objects already locked by the current Worker, ending on a deadlock. Tests have been added to check this situation in two different scenarios, with threaded and non-threaded libgit2 functions. The test can run async callback on checkout without deadlocking have been added and skipped, since it will fail with these changes, pointing to why these changes are UNSAFE. The behaviour here is the same as with the threadpool before the rewrite.
  • There are comments starting with Temporary workaround for LFS checkout in the code related to the changes.

We'll be working on the integration of LFS into nodegit, that will revert this Unsafe commit if you think it can be merged as a temporary workaround.

@AlexaXs AlexaXs marked this pull request as ready for review January 24, 2022 15:27
@AlexaXs AlexaXs requested a review from implausible January 24, 2022 15:28
@implausible
Copy link
Member

I think that generally covers my major concerns. I would be open to this workaround since it is targeted and addresses real customer impact. I am excited to see LFS become nodegit native in the future!

@implausible
Copy link
Member

Would you kindly add a test in https://github.com/nodegit/nodegit/blob/master/test/tests/worker.js to demonstrate that this code is safe during shutdowns?

- ThreadPoolImpl doesn't need to keep a pointer of context.
- Methods RunJSThreadCallbacksFromOrchestrator not used.
We want to test two scenarios:
- When libgit2 spawns threads to do the work (when doing a checkout).
- When libigt2 leverages a single thread to do the work (for example when working with submodules).

In each scenario, we'll run synchronous work inside the callbacks, where no locking is applied, so they should succeed.
We'll also run asynchronous work inside the callbacks that lock the same objects already locked. These tests should be able to run by temporary unlocking those objects until the callback ends.
This is a temporary workaround in order to avoid the lost of performance with LFS checkout.

The change is limited to the processing of callbacks from Workers that leverage threaded libgit2 functions. Basically what it does is allowing the callbacks from executorEventsQueue to be queued in jsThreadCallbackQueue without waiting for the current one to end.

It is unsafe because with threaded libgit2 functions there is a potential risk of deadlock if the callbacks need to lock an object.

This commit will be reverted when nodegit-lfs is integrated into nodegit.
@AlexaXs AlexaXs force-pushed the fix/LFS-checkout-performance-regression branch from b4ccac5 to 7b248d9 Compare February 11, 2022 14:20
Checkout leverages libgit2 threads and when applying filters it runs JS callbacks. These tests check that when running checkout on a worker thread and this is terminated, it exists gracefully without memory leaks.
@AlexaXs
Copy link
Contributor Author

AlexaXs commented Feb 17, 2022

I would be open to this workaround since it is targeted and addresses real customer impact. I am excited to see LFS become nodegit native in the future!

Thank you so much @implausible, and yes, I am excited too about the LFS integration!

Sorry for the delay in pushing the tests. I`ve added a couple of tests, the first one terminates a worker thread that is running checkouts in a loop where a git filter is used, to demonstrate that it is terminated gracefully. The second one checks that the objects involved in multiple checkouts are tracked and will be freed when the worker thread is terminated, preventing memory leaks.

@ianhattendorf ianhattendorf merged commit 4602913 into nodegit:master Mar 17, 2022
@AlexaXs AlexaXs deleted the fix/LFS-checkout-performance-regression branch March 29, 2022 14:03
ianhattendorf added a commit to ianhattendorf/nodegit that referenced this pull request Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants