Skip to content

WIP: Reftables support#5462

Closed
pks-t wants to merge 33 commits intolibgit2:mainfrom
pks-t:pks/reftables-support
Closed

WIP: Reftables support#5462
pks-t wants to merge 33 commits intolibgit2:mainfrom
pks-t:pks/reftables-support

Conversation

@pks-t
Copy link
Member

@pks-t pks-t commented Mar 25, 2020

So this is a very rough reftables refdb implementation for libgit2. It's main intent is to start a discussion, see what prerequisites we have in libgit2 to properly implement refdbs other than refdb_fs and whether I'm using the reftable API correctly. Thoughts:

  • While we theoretically support reference databases other than the filesystem-based one, our support for that is... lacking. We have dozens of places where we assume $GITDIR/HEAD exists, we write references like MERGE_HEAD, ORIG_HEAD manually without consulting the refdb, lack support to create new repos with a different refdb as we create the filesystem structures like $GITDIR/refs and $GITDIR/HEAD directly in repository.c without consulting the refdb at all. We'll have to carefully vet our code to remove all accesses to anything refish outside of the refdb and instead call into the refdb layer.

  • It seems to me that reftables currently don't support a few operations that we expect from any given refdb implementation. Most importantly I couldn't figure out how to delete and rename a reflog and how to implement explicit locking (e.g. what the file backend does when creating .git/refs/heads/master.lock).

  • I found it quite hard to get into reftables as the interface exposes about quite a lot of the low-level details. This is probably my own fault as I usually tend to dive heads-deep into anything new without looking left and right first. But one resource that would've helped quite a lot is something like high-level examples that represent common git operations. I feel like I've got it right by now, but I'd be happy to get feedback on this to see how wrong I am.

  • Are there any testvectors around? Like a repository that one may embed into libgit2 itself that uses reftables to verify we behave as expected. I couldn't get a hold of any and as it's currently non-trivial to use reftables anywhere I wasn't able to come up with tests yet. As a result, my current code will probably fail very quickly.

  • At some places I feel like there's high potential for namespacing issues. E.g. struct iterator is highly generic and may trivially clash with any preexisting types. It would be nice to use namespace prefixes for all exported symbols.

  • I think it would help if the C files were layed out in a more "standard" library way. Most importantly, separate out the "reftable.h" header file from the rest into its own folder so that one can add it to the include path without having to care about accidentally overriding "tree.h" that's part of libgit2 itself.

  • I didn't find any way to override memory allocation functions. I think we've discussed that at the Contributor's summit already.

Anyway, thanks a lot for working on reftables @hanwen, this is really exciting! I'd be happy to get feedback on my current implementation, which basically represents how I think it should work. Chances are high I've got it completely wrong, though ;)

@pks-t pks-t force-pushed the pks/reftables-support branch from 8f53ad6 to 83ceed9 Compare March 25, 2020 14:24
&entry->oid_old, &entry->oid_cur,
NULL, entry->msg)) < 0)
goto out;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably need to deduplicate pre-existing reflog entries. With filesystem-based reflog, we're rewriting the complete reflog every time, I think, but with reftables I'll need to figure out which entries are new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think you are right. (I don't understand why you get duplicate entries, though.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly about git_reflog semantics in libgit2. When you pass it to the filesystem-based reflog, you need to have the complete reflog history in order to re-write it in full. So with reftable I'll need to see which entry is the last one that we've already written to disk and only start writing after that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. In reftable, the log entries are stored newest-first, so a simple seek to the refname will tell you the last entry written.

Copy link
Contributor

@hanwen hanwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filesystem-based one, our support for that is... lacking. We have
dozens of places where we assume $GITDIR/HEAD exists, we write
references like MERGE_HEAD, ORIG_HEAD manually without consulting the
refdb, lack support to create new repos with a different refdb as we
create the filesystem structures like $GITDIR/refs and $GITDIR/HEAD
directly in repository.c without consulting the refdb at all. We'll
have to carefully vet our code to remove all accesses to anything
refish outside of the refdb and instead call into the refdb layer.

Yes, I think git-core also suffers from this.

It seems to me that reftables currently don't support a few operations
that we expect from any given refdb implementation. Most importantly I
couldn't figure out how to delete and rename a reflog and how to
implement explicit locking (e.g. what the file backend does when
creating .git/refs/heads/master.lock).

For deleting a complete reflog, you'd have to delete all the entries
individually. For renaming a reflog, the suggestion is to record a
deletion + creation. This would let you retractively understand that
the rename took place.

Why do you want explicit locking? The API is written as if you'd be
doing transactions, i.e. you try to make the change, and if you get a
LOCK_ERROR, you'll have to try again.

I found it quite hard to get into reftables as the interface exposes
about quite a lot of the low-level details. This is probably my own
fault as I usually tend to dive heads-deep into anything new without
looking left and right first. But one resource that would've helped
quite a lot is something like high-level examples that represent
common git operations. I feel like I've got it right by now, but I'd
be happy to get feedback on this to see how wrong I am.

the best resource is probably code
gitgitgadget/git#539, which adds support to
git-core.

Are there any testvectors around? Like a repository that one may embed
into libgit2 itself that uses reftables to verify we behave as
expected. I couldn't get a hold of any and as it's currently
non-trivial to use reftables anywhere I wasn't able to come up with
tests yet. As a result, my current code will probably fail very
quickly.

The easiest is to compile JGit,

bazel build org.eclipse.jgit.pgm:pgm

this leaves a binary in

bazel-bin/org.eclipse.jgit.pgm/jgit

then you can do

jgit init
jgit convert-ref-storage --format reftable

There currently aren't test vectors, but it's a good suggestion. I
could add some to the repository.

At some places I feel like there's high potential for namespacing
issues. E.g. struct iterator is highly generic and may trivially clash
with any preexisting types. It would be nice to use namespace prefixes
for all exported symbols.

I want to do this, but I've been putting it off, because I don't have
an IDE that will help me with renaming, and doing it by hand is
somewhat labor intensive.

I think it would help if the C files were layed out in a more
"standard" library way. Most importantly, separate out the
"reftable.h" header file from the rest into its own folder so that one
can add it to the include path without having to care about
accidentally overriding "tree.h" that's part of libgit2 itself.

I'll get to that.

I didn't find any way to override memory allocation functions. I think
we've discussed that at the Contributor's summit already.
Anyway, thanks a lot for working on reftables @hanwen, this is really
exciting! I'd be happy to get feedback on my current implementation,
which basically represents how I think it should work. Chances are
high I've got it completely wrong, though ;)

you're welcome! I am glad to see it get uptake :)

&entry->oid_old, &entry->oid_cur,
NULL, entry->msg)) < 0)
goto out;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think you are right. (I don't understand why you get duplicate entries, though.)

@pks-t
Copy link
Member Author

pks-t commented Mar 26, 2020

Why do you want explicit locking? The API is written as if you'd be
doing transactions, i.e. you try to make the change, and if you get a
LOCK_ERROR, you'll have to try again.

Because we're kind of in a special place with libgit2. With filesystem based refs, you can explicitly lock references to keep other applications from updating them by themselves. As libgit2 may be part of a long-running application (e.g. an IDE, but it could be anything) where the author of that application may have the need to lock references for a longer time, we support a transactional API that keeps other Git applications from updating them. While it's true that you would get an error anyway with the reftable API, this kind of opportunistic locking wouldn't help in such a usecase.

Thanks for your answers, I'll make sure to update the PR soonish with your feedback!

@pks-t pks-t force-pushed the pks/reftables-support branch 2 times, most recently from ee17225 to 0bfe212 Compare March 26, 2020 09:10
@hanwen
Copy link
Contributor

hanwen commented Mar 26, 2020 via email

@pks-t pks-t force-pushed the pks/reftables-support branch from 5944c49 to 65c0d65 Compare March 27, 2020 13:30
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>
#include <zlib.h>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently doesn't compile on Windows due to <sys/time.h> not being found. I think it should work just fine if changing this to <time.h>. Instead, we could probably have the equivalent of REFTABLE_IN_GITCORE, but I think I'd rather prefer the header to be agnostic of libgit2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On linux, this would yield:

c/stack.c:226:12: error: implicit declaration of function 'gettimeofday' [-Werror=implicit-function-declaration]
226 | int err = gettimeofday(&deadline, NULL);
| ^~~~~~~~~~~~

@pks-t
Copy link
Member Author

pks-t commented Mar 27, 2020

Thanks a lot for addressing my concerns, @hanwen! I've updated the PR to include all of them.

@pks-t
Copy link
Member Author

pks-t commented Mar 28, 2020 via email

@hanwen
Copy link
Contributor

hanwen commented Mar 28, 2020

For the ref-like objects, this is what JGit does
https://eclipse.googlesource.com/jgit/jgit/+/459fc28cf6a22992b7d168cfd52286cd76ca68bc/org.eclipse.jgit/src/org/eclipse/jgit/lib/RefDatabase.java#511

(https://eclipse.googlesource.com/jgit/jgit/+/459fc28cf6a22992b7d168cfd52286cd76ca68bc/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/RefDirectory.java#113)

In JGit, HEAD is expected to be managed by RefDatabase. Unfortunately, nobody has been able to articulate to me what the expectations/abstractions for this in git-core should do. Can you do transactional updates with HEAD and another branch together in libgit2?

@pks-t
Copy link
Member Author

pks-t commented Mar 28, 2020 via email

@hanwen
Copy link
Contributor

hanwen commented Mar 30, 2020

Hey Patrick,

can you check out

https://godoc.org/github.com/google/reftable#Addition

and

https://godoc.org/github.com/google/reftable#Stack.NewAddition

would that work as an API for you?

@pks-t
Copy link
Member Author

pks-t commented Mar 31, 2020

Just to make sure I understand it, this is how I'd imagine it to look like in C:

/* Create a new transactional addition for the stack */
reftable_stack_new_addition(&addition, stack);
/* Write a set of ref/log records via the usual writer callback */
reftable_addition_add(addition, reftable_backend_write_records, &args1);
/* Writing multiple times is okay for a single addition */
reftable_addition_add(addition, reftable_backend_write_records, &args2);
/* Now commit changes to persist it to disk and update the stack */
reftable_addition_commit(addition);

If I understand it correctly, then Close() will not commit any data but just release the Addition, right? So it's kind of a rollback/abort function.

If my above assumptions are correct then the API should work for us, thanks a lot!

/* 4-byte identifier ("sha1", "s256") of the hash.
* Defaults to SHA1 if unset
*/
uint32_t hash_id;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't quite work as advertised. I've set up reftable_write_options opts ={0}, where I'd expect it to just default to "sha1". But reftable_new_merged_table() fails when comparing r->hash_id != hash_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be helpful to have some macros available so that one may set up the expected values without having to define them locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't repro your complaint, see

google/reftable@8a04e7f

can you provide a repro testcase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to check in for comments. The following fails if I do not set up options.hash_id:

const char *dir, *table;
struct reftable_write_options options = {0};
struct reftable_ref_record ref = {0};
struct reftable_stack *stack;

dir = "/path/to/testrepo-reftable/.git/reftable";
table = "/path/to/testrepo-reftable/.git/reftable/tables.list";

/* This line is required, otherwise `reftable_new_stack` fails. */
options.hash_id = 0x73686131;

cl_must_pass(reftable_new_stack(&stack, dir, table, options));
cl_must_pass(reftable_stack_reload(stack));
cl_must_pass(reftable_stack_read_ref(stack, "refs/heads/master", &ref));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm currently not using an up to date version of the reftable library as I first want to try to get what I have completely functioning (which it mostly is already, but there's still some issues with writing for preexisting branches).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which version are you using?

btw, if you are writing existing branches, make sure you sure you set an update_index that is beyond the update_index of the existing branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed in google/reftable@f781daf

@pks-t pks-t force-pushed the pks/reftables-support branch from 65c0d65 to 7b53cfc Compare March 31, 2020 09:13
@pks-t
Copy link
Member Author

pks-t commented Mar 31, 2020

I've added a test vector and was able to get a first test to pass with the reftable backend, which reads a reference and then renames it. More tests and bug fixes will follow.

I've fixed the refdb detection to now consult "extensions.refStorage" as mandated by the documentation.

I've noticed that another requirement is to get repository format version 1 implemented. There's a temporary commit in here that just removes the error check.

@hanwen
Copy link
Contributor

hanwen commented Mar 31, 2020

google/reftable@7aabfe0 for the transaction api in C.

@hanwen
Copy link
Contributor

hanwen commented Mar 31, 2020

BTW, - I would be grateful for any feedback about reftable on windows. I don't have windows, and I'd like to keep it that way.

@pks-t pks-t force-pushed the pks/reftables-support branch from 7b53cfc to 4c2ab8f Compare April 2, 2020 11:38
@pks-t
Copy link
Member Author

pks-t commented Apr 2, 2020

Rebased on current master, which now includes support for repository format v1 and extensions.

BTW, - I would be grateful for any feedback about reftable on windows. I don't have windows, and I'd like to keep it that way.

I may dig up my Windows VM as soon as I've got the backend working on my own machine. Until then there's currently only the CI that provides any kind of feedback on Windows.

@pks-t pks-t force-pushed the pks/reftables-support branch from 4c2ab8f to 4ab39b1 Compare April 3, 2020 13:34
@pks-t pks-t force-pushed the pks/reftables-support branch 2 times, most recently from 65fc25f to 70fd8b1 Compare April 19, 2020 12:04
@hanwen
Copy link
Contributor

hanwen commented Apr 27, 2020

@pks-t - does libgit2 already have code to detect file / directory conflicts? Looks like I may need to write and/or refactor that to check transactions, and wondering if that makes sense to put into the reftable library.

@pks-t
Copy link
Member Author

pks-t commented Apr 27, 2020 via email

@pks-t pks-t force-pushed the pks/reftables-support branch from 70fd8b1 to 6276b81 Compare April 30, 2020 12:49
@pks-t
Copy link
Member Author

pks-t commented Apr 30, 2020

Rebased on latest master, updated the reftable library to 66d38fe (C: move block_iter_next comment to header, 2020-04-28), fixed a set of bugs in the reflog implementation and improved error handling in most places. All in all the PR is close to being finished, but I've stumbled over one weird bug where compressing of the reftable stack seems to fail after writing symbolic references. No idea yet whether that's a fault of mine or in the reftable library, but I'll investigate.

pks-t added 8 commits August 1, 2025 08:25
There are a bunch of tests where we read or write references via the
filesystem directly. This only works with the "files" backend, but
naturally breaks if we supported any other reference format.

Refactor these tests to instead use the refdb to access those.
When testing conditional includes we overwrite the repository's config
file with the relevant conditions. This causes us to fully overwrite all
repository configuration, including the repository format version and
any extensions. While the test repository used in this test does not
have any extensions, we will add a reftable-enabled repository that does
rely on the "refStorage" extension eventually.

Fix this by only modifying the relevant config keys.
We have a bunch of checks for properties of the "files" reference
backend:

  - Whether a specific reference has been packed or whether it still
    exists as a loose reference.

  - Whether empty ref directories get pruned.

  - Whether we properly fsync data to disk.

These checks continue to be sensible for that backend, but for any other
backend they plain don't work. Adapt the tests so that we only run them
in case the repository uses the "files" backend.
When initializing the "files" refdb we read a couple of values from the
Git configuration. Unfortunately, this causes a chicken-and-egg problem
when reading configuration with "includeIf.onbranch" conditionals: we
need to instantiate the refdb to evaluate the condition, but we need to
read the configuration to initialize the refdb.

We currently work around the issue by reading the "HEAD" file directly
when evaluating the conditional include. But while that works with the
"files" backend, any other backends that store "HEAD" anywhere else will
break.

Prepare for a fix by deferring reading the configuration. We really only
need to be able to execute `git_refdb_lookup()`, so all we need to
ensure is that we can look up a branch without triggering any config
reads.
With the preceding commit we have refactored the "files" backend so that
it can be both instantiated and used to look up a reference without
reading any configuration. With this change in place we don't cause
infinite recursion anymore when using the refdb to evaluate "onbranch"
conditions.

Refactor the code to use the refdb to look up "HEAD". Note that we
cannot use `git_reference_lookup()` here, as that function itself tries
to normalize reference names, which in turn involves reading the Git
configuration. So instead, we use the lower-level `git_refdb_lookup()`
function, as we don't need the normalization anyway.
Expose a function to read a loose reference. This function will be used
in a subsequent commit to read pseudo-refs on the generic refdb layer.
Regardless of which reference storage format is used, pseudorefs will
always be looked up via the filesystem as loose refs. This is because
pseudorefs do not strictly follow the reference format and may contain
additional metadata that is not present in a normal reference.

We don't honor this in `git_reference_lookup()` though but instead defer
to the refdb to read such references. This obviously works just fine
with the "files" backend, but any other backend would have to grow
custom logic to handle reading pseudorefs.

Refactor `git_reference_lookup_resolved()` so that it knows to always
read pseudorefs as loose references. This allows refdb implementations
to not care about pseudoref handling at all.
With a recent upgrade to a newer version of MSVC we now get a bunch of
warnings when two operands use different enum types. While sensible in
theory, in practice we have a couple of non-public enums that extend
public enums, like for example with `GIT_SUBMODULE_STATUS`.

Let's for now disable this warning to unblock our builds. The
alternative would be to add casts all over the place, but that feels
rather cumbersome.
@pks-t pks-t force-pushed the pks/reftables-support branch from 392cffb to 4884111 Compare August 1, 2025 14:51
@pks-t
Copy link
Member Author

pks-t commented Aug 1, 2025

In the current state all tests now pass on Linux. There's a couple patches I've sent upstream to fix issues with the reftable library in Git, but those are rather minor. Still needed is an adjustment of "deps/reftable/system.c" to work with non-Linux systems as well as CI changes to actually run tests with reftables enabled. I should be able to wrap that up next week, I guess.

@pks-t
Copy link
Member Author

pks-t commented Aug 1, 2025

There are also a bunch of PRs that are dependencies in libgit2:

pks-t added 10 commits August 3, 2025 17:55
Import the reftable library from commit f1228cd12c1 (reftable: make
REFTABLE_UNUSED C99 compatible, 2025-05-29). This is an exact copy of
the reftable library.

The library will be wired into libgit2 over the next couple of commits.
While the reftable library is mostly decoupled from the Git codebase,
some specific functionality intentionally taps into Git subsystems. To
make porting of the reftable library easier all of these are contained
in "system.h" and "system.c".

Reimplement those compatibility shims so that they work for libgit2.
The reftable writer accidentally uses the Git-specific `QSORT()` macro.
This macro removes the need for the caller to provide the element size,
but other than that it's mostly equivalent to `qsort()`.

Replace the macro accordingly. This incompatibility is a bug in Git that
needs to be fixed upstream.
While perfectly legal, older compiler toolchains complain when
zero-initializing structs that contain nested structs with `{0}`:

	/home/libgit2/source/deps/reftable/stack.c:862:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
		struct reftable_addition empty = REFTABLE_ADDITION_INIT;
						 ^~~~~~~~~~~~~~~~~~~~~~
	/home/libgit2/source/deps/reftable/stack.c:707:33: note: expanded from macro 'REFTABLE_ADDITION_INIT'
	#define REFTABLE_ADDITION_INIT {0}
					^

Silence this warning by using `{{0}}` instead.
Wire up the reftable library with CMake. At the current point in time
the library is not yet plugged into libgit2 itself.
While it is possible to pass flags to `reftable_stack_new_addition()` to
alter the behaviour of a stack addition, it is not possible to do so via
`reftable_stack_add()`. Plug this gap.
@pks-t pks-t force-pushed the pks/reftables-support branch from 4884111 to eb9ed40 Compare August 3, 2025 17:01
pks-t added 4 commits August 3, 2025 19:06
Implement the reftable backend that is used to read and write reftables.
The backend is not yet used anywhere after this commit.
Wire up the "reftable" storage format via the newly implemented reftable
backend.
Generate test resources for reftables. These resources are basically the
Git repositories we already have, but converted to use the "reftable"
format. For most of the part, this conversion is done by executing `git
refs migrate`.

A couple notes:

  - This require a recent Git upstream version with not-yet-upstreamed
    patches due to a bug in `git refs migrate` with reflogs.

  - The migration command does not yet support repositories with
    worktrees. Those were converted by first removing the worktrees,
    migrating the refs and then recreating them.

  - The HEAD_TRACKER reference in testrepo.git is not recognized as a
    root ref and is thus not automatically migrated.

  - testrepo.git has an empty reflog for refs/heads/with-empty-log that
    does not get migrated.
Wire up reftable tests so that they can be executed by setting the
`CLAR_REF_FORMAT` environment variable. This only catches tests that use
`cl_git_sandbox_init()`, but that should cover most of our tests. So
this infrastructure isn't perfect, but for now it's good enough. We may
want to iterate on it in the future.
@pks-t pks-t force-pushed the pks/reftables-support branch from eb9ed40 to 22b7bd2 Compare August 3, 2025 17:09
@pks-gitlab pks-gitlab mentioned this pull request Aug 4, 2025
@pks-t
Copy link
Member Author

pks-t commented Aug 4, 2025

Closing in favor of #7117, as I'm working on this with my GitLab account right now.

@pks-t pks-t closed this Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants