Add support for cloning SHA256 repositories#1825
Conversation
There was a problem hiding this comment.
Pull request overview
This PR progresses SHA256 support by enabling cloning of SHA256 repositories. It introduces ObjectFormatGetter and ObjectFormatSetter interfaces to allow storage backends to report and configure their hash algorithm. The implementation updates transport protocols, packfile handling, and index file encoding to support both SHA1 and SHA256.
Changes:
- Added
ObjectFormatGetterandObjectFormatSetterinterfaces implemented by in-tree storage backends (memory and filesystem) - Updated transport layer to detect and propagate object format from capabilities or hash sizes
- Modified packfile parser/scanner to handle SHA256 hashes and updated idxfile writer to skip unmaterialized delta objects
- Removed
ErrSHA256NotSupportederror as SHA256 is now supported
Reviewed changes
Copilot reviewed 49 out of 49 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| storage/storer.go | Defines new ObjectFormatGetter/Setter interfaces |
| storage/memory/storage.go | Implements object format interfaces for memory storage |
| storage/filesystem/storage.go | Implements object format interfaces for filesystem storage |
| plumbing/transport/*.go | Updates transport to detect and pass object format to packfile parser |
| plumbing/format/packfile/*.go | Updates packfile parsing to support SHA256 hashes |
| plumbing/format/idxfile/*.go | Refactors encoder and adds defensive checks for unmaterialized objects |
| plumbing/protocol/packp/*.go | Updates protocol parsing to handle variable hash sizes |
| repository.go | Removes ErrSHA256NotSupported (breaking change) |
| repository_test.go | Adds comprehensive tests for SHA256 cloning |
| worktree_commit_test.go | Converts assertions to require for better test reliability |
| internal/server/* | Adds internal HTTP server infrastructure for testing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
storage/memory/storage.go
Outdated
| s.config.Extensions.ObjectFormat = of | ||
| s.config.Core.RepositoryFormatVersion = formatcfg.Version1 |
There was a problem hiding this comment.
Direct access to s.config on lines 88-89 could cause a nil pointer dereference if config has not been initialized yet. While Config() initializes it lazily, SetObjectFormat doesn't call Config() first. This should either call s.Config() first to ensure initialization, or handle the nil case explicitly.
storage/storer.go
Outdated
| // ObjectFormatGetter is implemented by storage backends that can report | ||
| // their object format (hash algorithm), such as SHA-1 or SHA-256. | ||
| type ObjectFormatGetter interface { | ||
| // ObjectFormat returns the object format (hash algorithm) used by this storage. | ||
| ObjectFormat() formatcfg.ObjectFormat | ||
| } | ||
|
|
||
| // ObjectFormatSetter is implemented by storage backends that support | ||
| // configuring the object format (hash algorithm) used for the repository. | ||
| // This is typically called during repository initialization or when the | ||
| // first packfile is fetched. | ||
| type ObjectFormatSetter interface { | ||
| // SetObjectFormat configures the object format (hash algorithm) for this storage. | ||
| SetObjectFormat(formatcfg.ObjectFormat) error | ||
| } | ||
|
|
There was a problem hiding this comment.
Why do we want a custom interface for this? This makes it look like a repository object format is a special object in Git. Instead, I would reuse the config.ConfigStorer interface to get/set the object format and come up with a storage.ReadObjectFormat(config.ConfigStorer) and storage.SetObjectFormat(config.ConfigStorer, formatcfg.ObjectFormat) or something similar.
There was a problem hiding this comment.
This was a compromise, the reasoning being:
- When we init a new repository we don't know what
ObjectFormatit will have. That information is acquired talking with the server and then the new format is set usingUpdateObjectStorage. Alternatively, we could hold the repository initialisation to after the fetching process - but that would be a larger change. - A key motivation here was backwards compatibility with off-tree storage implementations. This approach means they continue to compile and work when interacting with SHA1 repositories. But safely fail when interacting with a SHA256 repo.
There was a problem hiding this comment.
You need to specify the object format for new repositories, otherwise, it defaults to SHA1. That's how git works by default. Migrating a repository from one format to another requires creating a new repository with the new format, and import/export the repository from the old to the new format via git fast-import and git fast-export.
I think there should be an option to specify the object format for new repositories. UpdateObjectStorage and others should always use the repository format. The transport layer should validate that new requests match the storage format.
There was a problem hiding this comment.
Empty repos are kind of this funny-place where you'd think the hash function is undefined, but it is actually defined and stored (implicitly or explicitly) as part of the config. The main reason is what happens when you have an empty repository, and then you clone it?
This is a valid operation that must work predictably. An empty repository is initialized by git forges like Gitea, then you clone it, add commits and push them. Waiting until you create the first objects to define the repository ObjectFormat does not work.
The data sent when you transfer a repo includes this type, even if there are no object, to make sure that the ObjectFormat is unambiguous.
There was a problem hiding this comment.
Thank you both for the feedback in such a timely manner. This approach was taken to lean on existing primitives while Cloning new repositories as at that point in time you don't know what ObjectFormat your server has. As per upstream:
We have a chicken-and-egg situation between initializing the refdb
* and spawning transport helpers:
*
* - Initializing the refdb requires us to know about the object
* format. We thus have to spawn the transport helper to learn
* about it.
*
* - The transport helper may want to access the Git repository. But
* because the refdb has not been initialized, we don't have "HEAD"
* or "refs/". Thus, the helper cannot find the Git repository.
Upstream worked around this by introducing a lazy refdb init, which instead of initialising a valid refdb it actually creates an invalid refdb - only finishing off the init process once the target objectformat is agreed with the remote. Let me take that approach into consideration and update this PR accordingly.
There was a problem hiding this comment.
Following-up from this discussion, I made a few changes to better align with upstream:
- Added the partial init and moved the setting of object format to the pack negotiation logic. That only takes place on the first negotiation during a clone operation, subsequent negotiations won't trigger a object format change in the storage.
- Tests were added to ensure that object format can't be changed on initialised repositories - regardless of it being set explicitly or not.
- The go-git Storage abstraction for filesystem ignores any
ObjectFormatpassed-in toNewStorageWithOptionswhen the underlying fs contains an initialised repository. In such cases, the repository config is taken a single source of truth.
Are there any additional scenarios we may need to cater for on the test cases?
plumbing/format/packfile/common.go
Outdated
| if setter, ok := s.(storage.ObjectFormatSetter); ok { | ||
| err := setter.SetObjectFormat(of) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
What happens if the repository has a different object format from the incoming packfile? I think in that case this should fail with an error. I think we should always use the storage object format and pass that down to the parser
There was a problem hiding this comment.
This PR makes both memory and filesystem implementations to fail if they are not empty when a change of object format is requested.
This is a fail-safe mechanism to enable single object format repositories to work, while erroring when a repo is trying to handle multiple formats. This should be reviewed as part of the translation table implementation. I'm copying this thread to the issue so that this discussion can be discoverable.
| } | ||
| } | ||
|
|
||
| func WithObjectFormat(of config.ObjectFormat) ParserOption { |
There was a problem hiding this comment.
Shouldn't we infer the object format from the repository storage?
There was a problem hiding this comment.
Overall, I don't think we can:
- A new repository likely won't have the ObjectFormat set. Even if it does, it may not align with the target pack being fetched. For example, new repo (defaults to SHA1) then fetches a SHA256 packfile.
- An existing repository set with a different
ObjectFormatthan the one for the packfile being downloaded.
These changes would set the repository objectformat to the target packfile format, provided this is a new/empty repository. This feels like a safe approach until we have multi ObjectFormat support. What do you think?
There was a problem hiding this comment.
For reference, in the filesystem storage, the Parser will be called here with the ObjectFormat coming from either the repository config, for existing underlying repo, or from the storage options otherwise.
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
This error was created when go-git needed to be built for sha256 use, which is no longer the case. Signed-off-by: Paulo Gomes <pjbgf@linux.com>
The fetching of memory objects will now ensure the correct hash format. The same applies to the pack hash size, which previously was returning a SHA1 hash for SHA256 packfiles. Additionally, the dual hash (both SHA1 and SHA256) was removed. This is largely for practical reasons, as it is likely that this part of the logic may not be required in the future. Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
The idxfile.encoder lacked defensive programming and was quite light on testing. The refactor aligns the code structure with the recent implementation of rev.Encoder. Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
42687fd to
d959bb5
Compare
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
…res' A bug in other areas of plumbing/format was causing the revfile package to generate invalid files at times. This additional checks will improve the test harness to avoid future regression. Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
The SHA256 enablement requires the ability to assess whether the object format is the default due to it being unset, or it was explicitly picked. Another motivation for the change from int to string is that both the Git protocol and the Git config store this information as string, so this change better aligns with upstream. Signed-off-by: Paulo Gomes <pjbgf@linux.com>
There are two interfaces to represent git Storers: storer.Storer and storage.Storer. The former is a light-weight version focusing only on encoded objects and references, the latter a full-on representation of a git repository. The latter provide access to the ObjectFormat via storer.ConfigStorer. Although the interface is implemented at the in-tree storages, the interface now lives together with the former so that it becomes clearer that it is only needed when handling storer.Storer. Alternatively, ObjectFormatGetter could be removed in favour of casting storer.Storer into storage.Storer, the problem with this approach is that any off-tree implementation would have a greater burden in order to implement sha256 support. Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Fully initialised repositories must not have their object format changed. However, when cloning a repository we first init the repository to then negotiate capabilities with the remote. Only at that point we are aware of what object format the local storage must be created with. To support the clone workflow, while also blocking the changing of object format for empty but fully initialised repositories, we: - Implement partial inits which are validated at pack negotiation time. - Keep the logic private so that it is only used internally for clones. - Add tests to avoid future regression on empty repositories. Additional info upstream: - https://github.com/git/git/blob/ab380cb80b0727f7f2d7f6b17592ae6783e9820c/builtin/clone.c#L1216C60-L1216C68 Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Ensures that existing repositories will always take precedence, and whatever ObjectFormat passed in for NewStorageWithOptions is ignored. Signed-off-by: Paulo Gomes <pjbgf@linux.com>
aymanbagabas
left a comment
There was a problem hiding this comment.
Looking great! I wonder if we should keep the ObjectFormatGetter 🤔
| type ObjectFormatGetter interface { | ||
| // ObjectFormat returns the object format (hash algorithm) used by the Storer. | ||
| ObjectFormat() formatcfg.ObjectFormat | ||
| } |
There was a problem hiding this comment.
Do we need this interface still? If we're storing the object format in config and config is the single source of truth, this seems redundant. On this note, I always found the storer.Storer and storage.Storer interfaces confusing, and we should deprecate one of them.
There was a problem hiding this comment.
Long-term we may not need this interface - I will add some comments to that effect. I also agree that storer.Storer and storage.Storer could be reviewed, so I created an issue to handle both subjects at once - tracking to the v7 milestone.
For now, I rather we keep ObjectFormatGetter in so that we don't break the interface with existing external implementations, but also have a safe to detect lack of support from SHA256 on off-tree storage implementations. As lack of this awareness could result in data corruption - I will add a test to highlight this point.
| if setter, ok := st.(storage.ObjectFormatSetter); ok { | ||
| err := setter.SetObjectFormat(serverFormat) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("unable to set object format: %w", err) | ||
| } | ||
|
|
||
| clientFormat = serverFormat | ||
| } |
There was a problem hiding this comment.
| if setter, ok := st.(storage.ObjectFormatSetter); ok { | |
| err := setter.SetObjectFormat(serverFormat) | |
| if err != nil { | |
| return nil, fmt.Errorf("unable to set object format: %w", err) | |
| } | |
| clientFormat = serverFormat | |
| } | |
| clientFormat = serverFormat | |
| cfg.Extensions.ObjectFormat = serverFormat | |
| st.SetConfig(cfg) |
There was a problem hiding this comment.
We can't just change the config, as the storage is already initialised at this point, so we need to reinitilise the (object) hashers with the target format.
If we were to add this additional logic onto SetConfig, that would mean that we'd need to verify on each call whether it is OK to change the object format and whether that should trigger a reinitilisation of the (object) hashers. As this would be storage.Storer specific, this could lead to a corrupted state on off-tree implementations - as the core logic would be unaware of such support.
| // SetObjectFormat configures the object format (hash algorithm) for this storage. | ||
| SetObjectFormat(formatcfg.ObjectFormat) error | ||
| } | ||
|
|
There was a problem hiding this comment.
| // ObjectFormat finds the object format for the given [config.ConfigStorer]. | |
| // If the config storer doesn't define an object format, it will return | |
| // [config.UnsetObjectFormat]. | |
| func ObjectFormat(st config.ConfigStorer) config.ObjectFormat { | |
| cfg, err := st.Config() | |
| if err == nil { | |
| return cfg.Extensions.ObjectFormat | |
| } | |
| return config.UnsetObjectFormat | |
| } | |
| // SetObjectFormat sets the object format for the given [config.ConfigStorer]. | |
| func SetObjectFormat(st config.ConfigStorer, f config.ObjectFormat) error { | |
| cfg, err := st.Config() | |
| if err != nil { | |
| return err | |
| } | |
| cfg.Extensions.ObjectFormat = f | |
| return st.SetConfig(cfg) | |
| } |
This PR progresses the SHA256 support journey by:
object-formatcapability from the server side when it is supported for both SHA1 and SHA256.storer.ObjectFormatGetterandstorage.ObjectFormatSetterinterfaces and implementing them on in-tree storage options. External storage will continue to work, however won't be able to support SHA256 until they implement the new interfaces.Running the
cloneexample against a sha256 repository:During the implementation I stumbled onto other issues which I fixed as part of this PR as it was either blocking the way, or the change was small enough to make sense to do so:
format/idxfilelacked defensive programming to remove (or error) when objects were not fully materialised, which could lead to invalid index files. This resulted intoobject not founderrors, which may be related to existing reports of such symptom still happening on v6 (e.g. Pull returns object not found #207go-gitsometimes returnsobject not foundwhen querying for the commit object for theHEAD#827)format/idxfilewas growing the backing arrays fornames,offsetandcrc32as it iterated through objects. This is rather inefficient and has performance impact when processing very large pack files.filesystemandmemorystorage, to avoid future regression when handlingObjectFormat.Relates to #706.
ℹ️ Some of this work was done as part of openSUSE's Hack Week 25. Thanks SUSE for sponsoring the
go-gitproject. 🙌