Skip to content

Add support for cloning SHA256 repositories#1825

Open
pjbgf wants to merge 29 commits intogo-git:mainfrom
pjbgf:sha256-open
Open

Add support for cloning SHA256 repositories#1825
pjbgf wants to merge 29 commits intogo-git:mainfrom
pjbgf:sha256-open

Conversation

@pjbgf
Copy link
Member

@pjbgf pjbgf commented Jan 26, 2026

This PR progresses the SHA256 support journey by:

  • Enabling the cloning of SHA1 and SHA256 repositories.
  • Setting the object-format capability from the server side when it is supported for both SHA1 and SHA256.
  • Adding new storer.ObjectFormatGetter and storage.ObjectFormatSetter interfaces and implementing them on in-tree storage options. External storage will continue to work, however won't be able to support SHA256 until they implement the new interfaces.

Running the clone example against a sha256 repository:

$ go run _examples/clone/main.go https://gitlab.com/pjbgf/sha256.git /tmp/test4
git clone https://gitlab.com/pjbgf/sha256.git /tmp/test4 --recursive
commit b2755b63e92d79410f28a33bef90107789c63a139121a457920c668d5805e974
Author: Paulo Gomes <pjbgf@linux.com>
Date:   Thu Sep 25 22:44:22 2025 +0100

    Add dirs

    Signed-off-by: Paulo Gomes <pjbgf@linux.com>

During the implementation I stumbled onto other issues which I fixed as part of this PR as it was either blocking the way, or the change was small enough to make sense to do so:

  • format/idxfile lacked defensive programming to remove (or error) when objects were not fully materialised, which could lead to invalid index files. This resulted into object not found errors, which may be related to existing reports of such symptom still happening on v6 (e.g. Pull returns object not found #207 go-git sometimes returns object not found when querying for the commit object for the HEAD #827)
  • format/idxfile was growing the backing arrays for names, offset and crc32 as it iterated through objects. This is rather inefficient and has performance impact when processing very large pack files.
  • Expanded on test coverage for filesystem and memory storage, to avoid future regression when handling ObjectFormat.

Relates to #706.

ℹ️ Some of this work was done as part of openSUSE's Hack Week 25. Thanks SUSE for sponsoring the go-git project. 🙌

@pjbgf pjbgf added this to the v6.0.0 milestone Jan 26, 2026
Copilot AI review requested due to automatic review settings January 26, 2026 07:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR progresses SHA256 support by enabling cloning of SHA256 repositories. It introduces ObjectFormatGetter and ObjectFormatSetter interfaces to allow storage backends to report and configure their hash algorithm. The implementation updates transport protocols, packfile handling, and index file encoding to support both SHA1 and SHA256.

Changes:

  • Added ObjectFormatGetter and ObjectFormatSetter interfaces implemented by in-tree storage backends (memory and filesystem)
  • Updated transport layer to detect and propagate object format from capabilities or hash sizes
  • Modified packfile parser/scanner to handle SHA256 hashes and updated idxfile writer to skip unmaterialized delta objects
  • Removed ErrSHA256NotSupported error as SHA256 is now supported

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
storage/storer.go Defines new ObjectFormatGetter/Setter interfaces
storage/memory/storage.go Implements object format interfaces for memory storage
storage/filesystem/storage.go Implements object format interfaces for filesystem storage
plumbing/transport/*.go Updates transport to detect and pass object format to packfile parser
plumbing/format/packfile/*.go Updates packfile parsing to support SHA256 hashes
plumbing/format/idxfile/*.go Refactors encoder and adds defensive checks for unmaterialized objects
plumbing/protocol/packp/*.go Updates protocol parsing to handle variable hash sizes
repository.go Removes ErrSHA256NotSupported (breaking change)
repository_test.go Adds comprehensive tests for SHA256 cloning
worktree_commit_test.go Converts assertions to require for better test reliability
internal/server/* Adds internal HTTP server infrastructure for testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 88 to 89
s.config.Extensions.ObjectFormat = of
s.config.Core.RepositoryFormatVersion = formatcfg.Version1
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct access to s.config on lines 88-89 could cause a nil pointer dereference if config has not been initialized yet. While Config() initializes it lazily, SetObjectFormat doesn't call Config() first. This should either call s.Config() first to ensure initialization, or handle the nil case explicitly.

Copilot uses AI. Check for mistakes.
@pjbgf pjbgf requested a review from aymanbagabas January 26, 2026 10:09
Comment on lines 30 to 45
// ObjectFormatGetter is implemented by storage backends that can report
// their object format (hash algorithm), such as SHA-1 or SHA-256.
type ObjectFormatGetter interface {
// ObjectFormat returns the object format (hash algorithm) used by this storage.
ObjectFormat() formatcfg.ObjectFormat
}

// ObjectFormatSetter is implemented by storage backends that support
// configuring the object format (hash algorithm) used for the repository.
// This is typically called during repository initialization or when the
// first packfile is fetched.
type ObjectFormatSetter interface {
// SetObjectFormat configures the object format (hash algorithm) for this storage.
SetObjectFormat(formatcfg.ObjectFormat) error
}

Copy link
Member

@aymanbagabas aymanbagabas Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want a custom interface for this? This makes it look like a repository object format is a special object in Git. Instead, I would reuse the config.ConfigStorer interface to get/set the object format and come up with a storage.ReadObjectFormat(config.ConfigStorer) and storage.SetObjectFormat(config.ConfigStorer, formatcfg.ObjectFormat) or something similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a compromise, the reasoning being:

  1. When we init a new repository we don't know what ObjectFormat it will have. That information is acquired talking with the server and then the new format is set using UpdateObjectStorage. Alternatively, we could hold the repository initialisation to after the fetching process - but that would be a larger change.
  2. A key motivation here was backwards compatibility with off-tree storage implementations. This approach means they continue to compile and work when interacting with SHA1 repositories. But safely fail when interacting with a SHA256 repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to specify the object format for new repositories, otherwise, it defaults to SHA1. That's how git works by default. Migrating a repository from one format to another requires creating a new repository with the new format, and import/export the repository from the old to the new format via git fast-import and git fast-export.

I think there should be an option to specify the object format for new repositories. UpdateObjectStorage and others should always use the repository format. The transport layer should validate that new requests match the storage format.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty repos are kind of this funny-place where you'd think the hash function is undefined, but it is actually defined and stored (implicitly or explicitly) as part of the config. The main reason is what happens when you have an empty repository, and then you clone it?

This is a valid operation that must work predictably. An empty repository is initialized by git forges like Gitea, then you clone it, add commits and push them. Waiting until you create the first objects to define the repository ObjectFormat does not work.

The data sent when you transfer a repo includes this type, even if there are no object, to make sure that the ObjectFormat is unambiguous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for the feedback in such a timely manner. This approach was taken to lean on existing primitives while Cloning new repositories as at that point in time you don't know what ObjectFormat your server has. As per upstream:

We have a chicken-and-egg situation between initializing the refdb
* and spawning transport helpers:
*
* - Initializing the refdb requires us to know about the object
* format. We thus have to spawn the transport helper to learn
* about it.
*
* - The transport helper may want to access the Git repository. But
* because the refdb has not been initialized, we don't have "HEAD"
* or "refs/". Thus, the helper cannot find the Git repository.

Upstream worked around this by introducing a lazy refdb init, which instead of initialising a valid refdb it actually creates an invalid refdb - only finishing off the init process once the target objectformat is agreed with the remote. Let me take that approach into consideration and update this PR accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following-up from this discussion, I made a few changes to better align with upstream:

  • Added the partial init and moved the setting of object format to the pack negotiation logic. That only takes place on the first negotiation during a clone operation, subsequent negotiations won't trigger a object format change in the storage.
  • Tests were added to ensure that object format can't be changed on initialised repositories - regardless of it being set explicitly or not.
  • The go-git Storage abstraction for filesystem ignores any ObjectFormat passed-in to NewStorageWithOptions when the underlying fs contains an initialised repository. In such cases, the repository config is taken a single source of truth.

Are there any additional scenarios we may need to cater for on the test cases?

Comment on lines 35 to 40
if setter, ok := s.(storage.ObjectFormatSetter); ok {
err := setter.SetObjectFormat(of)
if err != nil {
return err
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the repository has a different object format from the incoming packfile? I think in that case this should fail with an error. I think we should always use the storage object format and pass that down to the parser

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes both memory and filesystem implementations to fail if they are not empty when a change of object format is requested.

This is a fail-safe mechanism to enable single object format repositories to work, while erroring when a repo is trying to handle multiple formats. This should be reviewed as part of the translation table implementation. I'm copying this thread to the issue so that this discussion can be discoverable.

}
}

func WithObjectFormat(of config.ObjectFormat) ParserOption {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we infer the object format from the repository storage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I don't think we can:

  1. A new repository likely won't have the ObjectFormat set. Even if it does, it may not align with the target pack being fetched. For example, new repo (defaults to SHA1) then fetches a SHA256 packfile.
  2. An existing repository set with a different ObjectFormat than the one for the packfile being downloaded.

These changes would set the repository objectformat to the target packfile format, provided this is a new/empty repository. This feels like a safe approach until we have multi ObjectFormat support. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, in the filesystem storage, the Parser will be called here with the ObjectFormat coming from either the repository config, for existing underlying repo, or from the storage options otherwise.

@pjbgf pjbgf mentioned this pull request Jan 27, 2026
16 tasks
pjbgf added 11 commits January 31, 2026 23:16
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
This error was created when go-git needed to be built for sha256 use, which
is no longer the case.

Signed-off-by: Paulo Gomes <pjbgf@linux.com>
The fetching of memory objects will now ensure the correct hash format.
The same applies to the pack hash size, which previously was returning
a SHA1 hash for SHA256 packfiles.

Additionally, the dual hash (both SHA1 and SHA256) was removed. This is
largely for practical reasons, as it is likely that this part of the logic
may not be required in the future.

Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
The idxfile.encoder lacked defensive programming and was quite light
on testing. The refactor aligns the code structure with the recent
implementation of rev.Encoder.

Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
@pjbgf pjbgf force-pushed the sha256-open branch 2 times, most recently from 42687fd to d959bb5 Compare January 31, 2026 23:52
pjbgf added 10 commits February 1, 2026 00:29
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
…res'

A bug in other areas of plumbing/format was causing the revfile
package to generate invalid files at times. This additional checks
will improve the test harness to avoid future regression.

Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
pjbgf added 8 commits February 1, 2026 00:30
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
The SHA256 enablement requires the ability to assess whether the
object format is the default due to it being unset, or it was
explicitly picked.

Another motivation for the change from int to string is that
both the Git protocol and the Git config store this information
as string, so this change better aligns with upstream.

Signed-off-by: Paulo Gomes <pjbgf@linux.com>
There are two interfaces to represent git Storers: storer.Storer and storage.Storer.
The former is a light-weight version focusing only on encoded objects and references,
the latter a full-on representation of a git repository.

The latter provide access to the ObjectFormat via storer.ConfigStorer. Although the interface
is implemented at the in-tree storages, the interface now lives together with the former
so that it becomes clearer that it is only needed when handling storer.Storer.

Alternatively, ObjectFormatGetter could be removed in favour of casting storer.Storer into storage.Storer,
the problem with this approach is that any off-tree implementation would have a greater burden
in order to implement sha256 support.

Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Fully initialised repositories must not have their object format changed.
However, when cloning a repository we first init the repository to then
negotiate capabilities with the remote. Only at that point we are aware of
what object format the local storage must be created with.

To support the clone workflow, while also blocking the changing of object
format for empty but fully initialised repositories, we:
- Implement partial inits which are validated at pack negotiation time.
- Keep the logic private so that it is only used internally for clones.
- Add tests to avoid future regression on empty repositories.

Additional info upstream:
- https://github.com/git/git/blob/ab380cb80b0727f7f2d7f6b17592ae6783e9820c/builtin/clone.c#L1216C60-L1216C68

Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Ensures that existing repositories will always take precedence, and whatever
ObjectFormat passed in for NewStorageWithOptions is ignored.

Signed-off-by: Paulo Gomes <pjbgf@linux.com>
Copy link
Member

@aymanbagabas aymanbagabas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! I wonder if we should keep the ObjectFormatGetter 🤔

Comment on lines +34 to +37
type ObjectFormatGetter interface {
// ObjectFormat returns the object format (hash algorithm) used by the Storer.
ObjectFormat() formatcfg.ObjectFormat
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this interface still? If we're storing the object format in config and config is the single source of truth, this seems redundant. On this note, I always found the storer.Storer and storage.Storer interfaces confusing, and we should deprecate one of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-term we may not need this interface - I will add some comments to that effect. I also agree that storer.Storer and storage.Storer could be reviewed, so I created an issue to handle both subjects at once - tracking to the v7 milestone.

For now, I rather we keep ObjectFormatGetter in so that we don't break the interface with existing external implementations, but also have a safe to detect lack of support from SHA256 on off-tree storage implementations. As lack of this awareness could result in data corruption - I will add a test to highlight this point.

Comment on lines +90 to +97
if setter, ok := st.(storage.ObjectFormatSetter); ok {
err := setter.SetObjectFormat(serverFormat)
if err != nil {
return nil, fmt.Errorf("unable to set object format: %w", err)
}

clientFormat = serverFormat
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if setter, ok := st.(storage.ObjectFormatSetter); ok {
err := setter.SetObjectFormat(serverFormat)
if err != nil {
return nil, fmt.Errorf("unable to set object format: %w", err)
}
clientFormat = serverFormat
}
clientFormat = serverFormat
cfg.Extensions.ObjectFormat = serverFormat
st.SetConfig(cfg)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just change the config, as the storage is already initialised at this point, so we need to reinitilise the (object) hashers with the target format.

If we were to add this additional logic onto SetConfig, that would mean that we'd need to verify on each call whether it is OK to change the object format and whether that should trigger a reinitilisation of the (object) hashers. As this would be storage.Storer specific, this could lead to a corrupted state on off-tree implementations - as the core logic would be unaware of such support.

// SetObjectFormat configures the object format (hash algorithm) for this storage.
SetObjectFormat(formatcfg.ObjectFormat) error
}

Copy link
Member

@aymanbagabas aymanbagabas Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ObjectFormat finds the object format for the given [config.ConfigStorer].
// If the config storer doesn't define an object format, it will return
// [config.UnsetObjectFormat].
func ObjectFormat(st config.ConfigStorer) config.ObjectFormat {
cfg, err := st.Config()
if err == nil {
return cfg.Extensions.ObjectFormat
}
return config.UnsetObjectFormat
}
// SetObjectFormat sets the object format for the given [config.ConfigStorer].
func SetObjectFormat(st config.ConfigStorer, f config.ObjectFormat) error {
cfg, err := st.Config()
if err != nil {
return err
}
cfg.Extensions.ObjectFormat = f
return st.SetConfig(cfg)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants