Skip to content

reftable library#847

Closed
hanwen wants to merge 26 commits intogit:masterfrom
hanwen:libreftable
Closed

reftable library#847
hanwen wants to merge 26 commits intogit:masterfrom
hanwen:libreftable

Conversation

@hanwen
Copy link
Contributor

@hanwen hanwen commented Sep 16, 2020

This splits the giant commit from gitgitgadget#539 into a series of smaller commits, which build and have unittests.

Changes relative to last series: version 19 Apr 2021 (tip: b8729fe)

  • drop EINVAL for broken refs.
  • fix 2 UBSAN warnings. Thanks to Andrzej Hunt for reporting.
  • make tests pass against master (use oidread)
  • handle worktree/xx/refname refs.
  • handle main-worktree/ref
  • fix double free in branch renaming
  • handle relative paths (used in git upload-pack invocation) for ref backend creation.

cc: Han-Wen Nienhuys hanwen@google.com
cc: Jeff King peff@peff.net
cc: Ramsay Jones ramsay@ramsayjones.plus.com
cc: Jonathan Nieder jrnieder@gmail.com
cc: Johannes Schindelin Johannes.Schindelin@gmx.de
cc: Jonathan Tan jonathantanmy@google.com
cc: Josh Steadmon steadmon@google.com
cc: Emily Shaffer emilyshaffer@google.com
cc: Patrick Steinhardt ps@pks.im
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Felipe Contreras felipe.contreras@gmail.com
cc: Derrick Stolee stolee@gmail.com
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Han-Wen Nienhuys hanwen@google.comg
cc: "brian m. carlson" sandals@crustytoothpaste.net

@hanwen
Copy link
Contributor Author

hanwen commented Sep 16, 2020

/submit

@gitgitgadget-git
Copy link

Submitted as pull.847.git.git.1600283416.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-git-847/hanwen/libreftable-v1

To fetch this version to local tag pr-git-847/hanwen/libreftable-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-git-847/hanwen/libreftable-v1

@gitgitgadget-git
Copy link

This branch is now known as hn/reftable.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via c00808e.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 4f1956c.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via f9f83c1.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 488ad6a.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via ab283b9.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via a0bd498.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 0417227.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via e227917.

@@ -720,6 +720,7 @@ TEST_BUILTINS_OBJS += test-read-cache.o
TEST_BUILTINS_OBJS += test-read-graph.o

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static void reftable_log_record_key(const void *r, struct strbuf *dest)
> +{
> +	const struct reftable_log_record *rec =
> +		(const struct reftable_log_record *)r;
> +	int len = strlen(rec->refname);
> +	uint8_t i64[8];
> +	uint64_t ts = 0;
> +	strbuf_reset(dest);
> +	strbuf_add(dest, (uint8_t *)rec->refname, len + 1);
> +
> +	ts = (~ts) - rec->update_index;
> +	put_be64(&i64[0], ts);
> +	strbuf_add(dest, i64, sizeof(i64));
> +}

We seem to be getting

reftable/record.c: In function 'reftable_log_record_key':
reftable/record.c:578:2: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
  put_be64(&i64[0], ts);
    ^
        CC reftable/refname.o

when this series is merged to 'seen'.

cf. e.g. https://travis-ci.org/github/git/git/jobs/728655368


Thanks.



Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

On Sun, Sep 20, 2020 at 3:00 AM Junio C Hamano <gitster@pobox.com> wrote:
> > +static void reftable_log_record_key(const void *r, struct strbuf *dest)
> > +{
> > +     const struct reftable_log_record *rec =
> > +             (const struct reftable_log_record *)r;
> > +     int len = strlen(rec->refname);
> > +     uint8_t i64[8];
> > +     uint64_t ts = 0;
> > +     strbuf_reset(dest);
> > +     strbuf_add(dest, (uint8_t *)rec->refname, len + 1);
> > +
> > +     ts = (~ts) - rec->update_index;
> > +     put_be64(&i64[0], ts);
> > +     strbuf_add(dest, i64, sizeof(i64));
> > +}
>
> We seem to be getting
>
> reftable/record.c: In function 'reftable_log_record_key':
> reftable/record.c:578:2: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
>   put_be64(&i64[0], ts);
>     ^
>         CC reftable/refname.o
>
> when this series is merged to 'seen'.

Thanks for bringing this up. I did see this before, but I've been
unable to reproduce this locally, and I forgot about it.

The problem is actually triggered by the Git-provided put_be64()
(which appears unused in the Git source code). The definition

  #define put_be64(p, v) do { *(uint64_t *)(p) = htonll(v); } while (0)

is trying to reinterpret a char* as a uint64_t* , which is illegal in
strict aliasing rules? I originally had

+void put_be64(uint8_t *out, uint64_t v)
+{
+       int i = sizeof(uint64_t);
+        while (i--) {
+               out[i] = (uint8_t)(v & 0xff);
+               v >>= 8;
+       }
+}

in my reftable library, which is portable. Is there a reason for the
magic with htonll and friends?

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Sep 21, 2020 at 03:13:05PM +0200, Han-Wen Nienhuys wrote:

> The problem is actually triggered by the Git-provided put_be64()
> (which appears unused in the Git source code). The definition
> 
>   #define put_be64(p, v) do { *(uint64_t *)(p) = htonll(v); } while (0)
> 
> is trying to reinterpret a char* as a uint64_t* , which is illegal in
> strict aliasing rules?

Ah, thanks for finding that. I think this resolves the mystery I had
around aliasing warnings with put_be32() in f39ad38410 (fast-export: use
local array to store anonymized oid, 2020-06-25). I got confused because
when I looked at the implementation of put_be32(), I stupidly looked at
the NO_ALIGNED_LOADS fallback code, which indeed does not alias. But the
one we use on x86 does.

I thought this would be OK because C allows aliasing through "char *"
pointers. But I think it may only go the other way. I.e., you can
type-pun a uint64_t as a char, but not the other way around. But then
why doesn't basically every use of put_be32() and put_be64() complain?

> I originally had
> 
> +void put_be64(uint8_t *out, uint64_t v)
> +{
> +       int i = sizeof(uint64_t);
> +        while (i--) {
> +               out[i] = (uint8_t)(v & 0xff);
> +               v >>= 8;
> +       }
> +}
> 
> in my reftable library, which is portable. Is there a reason for the
> magic with htonll and friends?

Presumably it was thought to be faster. This comes originally from the
block-sha1 code in 660231aa97 (block-sha1: support for architectures
with memory alignment restrictions, 2009-08-12). I don't know how it
compares in practice, and especially these days.

Our fallback routines are similar to an unrolled version of what you
wrote above.

-Peff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Thu, Sep 24, 2020 at 03:21:51AM -0400, Jeff King wrote:

> > I originally had
> > 
> > +void put_be64(uint8_t *out, uint64_t v)
> > +{
> > +       int i = sizeof(uint64_t);
> > +        while (i--) {
> > +               out[i] = (uint8_t)(v & 0xff);
> > +               v >>= 8;
> > +       }
> > +}
> > 
> > in my reftable library, which is portable. Is there a reason for the
> > magic with htonll and friends?
> 
> Presumably it was thought to be faster. This comes originally from the
> block-sha1 code in 660231aa97 (block-sha1: support for architectures
> with memory alignment restrictions, 2009-08-12). I don't know how it
> compares in practice, and especially these days.
> 
> Our fallback routines are similar to an unrolled version of what you
> wrote above.

We should be able to measure it pretty easily, since block-sha1 uses a
lot of get_be32/put_be32. I generated a 4GB random file, built with
BLK_SHA1=Yes and -O2, and timed:

  t/helper/test-tool sha1 <foo.rand

Then I did the same, but building with -DNO_UNALIGNED_LOADS. The latter
actually ran faster, by a small margin. Here are the hyperfine results:

  [stock]
  Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
  Range (min … max):    6.550 s …  6.841 s    10 runs

  [-DNO_UNALIGNED_LOADS]
  Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
  Range (min … max):    6.394 s …  6.447 s    10 runs

For casual use as in reftables I doubt the difference is even
measurable. But this result implies that perhaps we ought to just be
using the fallback version all the time.

-Peff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes:

> Then I did the same, but building with -DNO_UNALIGNED_LOADS. The latter
> actually ran faster, by a small margin. Here are the hyperfine results:
>
>   [stock]
>   Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
>   Range (min … max):    6.550 s …  6.841 s    10 runs
>
>   [-DNO_UNALIGNED_LOADS]
>   Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
>   Range (min … max):    6.394 s …  6.447 s    10 runs
>
> For casual use as in reftables I doubt the difference is even
> measurable. But this result implies that perhaps we ought to just be
> using the fallback version all the time.

I like that one.  One less configurable knob that makes us execute
different codepaths is one less thing to be worried about.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 7b0ab69.

@gitgitgadget-git
Copy link

User Han-Wen Nienhuys <hanwen@google.com> has been added to the cc: list.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via d1726fb.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 0c5022d.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 3b8fedc.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via f5586d7.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 75d9f0d.

@gitgitgadget-git
Copy link

User Jeff King <peff@peff.net> has been added to the cc: list.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 4ec7939.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 0875b0a.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 8bb79dc.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 503b919.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 9312a58.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 4d8ca35.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 8cb3f1f.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 98d584a.

hanwen and others added 17 commits July 6, 2021 15:18
The reftable format is structured as a sequence of block. Within a block,
records are prefix compressed, with an index of offsets for fully expand keys to
enable binary search within blocks.

This commit provides the logic to read and write these blocks.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
The reftable format includes support for an (OID => ref) map. This map can speed
up visibility and reachability checks. In particular, various operations along
the fetch/push path within Gerrit have ben sped up by using this structure.

The map is constructed with help of a binary tree. Object IDs are hashes, so
they are uniformly distributed. Hence, the tree does not attempt forced
rebalancing.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
This supports reading a single reftable file.

The commit introduces an abstract iterator type, which captures the usecases
both of reading individual refs, and iterating over a segment of the ref
namespace.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
With support for reading and writing files in place, we can construct files (in
memory) and attempt to read them back.

Because some sections of the format are optional (eg. indices, log entries), we
have to exercise this code using multiple sizes of input data

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
This is needed to create a merged view multiple reftables

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
This adds an abstract, read-only interface to the ref database.

This primitive is used to construct the read view of the ref database
(the read view is constructed by merging several *.ref files). It also
provides the mechanism to provide a unified view of the refs in the main
repository and the per-worktree refs.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
The packed/loose format has restrictions on refnames: a and a/b cannot
coexist. This limitation does not apply to reftable per se, but must be
maintained for interoperability. This code adds validation routines to
abort transactions that are trying to add invalid names.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
provide a command-line utility for inspecting individual tables, and
inspecting a complete ref database

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
For background, see Documentation/technical/reftable.txt.

This introduces the file refs/reftable-backend.c containing a reftable-powered
ref storage backend.

It can be activated by setting GIT_TEST_REFTABLE in the environment. When
GIT_TEST_REFTABLE is set, the test prerequisite !REFFILES is set.

There is no option to git-init for now, as the test suite still shows failures
with GIT_TEST_REFTABLE=1.

Example use: see t/t0031-reftable.sh

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Junio Hamano <gitster@pobox.com>
Helped-by: Patrick Steinhardt <patrick.steinhardt@elego.de>
Co-authored-by: Jeff King <peff@peff.net>
In our git-prompt script we strive to use Bash builtins wherever
possible, because fork()-ing subshells for command substitutions and
fork()+exec()-ing Git commands are expensive on some platforms.  We
even read and parse '.git/HEAD' using Bash builtins to get the name of
the current branch [1].  However, the upcoming reftable refs backend
won't use '.git/HEAD' at all, but will write an invalid refname as
placeholder for backwards compatibility instead, which will break our
git-prompt script.

Update the git-prompt script to recognize the placeholder '.git/HEAD'
written by the reftable backend (its content is specified in the
reftable specs), and then fall back to use 'git symbolic-ref' to get
the name of the current branch.

[1] 3a43c4b (bash prompt: use bash builtins to find out current
    branch, 2011-03-31)

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
This command dumps individual tables or a stack of of tables.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
* Reftable for now lacks detailed error messages for directory/file conflicts.
  Skip message comparisons.

* Mark tests that muck with .git directly as REFFILES.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
@gitgitgadget-git
Copy link

There was a status update in the "Stalled" section about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 92fd260.

@gitgitgadget-git
Copy link

This patch series was integrated into next via 0d6f79d.

@gitgitgadget-git gitgitgadget-git bot added the next label Jul 7, 2021
@gitgitgadget-git
Copy link

This patch series was integrated into seen via 171a6b6.

@gitgitgadget-git
Copy link

There was a status update in the "Stalled" section about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via c9780bb.

@gitgitgadget-git
Copy link

This patch series was integrated into next via c9780bb.

@gitgitgadget-git
Copy link

This patch series was integrated into master via c9780bb.

@gitgitgadget-git
Copy link

Closed via c9780bb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants