Skip to content

Optimization batch 6: make full use of exact renames#842

Closed
newren wants to merge 2 commits intogitgitgadget:temporary/ort-perf-measurementsfrom
newren:ort-perf-batch-6
Closed

Optimization batch 6: make full use of exact renames#842
newren wants to merge 2 commits intogitgitgadget:temporary/ort-perf-measurementsfrom
newren:ort-perf-batch-6

Conversation

@newren
Copy link
Copy Markdown

@newren newren commented Jan 24, 2021

This series makes full use of exact renames; removing not only a destination pair, but a source pair as well when an exact rename is found and copy detection is not turned on.

Changes since v2:

  • Fix a comment typo, and fix a multi-line comment that didn't need to be a multi-line comment

CC: Derrick Stolee dstolee@microsoft.com
CC: Jonathan Tan jonathantanmy@google.com
CC: Taylor Blau me@ttaylorr.com
CC: Junio C Hamano gitster@pobox.com
CC: Jeff King peff@peff.net
CC: Karsten Blees blees@dcon.de
cc: Derrick Stolee stolee@gmail.com
cc: Elijah Newren newren@gmail.com

@newren
Copy link
Copy Markdown
Author

newren commented Feb 3, 2021

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 3, 2021

Submitted as pull.842.git.1612331345.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v1

To fetch this version to local tag pr-842/newren/ort-perf-batch-6-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-842/newren/ort-perf-batch-6-v1

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 3, 2021

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 3, 2021

User Elijah Newren <newren@gmail.com> has been added to the cc: list.

@newren
Copy link
Copy Markdown
Author

newren commented Feb 3, 2021

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 3, 2021

Submitted as pull.842.v2.git.1612382628.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v2

To fetch this version to local tag pr-842/newren/ort-perf-batch-6-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-842/newren/ort-perf-batch-6-v2

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 3, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series depends on en/merge-ort-perf and makes full use of exact
> renames; see commit messages for details.
>
> Thanks to Stolee and Junio for reviewing v1.
>
> Changes since v1:
>
>  * Update rename_src_nr when updating rename_src
>  * Introduce want_copies in the first patch and use it in a few more places
>  * Move a comment below a few exit-early if-checks.
>
> Elijah Newren (2):
>   diffcore-rename: no point trying to find a match better than exact
>   diffcore-rename: filter rename_src list when possible
>
>  diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 61 insertions(+), 8 deletions(-)

Thanks, these look bettrer.

With these changes, I guess there are only two things I find myself
somewhat embarrassing in the rename machinery that is still there
since I invented it.

 - We still need to go full matrix while finding the "best"
   pairing.  I cannot think of a way to avoid it (that is what makes
   it embarrassing) but wish there were some way to.

   In an early attempt, I tried to retire rename_src[j], once
   rename_dst[i] has been found to be a "good enough" match for it,
   from the pool of rename src candidates to find a good match for
   rename_dst[k] for i < k, but naive implementation of it would not
   work well for obvious reasons---rename_src[j] may match a lot
   better with rename_dst[k] than rename_dst[i] but we do not know
   that until we try to estimate similarity with rename_dst[k].


 - The .cnt_data member was designed to be a concise summary of the
   blob characteristics so that two .cnt_data can be "compared"
   fairly cheaply to see how "similar" two blobs are [*], but (1) it
   is rather big to be called a "concise summary", and (2) it was
   not chosen after real performance measurement, and we've been
   using it for the past 15 years without revisiting its design.

   Side note: In a very early prototype, the approach to assess
   similarity between two blobs was very different---there was no
   attempt to compute "concise summary" for each blob, but we just
   attempted to create delta (as in the pack data) between src and
   dst blobs and measured how small a delta we can use to transform
   from src to dst.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 3, 2021

On the Git mailing list, Elijah Newren wrote (reply to this):

On Wed, Feb 3, 2021 at 1:56 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This series depends on en/merge-ort-perf and makes full use of exact
> > renames; see commit messages for details.
> >
> > Thanks to Stolee and Junio for reviewing v1.
> >
> > Changes since v1:
> >
> >  * Update rename_src_nr when updating rename_src
> >  * Introduce want_copies in the first patch and use it in a few more places
> >  * Move a comment below a few exit-early if-checks.
> >
> > Elijah Newren (2):
> >   diffcore-rename: no point trying to find a match better than exact
> >   diffcore-rename: filter rename_src list when possible
> >
> >  diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 61 insertions(+), 8 deletions(-)
>
> Thanks, these look bettrer.
>
> With these changes, I guess there are only two things I find myself
> somewhat embarrassing in the rename machinery that is still there
> since I invented it.
>
>  - We still need to go full matrix while finding the "best"
>    pairing.  I cannot think of a way to avoid it (that is what makes
>    it embarrassing) but wish there were some way to.

It turns out that exact renames aren't the only thing that can reduce
the size of the matrix.  My next few series will add some more.

The matrix does remain, even at the end of all my performance work,
but multiple ways to shrink it certainly help a lot.

>    In an early attempt, I tried to retire rename_src[j], once
>    rename_dst[i] has been found to be a "good enough" match for it,
>    from the pool of rename src candidates to find a good match for
>    rename_dst[k] for i < k, but naive implementation of it would not
>    work well for obvious reasons---rename_src[j] may match a lot
>    better with rename_dst[k] than rename_dst[i] but we do not know
>    that until we try to estimate similarity with rename_dst[k].

You may really like the next two series I submit.  I have a smarter
way to find a "good enough" match (comparing to exactly one other file
and often finding sufficient similarity), and one that'll make
intuitive sense to users.

Then I have other series after those that optimize in different ways.

>  - The .cnt_data member was designed to be a concise summary of the
>    blob characteristics so that two .cnt_data can be "compared"
>    fairly cheaply to see how "similar" two blobs are [*], but (1) it
>    is rather big to be called a "concise summary", and (2) it was
>    not chosen after real performance measurement, and we've been
>    using it for the past 15 years without revisiting its design.

.cnt_data seemed kind of clever to me, though it did have a few things
that seemed surprising:

1) Small "lines".  The similarity works by mapping each "line" to an
integer using some simple hash, and keeps track of the list of hashed
integers for each file.  Once you have a list of hashed integers for
each file, similarity is determined by looking through the two lists
for two files and seeing what percentage of integers is found in both
(but see item #3 for how percentage is defined).  One special
consideration here was "lines".

I think because of a desire to handle binary files, there was a
decision to split at 64-bytes or LF, whichever comes first, so each
real line of code might be broken into multiple "lines".  The 64-bytes
thing might make things a little weird, though.  If you have a lot of
lines that are just over 64-characters long, they'll be split into two
"lines", with the second ones being very short.  Really short lines
are much more likely to look like other really short lines, which
might pad your similarity count.  I'm not sure what to do differently,
though.

2) Size-agnostic hashing.  The integer that each "line" is mapped to
is strictly between 0 and HASHBASE-1, where HASHBASE is #define'd to
107927.

By the pigeon-hole principle, since this hash size is fixed, any two
sufficiently large files will be marked by this algorithm as similar.
I created some testcases and verified that this was indeed true.  I
created two files at random using disjoint "alphabets" (e.g. one file
only contained letters, the other file only contained digits), and
found that at a certain size the algorithm would always return >50%
similarity.  The size was somewhere between 1MB and 8MB; I can't find
the table where I recorded that anymore.  So, basically, sufficiently
large files that are sufficiently close in size (since the algorithm
checks for size similarity as an optimization) will always be marked
as a rename or copy.  I think we haven't gotten reports of this "bug"
because files that large are for all intents and purposes binaries
that people likely won't modify.  And if they do modify the files,
then they'll conflict, but people won't look at the files to resolve
the conflict; instead, they'll generate a new "good" binary externally
and then stick it into place.  I investigated this just a little bit,
and even considered turning off rename detection for sufficiently
large files, but sizes aren't available until the tight inner loop and
adding any extra checking there actually costs enough to offset most
any benefit.  There might have been a way to avoid that, but
ultimately I just dropped the issue and did nothing.

3) It uses a similarity measure that diverges from what researches
used for MinHash and other fancy algorithms.  In particular,

   size(A intersect B) / size(A union B)  != size(A intersect B) /
max(size(A), size(B))

The formula on the right hand side would mean that if file A is a
subset of file B, say the first 10% of file B, then it will be treated
as 100% similar when most humans would look at it and say it is only
10% similar.  The MinHash definition for similarity measure seems like
it makes more sense...though if people have been passing numbers like
"60%" to -M, then this change of definition will force them to
recalibrate what numbers they pass.  I think it'd be really nice to
use the MinHash definition and that it might even fix bugs to do so,
but I was worried folks might perceive it as a backward incompatible
change if they need to recalibrate their numberings.

So, I ultimately didn't do anything here either...yet.  But this is
the one that I'd most like to do something about.  Are others
concerned about the possible need to recalibrate, or is that okay?
Maybe the performance gains I'm adding elsewhere will offset possible
grumpy users?

>    Side note: In a very early prototype, the approach to assess
>    similarity between two blobs was very different---there was no
>    attempt to compute "concise summary" for each blob, but we just
>    attempted to create delta (as in the pack data) between src and
>    dst blobs and measured how small a delta we can use to transform
>    from src to dst.

Interesting; I didn't know that.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 3, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes:

> 3) It uses a similarity measure that diverges from what researches
> used for MinHash and other fancy algorithms.  In particular,
>
>    size(A intersect B) / size(A union B)  != size(A intersect B) /
> max(size(A), size(B))
>
> The formula on the right hand side would mean that if file A is a
> subset of file B, say the first 10% of file B, then it will be treated
> as 100% similar when most humans would look at it and say it is only
> 10% similar.

If you are talking about "you start from 100 lines file and appended
900 lines of your own, then you still have 100% of the original
material remaining in the file", it is quite deliberate that we used
it as an indication that the original "100 lines" file is a good
candidate to have been renamed to the resulting "1000 lines" file.
It is "what you have kept from the original" measure.

Of course, taken to the extreme, this means that rename does not
have to be symmetrical.  "diff A B" may find that the original
100-line file in A has grown into 1000-line file in B elsewhere, but
"diff B A" or "diff -R A B" would not necessarily pair these two
blobs as matching.

> Maybe the performance gains I'm adding elsewhere will offset possible
> grumpy users?

Users, as they are, it would never happen.  When they have something
to complain about, they will, regardless of what else you do.


@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 3, 2021

On the Git mailing list, Jeff King wrote (reply to this):

On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote:

> >    In an early attempt, I tried to retire rename_src[j], once
> >    rename_dst[i] has been found to be a "good enough" match for it,
> >    from the pool of rename src candidates to find a good match for
> >    rename_dst[k] for i < k, but naive implementation of it would not
> >    work well for obvious reasons---rename_src[j] may match a lot
> >    better with rename_dst[k] than rename_dst[i] but we do not know
> >    that until we try to estimate similarity with rename_dst[k].
> 
> You may really like the next two series I submit.  I have a smarter
> way to find a "good enough" match (comparing to exactly one other file
> and often finding sufficient similarity), and one that'll make
> intuitive sense to users.

Here's a really old thread with an approach that may or may not be
similar to what you're thinking of:

  https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com/

Though maybe start with this summary message:

  https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@mail.gmail.com/

I remember experimenting some with it at the time, but never making
things faster. It's entirely possible (likely, even) that I was simply
not doing it well enough.

It's also been long enough since I looked at the rename code that I'm
not sure how different it is in practice. It still has a quadratic
matrix in the end. We basically do a similar strategy of
rolling-hash-fingerprint-and-see-where-things-collide, but I think we
may end up with more work during the quadratic part (again, it's been
a while, so I may just be totally off-base).

I've also wondered if something similar might be helpful for delta
compression (which is now done with an O(m*n) sliding window).

-Peff

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 4, 2021

On the Git mailing list, Elijah Newren wrote (reply to this):

On Wed, Feb 3, 2021 at 3:37 PM Jeff King <peff@peff.net> wrote:
>
> On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote:
>
> > >    In an early attempt, I tried to retire rename_src[j], once
> > >    rename_dst[i] has been found to be a "good enough" match for it,
> > >    from the pool of rename src candidates to find a good match for
> > >    rename_dst[k] for i < k, but naive implementation of it would not
> > >    work well for obvious reasons---rename_src[j] may match a lot
> > >    better with rename_dst[k] than rename_dst[i] but we do not know
> > >    that until we try to estimate similarity with rename_dst[k].
> >
> > You may really like the next two series I submit.  I have a smarter
> > way to find a "good enough" match (comparing to exactly one other file
> > and often finding sufficient similarity), and one that'll make
> > intuitive sense to users.
>
> Here's a really old thread with an approach that may or may not be
> similar to what you're thinking of:
>
>   https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com/
>
> Though maybe start with this summary message:
>
>   https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@mail.gmail.com/

Interesting thread; thanks for the link.  It's not remotely similar to
what I have done, but a brief glance through it reminds me of the
ideas at https://github.com/gitgitgadget/git/issues/519.

> I remember experimenting some with it at the time, but never making
> things faster. It's entirely possible (likely, even) that I was simply
> not doing it well enough.
>
> It's also been long enough since I looked at the rename code that I'm
> not sure how different it is in practice. It still has a quadratic
> matrix in the end. We basically do a similar strategy of
> rolling-hash-fingerprint-and-see-where-things-collide, but I think we
> may end up with more work during the quadratic part (again, it's been
> a while, so I may just be totally off-base).

I'm not sure if I should spoil the surprise for the patchsets I
haven't yet submitted but... when I get done, for the testcases I have
looked at, rename detection is no longer the slowest part of
merging/rebasing/cherry-picking -- not even when there are tens of
thousands of renames on one side of history.  And I didn't achieve
that by making other parts of the code slower.

If someone can come up with a real-world testcase where rename
detection is still really slow in a merge/rebase/cherry-pick with my
implemented optimizations, I've got at least one more optimization
that I hadn't bothered implementing because everything was fast enough
for me already.  And the idea you linked above and the ideas in the
gitgitgadget issue linked above all have more possibilities that are
complementary to what I have done that might help further.

> I've also wondered if something similar might be helpful for delta
> compression (which is now done with an O(m*n) sliding window).

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 12, 2021

This patch series was integrated into seen via git@72384d7.

@gitgitgadget gitgitgadget bot added the seen label Feb 12, 2021
diffcore_rename() had some code to avoid having destination paths that
already had an exact rename detected from being re-checked for other
renames.  Source paths, however, were re-checked because we wanted to
allow the possibility of detecting copies.  But if copy detection isn't
turned on, then this merely amounts to attempting to find a
better-than-exact match, which naturally ends up being an expensive
no-op.  In particular, copy detection is never turned on by the merge
machinery.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.263 s ±  0.053 s    14.119 s ±  0.101 s
    mega-renames:   5504.231 s ±  5.150 s  1802.044 s ±  0.828 s
    just-one-mega:   158.534 s ±  0.498 s    51.391 s ±  0.028 s

Signed-off-by: Elijah Newren <newren@gmail.com>
We have to look at each entry in rename_src a total of rename_dst_nr
times.  When we're not detecting copies, any exact renames or ignorable
rename paths will just be skipped over.  While checking that these can
be skipped over is a relatively cheap check, it's still a waste of time
to do that check more than once, let alone rename_dst_nr times.  When
rename_src_nr is a few thousand times bigger than the number of relevant
sources (such as when cherry-picking a commit that only touched a
handful of files, but from a side of history that has different names
for some high level directories), this time can add up.

First make an initial pass over the rename_src array and move all the
relevant entries to the front, so that we can iterate over just those
relevant entries.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.119 s ±  0.101 s    13.815 s ±  0.062 s
    mega-renames:   1802.044 s ±  0.828 s  1799.937 s ±  0.493 s
    just-one-mega:    51.391 s ±  0.028 s    51.289 s ±  0.019 s

Signed-off-by: Elijah Newren <newren@gmail.com>
@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 14, 2021

The pull request has 411 commits. The max allowed is 30. Please split the patch series into multiple pull requests. Also consider squashing related commits.

@newren
Copy link
Copy Markdown
Author

newren commented Feb 14, 2021

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Feb 14, 2021

Submitted as pull.842.v3.git.1613288101.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v3

To fetch this version to local tag pr-842/newren/ort-perf-batch-6-v3:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-842/newren/ort-perf-batch-6-v3

@newren newren closed this Mar 5, 2021
@newren newren deleted the ort-perf-batch-6 branch March 5, 2021 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant