Conversation
Trying to use Rust's Vec in C, or git's ALLOC_GROW() macros (via
wrapper functions) in Rust is painful because:
* C doesn't define its own vector type, and even though Rust does
have Vec its painful to use on the C side (more on that below).
However its still not viable to use Rust's Vec type because Git
needs to be able to compile without Rust. So ivec was created
expressley to be interoperable between C and Rust without needing
Rust.
* C doing vector things the Rust way would require wrapper functions,
and Rust doing vector things the C way would require wrapper
functions, so ivec was created to ensure a consistent contract
between the 2 languages for how to manipulate a vector.
* Currently, Rust defines its own 'Vec' type that is generic, but its
memory allocator and struct layout weren't designed for
interoperability with C (or any language for that matter), meaning
that the C side cannot push to or expand a 'Vec' without defining
wrapper functions in Rust that C can call. Without special care,
the two languages might use different allocators (malloc/free on
the C side, and possibly something else in Rust), which would make
it difficult for a function in one language to free elements
allocated by a call from a function in the other language.
* Similarly, git defines ALLOC_GROW() and related macros in
git-compat-util.h. While we could add functions allowing Rust to
invoke something similar to those macros, passing three variables
(pointer, length, allocated_size) instead of a single variable
(vector) across the language boundary requires more cognitive
overhead for readers to keep track of and makes it easier to make
mistakes. Further, for low-level components that we want to
eventually convert to pure Rust, such triplets would feel very out
of place.
To address these issue, introduce a new type, ivec -- short for
interoperable vector. (We refer to it as 'ivec' generally, though on
the Rust side the struct is called IVec to match Rust style.) This new
type is specifically designed for FFI purposes, so that both languages
handle the vector in the same way, though it could be used on either
side independently. This type is designed such that it can easily be
replaced by a Rust 'Vec' once interoperability is no longer a concern.
One particular item to note is that Git's macros to handle vec
operations infer the amount that a vec needs to grow from the size of
a pointer, but that makes it somewhat specific to the macros used in C.
To avoid defining every ivec function as a macro I opted to also
include an element_size field that allows concrete functions like
push() to know how much to grow the memory. This element_size also
helps in verifying that the ivec is correct when passing from C to
Rust.
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Later patches will prepare xdl_cleanup_records() to be moved into xdiffi.c since only the classic diff uses that function. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
All lines must be read anyway, so classify them after they're read in. Also move the memset() into xdl_init_classifier(). Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
The patience diff is set up the exact same way as histogram, see xdl_do_historgram_diff() in xhistogram.c. xdl_optimize_ctxs() is redundant now, delete it. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
View with --color-words. Prepare these functions to use the fields: delta_start, delta_end. A future patch will add these fields to xdfenv_t. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
This patch is best viewed with a before and after of the whole function. Rather than using 2 pointers and walking them. Use direct indexing with local variables of what is being compared to make it easier to follow along. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Placing delta_start in xdfenv_t instead of xdfile_t provides a more appropriate context since this variable only makes sense with a pair of files. View with --color-words. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
View with --color-words. Same argument as delta_start. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Disentangle xdl_cleanup_records() from the classifier so that it can be moved from xprepare.c into xdiffi.c. The classic diff is the only algorithm that needs to count the number of times each line occurs in each file. Make xdl_cleanup_records() count the number of lines instead of the classifier so it won't slow down patience or histogram. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Only the classic diff uses xdl_cleanup_records(). Move it, xdl_clean_mmatch(), and the macros to xdiffi.c and call xdl_cleanup_records() inside of xdl_do_classic_diff(). This better organizes the code related to the classic diff. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
|
/submit |
|
Submitted as pull.2156.git.git.1767379944.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> compat/ivec.c | 113 ++++++++++++++++++
> compat/ivec.h | 52 +++++++++
I very much like the general direction, but I wonder if we expect
many more "rust-to-C interface layer" files to come, which I suspect
is generally true, and in which case I think it is a good idea to
rethink the use of "compat/" for this purpose from early days, as
"compat/" is not about "compat between C and something else", but is
about "compat between platform peculiarity and (idealized) POSIX
environment our code assumes".
|
| @@ -1107,6 +1107,7 @@ LIB_OBJS += commit-reach.o | |||
| LIB_OBJS += commit.o | |||
There was a problem hiding this comment.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> + if (new_capacity == 0) {
> + free(self->ptr);
> + self->ptr = NULL;
if (!new_capacity)
FREE_AND_NULL(self->ptr);
else
...;
> +void ivec_free(void *self_)
> +{
> + struct IVec_c_void *self = self_;
> +
> + free(self->ptr);
> + self->ptr = NULL;
Likewise. Otherwise the code will fail coccicheck.
> + self->length = 0;
> + self->capacity = 0;
> + // DO NOT MODIFY element_size!!!
/* A single-liner comment in our codebase looks like this */
There was a problem hiding this comment.
Ezekiel Newren wrote on the Git mailing list (how to reply to this email):
On Sat, Jan 3, 2026 at 10:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > + if (new_capacity == 0) {
> > + free(self->ptr);
> > + self->ptr = NULL;
>
> if (!new_capacity)
> FREE_AND_NULL(self->ptr);
> else
> ...;
>
> > +void ivec_free(void *self_)
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + free(self->ptr);
> > + self->ptr = NULL;
>
> Likewise. Otherwise the code will fail coccicheck.
>
> > + self->length = 0;
> > + self->capacity = 0;
> > + // DO NOT MODIFY element_size!!!
>
> /* A single-liner comment in our codebase looks like this */
>
I will make these changes.|
On the Git mailing list, Yee Cheng Chin wrote (reply to this): Hi Ezekiel, I wonder if you saw my proposed patch "xdiff: fix outdated
xpatience comments referring to "ha" member var"?
(https://lore.kernel.org/pull.2139.git.git.1766464905719.gitgitgadget@gmail.com)
from 2 weeks ago? It simply cleans up a stale comment after a previous
xdiff cleanup when the "ha" member variable was split. I don't think
it conflicts with this part 3 (it's a small comments clean up) but I
wonder if you could take a look? Just to avoid future conflicts.
On Fri, Jan 2, 2026 at 10:52 AM Ezekiel Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Patch series summary:
>
> * patch 1: Introduce the ivec type
> * patch 2: Create the function xdl_do_classic_diff()
> * patches 3-4: generic cleanup
> * patches 5-8: convert from dstart/dend (in xdfile_t) to
> delta_start/delta_end (in xdfenv_t)
> * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
> xdiffi.c
>
> Things that will be addressed in future patch series:
>
> * Make xdl_cleanup_records() easier to read
> * convert recs/nrec into an ivec
> * convert changed to an ivec
> * remove reference_index/nreff from xdfile_t and turn it into an ivec
> * splitting minimal_perfect_hash out as its own ivec
> * improve the performance of the classifier and parsing/hashing lines
>
> === before this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; ptrdiff_t dstart, dend; bool *changed; size_t *reference_index;
> size_t nreff; } xdfile_t;
>
> typedef struct s_xdfenv { xdfile_t xdf1, xdf2; } xdfenv_t;
>
> === after this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; bool *changed; size_t *reference_index; size_t nreff; }
> xdfile_t;
>
> typedef struct s_xdfenv { xdfile_t xdf1, xdf2; size_t delta_start,
> delta_end; size_t mph_size; } xdfenv_t;
>
> Ezekiel Newren (10):
> ivec: introduce the C side of ivec
> xdiff: make classic diff explicit by creating xdl_do_classic_diff()
> xdiff: don't waste time guessing the number of lines
> xdiff: let patience and histogram benefit from xdl_trim_ends()
> xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
> xdiff: cleanup xdl_trim_ends()
> xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
> xdiff: replace xdfile_t.dend with xdfenv_t.delta_end
> xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
> xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c
>
> Makefile | 1 +
> compat/ivec.c | 113 ++++++++++++++++++
> compat/ivec.h | 52 +++++++++
> meson.build | 1 +
> xdiff/xdiffi.c | 221 +++++++++++++++++++++++++++++++++---
> xdiff/xdiffi.h | 1 +
> xdiff/xhistogram.c | 7 +-
> xdiff/xpatience.c | 7 +-
> xdiff/xprepare.c | 277 ++++++++-------------------------------------
> xdiff/xtypes.h | 3 +-
> xdiff/xutils.c | 20 ----
> xdiff/xutils.h | 1 -
> 12 files changed, 432 insertions(+), 272 deletions(-)
> create mode 100644 compat/ivec.c
> create mode 100644 compat/ivec.h
>
>
> base-commit: 66ce5f8e8872f0183bb137911c52b07f1f242d13
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2156%2Fezekielnewren%2Fxdiff-cleanup-3-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2156/ezekielnewren/xdiff-cleanup-3-v1
> Pull-Request: https://github.com/git/git/pull/2156
> --
> gitgitgadget
> |
|
User |
| @@ -1107,6 +1107,7 @@ LIB_OBJS += commit-reach.o | |||
| LIB_OBJS += commit.o | |||
There was a problem hiding this comment.
On the Git mailing list, Phillip Wood wrote (reply to this):
Hi Ezekiel
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > Trying to use Rust's Vec in C, or git's ALLOC_GROW() macros (via
> wrapper functions) in Rust is painful because:
> > * C doesn't define its own vector type, and even though Rust does
> have Vec its painful to use on the C side (more on that below).
> However its still not viable to use Rust's Vec type because Git
> needs to be able to compile without Rust. So ivec was created
> expressley to be interoperable between C and Rust without needing
> Rust.
> * C doing vector things the Rust way would require wrapper functions,
> and Rust doing vector things the C way would require wrapper
> functions, so ivec was created to ensure a consistent contract
> between the 2 languages for how to manipulate a vector.
> * Currently, Rust defines its own 'Vec' type that is generic, but its
> memory allocator and struct layout weren't designed for
> interoperability with C (or any language for that matter), meaning
> that the C side cannot push to or expand a 'Vec' without defining
> wrapper functions in Rust that C can call. Without special care,
> the two languages might use different allocators (malloc/free on
> the C side, and possibly something else in Rust), which would make
> it difficult for a function in one language to free elements
> allocated by a call from a function in the other language.
> * Similarly, git defines ALLOC_GROW() and related macros in
> git-compat-util.h. While we could add functions allowing Rust to
> invoke something similar to those macros, passing three variables
> (pointer, length, allocated_size) instead of a single variable
> (vector) across the language boundary requires more cognitive
> overhead for readers to keep track of and makes it easier to make
> mistakes. Further, for low-level components that we want to
> eventually convert to pure Rust, such triplets would feel very out
> of place.
> > To address these issue, introduce a new type, ivec -- short for
> interoperable vector. (We refer to it as 'ivec' generally, though on
> the Rust side the struct is called IVec to match Rust style.) This new
> type is specifically designed for FFI purposes, so that both languages
> handle the vector in the same way, though it could be used on either
> side independently. This type is designed such that it can easily be
> replaced by a Rust 'Vec' once interoperability is no longer a concern.
> > One particular item to note is that Git's macros to handle vec
> operations infer the amount that a vec needs to grow from the size of
> a pointer, but that makes it somewhat specific to the macros used in C.
> To avoid defining every ivec function as a macro I opted to also
> include an element_size field that allows concrete functions like
> push() to know how much to grow the memory. This element_size also
> helps in verifying that the ivec is correct when passing from C to
> Rust.
I've left some comments below but I think this is a sensible direction.
> diff --git a/compat/ivec.c b/compat/ivec.c
> new file mode 100644
> index 0000000000..0a777e78dc
> --- /dev/null
> +++ b/compat/ivec.c
> @@ -0,0 +1,113 @@
> +#include "ivec.h"
> +
> +struct IVec_c_void {
We normally use all lower case names for structs but as this is shared with rust it maybe makes sense to use CamelCase so the names are the same in both languages.
> + void *ptr;
> + size_t length;
> + size_t capacity;
> + size_t element_size;
> +};
> +
> +static void _set_capacity(void *self_, size_t new_capacity)
> +{
> + struct IVec_c_void *self = self_;
Passing any of the ivec variants defined below to this function invokes undefined behavior because we're not casting the pointer back to the orginal type. However I think on the platforms we care about sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
> +
> + if (new_capacity == self->capacity) {
> + return;
> + }
> + if (new_capacity == 0) {
> + free(self->ptr);
> + self->ptr = NULL;
> + } else {
> + self->ptr = realloc(self->ptr, new_capacity * self->element_size);
> + }
> + self->capacity = new_capacity;
Not if realloc() returns NULL. We should check for that, probably by using xrealloc().
> +void ivec_zero(void *self_, size_t capacity)
> +{
> + struct IVec_c_void *self = self_;
> +
> + self->ptr = calloc(capacity, self->element_size);
We should be handling allocation failures here probably by using xcalloc().
> +void ivec_reserve(void *self_, size_t additional)
> +{
> + struct IVec_c_void *self = self_;
> +
> + size_t growby = 128;
> + if (self->capacity > growby)
> + growby = self->capacity;
> + if (additional > growby)
> + growby = additional;
This growth strategy differs from both ALLOC_GROW() and XDL_ALLOC_GROW(), if there isn't a good reason for that we should perhaps just use ALLOC_GROW() here.
> +void ivec_push(void *self_, const void *value)
> +{
> + struct IVec_c_void *self = self_;
> + void *dst = NULL;
> +
> + if (self->length == self->capacity)
> + ivec_reserve(self, 1);
> +
> + dst = (uint8_t*)self->ptr + self->length * self->element_size;
> + memcpy(dst, value, self->element_size);
If self->element_size was a compile time constant the compiler could easily optimize this call away. I'm not sure that is easy to achieve though.
> + self->length++;
> +}
> +
> +void ivec_free(void *self_)
Normally we'd call a like this that free the allocations and re-initializes the members ivec_clear()
> +{
> + struct IVec_c_void *self = self_;
> +
> + free(self->ptr);
> + self->ptr = NULL;
> + self->length = 0;
> + self->capacity = 0;
> + // DO NOT MODIFY element_size!!!
> +}
> +
> +void ivec_move(void *src_, void *dst_)
> +{
> + struct IVec_c_void *src = src_;
> + struct IVec_c_void *dst = dst_;
Maybe we should add
if (src->element_size != dst->element_size)
BUG("moving incompatible arrays");
> +
> + ivec_free(dst);
> + dst->ptr = src->ptr;
> + dst->length = src->length;
> + dst->capacity = src->capacity;
> + // DO NOT MODIFY element_size!!!
As the element sizes must match maybe *dst = *src would be clearer?
> +
> + src->ptr = NULL;
> + src->length = 0;
> + src->capacity = 0;
> + // DO NOT MODIFY element_size!!!
> +}
> diff --git a/compat/ivec.h b/compat/ivec.h
> new file mode 100644
> index 0000000000..654a05c506
> --- /dev/null
> +++ b/compat/ivec.h
> @@ -0,0 +1,52 @@
> +#ifndef IVEC_H
> +#define IVEC_H
> +
> +#include <git-compat-util.h>
It would be nice to have some documentation in this header, see the examples in strvec.h and hashmap.h
> +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
This is a bit cumbersome to use compared to our usual *_INIT macros. I'm struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE cannot define a per-type initializer macro and I we cannot initialize the element size without knowing the type.
> +
> +#ifndef CBINDGEN
> +#define DEFINE_IVEC_TYPE(type, suffix) \
> +struct IVec_##suffix { \
> + type* ptr; \
> + size_t length; \
> + size_t capacity; \
> + size_t element_size; \
> +}
I wonder if we want to define type safe inline safe wrappers for the ivec_* functions here. I think the only functions where the element type matters are ivec_move() and ivec_push(), for the others like ivec_zero(), ivec_reserve() and ivec_free() the element type does not matter. ivec_push() would certainly be easier to use with a wrapper as means we can avoid forcing the caller to take the address of the value.
static inline ivec_##suffix##_push(struct IVec_##suffix *self, type value) { \
const void *ptr = &value; \
ivec_push(self, ptr); \
}
I'll try and take a look at the rest of this series next week
Thanks
Phillip
> +
> +DEFINE_IVEC_TYPE(bool, bool);
> +
> +DEFINE_IVEC_TYPE(uint8_t, u8);
> +DEFINE_IVEC_TYPE(uint16_t, u16);
> +DEFINE_IVEC_TYPE(uint32_t, u32);
> +DEFINE_IVEC_TYPE(uint64_t, u64);
> +
> +DEFINE_IVEC_TYPE(int8_t, i8);
> +DEFINE_IVEC_TYPE(int16_t, i16);
> +DEFINE_IVEC_TYPE(int32_t, i32);
> +DEFINE_IVEC_TYPE(int64_t, i64);
> +
> +DEFINE_IVEC_TYPE(float, f32);
> +DEFINE_IVEC_TYPE(double, f64);
> +
> +DEFINE_IVEC_TYPE(size_t, usize);
> +DEFINE_IVEC_TYPE(ssize_t, isize);
> +#endif
> +
> +void ivec_init(void *self_, size_t element_size);
> +
> +void ivec_zero(void *self_, size_t capacity);
> +
> +void ivec_reserve_exact(void *self_, size_t additional);
> +
> +void ivec_reserve(void *self_, size_t additional);
> +
> +void ivec_shrink_to_fit(void *self_);
> +
> +void ivec_push(void *self_, const void *value);
> +
> +void ivec_free(void *self_);
> +
> +void ivec_move(void *src, void *dst);
> +
> +#endif /* IVEC_H */
> diff --git a/meson.build b/meson.build
> index dd52efd1c8..42ac0c8c42 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -302,6 +302,7 @@ libgit_sources = [
> 'commit.c',
> 'common-exit.c',
> 'common-init.c',
> + 'compat/ivec.c',
> 'compat/nonblock.c',
> 'compat/obstack.c',
> 'compat/open.c',There was a problem hiding this comment.
Ezekiel Newren wrote on the Git mailing list (how to reply to this email):
On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> > diff --git a/compat/ivec.c b/compat/ivec.c
> > new file mode 100644
> > index 0000000000..0a777e78dc
> > --- /dev/null
> > +++ b/compat/ivec.c
> > @@ -0,0 +1,113 @@
> > +#include "ivec.h"
> > +
> > +struct IVec_c_void {
>
> We normally use all lower case names for structs but as this is shared
> with rust it maybe makes sense to use CamelCase so the names are the
> same in both languages.
My preference would be all lowercase, but cbindgen insists on using
the same casing as was used in Rust. I don't think there's a way to
make cbindgen use all lowercase for structs.
> > + void *ptr;
> > + size_t length;
> > + size_t capacity;
> > + size_t element_size;
> > +};
> > +
> > +static void _set_capacity(void *self_, size_t new_capacity)
> > +{
> > + struct IVec_c_void *self = self_;
>
> Passing any of the ivec variants defined below to this function invokes
> undefined behavior because we're not casting the pointer back to the
> orginal type. However I think on the platforms we care about
> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
If someone finds that this code does not work because of this
assumption I'd like to know. But I can't fathom a case where it
wouldn't work.
> > +
> > + if (new_capacity == self->capacity) {
> > + return;
> > + }
> > + if (new_capacity == 0) {
> > + free(self->ptr);
> > + self->ptr = NULL;
> > + } else {
> > + self->ptr = realloc(self->ptr, new_capacity * self->element_size);
> > + }
> > + self->capacity = new_capacity;
>
> Not if realloc() returns NULL. We should check for that, probably by
> using xrealloc().
>
> > +void ivec_zero(void *self_, size_t capacity)
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + self->ptr = calloc(capacity, self->element_size);
>
> We should be handling allocation failures here probably by using xcalloc().
I've changed it to xrealloc() similar for the calloc() call.
> > +void ivec_reserve(void *self_, size_t additional)
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + size_t growby = 128;
> > + if (self->capacity > growby)
> > + growby = self->capacity;
> > + if (additional > growby)
> > + growby = additional;
>
> This growth strategy differs from both ALLOC_GROW() and
> XDL_ALLOC_GROW(), if there isn't a good reason for that we should
> perhaps just use ALLOC_GROW() here.
XDL_ALLOW_GROW() can't be used because the pointer is always a void*
in this function.
> > +void ivec_push(void *self_, const void *value)
> > +{
> > + struct IVec_c_void *self = self_;
> > + void *dst = NULL;
> > +
> > + if (self->length == self->capacity)
> > + ivec_reserve(self, 1);
> > +
> > + dst = (uint8_t*)self->ptr + self->length * self->element_size;
> > + memcpy(dst, value, self->element_size);
>
> If self->element_size was a compile time constant the compiler could
> easily optimize this call away. I'm not sure that is easy to achieve though.
The problem is that I didn't want all of ivec to be macros that looked
like function calls. I wanted to minimize use of macros so that it was
easier to port and verify that the Rust implementation matches the
behavior of the C implementation.
> > +void ivec_free(void *self_)
>
> Normally we'd call a like this that free the allocations and
> re-initializes the members ivec_clear()
In Rust Vec.clear() means to set length to zero, but leaves the
allocation alone. The reason why I'm zeroing the struct is to help
avoid FFI issues. If not zero then what should the members be set to,
to indicate that using the struct is not valid anymore? In Rust an
object is freed when it goes out of scope and _cannot_ be accessed
afterward.
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + free(self->ptr);
> > + self->ptr = NULL;
> > + self->length = 0;
> > + self->capacity = 0;
> > + // DO NOT MODIFY element_size!!!
> > +}
> > +
> > +void ivec_move(void *src_, void *dst_)
> > +{
> > + struct IVec_c_void *src = src_;
> > + struct IVec_c_void *dst = dst_;
>
> Maybe we should add
>
> if (src->element_size != dst->element_size)
> BUG("moving incompatible arrays");
I'll do that.
> > +
> > + ivec_free(dst);
> > + dst->ptr = src->ptr;
> > + dst->length = src->length;
> > + dst->capacity = src->capacity;
> > + // DO NOT MODIFY element_size!!!
>
> As the element sizes must match maybe *dst = *src would be clearer?
That seems fine.
> > +
> > + src->ptr = NULL;
> > + src->length = 0;
> > + src->capacity = 0;
> > + // DO NOT MODIFY element_size!!!
> > +}
> > diff --git a/compat/ivec.h b/compat/ivec.h
> > new file mode 100644
> > index 0000000000..654a05c506
> > --- /dev/null
> > +++ b/compat/ivec.h
> > @@ -0,0 +1,52 @@
> > +#ifndef IVEC_H
> > +#define IVEC_H
> > +
> > +#include <git-compat-util.h>
>
> It would be nice to have some documentation in this header, see the
> examples in strvec.h and hashmap.h
>
> > +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
>
> This is a bit cumbersome to use compared to our usual *_INIT macros. I'm
> struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE
> cannot define a per-type initializer macro and I we cannot initialize
> the element size without knowing the type.
I don't see what's cumbersome about it. Maybe an example use case
would clarify things.
```
DEFINE_IVEC_TYPE(xrecord_t, xrecord);
void some_function() {
struct IVec_xrecord rec;
IVEC_INIT(rec); // i.e. ivec_init(&rec, sizeof(*rec.ptr);
// use concrete functions to manipulate vector or access the array
directly via ptr
}
```
IVEC_INIT() should be used on the concrete type.
> > +
> > +#ifndef CBINDGEN
> > +#define DEFINE_IVEC_TYPE(type, suffix) \
> > +struct IVec_##suffix { \
> > + type* ptr; \
> > + size_t length; \
> > + size_t capacity; \
> > + size_t element_size; \
> > +}
>
> I wonder if we want to define type safe inline safe wrappers for the
> ivec_* functions here. I think the only functions where the element type
> matters are ivec_move() and ivec_push(), for the others like
> ivec_zero(), ivec_reserve() and ivec_free() the element type does not
> matter. ivec_push() would certainly be easier to use with a wrapper as
> means we can avoid forcing the caller to take the address of the value.
>
> static inline ivec_##suffix##_push(struct IVec_##suffix *self, type
> value) { \
> const void *ptr = &value; \
> ivec_push(self, ptr); \
> }
I turned ivec_push() into a macro, but the rest will remain as
concrete functions.There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
I've Cc'd Peff and René for a second opinion if you have time please.
On 15/01/2026 15:55, Ezekiel Newren wrote:
> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
>>> +static void _set_capacity(void *self_, size_t new_capacity)
>>> +{
>>> + struct IVec_c_void *self = self_;
>>
>> Passing any of the ivec variants defined below to this function invokes
>> undefined behavior because we're not casting the pointer back to the
>> orginal type. However I think on the platforms we care about
>> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
> > If someone finds that this code does not work because of this
> assumption I'd like to know. But I can't fathom a case where it
> wouldn't work.
So we have two different structs
struct IVec_c_void {
void *ptr;
size_t length;
size_t capacity;
size_t element_size;
}
and
struct Ivec_u8 {
uint8_t *ptr;
size_t length;
size_t capacity;
size_t element_size;
}
One the platforms we care about they will have the same memory layout as all pointers have the same representation. However I don't think they are "compatible types" in the language of the C standard because the type of the "ptr" member differs. That means casting IVec_u8* to IVec_c_void* either directly or via void* is undefined and so
struct IVec_u8 vec;
ivec_init(&vec, sizeof(*vec.ptr));
is undefined. For the compiler to see the undefined cast it needs to look across translation units because the implementation of ivec_init() will be in a separate file to where it is called. Maybe that and the fact they have the same memory layout saves us from having to worry too much though I'm always nervous of undefined behavior.
An alternative would be to pass the individual struct members as function parameters
void ivec_init(void **vec, size_t &length, size_t &capacity,
size_t &element_size_, size_t element_size)
{
*vec = NULL;
*length = 0;
*capacity = 0;
*element_size_ = element_size;
}
and have DEFINE_IVEC_TYPE create typesafe wrappers
static inline void ivec_u8_init(struct IVec_u8 *vec)
{
void *ptr = vec->ptr;
ivec_init(&ptr, &v->length, &v->capacity,
&v->element_size, sizeof(*(v->ptr));
vec->ptr = ptr;
}
That's safe because we cast the "ptr" member to "void*" and then back to the original type. On the rust side the implementation of IVec<T> would also need to split out the individual struct members when it calls ivec_init() etc. It's all a bit more effort but the benefit is that we don't have any undefined behavior and we have a nice typesafe C interface to 'struct IVec_*'.
Thanks
PhillipThere was a problem hiding this comment.
René Scharfe wrote on the Git mailing list (how to reply to this email):
On 1/16/26 11:39 AM, Phillip Wood wrote:
> I've Cc'd Peff and René for a second opinion if you have time please.
>
> On 15/01/2026 15:55, Ezekiel Newren wrote:
>> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>>>> +static void _set_capacity(void *self_, size_t new_capacity)
>>>> +{
>>>> + struct IVec_c_void *self = self_;
>>>
>>> Passing any of the ivec variants defined below to this function invokes
>>> undefined behavior because we're not casting the pointer back to the
>>> orginal type. However I think on the platforms we care about
>>> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
>>
>> If someone finds that this code does not work because of this
>> assumption I'd like to know. But I can't fathom a case where it
>> wouldn't work.
>
> So we have two different structs
>
> struct IVec_c_void {
> void *ptr;
> size_t length;
> size_t capacity;
> size_t element_size;
> }
>
> and
>
> struct Ivec_u8 {
> uint8_t *ptr;
> size_t length;
> size_t capacity;
> size_t element_size;
> }
>
> One the platforms we care about they will have the same memory
> layout as all pointers have the same representation. However I don't
> think they are "compatible types" in the language of the C standard
> because the type of the "ptr" member differs. That means casting
> IVec_u8* to IVec_c_void* either directly or via void* is undefined
> and so
>
> struct IVec_u8 vec;
> ivec_init(&vec, sizeof(*vec.ptr));
>
> is undefined. For the compiler to see the undefined cast it needs to
> look across translation units because the implementation of
> ivec_init() will be in a separate file to where it is called. Maybe
> that and the fact they have the same memory layout saves us from
> having to worry too much though I'm always nervous of undefined
> behavior.
True. The GCC docs give a fun example of what a compiler might do
when using different struct types to access the same memory:
https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Aliasing-Type-Rules.html
Not sure it applies to this case, but the point is that compilers
can and will do terrifying things when they smell UB, with little
concern for safety or original intent.
> An alternative would be to pass the individual struct members as function parameters
>
> void ivec_init(void **vec, size_t &length, size_t &capacity,
> size_t &element_size_, size_t element_size)
> {
> *vec = NULL;
> *length = 0;
> *capacity = 0;
> *element_size_ = element_size;
> }
The ampersands (&) should be asterisks (*), right?
> and have DEFINE_IVEC_TYPE create typesafe wrappers
>
> static inline void ivec_u8_init(struct IVec_u8 *vec)
> {
> void *ptr = vec->ptr;
> ivec_init(&ptr, &v->length, &v->capacity,
> &v->element_size, sizeof(*(v->ptr));
> vec->ptr = ptr;
> }
Mixes "v" and "vec", misses a closing parenthesis. Looks viable,
though, and this method should be applicable to the rest of the
functions as well (on the C side).
I guess this doesn't require an element_size member anymore as
each wrapper can pass in the sizeof value.
> That's safe because we cast the "ptr" member to "void*" and then
> back to the original type. On the rust side the implementation of
> IVec<T> would also need to split out the individual struct members
> when it calls ivec_init() etc. It's all a bit more effort but the
> benefit is that we don't have any undefined behavior and we have a
> nice typesafe C interface to 'struct IVec_*'.
Right. No idea how ugly this would be on the Rust side, though.
René
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 16/01/2026 20:19, René Scharfe wrote:
> On 1/16/26 11:39 AM, Phillip Wood wrote:
>> I've Cc'd Peff and René for a second opinion if you have time please.
>>
>> On 15/01/2026 15:55, Ezekiel Newren wrote:
>>> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>
>>>>> +static void _set_capacity(void *self_, size_t new_capacity)
>>>>> +{
>>>>> + struct IVec_c_void *self = self_;
>>>>
>>>> Passing any of the ivec variants defined below to this function invokes
>>>> undefined behavior because we're not casting the pointer back to the
>>>> orginal type. However I think on the platforms we care about
>>>> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
>>>
>>> If someone finds that this code does not work because of this
>>> assumption I'd like to know. But I can't fathom a case where it
>>> wouldn't work.
>>
>> So we have two different structs
>>
>> struct IVec_c_void {
>> void *ptr;
>> size_t length;
>> size_t capacity;
>> size_t element_size;
>> }
>>
>> and
>>
>> struct Ivec_u8 {
>> uint8_t *ptr;
>> size_t length;
>> size_t capacity;
>> size_t element_size;
>> }
>>
>> One the platforms we care about they will have the same memory
>> layout as all pointers have the same representation. However I don't
>> think they are "compatible types" in the language of the C standard
>> because the type of the "ptr" member differs. That means casting
>> IVec_u8* to IVec_c_void* either directly or via void* is undefined
>> and so
>>
>> struct IVec_u8 vec;
>> ivec_init(&vec, sizeof(*vec.ptr));
>>
>> is undefined. For the compiler to see the undefined cast it needs to
>> look across translation units because the implementation of
>> ivec_init() will be in a separate file to where it is called. Maybe
>> that and the fact they have the same memory layout saves us from
>> having to worry too much though I'm always nervous of undefined
>> behavior.
> > True. The GCC docs give a fun example of what a compiler might do
> when using different struct types to access the same memory:
> > https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Aliasing-Type-Rules.html
Thanks for the link
> Not sure it applies to this case, but the point is that compilers
> can and will do terrifying things when they smell UB, with little
> concern for safety or original intent.
> >> An alternative would be to pass the individual struct members as function parameters
>>
>> void ivec_init(void **vec, size_t &length, size_t &capacity,
>> size_t &element_size_, size_t element_size)
>> {
>> *vec = NULL;
>> *length = 0;
>> *capacity = 0;
>> *element_size_ = element_size;
>> }
> > The ampersands (&) should be asterisks (*), right?
Indeed, that's embarrassing - I must have been thinking of the caller.
>> and have DEFINE_IVEC_TYPE create typesafe wrappers
>>
>> static inline void ivec_u8_init(struct IVec_u8 *vec)
>> {
>> void *ptr = vec->ptr;
>> ivec_init(&ptr, &v->length, &v->capacity,
>> &v->element_size, sizeof(*(v->ptr));
>> vec->ptr = ptr;
>> }
> > Mixes "v" and "vec", misses a closing parenthesis. Looks viable,
> though, and this method should be applicable to the rest of the
> functions as well (on the C side).
> > I guess this doesn't require an element_size member anymore as
> each wrapper can pass in the sizeof value.
Good point
>> That's safe because we cast the "ptr" member to "void*" and then
>> back to the original type. On the rust side the implementation of
>> IVec<T> would also need to split out the individual struct members
>> when it calls ivec_init() etc. It's all a bit more effort but the
>> benefit is that we don't have any undefined behavior and we have a
>> nice typesafe C interface to 'struct IVec_*'.
> Right. No idea how ugly this would be on the Rust side, though.
I'm hoping it's not too bad and `impl IVec<T>` just contains the equivalent of the wrappers generated by DEFINE_IVEC_TYPE()
Thanks
Phillip
> > René
> There was a problem hiding this comment.
Jeff King wrote on the Git mailing list (how to reply to this email):
On Wed, Jan 21, 2026 at 02:00:15PM -0700, Ezekiel Newren wrote:
> What about adding clar unit tests to make sure that different ivec
> types have the same size and layout? e.g. sizeof(IVec_c_void) ==
> sizeof(IVec_u8);
> sizeof(IVec_c_void) == sizeof(IVec_u16);
> sizeof(IVec_c_void) == sizeof(IVec_u32);
> sizeof(IVec_c_void) == sizeof(IVec_u64);
> ...
>
> As well as other tests for ivec.
I'm a little hesitant in general to have run-time tests for properties
around undefined behavior, just because the compiler is allowed to do a
lot of tricky things when we get into that territory. Plus it is not
really _solving_ the problem, but perhaps just alerting us slightly
sooner than the production code itself crashing and burning.
You'd also need to check the pointer field sizes directly due to
padding. I don't think it's sufficient, due to padding. If one pointer
is 4 bytes and another is 8 (for example), but the element afterwards
requires 8-byte alignment, then the compiler will have to insert 4 bytes
of padding. And the resulting struct size will be the same. You'd have
to more directly check that sizeof(uint_t*) == sizeof(void *), I think.
So I dunno. I am not a compiler expert, nor a rust expert, nor really
know anything about rust/C ABI boundaries. There might be no problem at
all here, and I'm only commenting on what I know is possible (albeit
unlikely) from the C side.
-PeffThere was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
Jeff King <peff@peff.net> writes:
> On Wed, Jan 21, 2026 at 02:00:15PM -0700, Ezekiel Newren wrote:
>
>> What about adding clar unit tests to make sure that different ivec
>> types have the same size and layout? e.g. sizeof(IVec_c_void) ==
>> sizeof(IVec_u8);
>> sizeof(IVec_c_void) == sizeof(IVec_u16);
>> sizeof(IVec_c_void) == sizeof(IVec_u32);
>> sizeof(IVec_c_void) == sizeof(IVec_u64);
>> ...
>>
>> As well as other tests for ivec.
>
> I'm a little hesitant in general to have run-time tests for properties
> around undefined behavior, just because the compiler is allowed to do a
> lot of tricky things when we get into that territory. Plus it is not
> really _solving_ the problem, but perhaps just alerting us slightly
> sooner than the production code itself crashing and burning.
Yup, by definition, testing undefined behaviour with code is more or
less pointless. Implementation defined behaviour, maybe, but not
undefined ones, please.
I thought you already gave them that having different possibilities
in a union would work correctly, but perhaps I was reading a
different thread? I dunno...
There was a problem hiding this comment.
Ezekiel Newren wrote on the Git mailing list (how to reply to this email):
On Tue, Jan 20, 2026 at 7:06 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ezekiel
>
> On 15/01/2026 15:55, Ezekiel Newren wrote:
> > On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> >>> +void ivec_reserve(void *self_, size_t additional)
> >>> +{
> >>> + struct IVec_c_void *self = self_;
> >>> +
> >>> + size_t growby = 128;
> >>> + if (self->capacity > growby)
> >>> + growby = self->capacity;
> >>> + if (additional > growby)
> >>> + growby = additional;
> >>
> >> This growth strategy differs from both ALLOC_GROW() and
> >> XDL_ALLOC_GROW(), if there isn't a good reason for that we should
> >> perhaps just use ALLOC_GROW() here.
> >
> > XDL_ALLOW_GROW() can't be used because the pointer is always a void*
> > in this function.
>
> Oh right. I'm not sure that's not a reason to use a different growth
> strategy though. The minimum size of 128 elements is probably good for
> the xdiff code that creates arrays with one element per line but if this
> is supposed to be for general use it is going to waste space when we're
> allocating a lot of small arrays. ALLOC_GROW() uses alloc_nr() to
> calculate the new side so perhaps we could use that here?
If ivec_reserve() isn't suitable then ivec_reserve_exact() should be
used instead.
> >>> +void ivec_push(void *self_, const void *value)
> >>> +{
> >>> + struct IVec_c_void *self = self_;
> >>> + void *dst = NULL;
> >>> +
> >>> + if (self->length == self->capacity)
> >>> + ivec_reserve(self, 1);
> >>> +
> >>> + dst = (uint8_t*)self->ptr + self->length * self->element_size;
> >>> + memcpy(dst, value, self->element_size);
> >>
> >> If self->element_size was a compile time constant the compiler could
> >> easily optimize this call away. I'm not sure that is easy to achieve though.
> >
> > The problem is that I didn't want all of ivec to be macros that looked
> > like function calls. I wanted to minimize use of macros so that it was
> > easier to port and verify that the Rust implementation matches the
> > behavior of the C implementation.
>
> I think that's a reasonable concern. So is the plan to have a parallel
> rust implementation of these functions rather than call the C
> implementation from rust?
Yes, the Rust implementation will be independent of the C
implementation, but will behave the same way. That's why I'm calling
it an interoperable vec as opposed to a compatible vec. Rust can't
call the C ivec functions and C can't call the Rust ivec functions,
but they'll behave the same way.
> >>> +void ivec_free(void *self_)
> >>
> >> Normally we'd call a like this that free the allocations and
> >> re-initializes the members ivec_clear()
> >
> > In Rust Vec.clear() means to set length to zero, but leaves the
> > allocation alone. The reason why I'm zeroing the struct is to help
> > avoid FFI issues. If not zero then what should the members be set to,
> > to indicate that using the struct is not valid anymore? In Rust an
> > object is freed when it goes out of scope and _cannot_ be accessed
> > afterward.
Maybe I should call this ivec_drop(). Though the notion of explicitly
freeing an object in Rust is _almost_ nonsense. The way you free
something in Rust is to let it go out of scope.
> I'm aware that Vec::clear() has different semantics (it does what
> strbuf_reset() does). That's unfortunate but this function has different
> semantics to all the other *_free() functions in git. Our coding
> guidelines say
>
> - There are several common idiomatic names for functions performing
> specific tasks on a structure `S`:
>
> - `S_init()` initializes a structure without allocating the
> structure itself.
>
> - `S_release()` releases a structure's contents without freeing the
> structure.
>
> - `S_clear()` is equivalent to `S_release()` followed by `S_init()`
> such that the structure is directly usable after clearing it. When
> `S_clear()` is provided, `S_init()` shall not allocate resources
> that need to be released again.
>
> - `S_free()` releases a structure's contents and frees the
> structure.
>
> As we write more rust code and so wrap more of our existing structs
> we're going to be wrapping C code that uses the definitions above so I
> think we should do the same with struct IVec_*.
I disagree. IVec isn't a wrapper around an existing struct. ivec is
meant to very closely mimic Rust's Vec while guaranteeing
interoperability. For things like strbuf I haven't conceived of a
solution for that yet. Making ivec diverge from Rust's Vec will result
in POLA violations due to different behavior when refactoring an
IVec<your_type_here> to Vec<your_type_here>.
> >>> diff --git a/compat/ivec.h b/compat/ivec.h
> >>> new file mode 100644
> >>> index 0000000000..654a05c506
> >>> --- /dev/null
> >>> +++ b/compat/ivec.h
> >>> @@ -0,0 +1,52 @@
> >>> +#ifndef IVEC_H
> >>> +#define IVEC_H
> >>> +
> >>> +#include <git-compat-util.h>
> >>
> >> It would be nice to have some documentation in this header, see the
> >> examples in strvec.h and hashmap.h
> >>
> >>> +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
> >>
> >> This is a bit cumbersome to use compared to our usual *_INIT macros. I'm
> >> struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE
> >> cannot define a per-type initializer macro and I we cannot initialize
> >> the element size without knowing the type.
> >
> > I don't see what's cumbersome about it. Maybe an example use case
> > would clarify things.
>
> It is cumbersome because it separates the initialization from the
> declaration. Normally our *_INIT macros are initializer lists so we can
> write
>
> struct strbuf = STRBUF_INIT;
>
> which keeps the declaration and initialization together. Although
> they're on adjacent lines in your example in real code the
> initialization likely to be separated from the declaration by other
> variable declarations.
Ah I see what you mean now. I'll experiment with making IVEC_INIT()
work like that. One wrinkle is that STRBUF_INIT is a single concrete
type whereas IVEC_INIT() is meant for generic types.There was a problem hiding this comment.
Ezekiel Newren wrote on the Git mailing list (how to reply to this email):
On Wed, Jan 21, 2026 at 2:31 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> > On Wed, Jan 21, 2026 at 02:00:15PM -0700, Ezekiel Newren wrote:
> >
> >> What about adding clar unit tests to make sure that different ivec
> >> types have the same size and layout? e.g. sizeof(IVec_c_void) ==
> >> sizeof(IVec_u8);
> >> sizeof(IVec_c_void) == sizeof(IVec_u16);
> >> sizeof(IVec_c_void) == sizeof(IVec_u32);
> >> sizeof(IVec_c_void) == sizeof(IVec_u64);
> >> ...
> >>
> >> As well as other tests for ivec.
> >
> > I'm a little hesitant in general to have run-time tests for properties
> > around undefined behavior, just because the compiler is allowed to do a
> > lot of tricky things when we get into that territory. Plus it is not
> > really _solving_ the problem, but perhaps just alerting us slightly
> > sooner than the production code itself crashing and burning.
>
> Yup, by definition, testing undefined behaviour with code is more or
> less pointless. Implementation defined behaviour, maybe, but not
> undefined ones, please.
>
> I thought you already gave them that having different possibilities
> in a union would work correctly, but perhaps I was reading a
> different thread? I dunno...
>
In my opinion the proper solution to this is to document that any
platform with different size pointers for different types is not
supported by Git. Which would make using Git on those platforms "use
at your own risk".There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 21/01/2026 21:39, Ezekiel Newren wrote:
> On Tue, Jan 20, 2026 at 7:06 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> Hi Ezekiel
>>
>> On 15/01/2026 15:55, Ezekiel Newren wrote:
>>> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>>> +void ivec_reserve(void *self_, size_t additional)
>>>>> +{
>>>>> + struct IVec_c_void *self = self_;
>>>>> +
>>>>> + size_t growby = 128;
>>>>> + if (self->capacity > growby)
>>>>> + growby = self->capacity;
>>>>> + if (additional > growby)
>>>>> + growby = additional;
>>>>
>>>> This growth strategy differs from both ALLOC_GROW() and
>>>> XDL_ALLOC_GROW(), if there isn't a good reason for that we should
>>>> perhaps just use ALLOC_GROW() here.
>>>
>>> XDL_ALLOW_GROW() can't be used because the pointer is always a void*
>>> in this function.
>>
>> Oh right. I'm not sure that's not a reason to use a different growth
>> strategy though. The minimum size of 128 elements is probably good for
>> the xdiff code that creates arrays with one element per line but if this
>> is supposed to be for general use it is going to waste space when we're
>> allocating a lot of small arrays. ALLOC_GROW() uses alloc_nr() to
>> calculate the new side so perhaps we could use that here?
> > If ivec_reserve() isn't suitable then ivec_reserve_exact() should be
> used instead.
If some C code that pushes one element at a time to an array using ALLOC_GROW() is converted to use an ivec then we don't want to change the code behaves - that means it should grow the array in the same way. I don't see how the suggestion to use ivec_reserve_exact() helps in that situation. What is the advantage in having a different growth characteristic?
>>>>> +void ivec_push(void *self_, const void *value)
>>>>> +{
>>>>> + struct IVec_c_void *self = self_;
>>>>> + void *dst = NULL;
>>>>> +
>>>>> + if (self->length == self->capacity)
>>>>> + ivec_reserve(self, 1);
>>>>> +
>>>>> + dst = (uint8_t*)self->ptr + self->length * self->element_size;
>>>>> + memcpy(dst, value, self->element_size);
>>>>
>>>> If self->element_size was a compile time constant the compiler could
>>>> easily optimize this call away. I'm not sure that is easy to achieve though.
>>>
>>> The problem is that I didn't want all of ivec to be macros that looked
>>> like function calls. I wanted to minimize use of macros so that it was
>>> easier to port and verify that the Rust implementation matches the
>>> behavior of the C implementation.
>>
>> I think that's a reasonable concern. So is the plan to have a parallel
>> rust implementation of these functions rather than call the C
>> implementation from rust?
> > Yes, the Rust implementation will be independent of the C
> implementation, but will behave the same way. That's why I'm calling
> it an interoperable vec as opposed to a compatible vec. Rust can't
> call the C ivec functions and C can't call the Rust ivec functions,
> but they'll behave the same way.
Interesting - I'm curious what the advantage of that is over having rust call the C implementation? I can see you wouldn't want to be calling into C for each ivec.push() call, but checking if there is room to push the new element in rust and calling into C to extend the vector if not should be reasonable and then you don't have to re-implement everything in rust.
>>>>> +void ivec_free(void *self_)
>>>>
>>>> Normally we'd call a like this that free the allocations and
>>>> re-initializes the members ivec_clear()
>>>
>>> In Rust Vec.clear() means to set length to zero, but leaves the
>>> allocation alone. The reason why I'm zeroing the struct is to help
>>> avoid FFI issues. If not zero then what should the members be set to,
>>> to indicate that using the struct is not valid anymore? In Rust an
>>> object is freed when it goes out of scope and _cannot_ be accessed
>>> afterward.
> > Maybe I should call this ivec_drop(). Though the notion of explicitly
> freeing an object in Rust is _almost_ nonsense. The way you free
> something in Rust is to let it go out of scope.
Indeed - which means this wont be a public function in rust and so why do we worry about naming it ivec_clear()? At least ivec_drop() does not conflict with any of the standard function suffixes that we're already using in git.
>> I'm aware that Vec::clear() has different semantics (it does what
>> strbuf_reset() does). That's unfortunate but this function has different
>> semantics to all the other *_free() functions in git. Our coding
>> guidelines say
>>
>> - There are several common idiomatic names for functions performing
>> specific tasks on a structure `S`:
>>
>> - `S_init()` initializes a structure without allocating the
>> structure itself.
>>
>> - `S_release()` releases a structure's contents without freeing the
>> structure.
>>
>> - `S_clear()` is equivalent to `S_release()` followed by `S_init()`
>> such that the structure is directly usable after clearing it. When
>> `S_clear()` is provided, `S_init()` shall not allocate resources
>> that need to be released again.
>>
>> - `S_free()` releases a structure's contents and frees the
>> structure.
>>
>> As we write more rust code and so wrap more of our existing structs
>> we're going to be wrapping C code that uses the definitions above so I
>> think we should do the same with struct IVec_*.
> > I disagree. IVec isn't a wrapper around an existing struct.
So just because it is a new stuct it shouldn't have to follow the existing naming conventions?
> ivec is
> meant to very closely mimic Rust's Vec while guaranteeing
> interoperability. For things like strbuf I haven't conceived of a
> solution for that yet. Making ivec diverge from Rust's Vec will result
> in POLA violations due to different behavior when refactoring an
> IVec<your_type_here> to Vec<your_type_here>.
On the other hand, vec.reset() does not exist so you'd get a compiler error if you forgot to rename those calls when changing from IVec to Vec and the rust code wouldn't be calling ivec.clear(). I'm not sure citing POLA concerns is very convincing as ivec_free() in C is a POLA violation for anyone familiar with git's code base so it's not like there's a choice that avoids that concern.
Thanks
Phillip
>>>>> diff --git a/compat/ivec.h b/compat/ivec.h
>>>>> new file mode 100644
>>>>> index 0000000000..654a05c506
>>>>> --- /dev/null
>>>>> +++ b/compat/ivec.h
>>>>> @@ -0,0 +1,52 @@
>>>>> +#ifndef IVEC_H
>>>>> +#define IVEC_H
>>>>> +
>>>>> +#include <git-compat-util.h>
>>>>
>>>> It would be nice to have some documentation in this header, see the
>>>> examples in strvec.h and hashmap.h
>>>>
>>>>> +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
>>>>
>>>> This is a bit cumbersome to use compared to our usual *_INIT macros. I'm
>>>> struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE
>>>> cannot define a per-type initializer macro and I we cannot initialize
>>>> the element size without knowing the type.
>>>
>>> I don't see what's cumbersome about it. Maybe an example use case
>>> would clarify things.
>>
>> It is cumbersome because it separates the initialization from the
>> declaration. Normally our *_INIT macros are initializer lists so we can
>> write
>>
>> struct strbuf = STRBUF_INIT;
>>
>> which keeps the declaration and initialization together. Although
>> they're on adjacent lines in your example in real code the
>> initialization likely to be separated from the declaration by other
>> variable declarations.
> > Ah I see what you mean now. I'll experiment with making IVEC_INIT()
> work like that. One wrinkle is that STRBUF_INIT is a single concrete
> type whereas IVEC_INIT() is meant for generic types.
If you can get it to work that would be great, but I can't think of a way of getting it to work for a generic type.
Thanks
Phillip|
User |
|
This patch series was integrated into seen via 54f9c88. |
|
This patch series was integrated into seen via a3a8fdd. |
| @@ -1107,6 +1107,7 @@ LIB_OBJS += commit-reach.o | |||
| LIB_OBJS += commit.o | |||
There was a problem hiding this comment.
René Scharfe wrote on the Git mailing list (how to reply to this email):
On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
> diff --git a/compat/ivec.c b/compat/ivec.c
> new file mode 100644
> index 0000000000..0a777e78dc
> --- /dev/null
> +++ b/compat/ivec.c
> @@ -0,0 +1,113 @@
> +#include "ivec.h"
> +
> +struct IVec_c_void {
> + void *ptr;
> + size_t length;
> + size_t capacity;
> + size_t element_size;
> +};
> +
> +static void _set_capacity(void *self_, size_t new_capacity)
> +{
> + struct IVec_c_void *self = self_;
> +
> + if (new_capacity == self->capacity) {
> + return;
> + }
> + if (new_capacity == 0) {
> + free(self->ptr);
> + self->ptr = NULL;
> + } else {
> + self->ptr = realloc(self->ptr, new_capacity * self->element_size);
> + }
> + self->capacity = new_capacity;
> +}
> +
> +
> +void ivec_init(void *self_, size_t element_size)
> +{
> + struct IVec_c_void *self = self_;
> +
> + self->ptr = NULL;
> + self->length = 0;
> + self->capacity = 0;
> + self->element_size = element_size;
> +}
> +
> +void ivec_zero(void *self_, size_t capacity)
> +{
> + struct IVec_c_void *self = self_;
> +
> + self->ptr = calloc(capacity, self->element_size);
> + self->length = capacity;
> + self->capacity = capacity;
> + // DO NOT MODIFY element_size!!!
> +}
> +
> +void ivec_reserve_exact(void *self_, size_t additional)
> +{
> + struct IVec_c_void *self = self_;
> +
> + _set_capacity(self, self->capacity + additional);
> +}
> +
> +void ivec_reserve(void *self_, size_t additional)
> +{
> + struct IVec_c_void *self = self_;
> +
> + size_t growby = 128;
> + if (self->capacity > growby)
> + growby = self->capacity;
> + if (additional > growby)
> + growby = additional;
> +
> + _set_capacity(self, self->capacity + growby);
> +}
Constant growth steps like these cause linear growth and quadratic
complexity. ALLOC_GROW does exponential growth with factor 1.5 to
get linear complexity. Here's an old plea to do the same:
https://blog.mozilla.org/nnethercote/2014/11/04/please-grow-your-buffers-exponentially/
René
There was a problem hiding this comment.
Ezekiel Newren wrote on the Git mailing list (how to reply to this email):
On Fri, Jan 16, 2026 at 1:19 PM René Scharfe <l.s.r@web.de> wrote:
>
> On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
> > diff --git a/compat/ivec.c b/compat/ivec.c
> > new file mode 100644
> > index 0000000000..0a777e78dc
> > --- /dev/null
> > +++ b/compat/ivec.c
> > @@ -0,0 +1,113 @@
> > +#include "ivec.h"
> > +
> > +struct IVec_c_void {
> > + void *ptr;
> > + size_t length;
> > + size_t capacity;
> > + size_t element_size;
> > +};
> > +
> > +static void _set_capacity(void *self_, size_t new_capacity)
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + if (new_capacity == self->capacity) {
> > + return;
> > + }
> > + if (new_capacity == 0) {
> > + free(self->ptr);
> > + self->ptr = NULL;
> > + } else {
> > + self->ptr = realloc(self->ptr, new_capacity * self->element_size);
> > + }
> > + self->capacity = new_capacity;
> > +}
> > +
> > +
> > +void ivec_init(void *self_, size_t element_size)
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + self->ptr = NULL;
> > + self->length = 0;
> > + self->capacity = 0;
> > + self->element_size = element_size;
> > +}
> > +
> > +void ivec_zero(void *self_, size_t capacity)
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + self->ptr = calloc(capacity, self->element_size);
> > + self->length = capacity;
> > + self->capacity = capacity;
> > + // DO NOT MODIFY element_size!!!
> > +}
> > +
> > +void ivec_reserve_exact(void *self_, size_t additional)
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + _set_capacity(self, self->capacity + additional);
> > +}
> > +
> > +void ivec_reserve(void *self_, size_t additional)
> > +{
> > + struct IVec_c_void *self = self_;
> > +
> > + size_t growby = 128;
> > + if (self->capacity > growby)
> > + growby = self->capacity;
> > + if (additional > growby)
> > + growby = additional;
> > +
> > + _set_capacity(self, self->capacity + growby);
> > +}
>
> Constant growth steps like these cause linear growth and quadratic
> complexity. ALLOC_GROW does exponential growth with factor 1.5 to
> get linear complexity. Here's an old plea to do the same:
> https://blog.mozilla.org/nnethercote/2014/11/04/please-grow-your-buffers-exponentially/
>
> René
It _is_ exponential. ivec_reserve(&vec, 1) means grow by _at least_ 1.
I'm not using typical memory management as defined in
git-compat-util.h because I'm trying to get ivec to behave very
similarly to Rust's Vec so that when Rust is introduced into the code,
C programmers will already be familiar with how Vec operates _and_ so
that converting from IVec to Vec is as simple as refactoring IVec
declarations to Vec.
Since C does not support generics there is no _proper_ solution. What
I have come up with on the C side for ivec is my best effort
compromise.There was a problem hiding this comment.
René Scharfe wrote on the Git mailing list (how to reply to this email):
On 1/17/26 4:58 PM, Ezekiel Newren wrote:
> On Fri, Jan 16, 2026 at 1:19 PM René Scharfe <l.s.r@web.de> wrote:
>>
>>> +void ivec_reserve(void *self_, size_t additional)
>>> +{
>>> + struct IVec_c_void *self = self_;
>>> +
>>> + size_t growby = 128;
>>> + if (self->capacity > growby)
>>> + growby = self->capacity;
>>> + if (additional > growby)
>>> + growby = additional;
>>> +
>>> + _set_capacity(self, self->capacity + growby);
>>> +}
>>
>> Constant growth steps like these cause linear growth and quadratic
>> complexity. ALLOC_GROW does exponential growth with factor 1.5 to
>> get linear complexity. Here's an old plea to do the same:
>> https://blog.mozilla.org/nnethercote/2014/11/04/please-grow-your-buffers-exponentially/
>>
>> René
>
> It _is_ exponential. ivec_reserve(&vec, 1) means grow by _at least_ 1.
D'oh! Right, it grows with factor 2, as growby is at least as big as
->capacity. I can't read.
René
|
User |
| @@ -21,13 +21,12 @@ | |||
| */ | |||
There was a problem hiding this comment.
René Scharfe wrote on the Git mailing list (how to reply to this email):
On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
> @@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
> }
>
> +struct xoccurrence
> +{
> + size_t file1, file2;
> +};
> +
> +
> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> +
>
> /*
> * Try to reduce the problem complexity, discard records that have no
> * matches on the other file. Also, lines that have multiple matches
> * might be potentially discarded if they appear in a run of discardable.
> */
> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> - long i, nm, mlim;
> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> + long i;
> + size_t nm, mlim;
> xrecord_t *recs;
> - xdlclass_t *rcrec;
> uint8_t *action1 = NULL, *action2 = NULL;
> - bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
> + struct IVec_xoccurrence occ;
> + bool need_min = !!(flags & XDF_NEED_MINIMAL);
> int ret = 0;
> ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
> ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>
> + IVEC_INIT(occ);
> + ivec_zero(&occ, xe->mph_size);
This array is presized here. It is neither grown nor shrunken.
CALLOC_ARRAY would work just as well, at least at this point, no?
> +
> + for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> + size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> + occ.ptr[mph1].file1 += 1;
> + }
> +
> + for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> + size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> + occ.ptr[mph2].file2 += 1;
> + }
> +
> /*
> * Create temporary arrays that will help us decide if
> * changed[i] should remain false, or become true.
> @@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> mlim = XDL_MAX_EQLIMIT;
> for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> - rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> - nm = rcrec ? rcrec->len2 : 0;
> + nm = occ.ptr[recs->minimal_perfect_hash].file2;
> action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> }
>
> if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> mlim = XDL_MAX_EQLIMIT;
> for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> - rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> - nm = rcrec ? rcrec->len1 : 0;
> + nm = occ.ptr[recs->minimal_perfect_hash].file1;
> action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> }
>
> @@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> cleanup:
> xdl_free(action1);
> xdl_free(action2);
> + ivec_free(&occ);
>
> return ret;
> }There was a problem hiding this comment.
Ezekiel Newren wrote on the Git mailing list (how to reply to this email):
On Fri, Jan 16, 2026 at 1:19 PM René Scharfe <l.s.r@web.de> wrote:
>
> On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
> > @@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> > return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
> > }
> >
> > +struct xoccurrence
> > +{
> > + size_t file1, file2;
> > +};
> > +
> > +
> > +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> > +
> >
> > /*
> > * Try to reduce the problem complexity, discard records that have no
> > * matches on the other file. Also, lines that have multiple matches
> > * might be potentially discarded if they appear in a run of discardable.
> > */
> > -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> > - long i, nm, mlim;
> > +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> > + long i;
> > + size_t nm, mlim;
> > xrecord_t *recs;
> > - xdlclass_t *rcrec;
> > uint8_t *action1 = NULL, *action2 = NULL;
> > - bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
> > + struct IVec_xoccurrence occ;
> > + bool need_min = !!(flags & XDF_NEED_MINIMAL);
> > int ret = 0;
> > ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
> > ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
> >
> > + IVEC_INIT(occ);
> > + ivec_zero(&occ, xe->mph_size);
>
> This array is presized here. It is neither grown nor shrunken.
> CALLOC_ARRAY would work just as well, at least at this point, no?
>
> > +
> > + for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> > + size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> > + occ.ptr[mph1].file1 += 1;
> > + }
> > +
> > + for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> > + size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> > + occ.ptr[mph2].file2 += 1;
> > + }
> > +
> > /*
> > * Create temporary arrays that will help us decide if
> > * changed[i] should remain false, or become true.
> > @@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> > if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> > mlim = XDL_MAX_EQLIMIT;
> > for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> > - rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> > - nm = rcrec ? rcrec->len2 : 0;
> > + nm = occ.ptr[recs->minimal_perfect_hash].file2;
> > action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> > }
> >
> > if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> > mlim = XDL_MAX_EQLIMIT;
> > for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> > - rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> > - nm = rcrec ? rcrec->len1 : 0;
> > + nm = occ.ptr[recs->minimal_perfect_hash].file1;
> > action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> > }
> >
> > @@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> > cleanup:
> > xdl_free(action1);
> > xdl_free(action2);
> > + ivec_free(&occ);
> >
> > return ret;
> > }
In Rust the memory management macros defined in git-compat-util.h will
not be available. ivec was built expressly to bridge the gap between C
and Rust. I'm avoiding using those macros because I'm trying to get C
programmers familiar with how Rust's Vec operates without forcing them
to read and write in Rust. Also, it makes converting from IVec to Vec
super easy.
ivec_zero() also sets length and capacity. Also CALLOC_ARRAY needs to
know the type of the pointer which ivec_zero() does not have access
to. This is one of the few ivec functions that does not have a direct
equivalent in Rust's Vec, but is faster than what is logically
equivalent in Rust.
In Rust the closest safe equivalent would look like:
let size = 35;
let mut vec = Vec::<u64>::new();
vec.reserve_exact(size);
vec.fill(0); // requires that T implements the `Copy` trait
The unsafe version would look like:
let size = 35;
let mut vec = Vec::<u64>::new();
vec.reserve_exact(size);
unsafe {
std::ptr::write_bytes(vec.as_mut_ptr(), 0, size * size_of::<u64>());
}There was a problem hiding this comment.
René Scharfe wrote on the Git mailing list (how to reply to this email):
On 1/17/26 5:34 PM, Ezekiel Newren wrote:
> On Fri, Jan 16, 2026 at 1:19 PM René Scharfe <l.s.r@web.de> wrote:
>>
>> On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
>>> @@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
>>> return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
>>> }
>>>
>>> +struct xoccurrence
>>> +{
>>> + size_t file1, file2;
>>> +};
>>> +
>>> +
>>> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
>>> +
>>>
>>> /*
>>> * Try to reduce the problem complexity, discard records that have no
>>> * matches on the other file. Also, lines that have multiple matches
>>> * might be potentially discarded if they appear in a run of discardable.
>>> */
>>> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>>> - long i, nm, mlim;
>>> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
>>> + long i;
>>> + size_t nm, mlim;
>>> xrecord_t *recs;
>>> - xdlclass_t *rcrec;
>>> uint8_t *action1 = NULL, *action2 = NULL;
>>> - bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
>>> + struct IVec_xoccurrence occ;
>>> + bool need_min = !!(flags & XDF_NEED_MINIMAL);
>>> int ret = 0;
>>> ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
>>> ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>>>
>>> + IVEC_INIT(occ);
>>> + ivec_zero(&occ, xe->mph_size);
>>
>> This array is presized here. It is neither grown nor shrunken.
>> CALLOC_ARRAY would work just as well, at least at this point, no?
>>
>>> +
>>> + for (size_t j = 0; j < xe->xdf1.nrec; j++) {
>>> + size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
>>> + occ.ptr[mph1].file1 += 1;
>>> + }
>>> +
>>> + for (size_t j = 0; j < xe->xdf2.nrec; j++) {
>>> + size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
>>> + occ.ptr[mph2].file2 += 1;
>>> + }
>>> +
>>> /*
>>> * Create temporary arrays that will help us decide if
>>> * changed[i] should remain false, or become true.
>>> @@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>>> if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>>> mlim = XDL_MAX_EQLIMIT;
>>> for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
>>> - rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>>> - nm = rcrec ? rcrec->len2 : 0;
>>> + nm = occ.ptr[recs->minimal_perfect_hash].file2;
>>> action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>>> }
>>>
>>> if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>>> mlim = XDL_MAX_EQLIMIT;
>>> for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
>>> - rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>>> - nm = rcrec ? rcrec->len1 : 0;
>>> + nm = occ.ptr[recs->minimal_perfect_hash].file1;
>>> action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>>> }
>>>
>>> @@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>>> cleanup:
>>> xdl_free(action1);
>>> xdl_free(action2);
>>> + ivec_free(&occ);
>>>
>>> return ret;
>>> }
>
> In Rust the memory management macros defined in git-compat-util.h will
> not be available. ivec was built expressly to bridge the gap between C
> and Rust. I'm avoiding using those macros because I'm trying to get C
> programmers familiar with how Rust's Vec operates without forcing them
> to read and write in Rust. Also, it makes converting from IVec to Vec
> super easy.
>
> ivec_zero() also sets length and capacity. Also CALLOC_ARRAY needs to
> know the type of the pointer which ivec_zero() does not have access
> to. This is one of the few ivec functions that does not have a direct
> equivalent in Rust's Vec, but is faster than what is logically
> equivalent in Rust.
>
> In Rust the closest safe equivalent would look like:
>
> let size = 35;
> let mut vec = Vec::<u64>::new();
> vec.reserve_exact(size);
> vec.fill(0); // requires that T implements the `Copy` trait
>
> The unsafe version would look like:
> let size = 35;
> let mut vec = Vec::<u64>::new();
> vec.reserve_exact(size);
> unsafe {
> std::ptr::write_bytes(vec.as_mut_ptr(), 0, size * size_of::<u64>());
> }
I was being unclear and made a few assumptions here. My point was just
that this is a fixed-size array and doesn't need to be stored in a
variable-sized container. This is the first Ivec user, and I would have
expected it to exercise the push function. I assume accessing a
fixed-size array via FFI would be a lot easier since allocation and
growth are out of the picture.
René
|
This patch series was integrated into seen via a2374e1. |
|
This patch series was integrated into seen via 10fe540. |
|
User |
|
This patch series was integrated into seen via 8caa923. |
|
User |
| @@ -311,26 +311,13 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1, | |||
| } | |||
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > Later patches will prepare xdl_cleanup_records() to be moved into xdiffi.c
> since only the classic diff uses that function.
I assume that's to make it easier to covert the myers implementation to rust without affecting the rest of the code? If so it would be nice to say that.
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> +int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> + xdfenv_t *xe) {
> + int res;
> +
> + if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0)
> + return -1;
> +
> + if (XDF_DIFF_ALG(xpp->flags) == XDF_PATIENCE_DIFF) {
> + res = xdl_do_patience_diff(xpp, xe);
> + goto out;
> + }
> +
> + if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) {
> + res = xdl_do_histogram_diff(xpp, xe);
> + goto out;
> + }
> +
> + res = xdl_do_classic_diff(xe, xpp->flags);
This might be clearer that we're calling only one of the three functions if we wrote this as
if (XDF_DIFF_ALG(xpp->flags) == XDIF_PATIENCE_DIFF)
res = xdl_do_patience_diff(xpp, xe);
else if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF)
res = xdl_do_histogram_diff(xpp, xe);
else
res = xdl_do_classic_diff(xe, xpp->flags);
and then we can drop the out: label
Thanks
Phillip
> out:
> if (res < 0)
> xdl_free_env(xe);
> diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h
> index 49e52c67f9..8bf4c20373 100644
> --- a/xdiff/xdiffi.h
> +++ b/xdiff/xdiffi.h
> @@ -42,6 +42,7 @@ typedef struct s_xdchange {
> int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
> xdfile_t *xdf2, long off2, long lim2,
> long *kvdf, long *kvdb, int need_min, xdalgoenv_t *xenv);
> +int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags);
> int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> xdfenv_t *xe);
> int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags);There was a problem hiding this comment.
Ezekiel Newren wrote on the Git mailing list (how to reply to this email):
On Tue, Jan 20, 2026 at 8:01 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > Later patches will prepare xdl_cleanup_records() to be moved into xdiffi.c
> > since only the classic diff uses that function.
>
> I assume that's to make it easier to covert the myers implementation to
> rust without affecting the rest of the code? If so it would be nice to
> say that.
Making it easier to port to Rust is a side effect. The primary goal is
to simplify the job of xprepare to only parsing and hashing lines in a
file. xdl_cleanup_records() is only used by classic diff
(myers/minimal) which means it doesn't belong in xprepare because it's
part of a diff algorithm and isn't relevant to preparing the file for
a diff algorithm. Perhaps xdl_trim_ends() should be moved into
xdl_do_diff() too...
> > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>
> > +int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> > + xdfenv_t *xe) {
> > + int res;
> > +
> > + if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0)
> > + return -1;
> > +
> > + if (XDF_DIFF_ALG(xpp->flags) == XDF_PATIENCE_DIFF) {
> > + res = xdl_do_patience_diff(xpp, xe);
> > + goto out;
> > + }
> > +
> > + if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) {
> > + res = xdl_do_histogram_diff(xpp, xe);
> > + goto out;
> > + }
> > +
> > + res = xdl_do_classic_diff(xe, xpp->flags);
>
> This might be clearer that we're calling only one of the three functions
> if we wrote this as
>
> if (XDF_DIFF_ALG(xpp->flags) == XDIF_PATIENCE_DIFF)
> res = xdl_do_patience_diff(xpp, xe);
> else if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF)
> res = xdl_do_histogram_diff(xpp, xe);
> else
> res = xdl_do_classic_diff(xe, xpp->flags);
>
> and then we can drop the out: label
In a later cleanup, I make this exact change :)
xdiff/xprepare.c
Outdated
| @@ -26,8 +26,6 @@ | |||
| #define XDL_KPDIS_RUN 4 | |||
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > All lines must be read anyway, so classify them after they're read in.
> Also move the memset() into xdl_init_classifier().
So instead of looping over the input lines one and a bit times (the bit being from xdl_guess_lines) we now loop over them twice as we split them first and then classify them in a separate loop. It does save some work not to call xdl_guess_lines but it is unclear if that offsets classifying them in a separate loop.
> + for (size_t i = 0; i < xe->xdf1.nrec; i++) {
> + xrecord_t *rec = &xe->xdf1.recs[i];
> + xdl_classify_record(1, &cf, rec);
We seem to have lost the error handling if xdl_classify_record() fails.
Thanks
Phillip
> + }
> +
> + for (size_t i = 0; i < xe->xdf2.nrec; i++) {
> + xrecord_t *rec = &xe->xdf2.recs[i];
> + xdl_classify_record(2, &cf, rec);
> }
> > if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
> diff --git a/xdiff/xutils.c b/xdiff/xutils.c
> index 77ee1ad9c8..b3d51197c1 100644
> --- a/xdiff/xutils.c
> +++ b/xdiff/xutils.c
> @@ -118,26 +118,6 @@ void *xdl_cha_alloc(chastore_t *cha) {
> return data;
> }
> > -long xdl_guess_lines(mmfile_t *mf, long sample) {
> - long nl = 0, size, tsize = 0;
> - char const *data, *cur, *top;
> -
> - if ((cur = data = xdl_mmfile_first(mf, &size))) {
> - for (top = data + size; nl < sample && cur < top; ) {
> - nl++;
> - if (!(cur = memchr(cur, '\n', top - cur)))
> - cur = top;
> - else
> - cur++;
> - }
> - tsize += (long) (cur - data);
> - }
> -
> - if (nl && tsize)
> - nl = xdl_mmfile_size(mf) / (tsize / nl);
> -
> - return nl + 1;
> -}
> > int xdl_blankline(const char *line, long size, long flags)
> {
> diff --git a/xdiff/xutils.h b/xdiff/xutils.h
> index 615b4a9d35..d800840dd0 100644
> --- a/xdiff/xutils.h
> +++ b/xdiff/xutils.h
> @@ -31,7 +31,6 @@ int xdl_emit_diffrec(char const *rec, long size, char const *pre, long psize,
> int xdl_cha_init(chastore_t *cha, long isize, long icount);
> void xdl_cha_free(chastore_t *cha);
> void *xdl_cha_alloc(chastore_t *cha);
> -long xdl_guess_lines(mmfile_t *mf, long sample);
> int xdl_blankline(const char *line, long size, long flags);
> int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags);
> uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top);There was a problem hiding this comment.
Ezekiel Newren wrote on the Git mailing list (how to reply to this email):
On Tue, Jan 20, 2026 at 8:02 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > All lines must be read anyway, so classify them after they're read in.
> > Also move the memset() into xdl_init_classifier().
>
> So instead of looping over the input lines one and a bit times (the bit
> being from xdl_guess_lines) we now loop over them twice as we split them
> first and then classify them in a separate loop. It does save some work
> not to call xdl_guess_lines but it is unclear if that offsets
> classifying them in a separate loop.
>
> > + for (size_t i = 0; i < xe->xdf1.nrec; i++) {
> > + xrecord_t *rec = &xe->xdf1.recs[i];
> > + xdl_classify_record(1, &cf, rec);
>
> We seem to have lost the error handling if xdl_classify_record() fails.
The error handling was not "lost" it was deliberately removed. The
only way in which xdl_classify_record() could fail is by a failed
memory allocation. On the Rust side this would result in a panic
(panic means something different in Rust vs C) in which case C could
not possibly recover. Also for operations like Vec.push() in Rust it's
assumed that memory management functions will never fail and if they
do they crash the program with no chance of recovery (unless you
account for panic unwinding which is really ugly). It seems a lot of
arguments about ivec and my xdiff cleanups are "We don't do things
this way in Git/C" I'm aware of many of these arguments and I'm trying
to address them with a more specific answer of "Yes, but that's not
how things are done in Rust and all of this is to prepare the code for
conversion to Rust and some things shouldn't, or even, cannot be done
the C way in Rust."There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 21/01/2026 21:12, Ezekiel Newren wrote:
> On Tue, Jan 20, 2026 at 8:02 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
>>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>>
>>> All lines must be read anyway, so classify them after they're read in.
>>> Also move the memset() into xdl_init_classifier().
>>
>> So instead of looping over the input lines one and a bit times (the bit
>> being from xdl_guess_lines) we now loop over them twice as we split them
>> first and then classify them in a separate loop. It does save some work
>> not to call xdl_guess_lines but it is unclear if that offsets
>> classifying them in a separate loop.
>>
>>> + for (size_t i = 0; i < xe->xdf1.nrec; i++) {
>>> + xrecord_t *rec = &xe->xdf1.recs[i];
>>> + xdl_classify_record(1, &cf, rec);
>>
>> We seem to have lost the error handling if xdl_classify_record() fails.
> > The error handling was not "lost" it was deliberately removed. That's the sort of thing that needs to be explained in the commit message.
> The
> only way in which xdl_classify_record() could fail is by a failed
> memory allocation. On the Rust side this would result in a panic
> (panic means something different in Rust vs C) in which case C could
> not possibly recover.
There is no rust code in xdiff at the moment so we don't panic on failure. In git we'll die() because xdl_malloc() and friends are defined as xmalloc() etc. which die on allocation failure. However anyone else picking up this code and using a different allocator that does not die on allocation failure will expect the error to be propagated.
If you want to stop supporting other allocators then you should propose a patch to do so, not silently slip the change into this patch.
Thanks
Phillip
> Also for operations like Vec.push() in Rust it's
> assumed that memory management functions will never fail and if they
> do they crash the program with no chance of recovery (unless you
> account for panic unwinding which is really ugly). It seems a lot of
> arguments about ivec and my xdiff cleanups are "We don't do things
> this way in Git/C" I'm aware of many of these arguments and I'm trying
> to address them with a more specific answer of "Yes, but that's not
> how things are done in Rust and all of this is to prepare the code for
> conversion to Rust and some things shouldn't, or even, cannot be done
> the C way in Rust."| @@ -373,5 +373,7 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env, | |||
|
|
|||
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > The patience diff is set up the exact same way as histogram, see
> xdl_do_historgram_diff() in xhistogram.c. xdl_optimize_ctxs() is
> redundant now, delete it.
Does this change the output? The patience diff looks for unique context lines and builds the context out from those. For files that look like
Old New
A A
B B
C A
B B
A C
B
A
That will give a hunk
@@ -1,3 +0,5 @@
+A
+B
A
B
C
but trimming the common prefix first would give
@@ -1,5 +1,7
A
B
+A
+B
C
B
A
Though it seems like the diff silder causes us to output the same diff in both cases for that simple test so maybe it is not an issue. It would certainly be helpful to comment on any possible changes in the commit message as it could have been a deliberate choice not to trim the ends for those algorithms.
> -static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
> -
> - if (xdl_trim_ends(xdf1, xdf2) < 0 ||
> - xdl_cleanup_records(cf, xdf1, xdf2) < 0) {
> -
> - return -1;
> - }
> -
> - return 0;
> -}
> -
> int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> xdfenv_t *xe) {
> xdlclassifier_t cf;
> @@ -404,9 +393,10 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> xdl_classify_record(2, &cf, rec);
> }
> > + xdl_trim_ends(&xe->xdf1, &xe->xdf2);
It would be clear that this was safe if you changed the function signature to return void as the way it is called in xdl_optimize_ctxs() makes it look like it can return an error.
Thanks
Phillip
> if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
> (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> - xdl_optimize_ctxs(&cf, &xe->xdf1, &xe->xdf2) < 0) {
> + xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
> > xdl_free_ctx(&xe->xdf2);
> xdl_free_ctx(&xe->xdf1);There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 20/01/2026 15:02, Phillip Wood wrote:
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>
>> The patience diff is set up the exact same way as histogram, see
>> xdl_do_historgram_diff() in xhistogram.c. xdl_optimize_ctxs() is
>> redundant now, delete it.
> > Does this change the output? The patience diff looks for unique context > lines and builds the context out from those. For files that look like
> > Old New
> A A
> B B
> C A
> B B
> A C
> B
> A
> > That will give a hunk
> > @@ -1,3 +0,5 @@
> +A
> +B
> A
> B
> C
> > but trimming the common prefix first would give
> > @@ -1,5 +1,7
> A
> B
> +A
> +B
> C
> B
> A
> > Though it seems like the diff silder causes us to output the same diff > in both cases for that simple test so maybe it is not an issue.
It does change larger diffs. If you run
git show --diff-algorithm=patience --diff-merges=first-parent f406b89552
You get a different diff with this series applied.
Thanks
Phillip
> It would > certainly be helpful to comment on any possible changes in the commit > message as it could have been a deliberate choice not to trim the ends > for those algorithms.
> >> -static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1, >> xdfile_t *xdf2) {
>> -
>> - if (xdl_trim_ends(xdf1, xdf2) < 0 ||
>> - xdl_cleanup_records(cf, xdf1, xdf2) < 0) {
>> -
>> - return -1;
>> - }
>> -
>> - return 0;
>> -}
>> -
>> int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>> xdfenv_t *xe) {
>> xdlclassifier_t cf;
>> @@ -404,9 +393,10 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, >> xpparam_t const *xpp,
>> xdl_classify_record(2, &cf, rec);
>> }
>> + xdl_trim_ends(&xe->xdf1, &xe->xdf2);
> > It would be clear that this was safe if you changed the function > signature to return void as the way it is called in xdl_optimize_ctxs() > makes it look like it can return an error.
> > Thanks
> > Phillip
> >> if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
>> (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
>> - xdl_optimize_ctxs(&cf, &xe->xdf1, &xe->xdf2) < 0) {
>> + xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
>> xdl_free_ctx(&xe->xdf2);
>> xdl_free_ctx(&xe->xdf1);
> >
xdiff/xprepare.c
Outdated
| @@ -263,7 +261,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) { | |||
| * matches on the other file. Also, lines that have multiple matches | |||
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > View with --color-words. Prepare these functions to use the fields:
> delta_start, delta_end. A future patch will add these fields to
> xdfenv_t.
I'm afraid this message doesn't make much sense to me. What are these new fields? I think it would help to explain what this up comming change is going to do and why.
Oh, having read patch 7 we're removing dstart and dend from xdfile_t and replacing them with delta_start and delta_end in xdfenv_t. It would be useful to say that here.
> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
> +static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
Maybe we could add xdf1 and xdf2 as local variables to avoid having to change the code that accesses the members of xdfile_t that are not going to be moved to xdfenv_t.
Thanks
Phillip
> long i, nm, mlim;
> xrecord_t *recs;
> xdlclass_t *rcrec;
> @@ -273,11 +273,11 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
> * Create temporary arrays that will help us decide if
> * changed[i] should remain false, or become true.
> */
> - if (!XDL_CALLOC_ARRAY(action1, xdf1->nrec + 1)) {
> + if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
> ret = -1;
> goto cleanup;
> }
> - if (!XDL_CALLOC_ARRAY(action2, xdf2->nrec + 1)) {
> + if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
> ret = -1;
> goto cleanup;
> }
> @@ -285,17 +285,17 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
> /*
> * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
> */
> - if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
> + if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> mlim = XDL_MAX_EQLIMIT;
> - for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
> + for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; i <= xe->xdf1.dend; i++, recs++) {
> rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> nm = rcrec ? rcrec->len2 : 0;
> action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> }
> > - if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
> + if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> mlim = XDL_MAX_EQLIMIT;
> - for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
> + for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; i <= xe->xdf2.dend; i++, recs++) {
> rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> nm = rcrec ? rcrec->len1 : 0;
> action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> @@ -305,27 +305,27 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
> * Use temporary arrays to decide if changed[i] should remain
> * false, or become true.
> */
> - xdf1->nreff = 0;
> - for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
> - i <= xdf1->dend; i++, recs++) {
> + xe->xdf1.nreff = 0;
> + for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
> + i <= xe->xdf1.dend; i++, recs++) {
> if (action1[i] == KEEP ||
> - (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
> - xdf1->reference_index[xdf1->nreff++] = i;
> + (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->xdf1.dstart, xe->xdf1.dend))) {
> + xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
> /* changed[i] remains false, i.e. keep */
> } else
> - xdf1->changed[i] = true;
> + xe->xdf1.changed[i] = true;
> /* i.e. discard */
> }
> > - xdf2->nreff = 0;
> - for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
> - i <= xdf2->dend; i++, recs++) {
> + xe->xdf2.nreff = 0;
> + for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
> + i <= xe->xdf2.dend; i++, recs++) {
> if (action2[i] == KEEP ||
> - (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
> - xdf2->reference_index[xdf2->nreff++] = i;
> + (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->xdf2.dstart, xe->xdf2.dend))) {
> + xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
> /* changed[i] remains false, i.e. keep */
> } else
> - xdf2->changed[i] = true;
> + xe->xdf2.changed[i] = true;
> /* i.e. discard */
> }
> > @@ -340,27 +340,27 @@ cleanup:
> /*
> * Early trim initial and terminal matching records.
> */
> -static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
> +static int xdl_trim_ends(xdfenv_t *xe) {
> long i, lim;
> xrecord_t *recs1, *recs2;
> > - recs1 = xdf1->recs;
> - recs2 = xdf2->recs;
> - for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
> + recs1 = xe->xdf1.recs;
> + recs2 = xe->xdf2.recs;
> + for (i = 0, lim = (long)XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec); i < lim;
> i++, recs1++, recs2++)
> if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
> break;
> > - xdf1->dstart = xdf2->dstart = i;
> + xe->xdf1.dstart = xe->xdf2.dstart = i;
> > - recs1 = xdf1->recs + xdf1->nrec - 1;
> - recs2 = xdf2->recs + xdf2->nrec - 1;
> + recs1 = xe->xdf1.recs + xe->xdf1.nrec - 1;
> + recs2 = xe->xdf2.recs + xe->xdf2.nrec - 1;
> for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
> if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
> break;
> > - xdf1->dend = (long)xdf1->nrec - i - 1;
> - xdf2->dend = (long)xdf2->nrec - i - 1;
> + xe->xdf1.dend = (long)xe->xdf1.nrec - i - 1;
> + xe->xdf2.dend = (long)xe->xdf2.nrec - i - 1;
> > return 0;
> }
> @@ -393,10 +393,10 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> xdl_classify_record(2, &cf, rec);
> }
> > - xdl_trim_ends(&xe->xdf1, &xe->xdf2);
> + xdl_trim_ends(xe);
> if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
> (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> - xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
> + xdl_cleanup_records(&cf, xe) < 0) {
> > xdl_free_ctx(&xe->xdf2);
> xdl_free_ctx(&xe->xdf1);| @@ -342,81 +340,63 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd | |||
| /* | |||
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > This patch is best viewed with a before and after of the whole
> function.
> > Rather than using 2 pointers and walking them. Use direct indexing with
> local variables of what is being compared to make it easier to follow
> along.
I think using direct indexing makes things clearer, but I'm not sure this is a faithful conversion (see below).
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 0acb3437d4..06b6a6f804 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -340,29 +340,29 @@ cleanup:
> /*
> * Early trim initial and terminal matching records.
> */
> -static int xdl_trim_ends(xdfenv_t *xe) {
> - long i, lim;
> - xrecord_t *recs1, *recs2;
> -
> - recs1 = xe->xdf1.recs;
> - recs2 = xe->xdf2.recs;
> - for (i = 0, lim = (long)XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec); i < lim;
> - i++, recs1++, recs2++)
> - if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
> +static void xdl_trim_ends(xdfenv_t *xe)
> +{
> + size_t lim = XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec);
> +
> + for (size_t i = 0; i < lim; i++) {
> + size_t mph1 = xe->xdf1.recs[i].minimal_perfect_hash;
> + size_t mph2 = xe->xdf2.recs[i].minimal_perfect_hash;
> + if (mph1 != mph2) {
> + xe->xdf1.dstart = xe->xdf2.dstart = (ssize_t)i;
The type of dstart is ptrdiff_t, not ssize_t.
The original set dstart and dend unconditionally but here they are not set if all the lines match.
Thanks
Phillip
> + lim -= i;
> break;
> + }
> + }
> > - xe->xdf1.dstart = xe->xdf2.dstart = i;
> -
> - recs1 = xe->xdf1.recs + xe->xdf1.nrec - 1;
> - recs2 = xe->xdf2.recs + xe->xdf2.nrec - 1;
> - for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
> - if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
> + for (size_t i = 0; i < lim; i++) {
> + size_t mph1 = xe->xdf1.recs[xe->xdf1.nrec - 1 - i].minimal_perfect_hash;
> + size_t mph2 = xe->xdf2.recs[xe->xdf2.nrec - 1 - i].minimal_perfect_hash;
> + if (mph1 != mph2) {
> + xe->xdf1.dend = xe->xdf1.nrec - 1 - i;
> + xe->xdf2.dend = xe->xdf2.nrec - 1 - i;
> break;
> -
> - xe->xdf1.dend = (long)xe->xdf1.nrec - i - 1;
> - xe->xdf2.dend = (long)xe->xdf2.nrec - i - 1;
> -
> - return 0;
> + }
> + }
> }
> > | @@ -365,6 +365,6 @@ static int histogram_diff(xpparam_t const *xpp, xdfenv_t *env, | |||
| int xdl_do_histogram_diff(xpparam_t const *xpp, xdfenv_t *env) | |||
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > Placing delta_start in xdfenv_t instead of xdfile_t provides a more
> appropriate context since this variable only makes sense with a pair
> of files. View with --color-words.
So as dstart and dend must be the same for both files we now store the values once in xdfenv_t. That explains why we start passing xdfenv_t around rather than xdfile_t in patch 5.
Thanks
Phillip
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
> xdiff/xhistogram.c | 4 ++--
> xdiff/xpatience.c | 4 ++--
> xdiff/xprepare.c | 17 +++++++++--------
> xdiff/xtypes.h | 3 ++-
> 4 files changed, 15 insertions(+), 13 deletions(-)
> > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
> index 5ae1282c27..eb6a52d9ba 100644
> --- a/xdiff/xhistogram.c
> +++ b/xdiff/xhistogram.c
> @@ -365,6 +365,6 @@ out:
> int xdl_do_histogram_diff(xpparam_t const *xpp, xdfenv_t *env)
> {
> return histogram_diff(xpp, env,
> - env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
> - env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
> + env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
> + env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
> }
> diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
> index 2bce07cf48..bd0ffbb417 100644
> --- a/xdiff/xpatience.c
> +++ b/xdiff/xpatience.c
> @@ -374,6 +374,6 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
> int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
> {
> return patience_diff(xpp, env,
> - env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
> - env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
> + env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
> + env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
> }
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 06b6a6f804..e88468e74c 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -173,7 +173,6 @@ static int xdl_prepare_ctx(mmfile_t *mf, xdfile_t *xdf, uint64_t flags) {
> > xdf->changed += 1;
> xdf->nreff = 0;
> - xdf->dstart = 0;
> xdf->dend = xdf->nrec - 1;
> > return 0;
> @@ -287,7 +286,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> */
> if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> mlim = XDL_MAX_EQLIMIT;
> - for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; i <= xe->xdf1.dend; i++, recs++) {
> + for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= xe->xdf1.dend; i++, recs++) {
> rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> nm = rcrec ? rcrec->len2 : 0;
> action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> @@ -295,7 +294,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> > if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> mlim = XDL_MAX_EQLIMIT;
> - for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; i <= xe->xdf2.dend; i++, recs++) {
> + for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= xe->xdf2.dend; i++, recs++) {
> rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> nm = rcrec ? rcrec->len1 : 0;
> action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> @@ -306,10 +305,10 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> * false, or become true.
> */
> xe->xdf1.nreff = 0;
> - for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
> + for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
> i <= xe->xdf1.dend; i++, recs++) {
> if (action1[i] == KEEP ||
> - (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->xdf1.dstart, xe->xdf1.dend))) {
> + (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, xe->xdf1.dend))) {
> xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
> /* changed[i] remains false, i.e. keep */
> } else
> @@ -318,10 +317,10 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> }
> > xe->xdf2.nreff = 0;
> - for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
> + for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
> i <= xe->xdf2.dend; i++, recs++) {
> if (action2[i] == KEEP ||
> - (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->xdf2.dstart, xe->xdf2.dend))) {
> + (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, xe->xdf2.dend))) {
> xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
> /* changed[i] remains false, i.e. keep */
> } else
> @@ -348,7 +347,7 @@ static void xdl_trim_ends(xdfenv_t *xe)
> size_t mph1 = xe->xdf1.recs[i].minimal_perfect_hash;
> size_t mph2 = xe->xdf2.recs[i].minimal_perfect_hash;
> if (mph1 != mph2) {
> - xe->xdf1.dstart = xe->xdf2.dstart = (ssize_t)i;
> + xe->delta_start = (ssize_t)i;
> lim -= i;
> break;
> }
> @@ -370,6 +369,8 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> xdfenv_t *xe) {
> xdlclassifier_t cf;
> > + xe->delta_start = 0;
> +
> if (xdl_prepare_ctx(mf1, &xe->xdf1, xpp->flags) < 0) {
> > return -1;
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index 979586f20a..bda1f85eb0 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -48,7 +48,7 @@ typedef struct s_xrecord {
> typedef struct s_xdfile {
> xrecord_t *recs;
> size_t nrec;
> - ptrdiff_t dstart, dend;
> + ptrdiff_t dend;
> bool *changed;
> size_t *reference_index;
> size_t nreff;
> @@ -56,6 +56,7 @@ typedef struct s_xdfile {
> > typedef struct s_xdfenv {
> xdfile_t xdf1, xdf2;
> + size_t delta_start;
> } xdfenv_t;
> > There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 20/01/2026 16:32, Phillip Wood wrote:
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>
>> Placing delta_start in xdfenv_t instead of xdfile_t provides a more
>> appropriate context since this variable only makes sense with a pair
>> of files. View with --color-words.
> > So as dstart and dend must be the same for both files we now store the > values once in xdfenv_t. Except it's only dstart that's the same, dend is different because it convinently stores an index, not an offset from the end. Having realized that, moving them to xdfenv_t makes less sense as having to calculate the dend index from an offset from the end of the array each time is a pain and sooner or later we'll make a mistake.
Thanks
Phillip
> That explains why we start passing xdfenv_t > around rather than xdfile_t in patch 5.
> > Thanks
> > Phillip
> >> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>> ---
>> xdiff/xhistogram.c | 4 ++--
>> xdiff/xpatience.c | 4 ++--
>> xdiff/xprepare.c | 17 +++++++++--------
>> xdiff/xtypes.h | 3 ++-
>> 4 files changed, 15 insertions(+), 13 deletions(-)
>>
>> diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
>> index 5ae1282c27..eb6a52d9ba 100644
>> --- a/xdiff/xhistogram.c
>> +++ b/xdiff/xhistogram.c
>> @@ -365,6 +365,6 @@ out:
>> int xdl_do_histogram_diff(xpparam_t const *xpp, xdfenv_t *env)
>> {
>> return histogram_diff(xpp, env,
>> - env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
>> - env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
>> + env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
>> + env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
>> }
>> diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
>> index 2bce07cf48..bd0ffbb417 100644
>> --- a/xdiff/xpatience.c
>> +++ b/xdiff/xpatience.c
>> @@ -374,6 +374,6 @@ static int patience_diff(xpparam_t const *xpp, >> xdfenv_t *env,
>> int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
>> {
>> return patience_diff(xpp, env,
>> - env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
>> - env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
>> + env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
>> + env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
>> }
>> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
>> index 06b6a6f804..e88468e74c 100644
>> --- a/xdiff/xprepare.c
>> +++ b/xdiff/xprepare.c
>> @@ -173,7 +173,6 @@ static int xdl_prepare_ctx(mmfile_t *mf, xdfile_t >> *xdf, uint64_t flags) {
>> xdf->changed += 1;
>> xdf->nreff = 0;
>> - xdf->dstart = 0;
>> xdf->dend = xdf->nrec - 1;
>> return 0;
>> @@ -287,7 +286,7 @@ static int xdl_cleanup_records(xdlclassifier_t >> *cf, xdfenv_t *xe) {
>> */
>> if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>> mlim = XDL_MAX_EQLIMIT;
>> - for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; >> i <= xe->xdf1.dend; i++, recs++) {
>> + for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; >> i <= xe->xdf1.dend; i++, recs++) {
>> rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>> nm = rcrec ? rcrec->len2 : 0;
>> action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && ! >> need_min) ? INVESTIGATE: KEEP;
>> @@ -295,7 +294,7 @@ static int xdl_cleanup_records(xdlclassifier_t >> *cf, xdfenv_t *xe) {
>> if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>> mlim = XDL_MAX_EQLIMIT;
>> - for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; >> i <= xe->xdf2.dend; i++, recs++) {
>> + for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; >> i <= xe->xdf2.dend; i++, recs++) {
>> rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>> nm = rcrec ? rcrec->len1 : 0;
>> action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && ! >> need_min) ? INVESTIGATE: KEEP;
>> @@ -306,10 +305,10 @@ static int xdl_cleanup_records(xdlclassifier_t >> *cf, xdfenv_t *xe) {
>> * false, or become true.
>> */
>> xe->xdf1.nreff = 0;
>> - for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
>> + for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
>> i <= xe->xdf1.dend; i++, recs++) {
>> if (action1[i] == KEEP ||
>> - (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, >> i, xe->xdf1.dstart, xe->xdf1.dend))) {
>> + (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, >> i, xe->delta_start, xe->xdf1.dend))) {
>> xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
>> /* changed[i] remains false, i.e. keep */
>> } else
>> @@ -318,10 +317,10 @@ static int xdl_cleanup_records(xdlclassifier_t >> *cf, xdfenv_t *xe) {
>> }
>> xe->xdf2.nreff = 0;
>> - for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
>> + for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
>> i <= xe->xdf2.dend; i++, recs++) {
>> if (action2[i] == KEEP ||
>> - (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, >> i, xe->xdf2.dstart, xe->xdf2.dend))) {
>> + (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, >> i, xe->delta_start, xe->xdf2.dend))) {
>> xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
>> /* changed[i] remains false, i.e. keep */
>> } else
>> @@ -348,7 +347,7 @@ static void xdl_trim_ends(xdfenv_t *xe)
>> size_t mph1 = xe->xdf1.recs[i].minimal_perfect_hash;
>> size_t mph2 = xe->xdf2.recs[i].minimal_perfect_hash;
>> if (mph1 != mph2) {
>> - xe->xdf1.dstart = xe->xdf2.dstart = (ssize_t)i;
>> + xe->delta_start = (ssize_t)i;
>> lim -= i;
>> break;
>> }
>> @@ -370,6 +369,8 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, >> xpparam_t const *xpp,
>> xdfenv_t *xe) {
>> xdlclassifier_t cf;
>> + xe->delta_start = 0;
>> +
>> if (xdl_prepare_ctx(mf1, &xe->xdf1, xpp->flags) < 0) {
>> return -1;
>> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
>> index 979586f20a..bda1f85eb0 100644
>> --- a/xdiff/xtypes.h
>> +++ b/xdiff/xtypes.h
>> @@ -48,7 +48,7 @@ typedef struct s_xrecord {
>> typedef struct s_xdfile {
>> xrecord_t *recs;
>> size_t nrec;
>> - ptrdiff_t dstart, dend;
>> + ptrdiff_t dend;
>> bool *changed;
>> size_t *reference_index;
>> size_t nreff;
>> @@ -56,6 +56,7 @@ typedef struct s_xdfile {
>> typedef struct s_xdfenv {
>> xdfile_t xdf1, xdf2;
>> + size_t delta_start;
>> } xdfenv_t;
> > |
This patch series was integrated into seen via 9717e99. |
| @@ -21,13 +21,12 @@ | |||
| */ | |||
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
Hi Ezekiel
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > Disentangle xdl_cleanup_records() from the classifier so that it can be
> moved from xprepare.c into xdiffi.c.
> > The classic diff is the only algorithm that needs to count the number
> of times each line occurs in each file. Make xdl_cleanup_records()
> count the number of lines instead of the classifier so it won't slow
> down patience or histogram.
Have you measured the speed up that this gives? It looks like it saves very little work for the patience or histogram algorithms and means we now make a second pass over the data in the myers case. If there is a reason to do this related to the rust conversion then that might be a more convincing argument. As Rene has said already this isn't a particularly interesting demonstration of struct IVec - it would be nice to see more of the API exercised.
Thanks
Phillip
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
> xdiff/xprepare.c | 52 +++++++++++++++++++++++++++++++++---------------
> xdiff/xtypes.h | 1 +
> 2 files changed, 37 insertions(+), 16 deletions(-)
> > diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index d3cdb6ac02..b53a3b80c4 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -21,6 +21,7 @@
> */
> > #include "xinclude.h"
> +#include "compat/ivec.h"
> > > #define XDL_KPDIS_RUN 4
> @@ -35,7 +36,6 @@ typedef struct s_xdlclass {
> struct s_xdlclass *next;
> xrecord_t rec;
> long idx;
> - long len1, len2;
> } xdlclass_t;
> > typedef struct s_xdlclassifier {
> @@ -92,7 +92,7 @@ static void xdl_free_classifier(xdlclassifier_t *cf) {
> }
> > > -static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
> +static int xdl_classify_record(xdlclassifier_t *cf, xrecord_t *rec) {
> size_t hi;
> xdlclass_t *rcrec;
> > @@ -113,13 +113,10 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
> return -1;
> cf->rcrecs[rcrec->idx] = rcrec;
> rcrec->rec = *rec;
> - rcrec->len1 = rcrec->len2 = 0;
> rcrec->next = cf->rchash[hi];
> cf->rchash[hi] = rcrec;
> }
> > - (pass == 1) ? rcrec->len1++ : rcrec->len2++;
> -
> rec->minimal_perfect_hash = (size_t)rcrec->idx;
> > return 0;
> @@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
> }
> > +struct xoccurrence
> +{
> + size_t file1, file2;
> +};
> +
> +
> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> +
> > /*
> * Try to reduce the problem complexity, discard records that have no
> * matches on the other file. Also, lines that have multiple matches
> * might be potentially discarded if they appear in a run of discardable.
> */
> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> - long i, nm, mlim;
> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> + long i;
> + size_t nm, mlim;
> xrecord_t *recs;
> - xdlclass_t *rcrec;
> uint8_t *action1 = NULL, *action2 = NULL;
> - bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
> + struct IVec_xoccurrence occ;
> + bool need_min = !!(flags & XDF_NEED_MINIMAL);
> int ret = 0;
> ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
> ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
> > + IVEC_INIT(occ);
> + ivec_zero(&occ, xe->mph_size);
> +
> + for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> + size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> + occ.ptr[mph1].file1 += 1;
> + }
> +
> + for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> + size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> + occ.ptr[mph2].file2 += 1;
> + }
> +
> /*
> * Create temporary arrays that will help us decide if
> * changed[i] should remain false, or become true.
> @@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> mlim = XDL_MAX_EQLIMIT;
> for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> - rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> - nm = rcrec ? rcrec->len2 : 0;
> + nm = occ.ptr[recs->minimal_perfect_hash].file2;
> action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> }
> > if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> mlim = XDL_MAX_EQLIMIT;
> for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> - rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> - nm = rcrec ? rcrec->len1 : 0;
> + nm = occ.ptr[recs->minimal_perfect_hash].file1;
> action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> }
> > @@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> cleanup:
> xdl_free(action1);
> xdl_free(action2);
> + ivec_free(&occ);
> > return ret;
> }
> @@ -387,18 +405,20 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> > for (size_t i = 0; i < xe->xdf1.nrec; i++) {
> xrecord_t *rec = &xe->xdf1.recs[i];
> - xdl_classify_record(1, &cf, rec);
> + xdl_classify_record(&cf, rec);
> }
> > for (size_t i = 0; i < xe->xdf2.nrec; i++) {
> xrecord_t *rec = &xe->xdf2.recs[i];
> - xdl_classify_record(2, &cf, rec);
> + xdl_classify_record(&cf, rec);
> }
> > + xe->mph_size = cf.count;
> +
> xdl_trim_ends(xe);
> if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
> (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> - xdl_cleanup_records(&cf, xe) < 0) {
> + xdl_cleanup_records(xe, xpp->flags) < 0) {
> > xdl_free_ctx(&xe->xdf2);
> xdl_free_ctx(&xe->xdf1);
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index a939396064..2528bd37e8 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -56,6 +56,7 @@ typedef struct s_xdfile {
> typedef struct s_xdfenv {
> xdfile_t xdf1, xdf2;
> size_t delta_start, delta_end;
> + size_t mph_size;
> } xdfenv_t;
> > | @@ -21,6 +21,7 @@ | |||
| */ | |||
There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
Hi Ezekiel
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> > Only the classic diff uses xdl_cleanup_records(). Move it,
> xdl_clean_mmatch(), and the macros to xdiffi.c and call
> xdl_cleanup_records() inside of xdl_do_classic_diff(). This better
> organizes the code related to the classic diff.
I think calling xdl_cleanup_records() from inside xdl_do_classic_diff() makes sense. I don't have a strong opinion either way on the code movement. You should remove '#include "compat/ivec.h"' from xprepare.c if you're moving the only code that uses it out of that file.
Thanks
Phillip
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
> xdiff/xdiffi.c | 180 ++++++++++++++++++++++++++++++++++++++++++++
> xdiff/xprepare.c | 191 +----------------------------------------------
> 2 files changed, 181 insertions(+), 190 deletions(-)
> > diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index e3196c7245..0f1fd7cf80 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -21,6 +21,7 @@
> */
> > #include "xinclude.h"
> +#include "compat/ivec.h"
> > static size_t get_hash(xdfile_t *xdf, long index)
> {
> @@ -33,6 +34,14 @@ static size_t get_hash(xdfile_t *xdf, long index)
> #define XDL_SNAKE_CNT 20
> #define XDL_K_HEUR 4
> > +#define XDL_KPDIS_RUN 4
> +#define XDL_MAX_EQLIMIT 1024
> +#define XDL_SIMSCAN_WINDOW 100
> +
> +#define DISCARD 0
> +#define KEEP 1
> +#define INVESTIGATE 2
> +
> typedef struct s_xdpsplit {
> long i1, i2;
> int min_lo, min_hi;
> @@ -311,6 +320,175 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
> }
> > > +static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> + long r, rdis0, rpdis0, rdis1, rpdis1;
> +
> + /*
> + * Limits the window that is examined during the similar-lines
> + * scan. The loops below stops when action[i - r] == KEEP
> + * (line that has no match), but there are corner cases where
> + * the loop proceed all the way to the extremities by causing
> + * huge performance penalties in case of big files.
> + */
> + if (i - s > XDL_SIMSCAN_WINDOW)
> + s = i - XDL_SIMSCAN_WINDOW;
> + if (e - i > XDL_SIMSCAN_WINDOW)
> + e = i + XDL_SIMSCAN_WINDOW;
> +
> + /*
> + * Scans the lines before 'i' to find a run of lines that either
> + * have no match (action[j] == DISCARD) or have multiple matches
> + * (action[j] == INVESTIGATE). Note that we always call this
> + * function with action[i] == INVESTIGATE, so the current line
> + * (i) is already a multimatch line.
> + */
> + for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
> + if (action[i - r] == DISCARD)
> + rdis0++;
> + else if (action[i - r] == INVESTIGATE)
> + rpdis0++;
> + else if (action[i - r] == KEEP)
> + break;
> + else
> + BUG("Illegal value for action[i - r]");
> + }
> + /*
> + * If the run before the line 'i' found only multimatch lines,
> + * we return false and hence we don't make the current line (i)
> + * discarded. We want to discard multimatch lines only when
> + * they appear in the middle of runs with nomatch lines
> + * (action[j] == DISCARD).
> + */
> + if (rdis0 == 0)
> + return 0;
> + for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
> + if (action[i + r] == DISCARD)
> + rdis1++;
> + else if (action[i + r] == INVESTIGATE)
> + rpdis1++;
> + else if (action[i + r] == KEEP)
> + break;
> + else
> + BUG("Illegal value for action[i + r]");
> + }
> + /*
> + * If the run after the line 'i' found only multimatch lines,
> + * we return false and hence we don't make the current line (i)
> + * discarded.
> + */
> + if (rdis1 == 0)
> + return false;
> + rdis1 += rdis0;
> + rpdis1 += rpdis0;
> +
> + return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
> +}
> +
> +struct xoccurrence
> +{
> + size_t file1, file2;
> +};
> +
> +
> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> +
> +
> +/*
> + * Try to reduce the problem complexity, discard records that have no
> + * matches on the other file. Also, lines that have multiple matches
> + * might be potentially discarded if they appear in a run of discardable.
> + */
> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> + long i;
> + size_t nm, mlim;
> + xrecord_t *recs;
> + uint8_t *action1 = NULL, *action2 = NULL;
> + struct IVec_xoccurrence occ;
> + bool need_min = !!(flags & XDF_NEED_MINIMAL);
> + int ret = 0;
> + ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
> + ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
> +
> + IVEC_INIT(occ);
> + ivec_zero(&occ, xe->mph_size);
> +
> + for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> + size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> + occ.ptr[mph1].file1 += 1;
> + }
> +
> + for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> + size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> + occ.ptr[mph2].file2 += 1;
> + }
> +
> + /*
> + * Create temporary arrays that will help us decide if
> + * changed[i] should remain false, or become true.
> + */
> + if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
> + ret = -1;
> + goto cleanup;
> + }
> + if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
> + ret = -1;
> + goto cleanup;
> + }
> +
> + /*
> + * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
> + */
> + if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> + mlim = XDL_MAX_EQLIMIT;
> + for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> + nm = occ.ptr[recs->minimal_perfect_hash].file2;
> + action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> + }
> +
> + if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> + mlim = XDL_MAX_EQLIMIT;
> + for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> + nm = occ.ptr[recs->minimal_perfect_hash].file1;
> + action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> + }
> +
> + /*
> + * Use temporary arrays to decide if changed[i] should remain
> + * false, or become true.
> + */
> + xe->xdf1.nreff = 0;
> + for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
> + i <= dend1; i++, recs++) {
> + if (action1[i] == KEEP ||
> + (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, dend1))) {
> + xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
> + /* changed[i] remains false, i.e. keep */
> + } else
> + xe->xdf1.changed[i] = true;
> + /* i.e. discard */
> + }
> +
> + xe->xdf2.nreff = 0;
> + for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
> + i <= dend2; i++, recs++) {
> + if (action2[i] == KEEP ||
> + (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, dend2))) {
> + xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
> + /* changed[i] remains false, i.e. keep */
> + } else
> + xe->xdf2.changed[i] = true;
> + /* i.e. discard */
> + }
> +
> +cleanup:
> + xdl_free(action1);
> + xdl_free(action2);
> + ivec_free(&occ);
> +
> + return ret;
> +}
> +
> +
> int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
> {
> long ndiags;
> @@ -318,6 +496,8 @@ int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
> xdalgoenv_t xenv;
> int res;
> > + xdl_cleanup_records(xe, flags);
> +
> /*
> * Allocate and setup K vectors to be used by the differential
> * algorithm.
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index b53a3b80c4..3f555e29f4 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -24,14 +24,6 @@
> #include "compat/ivec.h"
> > > -#define XDL_KPDIS_RUN 4
> -#define XDL_MAX_EQLIMIT 1024
> -#define XDL_SIMSCAN_WINDOW 100
> -
> -#define DISCARD 0
> -#define KEEP 1
> -#define INVESTIGATE 2
> -
> typedef struct s_xdlclass {
> struct s_xdlclass *next;
> xrecord_t rec;
> @@ -50,8 +42,6 @@ typedef struct s_xdlclassifier {
> } xdlclassifier_t;
> > > -
> -
> static int xdl_init_classifier(xdlclassifier_t *cf, long size, long flags) {
> memset(cf, 0, sizeof(xdlclassifier_t));
> > @@ -186,175 +176,6 @@ void xdl_free_env(xdfenv_t *xe) {
> }
> > > -static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> - long r, rdis0, rpdis0, rdis1, rpdis1;
> -
> - /*
> - * Limits the window that is examined during the similar-lines
> - * scan. The loops below stops when action[i - r] == KEEP
> - * (line that has no match), but there are corner cases where
> - * the loop proceed all the way to the extremities by causing
> - * huge performance penalties in case of big files.
> - */
> - if (i - s > XDL_SIMSCAN_WINDOW)
> - s = i - XDL_SIMSCAN_WINDOW;
> - if (e - i > XDL_SIMSCAN_WINDOW)
> - e = i + XDL_SIMSCAN_WINDOW;
> -
> - /*
> - * Scans the lines before 'i' to find a run of lines that either
> - * have no match (action[j] == DISCARD) or have multiple matches
> - * (action[j] == INVESTIGATE). Note that we always call this
> - * function with action[i] == INVESTIGATE, so the current line
> - * (i) is already a multimatch line.
> - */
> - for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
> - if (action[i - r] == DISCARD)
> - rdis0++;
> - else if (action[i - r] == INVESTIGATE)
> - rpdis0++;
> - else if (action[i - r] == KEEP)
> - break;
> - else
> - BUG("Illegal value for action[i - r]");
> - }
> - /*
> - * If the run before the line 'i' found only multimatch lines,
> - * we return false and hence we don't make the current line (i)
> - * discarded. We want to discard multimatch lines only when
> - * they appear in the middle of runs with nomatch lines
> - * (action[j] == DISCARD).
> - */
> - if (rdis0 == 0)
> - return 0;
> - for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
> - if (action[i + r] == DISCARD)
> - rdis1++;
> - else if (action[i + r] == INVESTIGATE)
> - rpdis1++;
> - else if (action[i + r] == KEEP)
> - break;
> - else
> - BUG("Illegal value for action[i + r]");
> - }
> - /*
> - * If the run after the line 'i' found only multimatch lines,
> - * we return false and hence we don't make the current line (i)
> - * discarded.
> - */
> - if (rdis1 == 0)
> - return false;
> - rdis1 += rdis0;
> - rpdis1 += rpdis0;
> -
> - return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
> -}
> -
> -struct xoccurrence
> -{
> - size_t file1, file2;
> -};
> -
> -
> -DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> -
> -
> -/*
> - * Try to reduce the problem complexity, discard records that have no
> - * matches on the other file. Also, lines that have multiple matches
> - * might be potentially discarded if they appear in a run of discardable.
> - */
> -static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> - long i;
> - size_t nm, mlim;
> - xrecord_t *recs;
> - uint8_t *action1 = NULL, *action2 = NULL;
> - struct IVec_xoccurrence occ;
> - bool need_min = !!(flags & XDF_NEED_MINIMAL);
> - int ret = 0;
> - ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
> - ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
> -
> - IVEC_INIT(occ);
> - ivec_zero(&occ, xe->mph_size);
> -
> - for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> - size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> - occ.ptr[mph1].file1 += 1;
> - }
> -
> - for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> - size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> - occ.ptr[mph2].file2 += 1;
> - }
> -
> - /*
> - * Create temporary arrays that will help us decide if
> - * changed[i] should remain false, or become true.
> - */
> - if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
> - ret = -1;
> - goto cleanup;
> - }
> - if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
> - ret = -1;
> - goto cleanup;
> - }
> -
> - /*
> - * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
> - */
> - if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> - mlim = XDL_MAX_EQLIMIT;
> - for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> - nm = occ.ptr[recs->minimal_perfect_hash].file2;
> - action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> - }
> -
> - if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> - mlim = XDL_MAX_EQLIMIT;
> - for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> - nm = occ.ptr[recs->minimal_perfect_hash].file1;
> - action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> - }
> -
> - /*
> - * Use temporary arrays to decide if changed[i] should remain
> - * false, or become true.
> - */
> - xe->xdf1.nreff = 0;
> - for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
> - i <= dend1; i++, recs++) {
> - if (action1[i] == KEEP ||
> - (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, dend1))) {
> - xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
> - /* changed[i] remains false, i.e. keep */
> - } else
> - xe->xdf1.changed[i] = true;
> - /* i.e. discard */
> - }
> -
> - xe->xdf2.nreff = 0;
> - for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
> - i <= dend2; i++, recs++) {
> - if (action2[i] == KEEP ||
> - (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, dend2))) {
> - xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
> - /* changed[i] remains false, i.e. keep */
> - } else
> - xe->xdf2.changed[i] = true;
> - /* i.e. discard */
> - }
> -
> -cleanup:
> - xdl_free(action1);
> - xdl_free(action2);
> - ivec_free(&occ);
> -
> - return ret;
> -}
> -
> -
> /*
> * Early trim initial and terminal matching records.
> */
> @@ -414,19 +235,9 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> }
> > xe->mph_size = cf.count;
> + xdl_free_classifier(&cf);
> > xdl_trim_ends(xe);
> - if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
> - (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> - xdl_cleanup_records(xe, xpp->flags) < 0) {
> -
> - xdl_free_ctx(&xe->xdf2);
> - xdl_free_ctx(&xe->xdf1);
> - xdl_free_classifier(&cf);
> - return -1;
> - }
> -
> - xdl_free_classifier(&cf);
> > return 0;
> }There was a problem hiding this comment.
Phillip Wood wrote on the Git mailing list (how to reply to this email):
On 21/01/2026 15:01, Phillip Wood wrote:
> Hi Ezekiel
> > On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>
>> Only the classic diff uses xdl_cleanup_records(). Move it,
>> xdl_clean_mmatch(), and the macros to xdiffi.c and call
>> xdl_cleanup_records() inside of xdl_do_classic_diff(). This better
>> organizes the code related to the classic diff.
> > I think calling xdl_cleanup_records() from inside xdl_do_classic_diff() > makes sense. I don't have a strong opinion either way on the code > movement.
Having thought about it I'm not so sure the code movement here makes sense. Having utility functions in a separate file is perfectly reasonable (afterall xprepare.c existed before the histogram and patientce algorithms were added). It's not like the code xdiffi.c is only about the myers diff there is generic code for diff sliders in there as well.
Thanks
Phillip
You should remove '#include "compat/ivec.h"' from xprepare.c
> if you're moving the only code that uses it out of that file.
> > Thanks
> > Phillip
> >> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>> ---
>> xdiff/xdiffi.c | 180 ++++++++++++++++++++++++++++++++++++++++++++
>> xdiff/xprepare.c | 191 +----------------------------------------------
>> 2 files changed, 181 insertions(+), 190 deletions(-)
>>
>> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
>> index e3196c7245..0f1fd7cf80 100644
>> --- a/xdiff/xdiffi.c
>> +++ b/xdiff/xdiffi.c
>> @@ -21,6 +21,7 @@
>> */
>> #include "xinclude.h"
>> +#include "compat/ivec.h"
>> static size_t get_hash(xdfile_t *xdf, long index)
>> {
>> @@ -33,6 +34,14 @@ static size_t get_hash(xdfile_t *xdf, long index)
>> #define XDL_SNAKE_CNT 20
>> #define XDL_K_HEUR 4
>> +#define XDL_KPDIS_RUN 4
>> +#define XDL_MAX_EQLIMIT 1024
>> +#define XDL_SIMSCAN_WINDOW 100
>> +
>> +#define DISCARD 0
>> +#define KEEP 1
>> +#define INVESTIGATE 2
>> +
>> typedef struct s_xdpsplit {
>> long i1, i2;
>> int min_lo, min_hi;
>> @@ -311,6 +320,175 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long >> lim1,
>> }
>> +static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, >> long e) {
>> + long r, rdis0, rpdis0, rdis1, rpdis1;
>> +
>> + /*
>> + * Limits the window that is examined during the similar-lines
>> + * scan. The loops below stops when action[i - r] == KEEP
>> + * (line that has no match), but there are corner cases where
>> + * the loop proceed all the way to the extremities by causing
>> + * huge performance penalties in case of big files.
>> + */
>> + if (i - s > XDL_SIMSCAN_WINDOW)
>> + s = i - XDL_SIMSCAN_WINDOW;
>> + if (e - i > XDL_SIMSCAN_WINDOW)
>> + e = i + XDL_SIMSCAN_WINDOW;
>> +
>> + /*
>> + * Scans the lines before 'i' to find a run of lines that either
>> + * have no match (action[j] == DISCARD) or have multiple matches
>> + * (action[j] == INVESTIGATE). Note that we always call this
>> + * function with action[i] == INVESTIGATE, so the current line
>> + * (i) is already a multimatch line.
>> + */
>> + for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
>> + if (action[i - r] == DISCARD)
>> + rdis0++;
>> + else if (action[i - r] == INVESTIGATE)
>> + rpdis0++;
>> + else if (action[i - r] == KEEP)
>> + break;
>> + else
>> + BUG("Illegal value for action[i - r]");
>> + }
>> + /*
>> + * If the run before the line 'i' found only multimatch lines,
>> + * we return false and hence we don't make the current line (i)
>> + * discarded. We want to discard multimatch lines only when
>> + * they appear in the middle of runs with nomatch lines
>> + * (action[j] == DISCARD).
>> + */
>> + if (rdis0 == 0)
>> + return 0;
>> + for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
>> + if (action[i + r] == DISCARD)
>> + rdis1++;
>> + else if (action[i + r] == INVESTIGATE)
>> + rpdis1++;
>> + else if (action[i + r] == KEEP)
>> + break;
>> + else
>> + BUG("Illegal value for action[i + r]");
>> + }
>> + /*
>> + * If the run after the line 'i' found only multimatch lines,
>> + * we return false and hence we don't make the current line (i)
>> + * discarded.
>> + */
>> + if (rdis1 == 0)
>> + return false;
>> + rdis1 += rdis0;
>> + rpdis1 += rpdis0;
>> +
>> + return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
>> +}
>> +
>> +struct xoccurrence
>> +{
>> + size_t file1, file2;
>> +};
>> +
>> +
>> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
>> +
>> +
>> +/*
>> + * Try to reduce the problem complexity, discard records that have no
>> + * matches on the other file. Also, lines that have multiple matches
>> + * might be potentially discarded if they appear in a run of >> discardable.
>> + */
>> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
>> + long i;
>> + size_t nm, mlim;
>> + xrecord_t *recs;
>> + uint8_t *action1 = NULL, *action2 = NULL;
>> + struct IVec_xoccurrence occ;
>> + bool need_min = !!(flags & XDF_NEED_MINIMAL);
>> + int ret = 0;
>> + ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
>> + ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>> +
>> + IVEC_INIT(occ);
>> + ivec_zero(&occ, xe->mph_size);
>> +
>> + for (size_t j = 0; j < xe->xdf1.nrec; j++) {
>> + size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
>> + occ.ptr[mph1].file1 += 1;
>> + }
>> +
>> + for (size_t j = 0; j < xe->xdf2.nrec; j++) {
>> + size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
>> + occ.ptr[mph2].file2 += 1;
>> + }
>> +
>> + /*
>> + * Create temporary arrays that will help us decide if
>> + * changed[i] should remain false, or become true.
>> + */
>> + if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
>> + ret = -1;
>> + goto cleanup;
>> + }
>> + if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
>> + ret = -1;
>> + goto cleanup;
>> + }
>> +
>> + /*
>> + * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
>> + */
>> + if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>> + mlim = XDL_MAX_EQLIMIT;
>> + for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; >> i <= dend1; i++, recs++) {
>> + nm = occ.ptr[recs->minimal_perfect_hash].file2;
>> + action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? >> INVESTIGATE: KEEP;
>> + }
>> +
>> + if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>> + mlim = XDL_MAX_EQLIMIT;
>> + for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; >> i <= dend2; i++, recs++) {
>> + nm = occ.ptr[recs->minimal_perfect_hash].file1;
>> + action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? >> INVESTIGATE: KEEP;
>> + }
>> +
>> + /*
>> + * Use temporary arrays to decide if changed[i] should remain
>> + * false, or become true.
>> + */
>> + xe->xdf1.nreff = 0;
>> + for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
>> + i <= dend1; i++, recs++) {
>> + if (action1[i] == KEEP ||
>> + (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, >> i, xe->delta_start, dend1))) {
>> + xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
>> + /* changed[i] remains false, i.e. keep */
>> + } else
>> + xe->xdf1.changed[i] = true;
>> + /* i.e. discard */
>> + }
>> +
>> + xe->xdf2.nreff = 0;
>> + for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
>> + i <= dend2; i++, recs++) {
>> + if (action2[i] == KEEP ||
>> + (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, >> i, xe->delta_start, dend2))) {
>> + xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
>> + /* changed[i] remains false, i.e. keep */
>> + } else
>> + xe->xdf2.changed[i] = true;
>> + /* i.e. discard */
>> + }
>> +
>> +cleanup:
>> + xdl_free(action1);
>> + xdl_free(action2);
>> + ivec_free(&occ);
>> +
>> + return ret;
>> +}
>> +
>> +
>> int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
>> {
>> long ndiags;
>> @@ -318,6 +496,8 @@ int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
>> xdalgoenv_t xenv;
>> int res;
>> + xdl_cleanup_records(xe, flags);
>> +
>> /*
>> * Allocate and setup K vectors to be used by the differential
>> * algorithm.
>> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
>> index b53a3b80c4..3f555e29f4 100644
>> --- a/xdiff/xprepare.c
>> +++ b/xdiff/xprepare.c
>> @@ -24,14 +24,6 @@
>> #include "compat/ivec.h"
>> -#define XDL_KPDIS_RUN 4
>> -#define XDL_MAX_EQLIMIT 1024
>> -#define XDL_SIMSCAN_WINDOW 100
>> -
>> -#define DISCARD 0
>> -#define KEEP 1
>> -#define INVESTIGATE 2
>> -
>> typedef struct s_xdlclass {
>> struct s_xdlclass *next;
>> xrecord_t rec;
>> @@ -50,8 +42,6 @@ typedef struct s_xdlclassifier {
>> } xdlclassifier_t;
>> -
>> -
>> static int xdl_init_classifier(xdlclassifier_t *cf, long size, long >> flags) {
>> memset(cf, 0, sizeof(xdlclassifier_t));
>> @@ -186,175 +176,6 @@ void xdl_free_env(xdfenv_t *xe) {
>> }
>> -static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, >> long e) {
>> - long r, rdis0, rpdis0, rdis1, rpdis1;
>> -
>> - /*
>> - * Limits the window that is examined during the similar-lines
>> - * scan. The loops below stops when action[i - r] == KEEP
>> - * (line that has no match), but there are corner cases where
>> - * the loop proceed all the way to the extremities by causing
>> - * huge performance penalties in case of big files.
>> - */
>> - if (i - s > XDL_SIMSCAN_WINDOW)
>> - s = i - XDL_SIMSCAN_WINDOW;
>> - if (e - i > XDL_SIMSCAN_WINDOW)
>> - e = i + XDL_SIMSCAN_WINDOW;
>> -
>> - /*
>> - * Scans the lines before 'i' to find a run of lines that either
>> - * have no match (action[j] == DISCARD) or have multiple matches
>> - * (action[j] == INVESTIGATE). Note that we always call this
>> - * function with action[i] == INVESTIGATE, so the current line
>> - * (i) is already a multimatch line.
>> - */
>> - for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
>> - if (action[i - r] == DISCARD)
>> - rdis0++;
>> - else if (action[i - r] == INVESTIGATE)
>> - rpdis0++;
>> - else if (action[i - r] == KEEP)
>> - break;
>> - else
>> - BUG("Illegal value for action[i - r]");
>> - }
>> - /*
>> - * If the run before the line 'i' found only multimatch lines,
>> - * we return false and hence we don't make the current line (i)
>> - * discarded. We want to discard multimatch lines only when
>> - * they appear in the middle of runs with nomatch lines
>> - * (action[j] == DISCARD).
>> - */
>> - if (rdis0 == 0)
>> - return 0;
>> - for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
>> - if (action[i + r] == DISCARD)
>> - rdis1++;
>> - else if (action[i + r] == INVESTIGATE)
>> - rpdis1++;
>> - else if (action[i + r] == KEEP)
>> - break;
>> - else
>> - BUG("Illegal value for action[i + r]");
>> - }
>> - /*
>> - * If the run after the line 'i' found only multimatch lines,
>> - * we return false and hence we don't make the current line (i)
>> - * discarded.
>> - */
>> - if (rdis1 == 0)
>> - return false;
>> - rdis1 += rdis0;
>> - rpdis1 += rpdis0;
>> -
>> - return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
>> -}
>> -
>> -struct xoccurrence
>> -{
>> - size_t file1, file2;
>> -};
>> -
>> -
>> -DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
>> -
>> -
>> -/*
>> - * Try to reduce the problem complexity, discard records that have no
>> - * matches on the other file. Also, lines that have multiple matches
>> - * might be potentially discarded if they appear in a run of >> discardable.
>> - */
>> -static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
>> - long i;
>> - size_t nm, mlim;
>> - xrecord_t *recs;
>> - uint8_t *action1 = NULL, *action2 = NULL;
>> - struct IVec_xoccurrence occ;
>> - bool need_min = !!(flags & XDF_NEED_MINIMAL);
>> - int ret = 0;
>> - ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
>> - ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>> -
>> - IVEC_INIT(occ);
>> - ivec_zero(&occ, xe->mph_size);
>> -
>> - for (size_t j = 0; j < xe->xdf1.nrec; j++) {
>> - size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
>> - occ.ptr[mph1].file1 += 1;
>> - }
>> -
>> - for (size_t j = 0; j < xe->xdf2.nrec; j++) {
>> - size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
>> - occ.ptr[mph2].file2 += 1;
>> - }
>> -
>> - /*
>> - * Create temporary arrays that will help us decide if
>> - * changed[i] should remain false, or become true.
>> - */
>> - if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
>> - ret = -1;
>> - goto cleanup;
>> - }
>> - if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
>> - ret = -1;
>> - goto cleanup;
>> - }
>> -
>> - /*
>> - * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
>> - */
>> - if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>> - mlim = XDL_MAX_EQLIMIT;
>> - for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; >> i <= dend1; i++, recs++) {
>> - nm = occ.ptr[recs->minimal_perfect_hash].file2;
>> - action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? >> INVESTIGATE: KEEP;
>> - }
>> -
>> - if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>> - mlim = XDL_MAX_EQLIMIT;
>> - for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; >> i <= dend2; i++, recs++) {
>> - nm = occ.ptr[recs->minimal_perfect_hash].file1;
>> - action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? >> INVESTIGATE: KEEP;
>> - }
>> -
>> - /*
>> - * Use temporary arrays to decide if changed[i] should remain
>> - * false, or become true.
>> - */
>> - xe->xdf1.nreff = 0;
>> - for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
>> - i <= dend1; i++, recs++) {
>> - if (action1[i] == KEEP ||
>> - (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, >> i, xe->delta_start, dend1))) {
>> - xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
>> - /* changed[i] remains false, i.e. keep */
>> - } else
>> - xe->xdf1.changed[i] = true;
>> - /* i.e. discard */
>> - }
>> -
>> - xe->xdf2.nreff = 0;
>> - for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
>> - i <= dend2; i++, recs++) {
>> - if (action2[i] == KEEP ||
>> - (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, >> i, xe->delta_start, dend2))) {
>> - xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
>> - /* changed[i] remains false, i.e. keep */
>> - } else
>> - xe->xdf2.changed[i] = true;
>> - /* i.e. discard */
>> - }
>> -
>> -cleanup:
>> - xdl_free(action1);
>> - xdl_free(action2);
>> - ivec_free(&occ);
>> -
>> - return ret;
>> -}
>> -
>> -
>> /*
>> * Early trim initial and terminal matching records.
>> */
>> @@ -414,19 +235,9 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, >> xpparam_t const *xpp,
>> }
>> xe->mph_size = cf.count;
>> + xdl_free_classifier(&cf);
>> xdl_trim_ends(xe);
>> - if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
>> - (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
>> - xdl_cleanup_records(xe, xpp->flags) < 0) {
>> -
>> - xdl_free_ctx(&xe->xdf2);
>> - xdl_free_ctx(&xe->xdf1);
>> - xdl_free_classifier(&cf);
>> - return -1;
>> - }
>> -
>> - xdl_free_classifier(&cf);
>> return 0;
>> }
> > |
This patch series was integrated into seen via c078c58. |
|
This patch series was integrated into seen via ce8edf1. |
|
Phillip Wood wrote on the Git mailing list (how to reply to this email): The discussion of this series has got rather spread out so I thought it might be helpful to write a summary of my thoughts here.
On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> Patch series summary:
> > * patch 1: Introduce the ivec type
I agree this is a good idea to allow rust and C code to operate on the some data structure. The implementation needs a bit of work to avoid undefined behavior.
> * patch 2: Create the function xdl_do_classic_diff()
This is sensible
> * patches 3-4: generic cleanup
Patch 3 claims to "stop wasting time" but it introduces an extra pass over the input records without any explanation of why that is more efficient.
Patch 4 removes the common lines from the beginning and end of the input files before passing them on to the patience or histogram algorithms. That should speed things up (though we should measure by how much). It changes the output because excluding the common lines at beginning and end of the file changes the longest sequence of unique context lines in the lines that remain. If the different output is easier to read then that's clearly a good thing but you would need to do some analysis to show that.
> * patches 5-8: convert from dstart/dend (in xdfile_t) to
> delta_start/delta_end (in xdfenv_t)
dstart a dend in xdfile_t are the index of the first and last line after removing any common lines from the beginning and end. The proposal is to store the offset from the beginning and end instead in xdfenv_t. Looking at where dstart and dend are used I think storing the indices is more convenient - if we store offsets we end up calculating the indices from them which is a pain and introduces an opportunity to make an error.
> * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
> xdiffi.c
Here we finally get to use the ivec data structre introduced in patch 1. However it is just replacing a fixed size array and so does not demonstrate the more interesting parts of the API which concern growing the array as we push more elements to it. I'm also not convinced by the claim that this change saves time as it introduces an extra pass over the input records.
Overall I struggled to see how the cleanups proposed here linked to the introduction of the ivec data structure.
Thanks
Phillip
> Things that will be addressed in future patch series:
> > * Make xdl_cleanup_records() easier to read
> * convert recs/nrec into an ivec
> * convert changed to an ivec
> * remove reference_index/nreff from xdfile_t and turn it into an ivec
> * splitting minimal_perfect_hash out as its own ivec
> * improve the performance of the classifier and parsing/hashing lines
> > === before this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; ptrdiff_t dstart, dend; bool *changed; size_t *reference_index;
> size_t nreff; } xdfile_t;
> > typedef struct s_xdfenv { xdfile_t xdf1, xdf2; } xdfenv_t;
> > === after this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; bool *changed; size_t *reference_index; size_t nreff; }
> xdfile_t;
> > typedef struct s_xdfenv { xdfile_t xdf1, xdf2; size_t delta_start,
> delta_end; size_t mph_size; } xdfenv_t;
> > Ezekiel Newren (10):
> ivec: introduce the C side of ivec
> xdiff: make classic diff explicit by creating xdl_do_classic_diff()
> xdiff: don't waste time guessing the number of lines
> xdiff: let patience and histogram benefit from xdl_trim_ends()
> xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
> xdiff: cleanup xdl_trim_ends()
> xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
> xdiff: replace xdfile_t.dend with xdfenv_t.delta_end
> xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
> xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c
> > Makefile | 1 +
> compat/ivec.c | 113 ++++++++++++++++++
> compat/ivec.h | 52 +++++++++
> meson.build | 1 +
> xdiff/xdiffi.c | 221 +++++++++++++++++++++++++++++++++---
> xdiff/xdiffi.h | 1 +
> xdiff/xhistogram.c | 7 +-
> xdiff/xpatience.c | 7 +-
> xdiff/xprepare.c | 277 ++++++++-------------------------------------
> xdiff/xtypes.h | 3 +-
> xdiff/xutils.c | 20 ----
> xdiff/xutils.h | 1 -
> 12 files changed, 432 insertions(+), 272 deletions(-)
> create mode 100644 compat/ivec.c
> create mode 100644 compat/ivec.h
> > > base-commit: 66ce5f8e8872f0183bb137911c52b07f1f242d13
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2156%2Fezekielnewren%2Fxdiff-cleanup-3-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2156/ezekielnewren/xdiff-cleanup-3-v1
> Pull-Request: https://github.com/git/git/pull/2156 |
Patch series summary:
Things that will be addressed in future patch series:
=== before this patch series
typedef struct s_xdfile {
xrecord_t *recs;
size_t nrec;
ptrdiff_t dstart, dend;
bool *changed;
size_t *reference_index;
size_t nreff;
} xdfile_t;
typedef struct s_xdfenv {
xdfile_t xdf1, xdf2;
} xdfenv_t;
=== after this patch series
typedef struct s_xdfile {
xrecord_t *recs;
size_t nrec;
bool *changed;
size_t *reference_index;
size_t nreff;
} xdfile_t;
typedef struct s_xdfenv {
xdfile_t xdf1, xdf2;
size_t delta_start, delta_end;
size_t mph_size;
} xdfenv_t;
cc: Yee Cheng Chin ychin.git@gmail.com
cc: Phillip Wood phillip.wood123@gmail.com
cc: René Scharfe l.s.r@web.de
cc: Jeff King peff@peff.net
cc: "D. Ben Knoble" ben.knoble@gmail.com