Skip to content

Commit 0750bb5

Browse files
peffgitster
authored andcommitted
cat-file: support "unordered" output for --batch-all-objects
If you're going to access the contents of every object in a packfile, it's generally much more efficient to do so in pack order, rather than in hash order. That increases the locality of access within the packfile, which in turn is friendlier to the delta base cache, since the packfile puts related deltas next to each other. By contrast, hash order is effectively random, since the sha1 has no discernible relationship to the content. This patch introduces an "--unordered" option to cat-file which iterates over packs in pack-order under the hood. You can see the results when dumping all of the file content: $ time ./git cat-file --batch-all-objects --buffer --batch | wc -c 6883195596 real 0m44.491s user 0m42.902s sys 0m5.230s $ time ./git cat-file --unordered \ --batch-all-objects --buffer --batch | wc -c 6883195596 real 0m6.075s user 0m4.774s sys 0m3.548s Same output, different order, way faster. The same speed-up applies even if you end up accessing the object content in a different process, like: git cat-file --batch-all-objects --buffer --batch-check | grep blob | git cat-file --batch='%(objectname) %(rest)' | wc -c Adding "--unordered" to the first command drops the runtime in git.git from 24s to 3.5s. Side note: there are actually further speedups available for doing it all in-process now. Since we are outputting the object content during the actual pack iteration, we know where to find the object and could skip the extra lookup done by oid_object_info(). This patch stops short of that optimization since the underlying API isn't ready for us to make those sorts of direct requests. So if --unordered is so much better, why not make it the default? Two reasons: 1. We've promised in the documentation that --batch-all-objects outputs in hash order. Since cat-file is plumbing, people may be relying on that default, and we can't change it. 2. It's actually _slower_ for some cases. We have to compute the pack revindex to walk in pack order. And our de-duplication step uses an oidset, rather than a sort-and-dedup, which can end up being more expensive. If we're just accessing the type and size of each object, for example, like: git cat-file --batch-all-objects --buffer --batch-check my best-of-five warm cache timings go from 900ms to 1100ms using --unordered. Though it's possible in a cold-cache or under memory pressure that we could do better, since we'd have better locality within the packfile. And one final question: why is it "--unordered" and not "--pack-order"? The answer is again two-fold: 1. "pack order" isn't a well-defined thing across the whole set of objects. We're hitting loose objects, as well as objects in multiple packs, and the only ordering we're promising is _within_ a single pack. The rest is apparently random. 2. The point here is optimization. So we don't want to promise any particular ordering, but only to say that we will choose an ordering which is likely to be efficient for accessing the object content. That leaves the door open for further changes in the future without having to add another compatibility option. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
1 parent b1adb38 commit 0750bb5

File tree

3 files changed

+72
-5
lines changed

3 files changed

+72
-5
lines changed

Documentation/git-cat-file.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,16 @@ OPTIONS
104104
buffering; this is much more efficient when invoking
105105
`--batch-check` on a large number of objects.
106106

107+
--unordered::
108+
When `--batch-all-objects` is in use, visit objects in an
109+
order which may be more efficient for accessing the object
110+
contents than hash order. The exact details of the order are
111+
unspecified, but if you do not require a specific order, this
112+
should generally result in faster output, especially with
113+
`--batch`. Note that `cat-file` will still show each object
114+
only once, even if it is stored multiple times in the
115+
repository.
116+
107117
--allow-unknown-type::
108118
Allow -s or -t to query broken/corrupt objects of unknown type.
109119

builtin/cat-file.c

Lines changed: 51 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ struct batch_options {
2121
int print_contents;
2222
int buffer_output;
2323
int all_objects;
24+
int unordered;
2425
int cmdmode; /* may be 'w' or 'c' for --filters or --textconv */
2526
const char *format;
2627
};
@@ -410,6 +411,7 @@ static void batch_one_object(const char *obj_name, struct batch_options *opt,
410411
struct object_cb_data {
411412
struct batch_options *opt;
412413
struct expand_data *expand;
414+
struct oidset *seen;
413415
};
414416

415417
static int batch_object_cb(const struct object_id *oid, void *vdata)
@@ -437,6 +439,32 @@ static int collect_packed_object(const struct object_id *oid,
437439
return 0;
438440
}
439441

442+
static int batch_unordered_object(const struct object_id *oid, void *vdata)
443+
{
444+
struct object_cb_data *data = vdata;
445+
446+
if (oidset_contains(data->seen, oid))
447+
return 0;
448+
oidset_insert(data->seen, oid);
449+
450+
return batch_object_cb(oid, data);
451+
}
452+
453+
static int batch_unordered_loose(const struct object_id *oid,
454+
const char *path,
455+
void *data)
456+
{
457+
return batch_unordered_object(oid, data);
458+
}
459+
460+
static int batch_unordered_packed(const struct object_id *oid,
461+
struct packed_git *pack,
462+
uint32_t pos,
463+
void *data)
464+
{
465+
return batch_unordered_object(oid, data);
466+
}
467+
440468
static int batch_objects(struct batch_options *opt)
441469
{
442470
struct strbuf buf = STRBUF_INIT;
@@ -473,19 +501,35 @@ static int batch_objects(struct batch_options *opt)
473501
data.info.typep = &data.type;
474502

475503
if (opt->all_objects) {
476-
struct oid_array sa = OID_ARRAY_INIT;
477504
struct object_cb_data cb;
478505

479-
for_each_loose_object(collect_loose_object, &sa, 0);
480-
for_each_packed_object(collect_packed_object, &sa, 0);
481506
if (repository_format_partial_clone)
482507
warning("This repository has extensions.partialClone set. Some objects may not be loaded.");
483508

484509
cb.opt = opt;
485510
cb.expand = &data;
486-
oid_array_for_each_unique(&sa, batch_object_cb, &cb);
487511

488-
oid_array_clear(&sa);
512+
if (opt->unordered) {
513+
struct oidset seen = OIDSET_INIT;
514+
515+
cb.seen = &seen;
516+
517+
for_each_loose_object(batch_unordered_loose, &cb, 0);
518+
for_each_packed_object(batch_unordered_packed, &cb,
519+
FOR_EACH_OBJECT_PACK_ORDER);
520+
521+
oidset_clear(&seen);
522+
} else {
523+
struct oid_array sa = OID_ARRAY_INIT;
524+
525+
for_each_loose_object(collect_loose_object, &sa, 0);
526+
for_each_packed_object(collect_packed_object, &sa, 0);
527+
528+
oid_array_for_each_unique(&sa, batch_object_cb, &cb);
529+
530+
oid_array_clear(&sa);
531+
}
532+
489533
return 0;
490534
}
491535

@@ -586,6 +630,8 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
586630
N_("follow in-tree symlinks (used with --batch or --batch-check)")),
587631
OPT_BOOL(0, "batch-all-objects", &batch.all_objects,
588632
N_("show all objects with --batch or --batch-check")),
633+
OPT_BOOL(0, "unordered", &batch.unordered,
634+
N_("do not order --batch-all-objects output")),
589635
OPT_END()
590636
};
591637

t/t1006-cat-file.sh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -575,4 +575,15 @@ test_expect_success 'cat-file --batch-all-objects shows all objects' '
575575
test_cmp expect actual
576576
'
577577

578+
# The only user-visible difference is that the objects are no longer sorted,
579+
# and the resulting sort order is undefined. So we can only check that it
580+
# produces the same objects as the ordered case, but that at least exercises
581+
# the code.
582+
test_expect_success 'cat-file --unordered works' '
583+
git -C all-two cat-file --batch-all-objects --unordered \
584+
--batch-check="%(objectname)" >actual.unsorted &&
585+
sort <actual.unsorted >actual &&
586+
test_cmp expect actual
587+
'
588+
578589
test_done

0 commit comments

Comments
 (0)