Skip to content

Commit 0ad6090

Browse files
derrickstoleegitster
authored andcommitted
sparse-index: design doc and format update
This begins a long effort to update the index format to allow sparse directory entries. This should result in a significant improvement to Git commands when HEAD contains millions of files, but the user has selected many fewer files to keep in their sparse-checkout definition. Currently, the index format is only updated in the presence of extensions.sparseIndex instead of increasing a file format version number. This is temporary, and index v5 is part of the plan for future work in this area. The design document details many of the reasons for embarking on this work, and also the plan for completing it safely. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
1 parent 4795748 commit 0ad6090

File tree

2 files changed

+182
-0
lines changed

2 files changed

+182
-0
lines changed

Documentation/technical/index-format.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,13 @@ Git index format
4444
localization, no special casing of directory separator '/'). Entries
4545
with the same name are sorted by their stage field.
4646

47+
An index entry typically represents a file. However, if sparse-checkout
48+
is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
49+
`extensions.sparseIndex` extension is enabled, then the index may
50+
contain entries for directories outside of the sparse-checkout definition.
51+
These entries have mode `040000`, include the `SKIP_WORKTREE` bit, and
52+
the path ends in a directory separator.
53+
4754
32-bit ctime seconds, the last time a file's metadata changed
4855
this is stat(2) data
4956

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
Git Sparse-Index Design Document
2+
================================
3+
4+
The sparse-checkout feature allows users to focus a working directory on
5+
a subset of the files at HEAD. The cone mode patterns, enabled by
6+
`core.sparseCheckoutCone`, allow for very fast pattern matching to
7+
discover which files at HEAD belong in the sparse-checkout cone.
8+
9+
Three important scale dimensions for a Git working directory are:
10+
11+
* `HEAD`: How many files are present at `HEAD`?
12+
13+
* Populated: How many files are within the sparse-checkout cone.
14+
15+
* Modified: How many files has the user modified in the working directory?
16+
17+
We will use big-O notation -- O(X) -- to denote how expensive certain
18+
operations are in terms of these dimensions.
19+
20+
These dimensions are ordered by their magnitude: users (typically) modify
21+
fewer files than are populated, and we can only populate files at `HEAD`.
22+
23+
Problems occur if there is an extreme imbalance in these dimensions. For
24+
example, if `HEAD` contains millions of paths but the populated set has
25+
only tens of thousands, then commands like `git status` and `git add` can
26+
be dominated by operations that require O(`HEAD`) operations instead of
27+
O(Populated). Primarily, the cost is in parsing and rewriting the index,
28+
which is filled primarily with files at `HEAD` that are marked with the
29+
`SKIP_WORKTREE` bit.
30+
31+
The sparse-index intends to take these commands that read and modify the
32+
index from O(`HEAD`) to O(Populated). To do this, we need to modify the
33+
index format in a significant way: add "sparse directory" entries.
34+
35+
With cone mode patterns, it is possible to detect when an entire
36+
directory will have its contents outside of the sparse-checkout definition.
37+
Instead of listing all of the files it contains as individual entries, a
38+
sparse-index contains an entry with the directory name, referencing the
39+
object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
40+
If we need to discover the details for paths within that directory, we
41+
can parse trees to find that list.
42+
43+
At time of writing, sparse-directory entries violate expectations about the
44+
index format and its in-memory data structure. There are many consumers in
45+
the codebase that expect to iterate through all of the index entries and
46+
see only files. In fact, these loops expect to see a reference to every
47+
staged file. One way to handle this is to parse trees to replace a
48+
sparse-directory entry with all of the files within that tree as the index
49+
is loaded. However, parsing trees is slower than parsing the index format,
50+
so that is a slower operation than if we left the index alone. The plan is
51+
to make all of these integrations "sparse aware" so this expansion through
52+
tree parsing is unnecessary and they use fewer resources than when using a
53+
full index.
54+
55+
The implementation plan below follows four phases to slowly integrate with
56+
the sparse-index. The intention is to incrementally update Git commands to
57+
interact safely with the sparse-index without significant slowdowns. This
58+
may not always be possible, but the hope is that the primary commands that
59+
users need in their daily work are dramatically improved.
60+
61+
Phase I: Format and initial speedups
62+
------------------------------------
63+
64+
During this phase, Git learns to enable the sparse-index and safely parse
65+
one. Protections are put in place so that every consumer of the in-memory
66+
data structure can operate with its current assumption of every file at
67+
`HEAD`.
68+
69+
At first, every index parse will call a helper method,
70+
`ensure_full_index()`, which scans the index for sparse-directory entries
71+
(pointing to trees) and replaces them with the full list of paths (with
72+
blob contents) by parsing tree objects. This will be slower in all cases.
73+
The only noticeable change in behavior will be that the serialized index
74+
file contains sparse-directory entries.
75+
76+
To start, we use a new required index extension, `sdir`, to allow
77+
inserting sparse-directory entries into indexes with file format
78+
versions 2, 3, and 4. This prevents Git versions that do not understand
79+
the sparse-index from operating on one, while allowing tools that do not
80+
understand the sparse-index to operate on repositories as long as they do
81+
not interact with the index. A new format, index v5, will be introduced
82+
that includes sparse-directory entries by default. It might also
83+
introduce other features that have been considered for improving the
84+
index, as well.
85+
86+
Next, consumers of the index will be guarded against operating on a
87+
sparse-index by inserting calls to `ensure_full_index()` or
88+
`expand_index_to_path()`. After these guards are in place, we can begin
89+
leaving sparse-directory entries in the in-memory index structure.
90+
91+
Even after inserting these guards, we will keep expanding sparse-indexes
92+
for most Git commands using the `command_requires_full_index` repository
93+
setting. This setting will be on by default and disabled one builtin at a
94+
time until we have sufficient confidence that all of the index operations
95+
are properly guarded.
96+
97+
To complete this phase, the commands `git status` and `git add` will be
98+
integrated with the sparse-index so that they operate with O(Populated)
99+
performance. They will be carefully tested for operations within and
100+
outside the sparse-checkout definition.
101+
102+
Phase II: Careful integrations
103+
------------------------------
104+
105+
This phase focuses on ensuring that all index extensions and APIs work
106+
well with a sparse-index. This requires significant increases to our test
107+
coverage, especially for operations that interact with the working
108+
directory outside of the sparse-checkout definition. Some of these
109+
behaviors may not be the desirable ones, such as some tests already
110+
marked for failure in `t1092-sparse-checkout-compatibility.sh`.
111+
112+
The index extensions that may require special integrations are:
113+
114+
* FS Monitor
115+
* Untracked cache
116+
117+
While integrating with these features, we should look for patterns that
118+
might lead to better APIs for interacting with the index. Coalescing
119+
common usage patterns into an API call can reduce the number of places
120+
where sparse-directories need to be handled carefully.
121+
122+
Phase III: Important command speedups
123+
-------------------------------------
124+
125+
At this point, the patterns for testing and implementing sparse-directory
126+
logic should be relatively stable. This phase focuses on updating some of
127+
the most common builtins that use the index to operate as O(Populated).
128+
Here is a potential list of commands that could be valuable to integrate
129+
at this point:
130+
131+
* `git commit`
132+
* `git checkout`
133+
* `git merge`
134+
* `git rebase`
135+
136+
Hopefully, commands such as `git merge` and `git rebase` can benefit
137+
instead from merge algorithms that do not use the index as a data
138+
structure, such as the merge-ORT strategy. As these topics mature, we
139+
may enable the ORT strategy by default for repositories using the
140+
sparse-index feature.
141+
142+
Along with `git status` and `git add`, these commands cover the majority
143+
of users' interactions with the working directory. In addition, we can
144+
integrate with these commands:
145+
146+
* `git grep`
147+
* `git rm`
148+
149+
These have been proposed as some whose behavior could change when in a
150+
repo with a sparse-checkout definition. It would be good to include this
151+
behavior automatically when using a sparse-index. Some clarity is needed
152+
to make the behavior switch clear to the user.
153+
154+
This phase is the first where parallel work might be possible without too
155+
much conflicts between topics.
156+
157+
Phase IV: The long tail
158+
-----------------------
159+
160+
This last phase is less a "phase" and more "the new normal" after all of
161+
the previous work.
162+
163+
To start, the `command_requires_full_index` option could be removed in
164+
favor of expanding only when hitting an API guard.
165+
166+
There are many Git commands that could use special attention to operate as
167+
O(Populated), while some might be so rare that it is acceptable to leave
168+
them with additional overhead when a sparse-index is present.
169+
170+
Here are some commands that might be useful to update:
171+
172+
* `git sparse-checkout set`
173+
* `git am`
174+
* `git clean`
175+
* `git stash`

0 commit comments

Comments
 (0)