|
| 1 | +Git Sparse-Index Design Document |
| 2 | +================================ |
| 3 | + |
| 4 | +The sparse-checkout feature allows users to focus a working directory on |
| 5 | +a subset of the files at HEAD. The cone mode patterns, enabled by |
| 6 | +`core.sparseCheckoutCone`, allow for very fast pattern matching to |
| 7 | +discover which files at HEAD belong in the sparse-checkout cone. |
| 8 | + |
| 9 | +Three important scale dimensions for a Git working directory are: |
| 10 | + |
| 11 | +* `HEAD`: How many files are present at `HEAD`? |
| 12 | + |
| 13 | +* Populated: How many files are within the sparse-checkout cone. |
| 14 | + |
| 15 | +* Modified: How many files has the user modified in the working directory? |
| 16 | + |
| 17 | +We will use big-O notation -- O(X) -- to denote how expensive certain |
| 18 | +operations are in terms of these dimensions. |
| 19 | + |
| 20 | +These dimensions are ordered by their magnitude: users (typically) modify |
| 21 | +fewer files than are populated, and we can only populate files at `HEAD`. |
| 22 | + |
| 23 | +Problems occur if there is an extreme imbalance in these dimensions. For |
| 24 | +example, if `HEAD` contains millions of paths but the populated set has |
| 25 | +only tens of thousands, then commands like `git status` and `git add` can |
| 26 | +be dominated by operations that require O(`HEAD`) operations instead of |
| 27 | +O(Populated). Primarily, the cost is in parsing and rewriting the index, |
| 28 | +which is filled primarily with files at `HEAD` that are marked with the |
| 29 | +`SKIP_WORKTREE` bit. |
| 30 | + |
| 31 | +The sparse-index intends to take these commands that read and modify the |
| 32 | +index from O(`HEAD`) to O(Populated). To do this, we need to modify the |
| 33 | +index format in a significant way: add "sparse directory" entries. |
| 34 | + |
| 35 | +With cone mode patterns, it is possible to detect when an entire |
| 36 | +directory will have its contents outside of the sparse-checkout definition. |
| 37 | +Instead of listing all of the files it contains as individual entries, a |
| 38 | +sparse-index contains an entry with the directory name, referencing the |
| 39 | +object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit. |
| 40 | +If we need to discover the details for paths within that directory, we |
| 41 | +can parse trees to find that list. |
| 42 | + |
| 43 | +At time of writing, sparse-directory entries violate expectations about the |
| 44 | +index format and its in-memory data structure. There are many consumers in |
| 45 | +the codebase that expect to iterate through all of the index entries and |
| 46 | +see only files. In fact, these loops expect to see a reference to every |
| 47 | +staged file. One way to handle this is to parse trees to replace a |
| 48 | +sparse-directory entry with all of the files within that tree as the index |
| 49 | +is loaded. However, parsing trees is slower than parsing the index format, |
| 50 | +so that is a slower operation than if we left the index alone. The plan is |
| 51 | +to make all of these integrations "sparse aware" so this expansion through |
| 52 | +tree parsing is unnecessary and they use fewer resources than when using a |
| 53 | +full index. |
| 54 | + |
| 55 | +The implementation plan below follows four phases to slowly integrate with |
| 56 | +the sparse-index. The intention is to incrementally update Git commands to |
| 57 | +interact safely with the sparse-index without significant slowdowns. This |
| 58 | +may not always be possible, but the hope is that the primary commands that |
| 59 | +users need in their daily work are dramatically improved. |
| 60 | + |
| 61 | +Phase I: Format and initial speedups |
| 62 | +------------------------------------ |
| 63 | + |
| 64 | +During this phase, Git learns to enable the sparse-index and safely parse |
| 65 | +one. Protections are put in place so that every consumer of the in-memory |
| 66 | +data structure can operate with its current assumption of every file at |
| 67 | +`HEAD`. |
| 68 | + |
| 69 | +At first, every index parse will call a helper method, |
| 70 | +`ensure_full_index()`, which scans the index for sparse-directory entries |
| 71 | +(pointing to trees) and replaces them with the full list of paths (with |
| 72 | +blob contents) by parsing tree objects. This will be slower in all cases. |
| 73 | +The only noticeable change in behavior will be that the serialized index |
| 74 | +file contains sparse-directory entries. |
| 75 | + |
| 76 | +To start, we use a new required index extension, `sdir`, to allow |
| 77 | +inserting sparse-directory entries into indexes with file format |
| 78 | +versions 2, 3, and 4. This prevents Git versions that do not understand |
| 79 | +the sparse-index from operating on one, while allowing tools that do not |
| 80 | +understand the sparse-index to operate on repositories as long as they do |
| 81 | +not interact with the index. A new format, index v5, will be introduced |
| 82 | +that includes sparse-directory entries by default. It might also |
| 83 | +introduce other features that have been considered for improving the |
| 84 | +index, as well. |
| 85 | + |
| 86 | +Next, consumers of the index will be guarded against operating on a |
| 87 | +sparse-index by inserting calls to `ensure_full_index()` or |
| 88 | +`expand_index_to_path()`. After these guards are in place, we can begin |
| 89 | +leaving sparse-directory entries in the in-memory index structure. |
| 90 | + |
| 91 | +Even after inserting these guards, we will keep expanding sparse-indexes |
| 92 | +for most Git commands using the `command_requires_full_index` repository |
| 93 | +setting. This setting will be on by default and disabled one builtin at a |
| 94 | +time until we have sufficient confidence that all of the index operations |
| 95 | +are properly guarded. |
| 96 | + |
| 97 | +To complete this phase, the commands `git status` and `git add` will be |
| 98 | +integrated with the sparse-index so that they operate with O(Populated) |
| 99 | +performance. They will be carefully tested for operations within and |
| 100 | +outside the sparse-checkout definition. |
| 101 | + |
| 102 | +Phase II: Careful integrations |
| 103 | +------------------------------ |
| 104 | + |
| 105 | +This phase focuses on ensuring that all index extensions and APIs work |
| 106 | +well with a sparse-index. This requires significant increases to our test |
| 107 | +coverage, especially for operations that interact with the working |
| 108 | +directory outside of the sparse-checkout definition. Some of these |
| 109 | +behaviors may not be the desirable ones, such as some tests already |
| 110 | +marked for failure in `t1092-sparse-checkout-compatibility.sh`. |
| 111 | + |
| 112 | +The index extensions that may require special integrations are: |
| 113 | + |
| 114 | +* FS Monitor |
| 115 | +* Untracked cache |
| 116 | + |
| 117 | +While integrating with these features, we should look for patterns that |
| 118 | +might lead to better APIs for interacting with the index. Coalescing |
| 119 | +common usage patterns into an API call can reduce the number of places |
| 120 | +where sparse-directories need to be handled carefully. |
| 121 | + |
| 122 | +Phase III: Important command speedups |
| 123 | +------------------------------------- |
| 124 | + |
| 125 | +At this point, the patterns for testing and implementing sparse-directory |
| 126 | +logic should be relatively stable. This phase focuses on updating some of |
| 127 | +the most common builtins that use the index to operate as O(Populated). |
| 128 | +Here is a potential list of commands that could be valuable to integrate |
| 129 | +at this point: |
| 130 | + |
| 131 | +* `git commit` |
| 132 | +* `git checkout` |
| 133 | +* `git merge` |
| 134 | +* `git rebase` |
| 135 | + |
| 136 | +Hopefully, commands such as `git merge` and `git rebase` can benefit |
| 137 | +instead from merge algorithms that do not use the index as a data |
| 138 | +structure, such as the merge-ORT strategy. As these topics mature, we |
| 139 | +may enable the ORT strategy by default for repositories using the |
| 140 | +sparse-index feature. |
| 141 | + |
| 142 | +Along with `git status` and `git add`, these commands cover the majority |
| 143 | +of users' interactions with the working directory. In addition, we can |
| 144 | +integrate with these commands: |
| 145 | + |
| 146 | +* `git grep` |
| 147 | +* `git rm` |
| 148 | + |
| 149 | +These have been proposed as some whose behavior could change when in a |
| 150 | +repo with a sparse-checkout definition. It would be good to include this |
| 151 | +behavior automatically when using a sparse-index. Some clarity is needed |
| 152 | +to make the behavior switch clear to the user. |
| 153 | + |
| 154 | +This phase is the first where parallel work might be possible without too |
| 155 | +much conflicts between topics. |
| 156 | + |
| 157 | +Phase IV: The long tail |
| 158 | +----------------------- |
| 159 | + |
| 160 | +This last phase is less a "phase" and more "the new normal" after all of |
| 161 | +the previous work. |
| 162 | + |
| 163 | +To start, the `command_requires_full_index` option could be removed in |
| 164 | +favor of expanding only when hitting an API guard. |
| 165 | + |
| 166 | +There are many Git commands that could use special attention to operate as |
| 167 | +O(Populated), while some might be so rare that it is acceptable to leave |
| 168 | +them with additional overhead when a sparse-index is present. |
| 169 | + |
| 170 | +Here are some commands that might be useful to update: |
| 171 | + |
| 172 | +* `git sparse-checkout set` |
| 173 | +* `git am` |
| 174 | +* `git clean` |
| 175 | +* `git stash` |
0 commit comments