[RFC] Sparse Index by derrickstolee · Pull Request #847 · gitgitgadget/git

derrickstolee · 2021-01-25T14:47:17Z

This is based on ds/more-index-cleanups also available as GitGitGadget PR #839.

The sparse checkout feature allows users to specify a "populated set" that is smaller than the full list of files at HEAD. Files outside the sparse checkout definition are not present in the working directory, but are still present in the index (and marked with the CE_SKIP_WORKTREE bit).

This means that the working directory has size O(populated), and commands like "git status" or "git checkout" operate using an O(populated) number of filesystem operations. However, parsing the index still operates on the scale of O(HEAD).

This can be particularly jarring if you are merging a small repository with a large monorepo for the purpose of simplifying dependency management. Even if users have nothing more in their working directory than they had before, they suddenly see a significant increase in their "git status" or "git add" times. In these cases, simply parsing the index can be a huge portion of the command time.

This RFC proposes an update to the index formats to allow "sparse directory entries". These entries correspond to directories that are completely excluded from the sparse checkout definition. We can detect that a directory is excluded when using "cone mode" patterns.

Since having directory entries is a radical departure from the existing index format, a new extension "extensions.sparseIndex" is added. Using a sparse index should cause incompatible tools to fail because they do not understand this extension.

The index is a critical data structure, so making such a drastic change must be handled carefully. This RFC does only enough adjustments to demonstrate performance improvements for "git status" and "git add." Other commands should operate identically to before, since the other commands will expand a sparse index into a full index by parsing trees.

WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout tests. I'm not sure what is causing these failures, but I will explore while we discuss the possibility of the feature as a whole.

Here is an outline for this RFC:

Patches 1-14: create and test the sparse index format. This is just enough to start writing the format, but all Git commands become slower for using it. This is because everything is guarded to expand to a full index before actually operating on the cache entries.
Patch 15: This massive patch is actually a bunch of patches squashed together. I have a branch that adds "ensure_full_index()" guards in each necessary file along with some commentary about how the index is being used. This patch is presented here as one big dump because that commentary isn't particularly interesting if the RFC leads to a very different approach.
Patches 16-27: These changes make enough code "sparse aware" such that "git status" and "git add" start operating in time O(populated) instead of O(HEAD).

Performance numbers are given in patch 27, but repeated somewhat here. The test environment I use has ~2.1 million paths at HEAD, but only 68,000 populated paths given the sparse-checkout I'm using. The sparse index has about 2,000 sparse directory entries.

Use the full index. The index size is ~186 MB.
Use the sparse index. The index size is ~5.5 MB.
Use a commit where HEAD matches the populated set. The full index
size is ~5.3MB.

The third benchmark is included as a theoretical optimum for a repository of the same object database.

First, a clean 'git status' improves from 3.1s to 240ms.

Benchmark #1: full index (git status)
Time (mean ± σ): 3.167 s ± 0.036 s [User: 2.006 s, System: 1.078 s]
Range (min … max): 3.100 s … 3.208 s 10 runs

Benchmark #2: sparse index (git status)
Time (mean ± σ): 239.5 ms ± 8.1 ms [User: 189.4 ms, System: 226.8 ms]
Range (min … max): 226.0 ms … 251.9 ms 13 runs

Benchmark #3: small tree (git status)
Time (mean ± σ): 195.3 ms ± 4.5 ms [User: 116.5 ms, System: 84.4 ms]
Range (min … max): 188.8 ms … 202.8 ms 15 runs

The performance numbers for 'git add .' are much closer to optimal:

Benchmark #1: full index (git add .)
Time (mean ± σ): 3.076 s ± 0.022 s [User: 2.065 s, System: 0.943 s]
Range (min … max): 3.044 s … 3.116 s 10 runs

Benchmark #2: sparse index (git add .)
Time (mean ± σ): 218.0 ms ± 6.6 ms [User: 195.7 ms, System: 206.6 ms]
Range (min … max): 209.8 ms … 228.2 ms 13 runs

Benchmark #3: small tree (git add .)
Time (mean ± σ): 217.6 ms ± 5.4 ms [User: 131.9 ms, System: 86.7 ms]
Range (min … max): 212.1 ms … 228.4 ms 14 runs

I expect that making a sparse index work optimally through the most common Git commands will take a year of effort. During this process, I expect to add a lot of testing infrastructure around the sparse-checkout feature, especially in corner cases. (This RFC focuses on the happy paths of operating only within the sparse cone, but that will change in the future.)

If this general approach is acceptable, then I would follow it with a sequence of patch submissions that follow this approach:

Basics of the format. (Patches 1-14)
Add additional guards around index interactions (Patch 15, but split appropriately.)
Speed up "git status" and "git add" (Patches 16-27)

After those three items that are represented in this RFC, the work starts to parallelize a bit. My basic ideas for moving forward from this point are to do these basic steps:

Add new index API abstractions where appropriate, make them sparse-aware.
Add new tests around sparse-checkout corner cases. Ensure the sparse index works correctly.
For a given builtin, add extra testing for sparse-checkouts then it them sparse-aware.

Here are some specific questions I'm hoping to answer in this RFC period:

Are these sparse directory entries an appropriate way to extend the index format?
Is extensions.sparseIndex a good way to signal that these entries can exist?
Is git sparse-checkout init --cone --sparse-index an appropriate way to toggle the format?
Are there specific areas that I should target to harden the index API before I submit this work?
Does anyone have a good idea how to test a large portion of the test suite with sparse-index enabled? The problem I see is that most tests don't use sparse-checkout, so the sparse index is identical to the full index. Would it be interesting to enable the test setup to add such "extra" directories during the test setup?

Thanks,
-Stolee

Cc: gitster@pobox.com
Cc: newren@gmail.com
Cc: peff@peff.net
Cc: jrnieder@gmail.com
Cc: sunshine@sunshineco.com
Cc: pclouds@gmail.com
cc: Derrick Stolee stolee@gmail.com

Upcoming changes will introduce modifications to the index format that allow sparse directories. It will be useful to have a mechanism for converting those sparse index files into full indexes by walking the tree at those sparse directories. Name this method ensure_full_index() as it will guarantee that the index is fully expanded. This method is not implemented yet, and instead we focus on the scaffolding to declare it and call it at the appropriate time. Add a 'command_requires_full_index' member to struct repo_settings. This will be an indicator that we need the index in full mode to do certain index operations. This starts as being true for every command, then we will set it to false as some commands integrate with sparse indexes. If 'command_requires_full_index' is true, then we will immediately expand a sparse index to a full one upon reading from disk. This suffices for now, but we will want to add more callers to ensure_full_index() later. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries with the sparse_index bit. These currently cannot exist, but we will add a mechanism for collapsing a full index to a sparse one in a later change. That will happen at write time, so we must first allow parsing the format before writing it. Commands or methods that require a full index in order to operate can call ensure_full_index() to expand that index in-memory. This requires parsing trees using that index's repository. Sparse directory entries have a specific 'ce_mode' value. The macro S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This ce_mode is not possible with the existing index formats, so we don't also verify all properties of a sparse-directory entry, which are: 1. ce->ce_mode == 01000755 2. ce->flags & CE_SKIP_WORKTREE is true 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) 4. ce->oid references a tree object. These are all semi-enforced in ensure_full_index() to some extent. Any deviation will cause a warning at minimum or a failure in the worst case. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also add run_on_sparse and test_sparse_match helpers. These helpers will be used when the sparse index is implemented. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is being written correctly, especially as we build and test the sparse-index. To make the option parsing slightly more robust, wrap the string comparisons in a loop adapted from test-dir-iterator.c. Care must be taken with the final check for the 'cnt' variable. We continue the expectation that the numerical value is the final argument. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The 'test-tool read-cache --table' output is helpful to understand the full contents of the index entries on-disk. This is particularly helpful when trying to diagnose issues with a real repository example. However, for test cases we might want to compare the index contents of two repositories that were updated in similar ways, but will not actually share the same stat data. Add the '--no-stat' option to remove the timestamps and other stat data from the output. This allows us to compare index contents directly. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse index is written as part of init_repos. Since we will no longer always expand a sparse index into a full index, add an '--expand' parameter that adds a call to ensure_full_index() so we can compare a sparse index directly against a full index, or at least what the in-memory index looks like when expanded in this way. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write time. The existing logic provides a way for every sparse index to be expanded to a full index at read time. However, there are cases where an index is written and then continues to be used in-memory to perform further updates. unpack_trees() is frequently called after such a write. In particular, commands like 'git reset' do this double-update of the index. Ensure that we have a full index when entering unpack_trees(), but only when command_requires_full_index is true. This is always true at the moment, but we will later relax that after unpack_trees() is updated to handle sparse directory entries. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations on a pattern_list that only exists in-memory. This allows easy backing out in case the index update fails. However, if the index write itself cares about the sparse-checkout pattern set, we need access to that in-memory copy. Place a pointer to a 'struct pattern_list' in the index so we can access this on-demand. This will be used in the next change which uses the sparse-checkout definition to filter out directories that are outsie the sparse cone. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by replacing directories outside of the sparse cone with sparse directory entries. The convert_to_sparse() method does this, when the situation is appropriate. For now, we avoid converting the index to a sparse index if: 1. the index is split. 2. the index is already sparse. 3. sparse-checkout is disabled. 4. sparse-checkout does not use cone mode. Finally, we currently limit the conversion to when the GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git config will be added in a later change. The trickiest thing about this conversion is that we might not be able to mark a directory as a sparse directory just because it is outside the sparse cone. There might be unmerged files within that directory, so we need to look for those. Also, if there is some strange reason why a file is not marked with CE_SKIP_WORKTREE, then we should give up on converting that directory. There is still hope that some of its subdirectories might be able to convert to sparse, so we keep looking deeper. The conversion process is assisted by the cache-tree extension. This is calculated from the full index if it does not already exist. We then abandon the cache-tree as it no longer applies to the newly-sparse index. Thus, this cache-tree will be recalculated in every sparse-full-sparse round-trip until we integrate the cache-tree extension with the sparse index. We can compare the behavior of the sparse-index in t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1 when operating on the 'sparse-index' repo. We can also compare the two sparse repos directly, such as comparing their indexes (when expanded to full in the case of the 'sparse-index' repo). We also verify that the index is actually populated with sparse directory entries. The 'checkout and reset (mixed)' test is marked for failure when comparing a sparse repo to a full repo, but we can compare the two sparse-checkout cases directly to ensure that we are not changing the behavior when using a sparse index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit within a submodule. Submodules are populated or not depending on submodule configuration, not sparse-checkout. To ensure that the sparse-index feature integrates correctly with submodules, we should not collapse a directory if there is a Git link within its range. This allows us to remove ensure_full_index() from die_path_inside_submodule() because a sparse-index will not remove the entries for Git links. The loop already 'continue's if the cache entry is not a Git link. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a directory entry exists exactly in the index. We need to consider that it is possible to have a directory in a sparse index as long as that entry is itself marked with the skip-worktree bit. The negation of the 'pos' variable must be conditioned to only when it starts as negative. This is identical behavior as before when the index is full. The starts_with() condition matches because our name.buf terminates with a directory separator, just like our sparse directory entries. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly expanding a sparse index to a full one, then converting back to sparse when writing the index. As we integrate more Git commands with the sparse index, we will convert these commands to check that we do _not_ convert the sparse index to a full index and instead stay sparse the entire time. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to actually select this mode. Further, sparse directory entries are not understood by the index formats as advertised. We _could_ add a new index version that explicitly adds these capabilities, but there are nuances to index formats 2, 3, and 4 that are still valuable to select as options. For now, create a repo extension, "extensions.sparseIndex", that specifies that the tool reading this repository must understand sparse directory entries. This change only encodes the extension and enables it when GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI mechanism. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1. Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that specifies if the sparse index should be used. It also updates the index to use the correct format, either way. Add a warning in the documentation that the use of a repository extension might reduce compatibility with third-party tools. 'git sparse-checkout init' already sets extension.worktreeConfig, which places most sparse-checkout users outside of the scope of most third-party tools. Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of GIT_TEST_SPARSE_INDEX=1. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

This giant patch is not intended for actual review. I have a branch that has these changes split out in a sane way with some commentary in each file that is modified. The idea here is to guard certain portions of the codebase that do not know how to handle sparse indexes by ensuring that the index is expanded to a full index before proceeding with the logic. This also provides a good mechanism for testing which code needs updating to enable the sparse index in a Git builtin. The builtin can set the_repository->settings.command_requires_full_index to zero and then we can debug the command with a breakpoint on ensure_full_index(). That identifies the portion of code that needs adjusting before enabling sparse indexes for that command. Some index operations must be changed to operate on a non-const pointer, since ensuring a full index will modify the index itself. There are likely some gaps to these protections, which is why it will be important to carefully test each scenario as we relax the requirements. I expect that to be a long effort. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

As a first step to integrate 'git status' and 'git add' with the sparse index, we must start integrating unpack_trees() with sparse directory entries. These changes are currently impossible to trigger because unpack_trees() calls ensure_full_index() if command_requires_full_index is true. This is the case for all commands at the moment. As we expand more commands to be sparse-aware, we might find that more changes are required to unpack_trees(). The current changes will suffice for 'status' and 'add'. unpack_trees() calls the traverse_trees() API using unpack_callback() to decide if we should recurse into a subtree. We must add new abilities to skip a subtree if it corresponds to a sparse directory entry. It is important to be careful about the trailing directory separator that exists in the sparse directory entries but not in the subtree paths. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

When we have sparse directory entries in the index, we want to compare that directory against sparse-checkout patterns. Those pattern matching algorithms are built expecting a file path, not a directory path. This is especially important in the "cone mode" patterns which will match files that exist within the "parent directories" as well as the recursive directory matches. If path_matches_pattern_list() is given a directory, we can add a bogus filename ("-") to the directory and get the same results as before, assuming we are in cone mode. Since sparse index requires cone mode patterns, this is an acceptable assumption. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

By testing 'git -c core.fsmonitor= status -uno', we can check for the simplest index operations that can be made sparse-aware. The necessary implementation details are already integrated with sparse-checkout, so modify command_requires_full_index to be zero for cmd_status(). By running the debugger for 'git status -uno' after that change, we find two instances of ensure_full_index() that were added for extra safety, but can be removed without issue. In refresh_index(), we loop through the index entries. The refresh_cache_ent() method copies the sparse directories into the refreshed index without issue. The loop within run_diff_files() skips things that are in stage 0 and have skip-worktree enabled, so seems safe to disable ensure_full_index() here. While this change avoids calling ensure_full_index(), it actually slows 'git status' because we do not have the cache-tree extension to help us. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

'git status' began reporting a percentage of populated paths when sparse-checkout is enabled in 051df3c (wt-status: show sparse checkout status as well, 2020-07-18). This percentage is incorrect when the index has sparse directories. It would also be expensive to calculate as we would need to parse trees to count the total number of possible paths. Avoid the expensive computation by simplifying the output to only report that a sparse checkout exists, without the percentage. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Before we check if a specific file or directory exists in the index, it would be good to see if a leading directory is a sparse-directory. If so, we will want to expand the index _just enough_ to be sure that the paths we are interested in are in the index. The actually-interesting implementation will follow in a later change. For now, simply call ensure_full_index() to expand every directory simultaneously. Calls like index_dir_exists(), adjust_dirname_case(), and index_file_exists() in name-hash.c can trust the name hash if the index was properly expanded for the requested names. These methods can transition from ensure_full_index() to expand_to_path(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

We need to check the file hashmap first, then look to see if the directory signals a non-sparse directory entry. In such a case, we can rely on the contents of the sparse-index. We still use ensure_full_index() in the case that we hit a path that is within a sparse-directory entry. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Replace enough callers to ensure_full_index() to instead call expand_to_path() to reduce how often 'git add' expands a sparse index in memory (before writing a sparse index again). One non-obvious case is index_name_pos_also_unmerged() which is only hit on the Windows platform (in my tests). Use expand_to_path() instead of ensure_full_index(). Add a test to check that 'git add -A' and 'git add <file>' does not expand the index at all, as long as <file> is not within a sparse directory. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Since we already do not collapse a sparse directory if it contains a submodule, we don't need to expand to a full index in die_path_inside_submodule(). A simple scan of the index entries is sufficient. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The add_patterns() method has a way to extract a pattern file from the index. If this pattern file is sparse and within a sparse directory entry, then we need to expand the index before looking for that entry as a file path. For now, convert ensure_full_index() into expand_to_path() to only expand this way when absolutely necessary. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The FS Monitor feature uses a bitmap over the index entries. This currently interacts poorly with a sparse index. We will revisit this interaction in the future, but for now protect the index by refusing to use the FS Monitor extension at all if the index is sparse. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The add_pathspec_matches_against_index() focuses on matching a pathspec to file entries in the index. It is possible that this already works correctly for its only use: checking if untracked files exist in the index. It is likely that this causes a behavior issue when adding a directory that exists at HEAD but is outside the sparse cone. I'm marking this as a place to pursue with future tests. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes. However, the cache-tree is an important performance feature for commands like 'git status' and 'git add'. Integrate it with sparse directory entries. When writing a sparse index, completely clear and recalculate the cache tree. By starting from scratch, the only integration necessary is to check if we hit a sparse directory entry and create a leaf of the cache-tree that has an entry_count of one and no subtrees. Once the cache-tree exists within a sparse index, we finally get improved performance. I test the sparse index performance using a private monorepo with over 2.1 million files at HEAD, but with a sparse-checkout definition that has only 68,000 paths in the populated cone. The sparse index has about 2,000 sparse directory entries. I compare three scenarios: 1. Use the full index. The index size is ~186 MB. 2. Use the sparse index. The index size is ~5.5 MB. 3. Use a commit where HEAD matches the populated set. The full index size is ~5.3MB. The third benchmark is included as a theoretical optimium for a repository of the same object database. First, a clean 'git status' improves from 3.1s to 240ms. Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s [User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10 runs Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms [User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13 runs Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms [User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15 runs The optimimum is still 45ms faster. This is due in part to the 2,000+ sparse directory entries, but there might be other optimizations to make in the sparse-index case. In particular, I find that this performance difference disappears when I disable FS Monitor, which is somewhat disabled in the sparse-index case, but might still be adding overhead. The performance numbers for 'git add .' are much closer to optimal: Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s [User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10 runs Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms [User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13 runs Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms [User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14 runs In this test, I also used "echo >>README.md" to append a line to the README.md file, so the 'git add .' command is doing _something_ other than a no-op. Without this edit (and FS Monitor enabled) the small tree case again gains about 30ms on the sparse index case. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

derrickstolee · 2021-01-25T17:41:14Z

/submit

gitgitgadget · 2021-01-25T17:43:15Z

Submitted as pull.847.git.1611596533.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1

To fetch this version to local tag pr-847/derrickstolee/sparse-index/rfc-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-847/derrickstolee/sparse-index/rfc-v1

gitgitgadget · 2021-01-25T20:13:15Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This RFC proposes an update to the index formats to allow "sparse directory
> entries". These entries correspond to directories that are completely
> excluded from the sparse checkout definition. We can detect that a directory
> is excluded when using "cone mode" patterns.

Yay.

> Since having directory entries is a radical departure from the existing
> index format, a new extension "extensions.sparseIndex" is added. Using a
> sparse index should cause incompatible tools to fail because they do not
> understand this extension.

Safety is good, but because the index is purely a local matter, we
do not have to be so careful as updating the network protocols or
pack/object formats.

I think the use of "extensions.*" mechanism to render the repository
that uses the new feature unusable by older Git is safe enough, but
it may be too draconian.  For example, when things go wrong, don't
you want to "fetch"/"clone" from it into another repository to first
save the objects and refs?  You do not need a version of the index
file you understand in order to do that.

The index format has a mechanism to make older versions of Git bail
when it encounters a file that uses newer feature that they do not
understand.  Perhaps using it is sufficient instead?

gitgitgadget · 2021-01-25T21:31:28Z

On the Git mailing list, Derrick Stolee wrote (reply to this):



On 1/25/21 3:10 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> This RFC proposes an update to the index formats to allow "sparse directory
>> entries". These entries correspond to directories that are completely
>> excluded from the sparse checkout definition. We can detect that a directory
>> is excluded when using "cone mode" patterns.
> 
> Yay.
> 
>> Since having directory entries is a radical departure from the existing
>> index format, a new extension "extensions.sparseIndex" is added. Using a
>> sparse index should cause incompatible tools to fail because they do not
>> understand this extension.
> 
> Safety is good, but because the index is purely a local matter, we
> do not have to be so careful as updating the network protocols or
> pack/object formats.
> 
> I think the use of "extensions.*" mechanism to render the repository
> that uses the new feature unusable by older Git is safe enough, but
> it may be too draconian.  For example, when things go wrong, don't
> you want to "fetch"/"clone" from it into another repository to first
> save the objects and refs?  You do not need a version of the index
> file you understand in order to do that.
> 
> The index format has a mechanism to make older versions of Git bail
> when it encounters a file that uses newer feature that they do not
> understand.  Perhaps using it is sufficient instead?

There are interesting subtleties with the differences between index
formats 2, 3, and 4 that are worth keeping around. Perhaps the
extension could be a mechanism for allowing sparse directories in
those versions, but then a future "index version 5" includes sparse
directories without the extension.

I could spend time working on such an index v5 in parallel with
the updates to Git commands to make them sparse aware. The logic
from patches 1-14 in this series will be required before that could
begin.

Thanks,
-Stolee

gitgitgadget · 2021-01-25T21:31:31Z

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

gitgitgadget · 2021-01-27T04:02:58Z

cache.h

@@ -204,6 +204,10 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > We will mark an in-memory index_state as having sparse directory entries > with the sparse_index bit. These currently cannot exist, but we will add > a mechanism for collapsing a full index to a sparse one in a later > change. That will happen at write time, so we must first allow parsing > the format before writing it. > > Commands or methods that require a full index in order to operate can > call ensure_full_index() to expand that index in-memory. This requires > parsing trees using that index's repository. > > Sparse directory entries have a specific 'ce_mode' value. The macro > S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This > ce_mode is not possible with the existing index formats, so we don't > also verify all properties of a sparse-directory entry, which are: > > 1. ce->ce_mode == 01000755 This is a weird number. What's the reason for choosing it? It looks deceptively close to 0100755, normal executable files, but has the extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no file type. Since it's a directory, why not use S_IFDIR (040000)? (GITLINK does use the weird 0160000 value, but it happens to be S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a directory and a symlink") > 2. ce->flags & CE_SKIP_WORKTREE is true Makes sense. > 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) Is there a particular reason for this? I'm used to seeing names without the trailing slash, both in the index and in tree objects. I don't know enough to be for or against this idea; just curious at this point. > 4. ce->oid references a tree object. Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000? Also, as a bit of a side comment: There have been other requests in the past to support directory objects in the index. The only use I remember for them requested from others was to allow tracking empty directories. However, I've long wanted to introduce a new "blobtree" object to git, so that a user can "git add" some big binary file, but internally git splits the binary file up and stores it as multiple blobs within git plus a new "blobtree" object that references all the individual blobs with enough information about how to stitch all the blobs together to get the original file. I had a few different forms of "blobtree" things that I was interested in. I think brian once suggested some similar idea. > These are all semi-enforced in ensure_full_index() to some extent. Any > deviation will cause a warning at minimum or a failure in the worst > case. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > cache.h | 11 +++++- > read-cache.c | 9 +++++ > sparse-index.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++- > sparse-index.h | 1 + > 4 files changed, 113 insertions(+), 2 deletions(-) > > diff --git a/cache.h b/cache.h > index f9c7a603841..884046ca5b8 100644 > --- a/cache.h > +++ b/cache.h > @@ -204,6 +204,10 @@ struct cache_entry { > #error "CE_EXTENDED_FLAGS out of range" > #endif > > +#define CE_MODE_SPARSE_DIRECTORY 01000755 > +#define SPARSE_DIR_MODE 0100 Another magic value. Feels like the commit message should reference this one and why it was picked. Seems odd to me, and possibly problematic to re-use file permission bits that might collide with files recorded by really old versions of git. Maybe that's not a concern, though. > +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY) Should the special sauce apply to ce_flags rather than ce_mode? Thus, instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar to CE_SKIP_WORKTREE)? Or, alternatively, do we need a single special state here? Could we check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)? > + > /* Forward structure decls */ > struct pathspec; > struct child_process; > @@ -249,6 +253,8 @@ static inline unsigned int create_ce_mode(unsigned int mode) > { > if (S_ISLNK(mode)) > return S_IFLNK; > + if (mode == SPARSE_DIR_MODE) > + return CE_MODE_SPARSE_DIRECTORY; > if (S_ISDIR(mode) || S_ISGITLINK(mode)) > return S_IFGITLINK; > return S_IFREG | ce_permissions(mode); > @@ -319,7 +325,8 @@ struct index_state { > drop_cache_tree : 1, > updated_workdir : 1, > updated_skipworktree : 1, > - fsmonitor_has_run_once : 1; > + fsmonitor_has_run_once : 1, > + sparse_index : 1; > struct hashmap name_hash; > struct hashmap dir_hash; > struct object_id oid; > @@ -721,6 +728,8 @@ int read_index_from(struct index_state *, const char *path, > const char *gitdir); > int is_index_unborn(struct index_state *); > > +void ensure_full_index(struct index_state *istate); > + > /* For use with `write_locked_index()`. */ > #define COMMIT_LOCK (1 << 0) > #define SKIP_IF_UNCHANGED (1 << 1) > diff --git a/read-cache.c b/read-cache.c > index ecf6f689940..1097ecbf132 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -101,6 +101,9 @@ static const char *alternate_index_output; > > static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) > { > + if (S_ISSPARSEDIR(ce)) > + istate->sparse_index = 1; > + > istate->cache[nr] = ce; > add_name_hash(istate, ce); > } > @@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) > trace2_data_intmax("index", the_repository, "read/cache_nr", > istate->cache_nr); > > + if (!istate->repo) > + istate->repo = the_repository; > + prepare_repo_settings(istate->repo); > + if (istate->repo->settings.command_requires_full_index) > + ensure_full_index(istate); > + > return istate->cache_nr; > > unmap: > diff --git a/sparse-index.c b/sparse-index.c > index 82183ead563..1e70244dc13 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -1,8 +1,100 @@ > #include "cache.h" > #include "repository.h" > #include "sparse-index.h" > +#include "tree.h" > +#include "pathspec.h" > +#include "trace2.h" > + > +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) > +{ > + ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc); > + > + istate->cache[nr] = ce; > + add_name_hash(istate, ce); > +} > + > +static int add_path_to_index(const struct object_id *oid, > + struct strbuf *base, const char *path, > + unsigned int mode, int stage, void *context) > +{ > + struct index_state *istate = (struct index_state *)context; > + struct cache_entry *ce; > + size_t len = base->len; > + > + if (S_ISDIR(mode)) > + return READ_TREE_RECURSIVE; > + > + strbuf_addstr(base, path); > + > + ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0); > + ce->ce_flags |= CE_SKIP_WORKTREE; > + set_index_entry(istate, istate->cache_nr++, ce); > + > + strbuf_setlen(base, len); > + return 0; > +} > > void ensure_full_index(struct index_state *istate) > { > - /* intentionally left blank */ > + int i; > + struct index_state *full; > + > + if (!istate || !istate->sparse_index) > + return; > + > + if (!istate->repo) > + istate->repo = the_repository; > + > + trace2_region_enter("index", "ensure_full_index", istate->repo); > + > + /* initialize basics of new index */ > + full = xcalloc(1, sizeof(struct index_state)); > + memcpy(full, istate, sizeof(struct index_state)); > + > + /* then change the necessary things */ > + full->sparse_index = 0; > + full->cache_alloc = (3 * istate->cache_alloc) / 2; > + full->cache_nr = 0; > + ALLOC_ARRAY(full->cache, full->cache_alloc); > + > + for (i = 0; i < istate->cache_nr; i++) { > + struct cache_entry *ce = istate->cache[i]; > + struct tree *tree; > + struct pathspec ps; > + > + if (!S_ISSPARSEDIR(ce)) { > + set_index_entry(full, full->cache_nr++, ce); > + continue; > + } > + if (!(ce->ce_flags & CE_SKIP_WORKTREE)) > + warning(_("index entry is a directory, but not sparse (%08x)"), > + ce->ce_flags); > + > + /* recursively walk into cd->name */ > + tree = lookup_tree(istate->repo, &ce->oid); > + > + memset(&ps, 0, sizeof(ps)); > + ps.recursive = 1; > + ps.has_wildcard = 1; > + ps.max_depth = -1; > + > + read_tree_recursive(istate->repo, tree, > + ce->name, strlen(ce->name), > + 0, &ps, > + add_path_to_index, full); > + > + /* free directory entries. full entries are re-used */ > + discard_cache_entry(ce); > + } > + > + /* Copy back into original index. */ > + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); > + istate->sparse_index = 0; > + istate->cache = full->cache; Haven't you leaked the original istate->cache here? > + istate->cache_nr = full->cache_nr; > + istate->cache_alloc = full->cache_alloc; > + > + free(full); > + > + trace2_region_leave("index", "ensure_full_index", istate->repo); > } > diff --git a/sparse-index.h b/sparse-index.h > index 8dda92032e2..a2777dcac59 100644 > --- a/sparse-index.h > +++ b/sparse-index.h > @@ -3,5 +3,6 @@ > > struct index_state; > void ensure_full_index(struct index_state *istate); > +int convert_to_sparse(struct index_state *istate); Seems logically that you'd add one, but was this meant to be included in a later patch? > > #endif > \ No newline at end of file > -- > gitgitgadget >

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/26/2021 10:05 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: ... >> Sparse directory entries have a specific 'ce_mode' value. The macro >> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This >> ce_mode is not possible with the existing index formats, so we don't >> also verify all properties of a sparse-directory entry, which are: >> >> 1. ce->ce_mode == 01000755 > > This is a weird number. What's the reason for choosing it? It looks > deceptively close to 0100755, normal executable files, but has the > extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no > file type. > > Since it's a directory, why not use S_IFDIR (040000)? > > (GITLINK does use the weird 0160000 value, but it happens to be > S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a > directory and a symlink") I forget how exactly I came up with these magic constants, but then completely forgot to think of them critically because I haven't had to look at them in a while. They _are_ important, especially because these values affect the file format itself. I'll think harder on this before submitting a series intended for merging. >> 2. ce->flags & CE_SKIP_WORKTREE is true > > Makes sense. > >> 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) > > Is there a particular reason for this? I'm used to seeing names > without the trailing slash, both in the index and in tree objects. I > don't know enough to be for or against this idea; just curious at this > point. It's yet another way to distinguish directories from files, but there are cases where we do string searches up to a prefix, and having these directory separators did help, IIRC. >> 4. ce->oid references a tree object. > > Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000? ... >> +#define CE_MODE_SPARSE_DIRECTORY 01000755 >> +#define SPARSE_DIR_MODE 0100 > > Another magic value. Feels like the commit message should reference > this one and why it was picked. Seems odd to me, and possibly > problematic to re-use file permission bits that might collide with > files recorded by really old versions of git. Maybe that's not a > concern, though. > >> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY) > > Should the special sauce apply to ce_flags rather than ce_mode? Thus, > instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro > (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar > to CE_SKIP_WORKTREE)? > > Or, alternatively, do we need a single special state here? Could we > check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)? The intention was that ce_mode be a unique value that could only be assigned to a directory entry, which would then by necessity be sparse. Checking both ce_mode and ce_flags seemed wasteful with the given assumptions ... >> + /* Copy back into original index. */ >> + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); >> + istate->sparse_index = 0; >> + istate->cache = full->cache; > > Haven't you leaked the original istate->cache here? Yes, seems so. Will fix. Thanks, -Stolee

On the Git mailing list, Elijah Newren wrote (reply to this):

On Wed, Jan 27, 2021 at 5:43 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 1/26/2021 10:05 PM, Elijah Newren wrote: > > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > ... > >> Sparse directory entries have a specific 'ce_mode' value. The macro > >> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This > >> ce_mode is not possible with the existing index formats, so we don't > >> also verify all properties of a sparse-directory entry, which are: > >> > >> 1. ce->ce_mode == 01000755 > > > > This is a weird number. What's the reason for choosing it? It looks > > deceptively close to 0100755, normal executable files, but has the > > extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no > > file type. > > > > Since it's a directory, why not use S_IFDIR (040000)? > > > > (GITLINK does use the weird 0160000 value, but it happens to be > > S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a > > directory and a symlink") > > I forget how exactly I came up with these magic constants, but then > completely forgot to think of them critically because I haven't had > to look at them in a while. They _are_ important, especially because > these values affect the file format itself. > > I'll think harder on this before submitting a series intended for > merging. > > >> 2. ce->flags & CE_SKIP_WORKTREE is true > > > > Makes sense. > > > >> 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) > > > > Is there a particular reason for this? I'm used to seeing names > > without the trailing slash, both in the index and in tree objects. I > > don't know enough to be for or against this idea; just curious at this > > point. > > It's yet another way to distinguish directories from files, but > there are cases where we do string searches up to a prefix, and > having these directory separators did help, IIRC. > > >> 4. ce->oid references a tree object. > > > > Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000? > > ... > > >> +#define CE_MODE_SPARSE_DIRECTORY 01000755 > >> +#define SPARSE_DIR_MODE 0100 > > > > Another magic value. Feels like the commit message should reference > > this one and why it was picked. Seems odd to me, and possibly > > problematic to re-use file permission bits that might collide with > > files recorded by really old versions of git. Maybe that's not a > > concern, though. > > > >> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY) > > > > Should the special sauce apply to ce_flags rather than ce_mode? Thus, > > instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro > > (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar > > to CE_SKIP_WORKTREE)? > > > > Or, alternatively, do we need a single special state here? Could we > > check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)? > > The intention was that ce_mode be a unique value that could only > be assigned to a directory entry, which would then by necessity be > sparse. Checking both ce_mode and ce_flags seemed wasteful with the > given assumptions 040000 is a unique value that could only be assigned to a directory entry. Since we have no other uses of directories within the index, you are right, we wouldn't need to check ce_skip_worktree(ce) as well; just a check for the 040000 mode would be enough. > ... > > >> + /* Copy back into original index. */ > >> + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); > >> + istate->sparse_index = 0; > >> + istate->cache = full->cache; > > > > Haven't you leaked the original istate->cache here? > > Yes, seems so. Will fix. > > Thanks, > -Stolee

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes: >> 1. ce->ce_mode == 01000755 > > This is a weird number. What's the reason for choosing it? It looks > deceptively close to 0100755, normal executable files, but has the > extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no > file type. > > Since it's a directory, why not use S_IFDIR (040000)? > > (GITLINK does use the weird 0160000 value, but it happens to be > S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a > directory and a symlink") Yes, that combination of IFLNK/IFDIR was the reason why we use the value. I tend to think IFDIR is the best thing to use here.

gitgitgadget · 2021-01-27T05:45:17Z

t/helper/test-read-cache.c

@@ -2,18 +2,55 @@
 #include "cache.h"


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > This table is helpful for discovering data in the index to ensure it is > being written correctly, especially as we build and test the > sparse-index. > > To make the option parsing slightly more robust, wrap the string > comparisons in a loop adapted from test-dir-iterator.c. > > Care must be taken with the final check for the 'cnt' variable. We > continue the expectation that the numerical value is the final argument. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > t/helper/test-read-cache.c | 49 ++++++++++++++++++++++++++++++++++---- > 1 file changed, 44 insertions(+), 5 deletions(-) > > diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c > index 244977a29bd..cd7d106a675 100644 > --- a/t/helper/test-read-cache.c > +++ b/t/helper/test-read-cache.c > @@ -2,18 +2,55 @@ > #include "cache.h" > #include "config.h" > > +static void print_cache_entry(struct cache_entry *ce) > +{ > + /* stat info */ > + printf("%08x %08x %08x %08x %08x %08x ", > + ce->ce_stat_data.sd_ctime.sec, > + ce->ce_stat_data.sd_ctime.nsec, > + ce->ce_stat_data.sd_mtime.sec, > + ce->ce_stat_data.sd_mtime.nsec, > + ce->ce_stat_data.sd_dev, > + ce->ce_stat_data.sd_ino); Printing sec & nsec in hexidecimal? Why? Also, if they'll be displayed in hex, do you want to format them as 0x%08x, similar to what you do with binary below? > + > + /* mode in binary */ This comment feels misleading; I think this is the "S_IFMT portion of mode in binary" not "mode in binary". > + printf("0b%d%d%d%d ", > + (ce->ce_mode >> 15) & 1, > + (ce->ce_mode >> 14) & 1, > + (ce->ce_mode >> 13) & 1, > + (ce->ce_mode >> 12) & 1); Why binary? Also, since you defined a special magic constant of 01000755 which utilizes bit 18; how come you aren't including any bits higher than 15? > + /* output permissions? */ > + printf("%04o ", ce->ce_mode & 01777); 01777 instead of 07777 just because we don't have anything using the setuid or setgid bits? But if it's based on non-use, then we don't use the sticky bit (01000) either, so this could be just 0777. Also, if you're using 0b for binary to distinguish and you're clearly using multiple bases in this code, perhaps use a print format of 0o%04o (or 0o%03o if you only use a mask of 0777). > + printf("%s ", oid_to_hex(&ce->oid)); > + > + printf("%s\n", ce->name); > +} > + > +static void print_cache(struct index_state *cache) > +{ > + int i; > + for (i = 0; i < the_index.cache_nr; i++) > + print_cache_entry(the_index.cache[i]); > +} > + > int cmd__read_cache(int argc, const char **argv) > { > int i, cnt = 1; > const char *name = NULL; > + int table = 0; > > - if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) { > - argc--; > - argv++; > + for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) { > + if (skip_prefix(*argv, "--print-and-refresh=", &name)) > + continue; > + if (!strcmp(*argv, "--table")) { > + table = 1; > + } > } > > - if (argc == 2) > - cnt = strtol(argv[1], NULL, 0); > + if (argc == 1) > + cnt = strtol(argv[0], NULL, 0); > setup_git_directory(); > git_config(git_default_config, NULL); > for (i = 0; i < cnt; i++) { > @@ -30,6 +67,8 @@ int cmd__read_cache(int argc, const char **argv) > ce_uptodate(the_index.cache[pos]) ? "" : " not"); > write_file(name, "%d\n", i); > } > + if (table) > + print_cache(&the_index); > discard_cache(); > } > return 0; > -- > gitgitgadget >

gitgitgadget · 2021-01-27T05:45:19Z

unpack-trees.c

@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
 */


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > The next change will translate full indexes into sparse indexes at write > time. The existing logic provides a way for every sparse index to be > expanded to a full index at read time. However, there are cases where an > index is written and then continues to be used in-memory to perform > further updates. > > unpack_trees() is frequently called after such a write. In particular, > commands like 'git reset' do this double-update of the index. > > Ensure that we have a full index when entering unpack_trees(), but only > when command_requires_full_index is true. This is always true at the > moment, but we will later relax that after unpack_trees() is updated to > handle sparse directory entries. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > unpack-trees.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/unpack-trees.c b/unpack-trees.c > index f5f668f532d..4dd99219073 100644 > --- a/unpack-trees.c > +++ b/unpack-trees.c > @@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *, > */ > int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o) > { > + struct repository *repo = the_repository; > int i, ret; > static struct cache_entry *dfc; > struct pattern_list pl; > @@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options > trace_performance_enter(); > trace2_region_enter("unpack_trees", "unpack_trees", the_repository); > > + prepare_repo_settings(repo); > + if (repo->settings.command_requires_full_index) { > + ensure_full_index(o->src_index); > + ensure_full_index(o->dst_index); I was worried about o->result as well, since there is a memset(&o->result, 0, sizeof(o->result)); followed by manually initializing the relevant fields of the index_state. However, the relevant field here is your new sparse_index bit, and you want that to be 0, i.e. full. I also checked ensure_full_index() since it is often the case that o->src_index == o->dst_index, but it'll be safe to be called twice on the same index state -- at least as currently written. So, this patch seems good. > + } > + > if (!core_apply_sparse_checkout || !o->update) > o->skip_sparse_checkout = 1; > if (!o->skip_sparse_checkout && !o->pl) { > -- > gitgitgadget

gitgitgadget · 2021-01-27T06:33:10Z

t/t1092-sparse-checkout-compatibility.sh

@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > Add a new 'sparse-index' repo alongside the 'full-checkout' and > 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also > add run_on_sparse and test_sparse_match helpers. These helpers will be > used when the sparse index is implemented. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++---- > 1 file changed, 25 insertions(+), 4 deletions(-) > > diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh > index 8cd3e5a8d22..8876eae0fe3 100755 > --- a/t/t1092-sparse-checkout-compatibility.sh > +++ b/t/t1092-sparse-checkout-compatibility.sh > @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir' > test_expect_success 'setup' ' > git init initial-repo && > ( > + (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) && I thought parentheses started a subshell; once the subshell ends, wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away? > cd initial-repo && > echo a >a && > echo "after deep" >e && > @@ -87,23 +88,32 @@ init_repos () { > > cp -r initial-repo sparse-checkout && > git -C sparse-checkout reset --hard && > - git -C sparse-checkout sparse-checkout init --cone && > + > + cp -r initial-repo sparse-index && > + git -C sparse-index reset --hard && > > # initialize sparse-checkout definitions > - git -C sparse-checkout sparse-checkout set deep > + git -C sparse-checkout sparse-checkout init --cone && > + git -C sparse-checkout sparse-checkout set deep && > + GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone && > + GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep > } > > run_on_sparse () { > ( > cd sparse-checkout && > - $* >../sparse-checkout-out 2>../sparse-checkout-err > + GIT_TEST_SPARSE_INDEX=0 $* >../sparse-checkout-out 2>../sparse-checkout-err > + ) && > + ( > + cd sparse-index && > + $* >../sparse-index-out 2>../sparse-index-err > ) > } > > run_on_all () { > ( > cd full-checkout && > - $* >../full-checkout-out 2>../full-checkout-err > + GIT_TEST_SPARSE_INDEX=0 $* >../full-checkout-out 2>../full-checkout-err > ) && > run_on_sparse $* > } > @@ -114,6 +124,17 @@ test_all_match () { > test_cmp full-checkout-err sparse-checkout-err > } > > +test_sparse_match () { > + run_on_sparse $* && > + test_cmp sparse-checkout-out sparse-index-out && > + test_cmp sparse-checkout-err sparse-index-err > +} > + > +test_expect_success 'expanded in-memory index matches full index' ' > + init_repos && > + test_sparse_match test-tool read-cache --expand --table-no-stat > +' > + > test_expect_success 'status with options' ' > init_repos && > test_all_match git status --porcelain=v2 && > -- > gitgitgadget >

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/26/2021 10:08 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> >> From: Derrick Stolee <dstolee@microsoft.com> >> >> Add a new 'sparse-index' repo alongside the 'full-checkout' and >> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also >> add run_on_sparse and test_sparse_match helpers. These helpers will be >> used when the sparse index is implemented. >> >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++---- >> 1 file changed, 25 insertions(+), 4 deletions(-) >> >> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh >> index 8cd3e5a8d22..8876eae0fe3 100755 >> --- a/t/t1092-sparse-checkout-compatibility.sh >> +++ b/t/t1092-sparse-checkout-compatibility.sh >> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir' >> test_expect_success 'setup' ' >> git init initial-repo && >> ( >> + (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) && > > I thought parentheses started a subshell; once the subshell ends, > wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away? I think the "export" specifically pushes the setting out of the first level of subshell. This is the recommendation that comes up if one runs export GIT_TEST_SPARSE_INDEX=1 && inside a test on macOS, since this isn't completely portable. Thanks, -Stolee

On the Git mailing list, Elijah Newren wrote (reply to this):

On Wed, Jan 27, 2021 at 5:30 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 1/26/2021 10:08 PM, Elijah Newren wrote: > > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > >> > >> From: Derrick Stolee <dstolee@microsoft.com> > >> > >> Add a new 'sparse-index' repo alongside the 'full-checkout' and > >> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also > >> add run_on_sparse and test_sparse_match helpers. These helpers will be > >> used when the sparse index is implemented. > >> > >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > >> --- > >> t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++---- > >> 1 file changed, 25 insertions(+), 4 deletions(-) > >> > >> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh > >> index 8cd3e5a8d22..8876eae0fe3 100755 > >> --- a/t/t1092-sparse-checkout-compatibility.sh > >> +++ b/t/t1092-sparse-checkout-compatibility.sh > >> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir' > >> test_expect_success 'setup' ' > >> git init initial-repo && > >> ( > >> + (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) && > > > > I thought parentheses started a subshell; once the subshell ends, > > wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away? > > I think the "export" specifically pushes the setting out of the > first level of subshell. This is the recommendation that comes up You're having a child process change the environment variables of a parent process? ...without some kind of gdb or other debugger wizardry? > if one runs > > export GIT_TEST_SPARSE_INDEX=1 && > > inside a test on macOS, since this isn't completely portable. Um, I think you meant to run GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX && in order to avoid the unportable export GIT_TEST_SPARSE_INDEX=0 && because (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) && looks like a useless no-op. At least it would be in normal bash; is the test harness doing some special magic with it? In normal bash, the value definitely does NOT survive the subshell; (export just means that subprocesses of the subshell where the environment variable is set will see the value): $ echo Before: $GIT_TEST_SPARSE_INDEX && (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) && echo After: $GIT_TEST_SPARSE_INDEX Before: After: But in contrast, without the parentheses: $ echo Before: $GIT_TEST_SPARSE_INDEX && GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX && echo After: $GIT_TEST_SPARSE_INDEX Before: After: 0

gitgitgadget · 2021-01-27T17:05:06Z

builtin/sparse-checkout.c

@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > As we modify the sparse-checkout definition, we perform index operations > on a pattern_list that only exists in-memory. This allows easy backing > out in case the index update fails. > > However, if the index write itself cares about the sparse-checkout > pattern set, we need access to that in-memory copy. Place a pointer to > a 'struct pattern_list' in the index so we can access this on-demand. > This will be used in the next change which uses the sparse-checkout > definition to filter out directories that are outsie the sparse cone. s/outsie/outside/ > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > builtin/sparse-checkout.c | 17 ++++++++++------- > cache.h | 2 ++ > 2 files changed, 12 insertions(+), 7 deletions(-) > > diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c > index 2306a9ad98e..e00b82af727 100644 > --- a/builtin/sparse-checkout.c > +++ b/builtin/sparse-checkout.c > @@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl) > if (is_index_unborn(r->index)) > return UPDATE_SPARSITY_SUCCESS; > > + r->index->sparse_checkout_patterns = pl; > + > memset(&o, 0, sizeof(o)); > o.verbose_update = isatty(2); > o.update = 1; > @@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl) > else > rollback_lock_file(&lock_file); > > + r->index->sparse_checkout_patterns = NULL; > return result; > } > > @@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m) > { > int result; > int changed_config = 0; > - struct pattern_list pl; > - memset(&pl, 0, sizeof(pl)); > + struct pattern_list *pl = xcalloc(1, sizeof(*pl)); > > switch (m) { > case ADD: > if (core_sparse_checkout_cone) > - add_patterns_cone_mode(argc, argv, &pl); > + add_patterns_cone_mode(argc, argv, pl); > else > - add_patterns_literal(argc, argv, &pl); > + add_patterns_literal(argc, argv, pl); > break; > > case REPLACE: > - add_patterns_from_input(&pl, argc, argv); > + add_patterns_from_input(pl, argc, argv); > break; > } > > @@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m) > changed_config = 1; > } > > - result = write_patterns_and_update(&pl); > + result = write_patterns_and_update(pl); > > if (result && changed_config) > set_config(MODE_NO_PATTERNS); > > - clear_pattern_list(&pl); > + clear_pattern_list(pl); > + free(pl); > return result; > } > > diff --git a/cache.h b/cache.h > index 884046ca5b8..b05341cc687 100644 > --- a/cache.h > +++ b/cache.h > @@ -311,6 +311,7 @@ static inline unsigned int canon_mode(unsigned int mode) > struct split_index; > struct untracked_cache; > struct progress; > +struct pattern_list; > > struct index_state { > struct cache_entry **cache; > @@ -336,6 +337,7 @@ struct index_state { > struct mem_pool *ce_mem_pool; > struct progress *progress; > struct repository *repo; > + struct pattern_list *sparse_checkout_patterns; > }; > > /* Name hashing */ > -- > gitgitgadget Isn't this the same patch you put in your index cleanup series, or am I getting confused? It looks very familiar.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/27/2021 12:00 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> >> From: Derrick Stolee <dstolee@microsoft.com> >> >> As we modify the sparse-checkout definition, we perform index operations >> on a pattern_list that only exists in-memory. This allows easy backing >> out in case the index update fails. >> >> However, if the index write itself cares about the sparse-checkout >> pattern set, we need access to that in-memory copy. Place a pointer to >> a 'struct pattern_list' in the index so we can access this on-demand. >> This will be used in the next change which uses the sparse-checkout >> definition to filter out directories that are outsie the sparse cone. > > s/outsie/outside/ Thanks! > Isn't this the same patch you put in your index cleanup series, or am > I getting confused? It looks very familiar. I removed it from v2 of that series because it didn't do anything of value until we start using the sparse_checkout_patterns member in the next patch of _this_ series. Thanks, -Stolee

gitgitgadget · 2021-01-27T17:44:11Z

unpack-trees.c

@@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > The index_pos_by_traverse_info() currently throws a BUG() when a > directory entry exists exactly in the index. We need to consider that it > is possible to have a directory in a sparse index as long as that entry > is itself marked with the skip-worktree bit. > > The negation of the 'pos' variable must be conditioned to only when it > starts as negative. This is identical behavior as before when the index > is full. The first sentence of this paragraph was really hard for me to parse. Reading the code and then reading the sentence I could make sense of it, but I struggled to do so the other way around. Is there some way to reword this? Or is this just a reading comprehension issue on my part? (It might be...) > The starts_with() condition matches because our name.buf terminates with > a directory separator, just like our sparse directory entries. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > unpack-trees.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/unpack-trees.c b/unpack-trees.c > index 4dd99219073..b324eec2a5d 100644 > --- a/unpack-trees.c > +++ b/unpack-trees.c > @@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names, > strbuf_make_traverse_path(&name, info, names->path, names->pathlen); > strbuf_addch(&name, '/'); > pos = index_name_pos(o->src_index, name.buf, name.len); > - if (pos >= 0) > - BUG("This is a directory and should not exist in index"); > - pos = -pos - 1; > + if (pos >= 0) { > + if (!o->src_index->sparse_index || > + !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE)) > + BUG("This is a directory and should not exist in index"); > + } else > + pos = -pos - 1; > if (pos >= o->src_index->cache_nr || > !starts_with(o->src_index->cache[pos]->name, name.buf) || > (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf))) > -- > gitgitgadget The patch looks pretty straightforward to me.

gitgitgadget · 2021-01-27T17:44:14Z

cache-tree.c

@@ -6,6 +6,7 @@
 #include "object-store.h"


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > If we have a full index, then we can convert it to a sparse index by > replacing directories outside of the sparse cone with sparse directory > entries. The convert_to_sparse() method does this, when the situation is > appropriate. > > For now, we avoid converting the index to a sparse index if: > > 1. the index is split. > 2. the index is already sparse. > 3. sparse-checkout is disabled. > 4. sparse-checkout does not use cone mode. > > Finally, we currently limit the conversion to when the > GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git > config will be added in a later change. > > The trickiest thing about this conversion is that we might not be able > to mark a directory as a sparse directory just because it is outside the > sparse cone. There might be unmerged files within that directory, so we > need to look for those. Also, if there is some strange reason why a file > is not marked with CE_SKIP_WORKTREE, then we should give up on > converting that directory. There is still hope that some of its > subdirectories might be able to convert to sparse, so we keep looking > deeper. Oh good, you check for *both* unmerged entries and !CE_SKIP_WORKTREE ones. Very nice. > > The conversion process is assisted by the cache-tree extension. This is > calculated from the full index if it does not already exist. We then > abandon the cache-tree as it no longer applies to the newly-sparse > index. Thus, this cache-tree will be recalculated in every > sparse-full-sparse round-trip until we integrate the cache-tree > extension with the sparse index. When going from full to sparse, won't the parts of the cache-tree for paths outside of sparsified directories still be valid? Can't we use those? Also, when going from sparse to full, can't we just populate the cache-tree as well since we have to read trees to get the individual file entries and we will get all the directory (tree) values at the same time? > We can compare the behavior of the sparse-index in > t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1 > when operating on the 'sparse-index' repo. We can also compare the two > sparse repos directly, such as comparing their indexes (when expanded to > full in the case of the 'sparse-index' repo). We also verify that the > index is actually populated with sparse directory entries. > > The 'checkout and reset (mixed)' test is marked for failure when > comparing a sparse repo to a full repo, but we can compare the two > sparse-checkout cases directly to ensure that we are not changing the > behavior when using a sparse index. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > cache-tree.c | 3 + > read-cache.c | 18 ++- > sparse-index.c | 139 +++++++++++++++++++++++ > t/t1092-sparse-checkout-compatibility.sh | 63 +++++++++- > 4 files changed, 218 insertions(+), 5 deletions(-) > > diff --git a/cache-tree.c b/cache-tree.c > index 2fb483d3c08..5f07a39e501 100644 > --- a/cache-tree.c > +++ b/cache-tree.c > @@ -6,6 +6,7 @@ > #include "object-store.h" > #include "replace-object.h" > #include "promisor-remote.h" > +#include "sparse-index.h" > > #ifndef DEBUG_CACHE_TREE > #define DEBUG_CACHE_TREE 0 > @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags) > if (i) > return i; > > + ensure_full_index(istate); > + > if (!istate->cache_tree) > istate->cache_tree = cache_tree(); > > diff --git a/read-cache.c b/read-cache.c > index 1097ecbf132..0522260416e 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -25,6 +25,7 @@ > #include "fsmonitor.h" > #include "thread-utils.h" > #include "progress.h" > +#include "sparse-index.h" > > /* Mask for the name length in ce_flags in the on-disk index */ > > @@ -1002,8 +1003,15 @@ int verify_path(const char *path, unsigned mode) > > c = *path++; > if ((c == '.' && !verify_dotfile(path, mode)) || > - is_dir_sep(c) || c == '\0') > + is_dir_sep(c)) > return 0; > + /* > + * allow terminating directory separators for > + * sparse directory enries. s/enries/entries/ > + */ > + if (c == '\0') > + return mode == CE_MODE_SPARSE_DIRECTORY || > + mode == SPARSE_DIR_MODE; Why two values here? I get confused why you have both (which isn't new to this patch; I'm just still confused from when I saw SPARSE_DIR_MODE). > } else if (c == '\\' && protect_ntfs) { > if (is_ntfs_dotgit(path)) > return 0; > @@ -3062,6 +3070,13 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l > { > int ret; > > + ret = convert_to_sparse(istate); > + > + if (ret) { > + warning(_("failed to convert to a sparse-index")); > + return ret; > + } > + > /* > * TODO trace2: replace "the_repository" with the actual repo instance > * that is associated with the given "istate". > @@ -3165,6 +3180,7 @@ static int write_shared_index(struct index_state *istate, > int ret; > > move_cache_to_base_index(istate); > + convert_to_sparse(istate); > > trace2_region_enter_printf("index", "shared/do_write_index", > the_repository, "%s", (*temp)->filename.buf); > diff --git a/sparse-index.c b/sparse-index.c > index 1e70244dc13..d8f1a5a13d7 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -4,6 +4,145 @@ > #include "tree.h" > #include "pathspec.h" > #include "trace2.h" > +#include "cache-tree.h" > +#include "config.h" > +#include "dir.h" > +#include "fsmonitor.h" > + > +static struct cache_entry *construct_sparse_dir_entry( > + struct index_state *istate, > + const char *sparse_dir, > + struct cache_tree *tree) > +{ > + struct cache_entry *de; > + > + de = make_cache_entry(istate, SPARSE_DIR_MODE, &tree->oid, sparse_dir, 0, 0); > + > + de->ce_flags |= CE_SKIP_WORKTREE; > + return de; > +} > + > +/* > + * Returns the number of entries "inserted" into the index. > + */ > +static int convert_to_sparse_rec(struct index_state *istate, > + int num_converted, > + int start, int end, > + const char *ct_path, size_t ct_pathlen, > + struct cache_tree *ct) > +{ > + int i, can_convert = 1; > + int start_converted = num_converted; > + enum pattern_match_result match; > + int dtype; > + struct strbuf child_path = STRBUF_INIT; > + struct pattern_list *pl = istate->sparse_checkout_patterns; > + > + /* > + * Is the current path outside of the sparse cone? > + * Then check if the region can be replaced by a sparse > + * directory entry (everything is sparse and merged). > + */ > + match = path_matches_pattern_list(ct_path, ct_pathlen, > + NULL, &dtype, pl, istate); > + if (match != NOT_MATCHED) > + can_convert = 0; I know some people hate gotos, but this seems like one of the cases where a goto jumping after the following for & if would be clearer then setting can_convert to 0. > + > + for (i = start; can_convert && i < end; i++) { Instead of checking can_convert here... > + struct cache_entry *ce = istate->cache[i]; > + > + if (ce_stage(ce) || > + !(ce->ce_flags & CE_SKIP_WORKTREE)) > + can_convert = 0; ...could you just insert a break here? > + } > + > + if (can_convert) { > + struct cache_entry *se; > + se = construct_sparse_dir_entry(istate, ct_path, ct); > + > + istate->cache[num_converted++] = se; > + return 1; > + } > + > + for (i = start; i < end; ) { > + int count, span, pos = -1; > + const char *base, *slash; > + struct cache_entry *ce = istate->cache[i]; > + > + /* > + * Detect if this is a normal entry oustide of any subtree s/oustide/outside/ > + * entry. > + */ > + base = ce->name + ct_pathlen; > + slash = strchr(base, '/'); > + > + if (slash) > + pos = cache_tree_subtree_pos(ct, base, slash - base); > + > + if (pos < 0) { > + istate->cache[num_converted++] = ce; > + i++; > + continue; > + } > + > + strbuf_setlen(&child_path, 0); > + strbuf_add(&child_path, ce->name, slash - ce->name + 1); > + > + span = ct->down[pos]->cache_tree->entry_count; > + count = convert_to_sparse_rec(istate, > + num_converted, i, i + span, > + child_path.buf, child_path.len, > + ct->down[pos]->cache_tree); > + num_converted += count; > + i += span; And there's no i++ in the loop itself, so this is good. > + } > + > + strbuf_release(&child_path); > + return num_converted - start_converted; > +} > + > +int convert_to_sparse(struct index_state *istate) > +{ > + if (istate->split_index || istate->sparse_index || > + !core_apply_sparse_checkout || !core_sparse_checkout_cone) > + return 0; > + > + /* > + * For now, only create a sparse index with the > + * GIT_TEST_SPARSE_INDEX environment variable. We will relax > + * this once we have a proper way to opt-in (and later still, > + * opt-out). > + */ > + if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) > + return 0; > + > + if (!istate->sparse_checkout_patterns) { > + istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list)); > + if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0) > + return 0; > + } > + > + if (!istate->sparse_checkout_patterns->use_cone_patterns) { > + warning(_("attempting to use sparse-index without cone mode")); > + return -1; > + } > + > + if (cache_tree_update(istate, 0)) { > + warning(_("unable to update cache-tree, staying full")); > + return -1; > + } > + > + remove_fsmonitor(istate); > + > + trace2_region_enter("index", "convert_to_sparse", istate->repo); > + istate->cache_nr = convert_to_sparse_rec(istate, > + 0, 0, istate->cache_nr, > + "", 0, istate->cache_tree); > + istate->drop_cache_tree = 1; > + istate->sparse_index = 1; > + trace2_region_leave("index", "convert_to_sparse", istate->repo); > + return 0; > +} > > static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) > { > diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh > index 3aa9b0d21b4..22becbaca2e 100755 > --- a/t/t1092-sparse-checkout-compatibility.sh > +++ b/t/t1092-sparse-checkout-compatibility.sh > @@ -2,6 +2,9 @@ > > test_description='compare full workdir to sparse workdir' > > +GIT_TEST_CHECK_CACHE_TREE=0 Why do you need to set this? I vaguely remember needing to mess with this when working with sparse checkouts because it did weird stuff but I don't remember details. But since you patch touches cache_trees, it seems weird to show up without explanation. > +GIT_TEST_SPLIT_INDEX=0 > + > . ./test-lib.sh > > test_expect_success 'setup' ' > @@ -106,7 +109,7 @@ run_on_sparse () { > ) && > ( > cd sparse-index && > - $* >../sparse-index-out 2>../sparse-index-err > + GIT_TEST_SPARSE_INDEX=1 $* >../sparse-index-out 2>../sparse-index-err > ) > } > > @@ -121,7 +124,9 @@ run_on_all () { > test_all_match () { > run_on_all $* && > test_cmp full-checkout-out sparse-checkout-out && > - test_cmp full-checkout-err sparse-checkout-err > + test_cmp full-checkout-out sparse-index-out && > + test_cmp full-checkout-err sparse-checkout-err && > + test_cmp full-checkout-err sparse-index-err > } > > test_sparse_match () { > @@ -130,6 +135,38 @@ test_sparse_match () { > test_cmp sparse-checkout-err sparse-index-err > } > > +test_expect_success 'sparse-index contents' ' > + init_repos && > + > + test-tool -C sparse-index read-cache --table --no-stat >cache && > + for dir in folder1 folder2 x > + do > + TREE=$(git -C sparse-index rev-parse HEAD:$dir) && > + grep "0b0000 0755 $TREE $dir/" cache \ > + || return 1 > + done && > + > + GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 && > + > + test-tool -C sparse-index read-cache --table --no-stat >cache && > + for dir in deep folder2 x > + do > + TREE=$(git -C sparse-index rev-parse HEAD:$dir) && > + grep "0b0000 0755 $TREE $dir/" cache \ It would seem clearer to me if this output were to better match `git ls-tree -rt ...` output (or at least the ' tree ' lines from such output). > + || return 1 > + done && > + > + GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 && > + > + test-tool -C sparse-index read-cache --table --no-stat >cache && > + for dir in deep/deeper2 folder1 folder2 x > + do > + TREE=$(git -C sparse-index rev-parse HEAD:$dir) && > + grep "0b0000 0755 $TREE $dir/" cache \ > + || return 1 > + done > +' > + > test_expect_success 'expanded in-memory index matches full index' ' > init_repos && > test_sparse_match test-tool read-cache --expand --table --no-stat > @@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' ' > > test_expect_success 'status with options' ' > init_repos && > + test_sparse_match ls && > test_all_match git status --porcelain=v2 && > test_all_match git status --porcelain=v2 -z -u && > test_all_match git status --porcelain=v2 -uno && > @@ -169,7 +207,7 @@ test_expect_success 'add, commit, checkout' ' > > test_all_match git add -A && > test_all_match git status --porcelain=v2 && > - test_all_match git commit -m "Extend README.md" && > + test_all_match git commit -m "Extend-README.md" && Why this change? > > test_all_match git checkout HEAD~1 && > test_all_match git checkout - && > @@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' ' > test_all_match git reset update-folder2 > ' > > +# Ensure that sparse-index behaves identically to > +# sparse-checkout with a full index. > +test_expect_success 'checkout and reset (mixed) [sparse]' ' > + init_repos && > + > + test_sparse_match git checkout -b reset-test update-deep && > + test_sparse_match git reset deepest && > + test_sparse_match git reset update-folder1 && > + test_sparse_match git reset update-folder2 > +' > + > test_expect_success 'merge' ' > init_repos && > > @@ -309,14 +358,20 @@ test_expect_success 'clean' ' > test_all_match git status --porcelain=v2 && > test_all_match git clean -f && > test_all_match git status --porcelain=v2 && > + test_sparse_match ls && > + test_sparse_match ls folder1 && > > test_all_match git clean -xf && > test_all_match git status --porcelain=v2 && > + test_sparse_match ls && > + test_sparse_match ls folder1 && > > test_all_match git clean -xdf && > test_all_match git status --porcelain=v2 && > + test_sparse_match ls && > + test_sparse_match ls folder1 && > > - test_path_is_dir sparse-checkout/folder1 > + test_sparse_match test_path_is_dir folder1 > ' > > test_done > -- > gitgitgadget

gitgitgadget · 2021-01-27T17:49:57Z

t/t1092-sparse-checkout-compatibility.sh

@@ -2,11 +2,15 @@



On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > Add a test case that uses test_region to ensure that we are truly > expanding a sparse index to a full one, then converting back to sparse > when writing the index. As we integrate more Git commands with the > sparse index, we will convert these commands to check that we do _not_ > convert the sparse index to a full index and instead stay sparse the > entire time. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh > index 22becbaca2e..a22def89e37 100755 > --- a/t/t1092-sparse-checkout-compatibility.sh > +++ b/t/t1092-sparse-checkout-compatibility.sh > @@ -374,4 +374,21 @@ test_expect_success 'clean' ' > test_sparse_match test_path_is_dir folder1 > ' > > +test_expect_success 'sparse-index is expanded and converted back' ' > + init_repos && > + > + ( > + (GIT_TEST_SPARSE_INDEX=1 && export GIT_TEST_SPARSE_INDEX) && Drop the parentheses. What system are you running on that this test passed for you with those parentheses there? I checked out this particular commit and ran the test -- and it fails for me. Removing the parentheses makes the test pass. Is there some shell where parentheses only function as grouping, similar to bash's {...}, rather than as a subshell, the way bash handles (...) ? > + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > + git -C sparse-index -c core.fsmonitor="" reset --hard && > + test_region index convert_to_sparse trace2.txt && > + test_region index ensure_full_index trace2.txt && > + > + rm trace2.txt && > + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > + git -C sparse-index -c core.fsmonitor="" status -uno && > + test_region index ensure_full_index trace2.txt > + ) > +' > + > test_done > -- > gitgitgadget Otherwise, I like the test and this commit.

gitgitgadget · 2021-01-27T18:07:58Z

Documentation/config/extensions.txt

@@ -6,3 +6,10 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > Previously, we enabled the sparse index format only using > GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to > actually select this mode. Further, sparse directory entries are not > understood by the index formats as advertised. > > We _could_ add a new index version that explicitly adds these > capabilities, but there are nuances to index formats 2, 3, and 4 that > are still valuable to select as options. For now, create a repo > extension, "extensions.sparseIndex", that specifies that the tool > reading this repository must understand sparse directory entries. > > This change only encodes the extension and enables it when > GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI > mechanism. One other interesting thing to note is that last I checked, jgit doesn't support index format v4, which makes us unable to use it. Making a v5 would force jgit to support all previous index formats in order to support your new feature. However, the jgit thing is going to make it hard for me to find other users willing to test out this feature at $DAYJOB. But I don't think there's anyway around that; you need to change the index format. And you at least have jrnieder cc'ed. :-) > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > Documentation/config/extensions.txt | 7 ++++++ > cache.h | 1 + > repo-settings.c | 7 ++++++ > repository.h | 3 ++- > setup.c | 3 +++ > sparse-index.c | 38 +++++++++++++++++++++++++---- > 6 files changed, 53 insertions(+), 6 deletions(-) > > diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt > index 4e23d73cdca..5c86b364873 100644 > --- a/Documentation/config/extensions.txt > +++ b/Documentation/config/extensions.txt > @@ -6,3 +6,10 @@ extensions.objectFormat:: > Note that this setting should only be set by linkgit:git-init[1] or > linkgit:git-clone[1]. Trying to change it after initialization will not > work and will produce hard-to-diagnose issues. > + > +extensions.sparseIndex:: > + When combined with `core.sparseCheckout=true` and > + `core.sparseCheckoutCone=true`, the index may contain entries > + corresponding to directories outside of the sparse-checkout > + definition. Versions of Git that do not understand this extension > + do not expect directory entries in the index. Perhaps to make this slightly more explicit ("corresponding to" can be fuzzy and be read to assume you are talking about file entries underneath a directory rather than directory entries, so add an extra phrase to rule that out): ...the index may contain entries corresponding to directories outside of the sparse-checkout definition in lieu of containing each path under such directories... > diff --git a/cache.h b/cache.h > index b05341cc687..dcf089b7006 100644 > --- a/cache.h > +++ b/cache.h > @@ -1054,6 +1054,7 @@ struct repository_format { > int worktree_config; > int is_bare; > int hash_algo; > + int sparse_index; > char *work_tree; > struct string_list unknown_extensions; > struct string_list v1_only_extensions; > diff --git a/repo-settings.c b/repo-settings.c > index d63569e4041..9677d50f923 100644 > --- a/repo-settings.c > +++ b/repo-settings.c > @@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r) > * removed. > */ > r->settings.command_requires_full_index = 1; > + > + /* > + * Initialize this as off. > + */ > + r->settings.sparse_index = 0; > + if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value) > + r->settings.sparse_index = 1; > } > diff --git a/repository.h b/repository.h > index e06a2301569..a45f7520fd9 100644 > --- a/repository.h > +++ b/repository.h > @@ -42,7 +42,8 @@ struct repo_settings { > > int core_multi_pack_index; > > - unsigned command_requires_full_index:1; > + unsigned command_requires_full_index:1, > + sparse_index:1; > }; > > struct repository { > diff --git a/setup.c b/setup.c > index c04cd25a30d..cd839456461 100644 > --- a/setup.c > +++ b/setup.c > @@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var, > return error("invalid value for 'extensions.objectformat'"); > data->hash_algo = format; > return EXTENSION_OK; > + } else if (!strcmp(ext, "sparseindex")) { > + data->sparse_index = 1; > + return EXTENSION_OK; > } > return EXTENSION_UNKNOWN; > } > diff --git a/sparse-index.c b/sparse-index.c > index 5dd0b835b9d..71544095267 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate, > return num_converted - start_converted; > } > > +static int enable_sparse_index(struct repository *repo) > +{ > + const char *config_path = repo_git_path(repo, "config.worktree"); > + > + if (upgrade_repository_format(1) < 0) { > + warning(_("unable to upgrade repository format to enable sparse-index")); > + return -1; > + } > + git_config_set_in_file_gently(config_path, > + "extensions.sparseIndex", > + "true"); > + > + prepare_repo_settings(repo); > + repo->settings.sparse_index = 1; > + return 0; > +} > + > int convert_to_sparse(struct index_state *istate) > { > if (istate->split_index || istate->sparse_index || > !core_apply_sparse_checkout || !core_sparse_checkout_cone) > return 0; > > + if (!istate->repo) > + istate->repo = the_repository; > + > + /* > + * The GIT_TEST_SPARSE_INDEX environment variable triggers the > + * extensions.sparseIndex config variable to be on. > + */ > + if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) { > + int err = enable_sparse_index(istate->repo); > + if (err < 0) > + return err; > + } > + > /* > - * For now, only create a sparse index with the > - * GIT_TEST_SPARSE_INDEX environment variable. We will relax > - * this once we have a proper way to opt-in (and later still, > - * opt-out). > + * Only convert to sparse if extensions.sparseIndex is set. > */ > - if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) > + prepare_repo_settings(istate->repo); > + if (!istate->repo->settings.sparse_index) > return 0; > > if (!istate->sparse_checkout_patterns) { > -- > gitgitgadget

gitgitgadget · 2021-01-27T18:23:00Z

Documentation/git-sparse-checkout.txt

@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > The sparse index extension is used to signal that index writes should be > in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1. > > Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that > specifies if the sparse index should be used. It also updates the index > to use the correct format, either way. Add a warning in the > documentation that the use of a repository extension might reduce > compatibility with third-party tools. 'git sparse-checkout init' already > sets extension.worktreeConfig, which places most sparse-checkout users > outside of the scope of most third-party tools. Heh, looks like you're addressing my comments on the last patch about jgit. If I would have just read on... One side question, though -- I thought I remembered seeing that we record index versions or extension information directly in the index, so that third party tools have a way of noting that the index has something they won't understand, rather than just reading values that appear to be corrupt to them. Perhaps I missed it, but have you done anything like that with this series? > > Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of > GIT_TEST_SPARSE_INDEX=1. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > Documentation/git-sparse-checkout.txt | 14 +++++++++ > builtin/sparse-checkout.c | 17 ++++++++++- > sparse-index.c | 38 ++++++++++++++++-------- > sparse-index.h | 3 ++ > t/t1092-sparse-checkout-compatibility.sh | 33 ++++++++++---------- > 5 files changed, 75 insertions(+), 30 deletions(-) > > diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt > index a0eeaeb02ee..b51b8450cfd 100644 > --- a/Documentation/git-sparse-checkout.txt > +++ b/Documentation/git-sparse-checkout.txt > @@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the > When `--cone` is provided, the `core.sparseCheckoutCone` setting is > also set, allowing for better performance with a limited set of > patterns (see 'CONE PATTERN SET' below). > ++ > +Use the `--[no-]sparse-index` option to toggle the use of the sparse > +index format. This reduces the size of the index to be more closely > +aligned with your sparse-checkout definition. This can have significant > +performance advantages for commands such as `git status` or `git add`. > +This feature is still experimental. Some commands might be slower with > +a sparse index until they are properly integrated with the feature. > ++ > +**WARNING:** Using a sparse index requires modifying the index in a way > +that is not completely understood by other tools. Enabling sparse index > +enables the `extensions.spareseIndex` config value, which might cause extensions.sparseIndex; you have an extra 'e' in there. > +other tools to stop working with your repository. If you have trouble with > +this compatibility, then run `git sparse-checkout sparse-index disable` to > +remove this config and rewrite your index to not be sparse. > > 'set':: > Write a set of patterns to the sparse-checkout file, as given as > diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c > index e00b82af727..ca63e2c64e9 100644 > --- a/builtin/sparse-checkout.c > +++ b/builtin/sparse-checkout.c > @@ -14,6 +14,7 @@ > #include "unpack-trees.h" > #include "wt-status.h" > #include "quote.h" > +#include "sparse-index.h" > > static const char *empty_base = ""; > > @@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode) > } > > static char const * const builtin_sparse_checkout_init_usage[] = { > - N_("git sparse-checkout init [--cone]"), > + N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"), This all makes sense, but between partial clones, sparse-checkouts and sparse-indexes, I wonder if we're overloading users with terms and conditions. Perhaps that's inevitable in the short-term due to the various caveats that exist, but I'd just like to put out a fuzzy high-level goal of allowing users in the future to just specify "I want a sparse clone of this stuff" with as few special knobs and flags as possible. I don't want them to have to specify all of the individual things that means, such as they want (a) the history to be sparse (i.e. partial clone), (b) the checkout to be sparse, (c) the index to be sparse, (d) several commands to operate in a sparse manner, limiting their output based on the sparsity paths (hopefully they aren't required to list each one), and (e) several other commands shouldn't be limited by the sparsity paths. I guess it might be nice to _allow_ them to specify all the things it means for users who want control, but it'd be nice to avoid requiring it of all users. > NULL > }; > > static struct sparse_checkout_init_opts { > int cone_mode; > + int sparse_index; > } init_opts; > > static int sparse_checkout_init(int argc, const char **argv) > @@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv) > static struct option builtin_sparse_checkout_init_options[] = { > OPT_BOOL(0, "cone", &init_opts.cone_mode, > N_("initialize the sparse-checkout in cone mode")), > + OPT_BOOL(0, "sparse-index", &init_opts.sparse_index, > + N_("toggle the use of a sparse index")), > OPT_END(), > }; > > repo_read_index(the_repository); > > + init_opts.sparse_index = -1; > + > argc = parse_options(argc, argv, NULL, > builtin_sparse_checkout_init_options, > builtin_sparse_checkout_init_usage, 0); > @@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv) > sparse_filename = get_sparse_checkout_filename(); > res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL); > > + if (init_opts.sparse_index >= 0) { > + if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0) > + die(_("failed to modify sparse-index config")); > + > + /* force an index rewrite */ > + repo_read_index(the_repository); > + the_repository->index->updated_workdir = 1; > + } > + > /* If we already have a sparse-checkout file, use it. */ > if (res >= 0) { > free(sparse_filename); > diff --git a/sparse-index.c b/sparse-index.c > index 71544095267..3552f88fb03 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -104,23 +104,38 @@ static int convert_to_sparse_rec(struct index_state *istate, > > static int enable_sparse_index(struct repository *repo) > { > - const char *config_path = repo_git_path(repo, "config.worktree"); > + int res; > > if (upgrade_repository_format(1) < 0) { > warning(_("unable to upgrade repository format to enable sparse-index")); > return -1; > } > - git_config_set_in_file_gently(config_path, > - "extensions.sparseIndex", > - "true"); > + res = git_config_set_gently("extensions.sparseindex", "true"); > > prepare_repo_settings(repo); > repo->settings.sparse_index = 1; > - return 0; > + return res; > +} > + > +int set_sparse_index_config(struct repository *repo, int enable) > +{ > + int res; > + > + if (enable) > + return enable_sparse_index(repo); > + > + /* Don't downgrade repository format, just remove the extension. */ > + res = git_config_set_multivar_gently("extensions.sparseindex", NULL, "", > + CONFIG_FLAGS_MULTI_REPLACE); > + > + prepare_repo_settings(repo); > + repo->settings.sparse_index = 0; > + return res; > } > > int convert_to_sparse(struct index_state *istate) > { > + int test_env; > if (istate->split_index || istate->sparse_index || > !core_apply_sparse_checkout || !core_sparse_checkout_cone) > return 0; > @@ -129,14 +144,13 @@ int convert_to_sparse(struct index_state *istate) > istate->repo = the_repository; > > /* > - * The GIT_TEST_SPARSE_INDEX environment variable triggers the > - * extensions.sparseIndex config variable to be on. > + * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex > + * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly), > + * then purposefully disable the setting. > */ > - if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) { > - int err = enable_sparse_index(istate->repo); > - if (err < 0) > - return err; > - } > + test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1); > + if (test_env >= 0) > + set_sparse_index_config(istate->repo, test_env); > > /* > * Only convert to sparse if extensions.sparseIndex is set. > diff --git a/sparse-index.h b/sparse-index.h > index a2777dcac59..ca936e95d11 100644 > --- a/sparse-index.h > +++ b/sparse-index.h > @@ -5,4 +5,7 @@ struct index_state; > void ensure_full_index(struct index_state *istate); > int convert_to_sparse(struct index_state *istate); > > +struct repository; > +int set_sparse_index_config(struct repository *repo, int enable); > + > #endif > \ No newline at end of file > diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh > index a22def89e37..c6b7e8b8891 100755 > --- a/t/t1092-sparse-checkout-compatibility.sh > +++ b/t/t1092-sparse-checkout-compatibility.sh > @@ -4,6 +4,7 @@ test_description='compare full workdir to sparse workdir' > > GIT_TEST_CHECK_CACHE_TREE=0 > GIT_TEST_SPLIT_INDEX=0 > +GIT_TEST_SPARSE_INDEX= > > . ./test-lib.sh > > @@ -98,8 +99,9 @@ init_repos () { > # initialize sparse-checkout definitions > git -C sparse-checkout sparse-checkout init --cone && > git -C sparse-checkout sparse-checkout set deep && > - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone && > - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep > + git -C sparse-index sparse-checkout init --cone --sparse-index && > + test_cmp_config -C sparse-index true extensions.sparseindex && > + git -C sparse-index sparse-checkout set deep > } > > run_on_sparse () { > @@ -109,7 +111,7 @@ run_on_sparse () { > ) && > ( > cd sparse-index && > - GIT_TEST_SPARSE_INDEX=1 $* >../sparse-index-out 2>../sparse-index-err > + $* >../sparse-index-out 2>../sparse-index-err > ) > } > > @@ -146,7 +148,7 @@ test_expect_success 'sparse-index contents' ' > || return 1 > done && > > - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 && > + git -C sparse-index sparse-checkout set folder1 && > > test-tool -C sparse-index read-cache --table --no-stat >cache && > for dir in deep folder2 x > @@ -156,7 +158,7 @@ test_expect_success 'sparse-index contents' ' > || return 1 > done && > > - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 && > + git -C sparse-index sparse-checkout set deep/deeper1 && > > test-tool -C sparse-index read-cache --table --no-stat >cache && > for dir in deep/deeper2 folder1 folder2 x > @@ -377,18 +379,15 @@ test_expect_success 'clean' ' > test_expect_success 'sparse-index is expanded and converted back' ' > init_repos && > > - ( > - (GIT_TEST_SPARSE_INDEX=1 && export GIT_TEST_SPARSE_INDEX) && > - GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > - git -C sparse-index -c core.fsmonitor="" reset --hard && > - test_region index convert_to_sparse trace2.txt && > - test_region index ensure_full_index trace2.txt && > - > - rm trace2.txt && > - GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > - git -C sparse-index -c core.fsmonitor="" status -uno && > - test_region index ensure_full_index trace2.txt > - ) > + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > + git -C sparse-index -c core.fsmonitor="" reset --hard && > + test_region index convert_to_sparse trace2.txt && > + test_region index ensure_full_index trace2.txt && > + > + rm trace2.txt && > + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > + git -C sparse-index -c core.fsmonitor="" status -uno && > + test_region index ensure_full_index trace2.txt > ' > > test_done > -- > gitgitgadget I need to take a break from reviewing again at this point and work on some other tasks. I'll resume reviewing the series later, perhaps tomorrow afternoon.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/27/2021 1:18 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: > > I need to take a break from reviewing again at this point and work on > some other tasks. I'll resume reviewing the series later, perhaps > tomorrow afternoon. I appreciate your efforts here! I'm delaying detailed responses until enough time has passed for interested parties to comment. If it helps, then focusing on the "big things" is most important for now. I'll be more careful about typos and things when I submit patches for full review. For my part, I'm taking time to look around at other things that need my attention after being heads-down on this prototype. The sparse index will need proper care to implement and I'm not expecting it to move very quickly. Thanks, -Stolee

gitgitgadget · 2021-02-01T20:24:19Z

apply.c

@@ -3523,6 +3523,8 @@ static int load_current(struct apply_state *state,
 	if (!patch->is_new)


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > This giant patch is not intended for actual review. I have a branch that > has these changes split out in a sane way with some commentary in each > file that is modified. > > The idea here is to guard certain portions of the codebase that do not > know how to handle sparse indexes by ensuring that the index is expanded > to a full index before proceeding with the logic. > > This also provides a good mechanism for testing which code needs > updating to enable the sparse index in a Git builtin. The builtin can > set the_repository->settings.command_requires_full_index to zero and > then we can debug the command with a breakpoint on ensure_full_index(). > That identifies the portion of code that needs adjusting before enabling > sparse indexes for that command. > > Some index operations must be changed to operate on a non-const pointer, > since ensuring a full index will modify the index itself. > > There are likely some gaps to these protections, which is why it will be > important to carefully test each scenario as we relax the requirements. > I expect that to be a long effort. I think the idea makes sense; it provides a way for us to incrementally build support for this new feature. I skimmed over the code and noticed various interesting places that had the ensure_full_index() call (e.g. read_skip_worktree_file_from_index() -- whose existence comes from sparsity; what irony...). Better breakouts would be great, so I'll defer commenting much until then. But, just to verify I'm understanding: the primary defence is the command_requires_full_index setting, and you have added several ensure_full_index() calls throughout the code in places you believe would need to be fixed up in case someone switches the command_requires_full_index setting. Is that correct? And your comment on the gaps is just that there may be other places that are missing the secondary protection (as opposed to my first reading of that paragraph as suggesting we aren't sure if we have enough protections yet and need to add more before this moves out of RFC); is that right? > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > apply.c | 10 +++++++++- > blame.c | 7 ++++++- > builtin/checkout-index.c | 5 ++++- > builtin/grep.c | 2 ++ > builtin/ls-files.c | 9 ++++++++- > builtin/merge-index.c | 2 ++ > builtin/mv.c | 2 ++ > builtin/rm.c | 2 ++ > builtin/sparse-checkout.c | 1 + > builtin/update-index.c | 2 ++ > cache.h | 1 + > diff-lib.c | 2 ++ > diff.c | 2 ++ > dir.c | 14 +++++++++++++- > entry.c | 2 ++ > fsmonitor.c | 11 ++++++++++- > merge-recursive.c | 22 +++++++++++++++++++--- > name-hash.c | 6 ++++++ > pathspec.c | 5 +++-- > pathspec.h | 4 ++-- > read-cache.c | 19 +++++++++++++++++-- > rerere.c | 2 ++ > resolve-undo.c | 6 ++++++ > sha1-name.c | 3 +++ > split-index.c | 2 ++ > submodule.c | 24 +++++++++++++++++++----- > submodule.h | 6 +++--- > tree.c | 2 ++ > wt-status.c | 7 +++++++ > 29 files changed, 159 insertions(+), 23 deletions(-) > > diff --git a/apply.c b/apply.c > index 668b16e9893..5bfbd928b38 100644 > --- a/apply.c > +++ b/apply.c > @@ -3523,6 +3523,8 @@ static int load_current(struct apply_state *state, > if (!patch->is_new) > BUG("patch to %s is not a creation", patch->old_name); > > + ensure_full_index(state->repo->index); > + > pos = index_name_pos(state->repo->index, name, strlen(name)); > if (pos < 0) > return error(_("%s: does not exist in index"), name); > @@ -3692,7 +3694,11 @@ static int check_preimage(struct apply_state *state, > } > > if (state->check_index && !previous) { > - int pos = index_name_pos(state->repo->index, old_name, > + int pos; > + > + ensure_full_index(state->repo->index); > + > + pos = index_name_pos(state->repo->index, old_name, > strlen(old_name)); > if (pos < 0) { > if (patch->is_new < 0) > @@ -3751,6 +3757,8 @@ static int check_to_create(struct apply_state *state, > if (state->check_index && (!ok_if_exists || !state->cached)) { > int pos; > > + ensure_full_index(state->repo->index); > + > pos = index_name_pos(state->repo->index, new_name, strlen(new_name)); > if (pos >= 0) { > struct cache_entry *ce = state->repo->index->cache[pos]; > diff --git a/blame.c b/blame.c > index a5044fcfaa6..0aa368a35cf 100644 > --- a/blame.c > +++ b/blame.c > @@ -108,6 +108,7 @@ static void verify_working_tree_path(struct repository *r, > return; > } > > + ensure_full_index(r->index); > pos = index_name_pos(r->index, path, strlen(path)); > if (pos >= 0) > ; /* path is in the index */ > @@ -277,7 +278,11 @@ static struct commit *fake_working_tree_commit(struct repository *r, > > len = strlen(path); > if (!mode) { > - int pos = index_name_pos(r->index, path, len); > + int pos; > + > + ensure_full_index(r->index); > + > + pos = index_name_pos(r->index, path, len); > if (0 <= pos) > mode = r->index->cache[pos]->ce_mode; > else > diff --git a/builtin/checkout-index.c b/builtin/checkout-index.c > index 4bbfc92dce5..24c85b1c125 100644 > --- a/builtin/checkout-index.c > +++ b/builtin/checkout-index.c > @@ -48,11 +48,14 @@ static void write_tempfile_record(const char *name, const char *prefix) > static int checkout_file(const char *name, const char *prefix) > { > int namelen = strlen(name); > - int pos = cache_name_pos(name, namelen); > + int pos; > int has_same_name = 0; > int did_checkout = 0; > int errs = 0; > > + ensure_full_index(the_repository->index); > + pos = index_name_pos(the_repository->index, name, namelen); > + > if (pos < 0) > pos = -pos - 1; > > diff --git a/builtin/grep.c b/builtin/grep.c > index ca259af4416..e53cf817204 100644 > --- a/builtin/grep.c > +++ b/builtin/grep.c > @@ -506,6 +506,8 @@ static int grep_cache(struct grep_opt *opt, > if (repo_read_index(repo) < 0) > die(_("index file corrupt")); > > + ensure_full_index(repo->index); > + > for (nr = 0; nr < repo->index->cache_nr; nr++) { > const struct cache_entry *ce = repo->index->cache[nr]; > strbuf_setlen(&name, name_base_len); > diff --git a/builtin/ls-files.c b/builtin/ls-files.c > index c8eae899b82..933e259cdbe 100644 > --- a/builtin/ls-files.c > +++ b/builtin/ls-files.c > @@ -150,7 +150,7 @@ static void show_other_files(const struct index_state *istate, > } > } > > -static void show_killed_files(const struct index_state *istate, > +static void show_killed_files(struct index_state *istate, > const struct dir_struct *dir) > { > int i; > @@ -159,6 +159,8 @@ static void show_killed_files(const struct index_state *istate, > char *cp, *sp; > int pos, len, killed = 0; > > + ensure_full_index(istate); > + > for (cp = ent->name; cp - ent->name < ent->len; cp = sp + 1) { > sp = strchr(cp, '/'); > if (!sp) { > @@ -313,6 +315,7 @@ static void show_files(struct repository *repo, struct dir_struct *dir) > show_killed_files(repo->index, dir); > } > if (show_cached || show_stage) { > + ensure_full_index(repo->index); > for (i = 0; i < repo->index->cache_nr; i++) { > const struct cache_entry *ce = repo->index->cache[i]; > > @@ -332,6 +335,7 @@ static void show_files(struct repository *repo, struct dir_struct *dir) > } > } > if (show_deleted || show_modified) { > + ensure_full_index(repo->index); > for (i = 0; i < repo->index->cache_nr; i++) { > const struct cache_entry *ce = repo->index->cache[i]; > struct stat st; > @@ -368,6 +372,7 @@ static void prune_index(struct index_state *istate, > > if (!prefix || !istate->cache_nr) > return; > + ensure_full_index(istate); > pos = index_name_pos(istate, prefix, prefixlen); > if (pos < 0) > pos = -pos-1; > @@ -428,6 +433,8 @@ void overlay_tree_on_index(struct index_state *istate, > if (!tree) > die("bad tree-ish %s", tree_name); > > + ensure_full_index(istate); > + > /* Hoist the unmerged entries up to stage #3 to make room */ > for (i = 0; i < istate->cache_nr; i++) { > struct cache_entry *ce = istate->cache[i]; > diff --git a/builtin/merge-index.c b/builtin/merge-index.c > index 38ea6ad6ca2..3e1ddabd650 100644 > --- a/builtin/merge-index.c > +++ b/builtin/merge-index.c > @@ -80,6 +80,8 @@ int cmd_merge_index(int argc, const char **argv, const char *prefix) > > read_cache(); > > + ensure_full_index(&the_index); > + > i = 1; > if (!strcmp(argv[i], "-o")) { > one_shot = 1; > diff --git a/builtin/mv.c b/builtin/mv.c > index 7dac714af90..2ab6416fce9 100644 > --- a/builtin/mv.c > +++ b/builtin/mv.c > @@ -145,6 +145,8 @@ int cmd_mv(int argc, const char **argv, const char *prefix) > if (read_cache() < 0) > die(_("index file corrupt")); > > + ensure_full_index(&the_index); > + > source = internal_prefix_pathspec(prefix, argv, argc, 0); > modes = xcalloc(argc, sizeof(enum update_mode)); > /* > diff --git a/builtin/rm.c b/builtin/rm.c > index 4858631e0f0..2db4fcd22d9 100644 > --- a/builtin/rm.c > +++ b/builtin/rm.c > @@ -291,6 +291,8 @@ int cmd_rm(int argc, const char **argv, const char *prefix) > > refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED, &pathspec, NULL, NULL); > > + ensure_full_index(&the_index); > + > seen = xcalloc(pathspec.nr, 1); > > for (i = 0; i < active_nr; i++) { > diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c > index ca63e2c64e9..14022b5e182 100644 > --- a/builtin/sparse-checkout.c > +++ b/builtin/sparse-checkout.c > @@ -123,6 +123,7 @@ static int update_working_directory(struct pattern_list *pl) > o.pl = pl; > > setup_work_tree(); > + ensure_full_index(r->index); > > repo_hold_locked_index(r, &lock_file, LOCK_DIE_ON_ERROR); > > diff --git a/builtin/update-index.c b/builtin/update-index.c > index 79087bccea4..521a6c23c75 100644 > --- a/builtin/update-index.c > +++ b/builtin/update-index.c > @@ -1088,6 +1088,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > > the_index.updated_skipworktree = 1; > > + ensure_full_index(&the_index); > + > /* > * Custom copy of parse_options() because we want to handle > * filename arguments as they come. > diff --git a/cache.h b/cache.h > index dcf089b7006..306eab444b9 100644 > --- a/cache.h > +++ b/cache.h > @@ -346,6 +346,7 @@ void add_name_hash(struct index_state *istate, struct cache_entry *ce); > void remove_name_hash(struct index_state *istate, struct cache_entry *ce); > void free_name_hash(struct index_state *istate); > > +void ensure_full_index(struct index_state *istate); > > /* Cache entry creation and cleanup */ > > diff --git a/diff-lib.c b/diff-lib.c > index b73cc1859a4..3743e4463b4 100644 > --- a/diff-lib.c > +++ b/diff-lib.c > @@ -96,6 +96,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option) > uint64_t start = getnanotime(); > struct index_state *istate = revs->diffopt.repo->index; > > + ensure_full_index(istate); > + > diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/"); > > refresh_fsmonitor(istate); > diff --git a/diff.c b/diff.c > index 2253ec88029..02fafee8587 100644 > --- a/diff.c > +++ b/diff.c > @@ -3901,6 +3901,8 @@ static int reuse_worktree_file(struct index_state *istate, > if (!want_file && would_convert_to_git(istate, name)) > return 0; > > + ensure_full_index(istate); > + > len = strlen(name); > pos = index_name_pos(istate, name, len); > if (pos < 0) > diff --git a/dir.c b/dir.c > index d153a63bbd1..ad6eb033cb1 100644 > --- a/dir.c > +++ b/dir.c > @@ -892,13 +892,15 @@ void add_pattern(const char *string, const char *base, > add_pattern_to_hashsets(pl, pattern); > } > > -static int read_skip_worktree_file_from_index(const struct index_state *istate, > +static int read_skip_worktree_file_from_index(struct index_state *istate, > const char *path, > size_t *size_out, char **data_out, > struct oid_stat *oid_stat) > { > int pos, len; > > + ensure_full_index(istate); > + > len = strlen(path); > pos = index_name_pos(istate, path, len); > if (pos < 0) > @@ -1088,6 +1090,10 @@ static int add_patterns(const char *fname, const char *base, int baselen, > close(fd); > if (oid_stat) { > int pos; > + > + if (istate) > + ensure_full_index(istate); > + > if (oid_stat->valid && > !match_stat_data_racy(istate, &oid_stat->stat, &st)) > ; /* no content change, oid_stat->oid still good */ > @@ -1696,6 +1702,8 @@ static enum exist_status directory_exists_in_index(struct index_state *istate, > if (ignore_case) > return directory_exists_in_index_icase(istate, dirname, len); > > + ensure_full_index(istate); > + > pos = index_name_pos(istate, dirname, len); > if (pos < 0) > pos = -pos-1; > @@ -2050,6 +2058,8 @@ static int get_index_dtype(struct index_state *istate, > int pos; > const struct cache_entry *ce; > > + ensure_full_index(istate); > + > ce = index_file_exists(istate, path, len, 0); > if (ce) { > if (!ce_uptodate(ce)) > @@ -3536,6 +3546,8 @@ static void connect_wt_gitdir_in_nested(const char *sub_worktree, > if (repo_read_index(&subrepo) < 0) > die(_("index file corrupt in repo %s"), subrepo.gitdir); > > + ensure_full_index(subrepo.index); > + > for (i = 0; i < subrepo.index->cache_nr; i++) { > const struct cache_entry *ce = subrepo.index->cache[i]; > > diff --git a/entry.c b/entry.c > index a0532f1f000..d505e6f2c6e 100644 > --- a/entry.c > +++ b/entry.c > @@ -412,6 +412,8 @@ static void mark_colliding_entries(const struct checkout *state, > > ce->ce_flags |= CE_MATCHED; > > + ensure_full_index(state->istate); > + > for (i = 0; i < state->istate->cache_nr; i++) { > struct cache_entry *dup = state->istate->cache[i]; > > diff --git a/fsmonitor.c b/fsmonitor.c > index fe9e9d7baf4..7b8cd3975b9 100644 > --- a/fsmonitor.c > +++ b/fsmonitor.c > @@ -97,6 +97,9 @@ int read_fsmonitor_extension(struct index_state *istate, const void *data, > void fill_fsmonitor_bitmap(struct index_state *istate) > { > unsigned int i, skipped = 0; > + > + ensure_full_index(istate); > + > istate->fsmonitor_dirty = ewah_new(); > for (i = 0; i < istate->cache_nr; i++) { > if (istate->cache[i]->ce_flags & CE_REMOVE) > @@ -158,7 +161,11 @@ static int query_fsmonitor(int version, const char *last_update, struct strbuf * > > static void fsmonitor_refresh_callback(struct index_state *istate, const char *name) > { > - int pos = index_name_pos(istate, name, strlen(name)); > + int pos; > + > + ensure_full_index(istate); > + > + pos = index_name_pos(istate, name, strlen(name)); > > if (pos >= 0) { > struct cache_entry *ce = istate->cache[pos]; > @@ -330,6 +337,8 @@ void tweak_fsmonitor(struct index_state *istate) > > if (istate->fsmonitor_dirty) { > if (fsmonitor_enabled) { > + ensure_full_index(istate); > + > /* Mark all entries valid */ > for (i = 0; i < istate->cache_nr; i++) { > istate->cache[i]->ce_flags |= CE_FSMONITOR_VALID; > diff --git a/merge-recursive.c b/merge-recursive.c > index f736a0f6323..12109f37723 100644 > --- a/merge-recursive.c > +++ b/merge-recursive.c > @@ -522,6 +522,8 @@ static struct string_list *get_unmerged(struct index_state *istate) > > unmerged->strdup_strings = 1; > > + ensure_full_index(istate); > + > for (i = 0; i < istate->cache_nr; i++) { > struct string_list_item *item; > struct stage_data *e; > @@ -762,6 +764,8 @@ static int dir_in_way(struct index_state *istate, const char *path, > strbuf_addstr(&dirpath, path); > strbuf_addch(&dirpath, '/'); > > + ensure_full_index(istate); > + > pos = index_name_pos(istate, dirpath.buf, dirpath.len); > > if (pos < 0) > @@ -785,9 +789,13 @@ static int dir_in_way(struct index_state *istate, const char *path, > static int was_tracked_and_matches(struct merge_options *opt, const char *path, > const struct diff_filespec *blob) > { > - int pos = index_name_pos(&opt->priv->orig_index, path, strlen(path)); > + int pos; > struct cache_entry *ce; > > + ensure_full_index(&opt->priv->orig_index); > + > + pos = index_name_pos(&opt->priv->orig_index, path, strlen(path)); > + > if (0 > pos) > /* we were not tracking this path before the merge */ > return 0; > @@ -802,7 +810,11 @@ static int was_tracked_and_matches(struct merge_options *opt, const char *path, > */ > static int was_tracked(struct merge_options *opt, const char *path) > { > - int pos = index_name_pos(&opt->priv->orig_index, path, strlen(path)); > + int pos; > + > + ensure_full_index(&opt->priv->orig_index); > + > + pos = index_name_pos(&opt->priv->orig_index, path, strlen(path)); > > if (0 <= pos) > /* we were tracking this path before the merge */ > @@ -814,6 +826,9 @@ static int was_tracked(struct merge_options *opt, const char *path) > static int would_lose_untracked(struct merge_options *opt, const char *path) > { > struct index_state *istate = opt->repo->index; > + int pos; > + > + ensure_full_index(istate); > > /* > * This may look like it can be simplified to: > @@ -832,7 +847,7 @@ static int would_lose_untracked(struct merge_options *opt, const char *path) > * update_file()/would_lose_untracked(); see every comment in this > * file which mentions "update_stages". > */ > - int pos = index_name_pos(istate, path, strlen(path)); > + pos = index_name_pos(istate, path, strlen(path)); > > if (pos < 0) > pos = -1 - pos; > @@ -3086,6 +3101,7 @@ static int handle_content_merge(struct merge_file_info *mfi, > * flag to avoid making the file appear as if it were > * deleted by the user. > */ > + ensure_full_index(&opt->priv->orig_index); > pos = index_name_pos(&opt->priv->orig_index, path, strlen(path)); > ce = opt->priv->orig_index.cache[pos]; > if (ce_skip_worktree(ce)) { > diff --git a/name-hash.c b/name-hash.c > index 4e03fac9bb1..0f6d4fcca5a 100644 > --- a/name-hash.c > +++ b/name-hash.c > @@ -679,6 +679,8 @@ int index_dir_exists(struct index_state *istate, const char *name, int namelen) > { > struct dir_entry *dir; > > + ensure_full_index(istate); > + > lazy_init_name_hash(istate); > dir = find_dir_entry(istate, name, namelen); > return dir && dir->nr; > @@ -689,6 +691,8 @@ void adjust_dirname_case(struct index_state *istate, char *name) > const char *startPtr = name; > const char *ptr = startPtr; > > + ensure_full_index( istate); > + > lazy_init_name_hash(istate); > while (*ptr) { > while (*ptr && *ptr != '/') > @@ -712,6 +716,8 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na > struct cache_entry *ce; > unsigned int hash = memihash(name, namelen); > > + ensure_full_index(istate); > + > lazy_init_name_hash(istate); > > ce = hashmap_get_entry_from_hash(&istate->name_hash, hash, NULL, > diff --git a/pathspec.c b/pathspec.c > index 7a229d8d22f..9b105855483 100644 > --- a/pathspec.c > +++ b/pathspec.c > @@ -20,7 +20,7 @@ > * to use find_pathspecs_matching_against_index() instead. > */ > void add_pathspec_matches_against_index(const struct pathspec *pathspec, > - const struct index_state *istate, > + struct index_state *istate, > char *seen) > { > int num_unmatched = 0, i; > @@ -36,6 +36,7 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec, > num_unmatched++; > if (!num_unmatched) > return; > + ensure_full_index(istate); > for (i = 0; i < istate->cache_nr; i++) { > const struct cache_entry *ce = istate->cache[i]; > ce_path_match(istate, ce, pathspec, seen); > @@ -51,7 +52,7 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec, > * given pathspecs achieves against all items in the index. > */ > char *find_pathspecs_matching_against_index(const struct pathspec *pathspec, > - const struct index_state *istate) > + struct index_state *istate) > { > char *seen = xcalloc(pathspec->nr, 1); > add_pathspec_matches_against_index(pathspec, istate, seen); > diff --git a/pathspec.h b/pathspec.h > index 454ce364fac..f19c5dcf022 100644 > --- a/pathspec.h > +++ b/pathspec.h > @@ -150,10 +150,10 @@ static inline int ps_strcmp(const struct pathspec_item *item, > } > > void add_pathspec_matches_against_index(const struct pathspec *pathspec, > - const struct index_state *istate, > + struct index_state *istate, > char *seen); > char *find_pathspecs_matching_against_index(const struct pathspec *pathspec, > - const struct index_state *istate); > + struct index_state *istate); > int match_pathspec_attrs(const struct index_state *istate, > const char *name, int namelen, > const struct pathspec_item *item); > diff --git a/read-cache.c b/read-cache.c > index 0522260416e..65679d70d7c 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -622,7 +622,11 @@ void remove_marked_cache_entries(struct index_state *istate, int invalidate) > > int remove_file_from_index(struct index_state *istate, const char *path) > { > - int pos = index_name_pos(istate, path, strlen(path)); > + int pos; > + > + ensure_full_index(istate); > + > + pos = index_name_pos(istate, path, strlen(path)); > if (pos < 0) > pos = -pos-1; > cache_tree_invalidate_path(istate, path); > @@ -640,9 +644,12 @@ static int compare_name(struct cache_entry *ce, const char *path, int namelen) > static int index_name_pos_also_unmerged(struct index_state *istate, > const char *path, int namelen) > { > - int pos = index_name_pos(istate, path, namelen); > + int pos; > struct cache_entry *ce; > > + ensure_full_index(istate); > + > + pos = index_name_pos(istate, path, namelen); > if (pos >= 0) > return pos; > > @@ -717,6 +724,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, > int hash_flags = HASH_WRITE_OBJECT; > struct object_id oid; > > + ensure_full_index(istate); > + > if (flags & ADD_CACHE_RENORMALIZE) > hash_flags |= HASH_RENORMALIZE; > > @@ -1095,6 +1104,8 @@ static int has_dir_name(struct index_state *istate, > size_t len_eq_last; > int cmp_last = 0; > > + ensure_full_index(istate); > + > /* > * We are frequently called during an iteration on a sorted > * list of pathnames and while building a new index. Therefore, > @@ -1338,6 +1349,8 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti > { > int pos; > > + ensure_full_index(istate); > + > if (option & ADD_CACHE_JUST_APPEND) > pos = istate->cache_nr; > else { > @@ -1547,6 +1560,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, > * we only have to do the special cases that are left. > */ > preload_index(istate, pathspec, 0); > + > + ensure_full_index(istate); > for (i = 0; i < istate->cache_nr; i++) { > struct cache_entry *ce, *new_entry; > int cache_errno = 0; > diff --git a/rerere.c b/rerere.c > index 9281131a9f1..1836a6cfbcf 100644 > --- a/rerere.c > +++ b/rerere.c > @@ -962,6 +962,8 @@ static int handle_cache(struct index_state *istate, > struct rerere_io_mem io; > int marker_size = ll_merge_marker_size(istate, path); > > + ensure_full_index(istate); > + > /* > * Reproduce the conflicted merge in-core > */ > diff --git a/resolve-undo.c b/resolve-undo.c > index 236320f179c..a4265834977 100644 > --- a/resolve-undo.c > +++ b/resolve-undo.c > @@ -125,6 +125,8 @@ int unmerge_index_entry_at(struct index_state *istate, int pos) > if (!istate->resolve_undo) > return pos; > > + ensure_full_index(istate); > + > ce = istate->cache[pos]; > if (ce_stage(ce)) { > /* already unmerged */ > @@ -172,6 +174,8 @@ void unmerge_marked_index(struct index_state *istate) > if (!istate->resolve_undo) > return; > > + ensure_full_index(istate); > + > for (i = 0; i < istate->cache_nr; i++) { > const struct cache_entry *ce = istate->cache[i]; > if (ce->ce_flags & CE_MATCHED) > @@ -186,6 +190,8 @@ void unmerge_index(struct index_state *istate, const struct pathspec *pathspec) > if (!istate->resolve_undo) > return; > > + ensure_full_index(istate); > + > for (i = 0; i < istate->cache_nr; i++) { > const struct cache_entry *ce = istate->cache[i]; > if (!ce_path_match(istate, ce, pathspec, NULL)) > diff --git a/sha1-name.c b/sha1-name.c > index 0b23b86ceb4..c2f17e526ab 100644 > --- a/sha1-name.c > +++ b/sha1-name.c > @@ -1734,6 +1734,8 @@ static void diagnose_invalid_index_path(struct repository *r, > if (!prefix) > prefix = ""; > > + ensure_full_index(r->index); > + > /* Wrong stage number? */ > pos = index_name_pos(istate, filename, namelen); > if (pos < 0) > @@ -1854,6 +1856,7 @@ static enum get_oid_result get_oid_with_context_1(struct repository *repo, > > if (!repo->index || !repo->index->cache) > repo_read_index(repo); > + ensure_full_index(repo->index); > pos = index_name_pos(repo->index, cp, namelen); > if (pos < 0) > pos = -pos - 1; > diff --git a/split-index.c b/split-index.c > index c0e8ad670d0..3150fa6476a 100644 > --- a/split-index.c > +++ b/split-index.c > @@ -4,6 +4,8 @@ > > struct split_index *init_split_index(struct index_state *istate) > { > + ensure_full_index(istate); > + > if (!istate->split_index) { > istate->split_index = xcalloc(1, sizeof(*istate->split_index)); > istate->split_index->refcount = 1; > diff --git a/submodule.c b/submodule.c > index b3bb59f0664..f80cfddbd52 100644 > --- a/submodule.c > +++ b/submodule.c > @@ -33,9 +33,13 @@ static struct oid_array ref_tips_after_fetch; > * will be disabled because we can't guess what might be configured in > * .gitmodules unless the user resolves the conflict. > */ > -int is_gitmodules_unmerged(const struct index_state *istate) > +int is_gitmodules_unmerged(struct index_state *istate) > { > - int pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE)); > + int pos; > + > + ensure_full_index(istate); > + > + pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE)); > if (pos < 0) { /* .gitmodules not found or isn't merged */ > pos = -1 - pos; > if (istate->cache_nr > pos) { /* there is a .gitmodules */ > @@ -77,7 +81,11 @@ int is_writing_gitmodules_ok(void) > */ > int is_staging_gitmodules_ok(struct index_state *istate) > { > - int pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE)); > + int pos; > + > + ensure_full_index(istate); > + > + pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE)); > > if ((pos >= 0) && (pos < istate->cache_nr)) { > struct stat st; > @@ -301,7 +309,7 @@ int is_submodule_populated_gently(const char *path, int *return_error_code) > /* > * Dies if the provided 'prefix' corresponds to an unpopulated submodule > */ > -void die_in_unpopulated_submodule(const struct index_state *istate, > +void die_in_unpopulated_submodule(struct index_state *istate, > const char *prefix) > { > int i, prefixlen; > @@ -311,6 +319,8 @@ void die_in_unpopulated_submodule(const struct index_state *istate, > > prefixlen = strlen(prefix); > > + ensure_full_index(istate); > + > for (i = 0; i < istate->cache_nr; i++) { > struct cache_entry *ce = istate->cache[i]; > int ce_len = ce_namelen(ce); > @@ -331,11 +341,13 @@ void die_in_unpopulated_submodule(const struct index_state *istate, > /* > * Dies if any paths in the provided pathspec descends into a submodule > */ > -void die_path_inside_submodule(const struct index_state *istate, > +void die_path_inside_submodule(struct index_state *istate, > const struct pathspec *ps) > { > int i, j; > > + ensure_full_index(istate); > + > for (i = 0; i < istate->cache_nr; i++) { > struct cache_entry *ce = istate->cache[i]; > int ce_len = ce_namelen(ce); > @@ -1420,6 +1432,8 @@ static int get_next_submodule(struct child_process *cp, > { > struct submodule_parallel_fetch *spf = data; > > + ensure_full_index(spf->r->index); > + > for (; spf->count < spf->r->index->cache_nr; spf->count++) { > const struct cache_entry *ce = spf->r->index->cache[spf->count]; > const char *default_argv; > diff --git a/submodule.h b/submodule.h > index 4ac6e31cf1f..84640c49c11 100644 > --- a/submodule.h > +++ b/submodule.h > @@ -39,7 +39,7 @@ struct submodule_update_strategy { > }; > #define SUBMODULE_UPDATE_STRATEGY_INIT {SM_UPDATE_UNSPECIFIED, NULL} > > -int is_gitmodules_unmerged(const struct index_state *istate); > +int is_gitmodules_unmerged(struct index_state *istate); > int is_writing_gitmodules_ok(void); > int is_staging_gitmodules_ok(struct index_state *istate); > int update_path_in_gitmodules(const char *oldpath, const char *newpath); > @@ -60,9 +60,9 @@ int is_submodule_active(struct repository *repo, const char *path); > * Otherwise the return error code is the same as of resolve_gitdir_gently. > */ > int is_submodule_populated_gently(const char *path, int *return_error_code); > -void die_in_unpopulated_submodule(const struct index_state *istate, > +void die_in_unpopulated_submodule(struct index_state *istate, > const char *prefix); > -void die_path_inside_submodule(const struct index_state *istate, > +void die_path_inside_submodule(struct index_state *istate, > const struct pathspec *ps); > enum submodule_update_type parse_submodule_update_type(const char *value); > int parse_submodule_update_strategy(const char *value, > diff --git a/tree.c b/tree.c > index e76517f6b18..60f575440c8 100644 > --- a/tree.c > +++ b/tree.c > @@ -170,6 +170,8 @@ int read_tree(struct repository *r, struct tree *tree, int stage, > * to matter. > */ > > + ensure_full_index(istate); > + > /* > * See if we have cache entry at the stage. If so, > * do it the original slow way, otherwise, append and then > diff --git a/wt-status.c b/wt-status.c > index 7074bbdd53c..5366d336938 100644 > --- a/wt-status.c > +++ b/wt-status.c > @@ -509,6 +509,8 @@ static int unmerged_mask(struct index_state *istate, const char *path) > int pos, mask; > const struct cache_entry *ce; > > + ensure_full_index(istate); > + > pos = index_name_pos(istate, path, strlen(path)); > if (0 <= pos) > return 0; > @@ -657,6 +659,8 @@ static void wt_status_collect_changes_initial(struct wt_status *s) > struct index_state *istate = s->repo->index; > int i; > > + ensure_full_index(istate); > + > for (i = 0; i < istate->cache_nr; i++) { > struct string_list_item *it; > struct wt_status_change_data *d; > @@ -2295,6 +2299,9 @@ static void wt_porcelain_v2_print_unmerged_entry( > */ > memset(stages, 0, sizeof(stages)); > sum = 0; > + > + ensure_full_index(istate); > + > pos = index_name_pos(istate, it->string, strlen(it->string)); > assert(pos < 0); > pos = -pos-1; > -- > gitgitgadget >

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/1/2021 3:22 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> >> From: Derrick Stolee <dstolee@microsoft.com> >> >> This giant patch is not intended for actual review. I have a branch that >> has these changes split out in a sane way with some commentary in each >> file that is modified. >> >> The idea here is to guard certain portions of the codebase that do not >> know how to handle sparse indexes by ensuring that the index is expanded >> to a full index before proceeding with the logic. >> >> This also provides a good mechanism for testing which code needs >> updating to enable the sparse index in a Git builtin. The builtin can >> set the_repository->settings.command_requires_full_index to zero and >> then we can debug the command with a breakpoint on ensure_full_index(). >> That identifies the portion of code that needs adjusting before enabling >> sparse indexes for that command. >> >> Some index operations must be changed to operate on a non-const pointer, >> since ensuring a full index will modify the index itself. >> >> There are likely some gaps to these protections, which is why it will be >> important to carefully test each scenario as we relax the requirements. >> I expect that to be a long effort. > > I think the idea makes sense; it provides a way for us to > incrementally build support for this new feature. > > I skimmed over the code and noticed various interesting places that > had the ensure_full_index() call (e.g. > read_skip_worktree_file_from_index() -- whose existence comes from > sparsity; what irony...). Better breakouts would be great, so I'll > defer commenting much until then. But, just to verify I'm > understanding: the primary defence is the command_requires_full_index > setting, and you have added several ensure_full_index() calls > throughout the code in places you believe would need to be fixed up in > case someone switches the command_requires_full_index setting. Is > that correct? And your comment on the gaps is just that there may be > other places that are missing the secondary protection (as opposed to > my first reading of that paragraph as suggesting we aren't sure if we > have enough protections yet and need to add more before this moves out > of RFC); is that right? Yes, the idea is that we can incrementally enable command_requires_full_index for some builtins and be confident that corner cases will be protected by ensure_full_index(). Further, we can test whether ensure_full_index() was called using test_region in test scripts to demonstrate that a command is truly "sparse aware" or if it is converting to full and back to sparse. There is also the case that when we write the index into a sparse format, the in-memory structure is modified. If the index is re-used afterwards, then we must expand to full again for these code paths. unpack_trees() already has one of these calls because it was necessary for the sparse-index write to work. The ensure_full_index() pattern also works when updating a builtin to work with the sparse-index because of the breakpoint trick. When I submit this as a full series, this patch will be one full patch series submission with careful comments about why each of these is added on a file-by-file basis. Thanks, -Stolee

gitgitgadget · 2021-02-01T20:54:20Z

dir.h

@@ -503,7 +503,7 @@ static inline int ce_path_match(const struct index_state *istate,
 				char *seen)


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > As a first step to integrate 'git status' and 'git add' with the sparse > index, we must start integrating unpack_trees() with sparse directory > entries. These changes are currently impossible to trigger because > unpack_trees() calls ensure_full_index() if command_requires_full_index > is true. This is the case for all commands at the moment. As we expand > more commands to be sparse-aware, we might find that more changes are > required to unpack_trees(). The current changes will suffice for > 'status' and 'add'. > > unpack_trees() calls the traverse_trees() API using unpack_callback() > to decide if we should recurse into a subtree. We must add new abilities > to skip a subtree if it corresponds to a sparse directory entry. Makes sense. > It is important to be careful about the trailing directory separator > that exists in the sparse directory entries but not in the subtree > paths. The comment makes me wonder if leaving the trailing directory separator out would be better, as it'd allow direct comparisons. Of course, you have a better idea of what is easier or harder based on this decision. Is there any chance you have a quick list of the places that the code was simplified by this decision and a list of places like this one that were made slightly harder? > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > dir.h | 2 +- > preload-index.c | 2 ++ > read-cache.c | 3 +++ > unpack-trees.c | 24 ++++++++++++++++++++++-- > 4 files changed, 28 insertions(+), 3 deletions(-) > > diff --git a/dir.h b/dir.h > index facfae47402..300305ec335 100644 > --- a/dir.h > +++ b/dir.h > @@ -503,7 +503,7 @@ static inline int ce_path_match(const struct index_state *istate, > char *seen) > { > return match_pathspec(istate, pathspec, ce->name, ce_namelen(ce), 0, seen, > - S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode)); > + S_ISSPARSEDIR(ce) || S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode)); I think this hunk becomes unnecessary if you use ce_mode = 040000 for sparse directory entries. > } > > static inline int dir_path_match(const struct index_state *istate, > diff --git a/preload-index.c b/preload-index.c > index ed6eaa47388..323fc8c5100 100644 > --- a/preload-index.c > +++ b/preload-index.c > @@ -54,6 +54,8 @@ static void *preload_thread(void *_data) > continue; > if (S_ISGITLINK(ce->ce_mode)) > continue; > + if (S_ISSPARSEDIR(ce)) > + continue; > if (ce_uptodate(ce)) > continue; > if (ce_skip_worktree(ce)) > diff --git a/read-cache.c b/read-cache.c > index 65679d70d7c..ab0c2b86ec0 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -1572,6 +1572,9 @@ int refresh_index(struct index_state *istate, unsigned int flags, > if (ignore_submodules && S_ISGITLINK(ce->ce_mode)) > continue; > > + if (istate->sparse_index && S_ISSPARSEDIR(ce)) > + continue; > + > if (pathspec && !ce_path_match(istate, ce, pathspec, seen)) > filtered = 1; > > diff --git a/unpack-trees.c b/unpack-trees.c > index b324eec2a5d..90644856a80 100644 > --- a/unpack-trees.c > +++ b/unpack-trees.c > @@ -583,6 +583,13 @@ static void mark_ce_used(struct cache_entry *ce, struct unpack_trees_options *o) > { > ce->ce_flags |= CE_UNPACKED; > > + /* > + * If this is a sparse directory, don't advance cache_bottom. > + * That will be advanced later using the cache-tree data. > + */ > + if (S_ISSPARSEDIR(ce)) > + return; I don't grok the cache_bottom stuff -- in general, nothing specific about your patch. But since I don't grok that stuff, it means I don't understand how your comment here relates; you may want to ping another reviewer about this portion of the patch. > + > if (o->cache_bottom < o->src_index->cache_nr && > o->src_index->cache[o->cache_bottom] == ce) { > int bottom = o->cache_bottom; > @@ -980,6 +987,9 @@ static int do_compare_entry(const struct cache_entry *ce, > ce_len -= pathlen; > ce_name = ce->name + pathlen; > > + /* remove directory separator if a sparse directory entry */ > + if (S_ISSPARSEDIR(ce)) > + ce_len--; Here's where your comment about trailing separator comes in; makes sense. > return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode); > } > > @@ -989,6 +999,10 @@ static int compare_entry(const struct cache_entry *ce, const struct traverse_inf > if (cmp) > return cmp; > > + /* If ce is a sparse directory, then allow equality here. */ > + if (S_ISSPARSEDIR(ce)) > + return 0; > + This seems surprising to me. Is there a chance you are comparing sparse directory A with sparse directory B and you return with equality? Or sparse_directory A with regular file B? Do the callers still do the right thing? If your code change here is right, it seems like it deserves an extra comment either in the code or the commit message. > /* > * Even if the beginning compared identically, the ce should > * compare as bigger than a directory leading up to it! > @@ -1239,6 +1253,7 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str > struct cache_entry *src[MAX_UNPACK_TREES + 1] = { NULL, }; > struct unpack_trees_options *o = info->data; > const struct name_entry *p = names; > + unsigned recurse = 1; > > /* Find first entry with a real name (we could use "mask" too) */ > while (!p->mode) > @@ -1280,12 +1295,16 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str > } > } > src[0] = ce; > + > + if (S_ISSPARSEDIR(ce)) > + recurse = 0; > } > break; > } > } > > - if (unpack_nondirectories(n, mask, dirmask, src, names, info) < 0) > + if (recurse && > + unpack_nondirectories(n, mask, dirmask, src, names, info) < 0) > return -1; > > if (o->merge && src[0]) { > @@ -1315,7 +1334,8 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str > } > } > > - if (traverse_trees_recursive(n, dirmask, mask & ~dirmask, > + if (recurse && > + traverse_trees_recursive(n, dirmask, mask & ~dirmask, > names, info) < 0) > return -1; > return mask; The unpack_callback() code has some comparison to a cache-tree, but I'd assume that you'd need to update cache-tree.c somewhat to take advantage of these sparse directory entries. Am I wrong, and you just get cache-tree.c working with sparse directory entries for free? Or is this something coming in a later patch?

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/1/2021 3:50 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> It is important to be careful about the trailing directory separator >> that exists in the sparse directory entries but not in the subtree >> paths. > > The comment makes me wonder if leaving the trailing directory > separator out would be better, as it'd allow direct comparisons. Of > course, you have a better idea of what is easier or harder based on > this decision. Is there any chance you have a quick list of the > places that the code was simplified by this decision and a list of > places like this one that were made slightly harder? I'm going through all of your comments and making notes about areas to fix and clean up before starting a new series for full review. This question of the trailing slash is important, and I will take particular care about answering it as I rework the series. However, the questions in this patch poke at the right places... >> + /* remove directory separator if a sparse directory entry */ >> + if (S_ISSPARSEDIR(ce)) >> + ce_len--; > > Here's where your comment about trailing separator comes in; makes sense. > >> return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode); >> } >> >> @@ -989,6 +999,10 @@ static int compare_entry(const struct cache_entry *ce, const struct traverse_inf >> if (cmp) >> return cmp; >> >> + /* If ce is a sparse directory, then allow equality here. */ >> + if (S_ISSPARSEDIR(ce)) >> + return 0; >> + > > This seems surprising to me. Is there a chance you are comparing > sparse directory A with sparse directory B and you return with > equality? Or sparse_directory A with regular file B? Do the callers > still do the right thing? If your code change here is right, it seems > like it deserves an extra comment either in the code or the commit > message. Sometimes a caller is asking for the first index entry corresponding to a directory. In these cases, the input could be "A/B/C/". We want to ensure that a sparse directory entry corresponding exactly to that directory is correctly matched. If we place "A/B/C" in the index instead, this search becomes more complicated (I think; I will justify this more after thinking about it). At this point in time, we are just saying "We found the entry with equal path value!" and not failing with the check in the rest of the method: /* * Even if the beginning compared identically, the ce should * compare as bigger than a directory leading up to it! */ return ce_namelen(ce) > traverse_path_len(info, tree_entry_len(n)); >> - if (traverse_trees_recursive(n, dirmask, mask & ~dirmask, >> + if (recurse && >> + traverse_trees_recursive(n, dirmask, mask & ~dirmask, >> names, info) < 0) >> return -1; >> return mask; > > The unpack_callback() code has some comparison to a cache-tree, but > I'd assume that you'd need to update cache-tree.c somewhat to take > advantage of these sparse directory entries. Am I wrong, and you just > get cache-tree.c working with sparse directory entries for free? Or > is this something coming in a later patch? In the RFC, I integrate the cache-tree with the sparse-index at the very end. I will move that integration to be much earlier in the next submission, so it becomes part of the format discussion. Thanks, -Stolee

gitgitgadget · 2021-02-01T22:15:37Z

dir.c

@@ -892,13 +892,15 @@ void add_pattern(const char *string, const char *base,
 	add_pattern_to_hashsets(pl, pattern);


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > When we have sparse directory entries in the index, we want to compare > that directory against sparse-checkout patterns. Those pattern matching > algorithms are built expecting a file path, not a directory path. This > is especially important in the "cone mode" patterns which will match > files that exist within the "parent directories" as well as the > recursive directory matches. > > If path_matches_pattern_list() is given a directory, we can add a bogus > filename ("-") to the directory and get the same results as before, > assuming we are in cone mode. Since sparse index requires cone mode > patterns, this is an acceptable assumption. Why is "-" a bogus filename? Is that only on certain operating systems, or are you just not expecting a user to name their file with such a bad name? What if there is a file with that name in that directory in the repository; do you need the pathname to be bogus? What do you mean by "get the same results as before"? The first paragraph suggests the code wouldn't handle a directory path, and that not handling it was problematic, so it seems unlikely you want the same results as that. But it's not clear what the "before" refers to here. > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > dir.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/dir.c b/dir.c > index ad6eb033cb1..c786fa98d0e 100644 > --- a/dir.c > +++ b/dir.c > @@ -1384,6 +1384,11 @@ enum pattern_match_result path_matches_pattern_list( > strbuf_addch(&parent_pathname, '/'); > strbuf_add(&parent_pathname, pathname, pathlen); > > + /* Directory requests should be added as if they are a file */ > + if (parent_pathname.len > 1 && > + parent_pathname.buf[parent_pathname.len - 1] == '/') Ah, this looks like a case where the trailing slash is helpful; without it, you might have to feed extra data in through the call hierarchy to signify that this is a directory entry. > + strbuf_add(&parent_pathname, "-", 1); > + > if (hashmap_contains_path(&pl->recursive_hashmap, > &parent_pathname)) { > result = MATCHED_RECURSIVE; hashmap_contains_path? Don't we already know (modulo special cases of our bogus value not quite being bogus enough) that this is false since we were adding a bogus path? How could the hashmap have a bogus value in it? Won't this particular call fail with or without our adding "-" to the end of the path? After this hashmap_contains_path() call, the subsequent code looks for the parent of the path by stripping off everything after the last '/'...which seems like the relevant code anyway. Is the problem that the hashmap_contains_path() call was returning true when we didn't add "-" to the end? If so, can we use and if or a goto instead to make the code skip this first check and move on to where we want it to go? Or am I misunderstanding something about this code?

gitgitgadget · 2021-02-01T22:36:41Z

name-hash.c

@@ -8,6 +8,7 @@
 #include "cache.h"


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > We need to check the file hashmap first, then look to see if the > directory signals a non-sparse directory entry. In such a case, we can > rely on the contents of the sparse-index. > > We still use ensure_full_index() in the case that we hit a path that is > within a sparse-directory entry. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > name-hash.c | 6 ++++++ > sparse-index.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 59 insertions(+) > > diff --git a/name-hash.c b/name-hash.c > index 641f6900a7c..cb0f316f652 100644 > --- a/name-hash.c > +++ b/name-hash.c > @@ -110,6 +110,12 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce) > if (ce->ce_flags & CE_HASHED) > return; > ce->ce_flags |= CE_HASHED; > + > + if (ce->ce_mode == CE_MODE_SPARSE_DIRECTORY) { > + add_dir_entry(istate, ce); > + return; > + } > + > hashmap_entry_init(&ce->ent, memihash(ce->name, ce_namelen(ce))); > hashmap_add(&istate->name_hash, &ce->ent); > > diff --git a/sparse-index.c b/sparse-index.c > index dd1a06dfdd3..bf8dce9a09b 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -281,9 +281,62 @@ void ensure_full_index(struct index_state *istate) > trace2_region_leave("index", "ensure_full_index", istate->repo); > } > > +static int in_expand_to_path = 0; > + > void expand_to_path(struct index_state *istate, > const char *path, size_t pathlen, int icase) > { > + struct strbuf path_as_dir = STRBUF_INIT; > + int pos; > + > + /* prevent extra recursion */ > + if (in_expand_to_path) > + return; Maybe "prevent extra expand_to_path() <-> index_file_exists() recursion", just to be extra explicit? > + > + if (!istate || !istate->sparse_index) > + return; > + > + if (!istate->repo) > + istate->repo = the_repository; So, we assume the_repository if istate->repo isn't set. I guess given the number of the_repository assumptions we have in the code, this isn't a big deal. And instead of a USE_THE_REPOSITORY_COMPATIBILITY_MACROS we have a NO_THE_REPOSITORY_COMPATIBILITY_MACROS, so there's nothing to mark this either. > + > + in_expand_to_path = 1; > + > + /* > + * We only need to actually expand a region if the > + * following are both true: > + * > + * 1. 'path' is not already in the index. > + * 2. Some parent directory of 'path' is a sparse directory. > + */ > + > + strbuf_add(&path_as_dir, path, pathlen); > + strbuf_addch(&path_as_dir, '/'); > + > + /* in_expand_to_path prevents infinite recursion here */ > + if (index_file_exists(istate, path, pathlen, icase)) > + goto cleanup; Shouldn't the editing of path_as_dir be done after the index_file_exists() call? In the case that the entry already exists, writing to path_as_dir is wasted work. > + pos = index_name_pos(istate, path_as_dir.buf, path_as_dir.len); > + > + if (pos < 0) > + pos = -pos - 1; > + > + /* > + * Even if the path doesn't exist, if the value isn't exactly a > + * sparse-directory entry, then there is no need to expand the > + * index. > + */ > + if (istate->cache[pos]->ce_mode != CE_MODE_SPARSE_DIRECTORY) > + goto cleanup; This looked wrong to me until I tried to come up with a counter-example. Here you are relying on the fact that before the comment, pos is going to be the index of a sparse directory entry -- either for path_as_dir or some ancestor directory. It would be nice if the comment mentioned that. > + > + trace2_region_enter("index", "expand_to_path", istate->repo); > + > /* for now, do the obviously-correct, slow thing */ > ensure_full_index(istate); > + > + trace2_region_leave("index", "expand_to_path", istate->repo); > + > +cleanup: > + strbuf_release(&path_as_dir); > + in_expand_to_path = 0; > } > -- > gitgitgadget Looks good otherwise.

gitgitgadget · 2021-02-01T23:12:53Z

builtin/add.c

@@ -491,6 +491,9 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 	add_new_files = !take_worktree_changes && !refresh_only && !add_renormalize;


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > Replace enough callers to ensure_full_index() to instead call > expand_to_path() to reduce how often 'git add' expands a sparse index in > memory (before writing a sparse index again). > > One non-obvious case is index_name_pos_also_unmerged() which is only hit > on the Windows platform (in my tests). Use expand_to_path() instead of > ensure_full_index(). I read this paragraph as saying that the conversion of index_name_pos_also_unmerged() was tricky, whereas after reading the patch it looks like the conversion was trivial and you were perhaps meaning to say that it is easy to miss that this function also needs a conversion. Also, since the second sentence is true of all the conversions, not sure how much it helps to highlight it when just talking about this one function. Both of these are minor quibbles, but if there's a clever way to reword here that reduces potential confusion, that'd be great. > Add a test to check that 'git add -A' and 'git add <file>' does not > expand the index at all, as long as <file> is not within a sparse > directory. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > builtin/add.c | 3 +++ > dir.c | 8 ++++---- > read-cache.c | 10 +++++----- > sparse-index.c | 18 ++++++++++++++---- > t/t1092-sparse-checkout-compatibility.sh | 14 ++++++++++++++ > 5 files changed, 40 insertions(+), 13 deletions(-) > > diff --git a/builtin/add.c b/builtin/add.c > index a825887c503..b73f8d51de6 100644 > --- a/builtin/add.c > +++ b/builtin/add.c > @@ -491,6 +491,9 @@ int cmd_add(int argc, const char **argv, const char *prefix) > add_new_files = !take_worktree_changes && !refresh_only && !add_renormalize; > require_pathspec = !(take_worktree_changes || (0 < addremove_explicit)); > > + prepare_repo_settings(the_repository); > + the_repository->settings.command_requires_full_index = 0; > + > hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR); > > /* > diff --git a/dir.c b/dir.c > index c786fa98d0e..21998c7c4b7 100644 > --- a/dir.c > +++ b/dir.c > @@ -18,6 +18,7 @@ > #include "ewah/ewok.h" > #include "fsmonitor.h" > #include "submodule-config.h" > +#include "sparse-index.h" > > /* > * Tells read_directory_recursive how a file or directory should be treated. > @@ -899,9 +900,9 @@ static int read_skip_worktree_file_from_index(struct index_state *istate, > { > int pos, len; > > - ensure_full_index(istate); > - > len = strlen(path); > + > + expand_to_path(istate, path, len, 0); > pos = index_name_pos(istate, path, len); > if (pos < 0) > return -1; > @@ -1707,8 +1708,7 @@ static enum exist_status directory_exists_in_index(struct index_state *istate, > if (ignore_case) > return directory_exists_in_index_icase(istate, dirname, len); > > - ensure_full_index(istate); > - > + expand_to_path(istate, dirname, len, 0); > pos = index_name_pos(istate, dirname, len); > if (pos < 0) > pos = -pos-1; > diff --git a/read-cache.c b/read-cache.c > index 78910d8f1b7..8c974829497 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -647,7 +647,7 @@ static int index_name_pos_also_unmerged(struct index_state *istate, > int pos; > struct cache_entry *ce; > > - ensure_full_index(istate); > + expand_to_path(istate, path, namelen, 0); > > pos = index_name_pos(istate, path, namelen); > if (pos >= 0) > @@ -724,8 +724,6 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, > int hash_flags = HASH_WRITE_OBJECT; > struct object_id oid; > > - ensure_full_index(istate); > - > if (flags & ADD_CACHE_RENORMALIZE) > hash_flags |= HASH_RENORMALIZE; > > @@ -733,6 +731,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, > return error(_("%s: can only add regular files, symbolic links or git-directories"), path); > > namelen = strlen(path); > + expand_to_path(istate, path, namelen, 0); > + > if (S_ISDIR(st_mode)) { > if (resolve_gitlink_ref(path, "HEAD", &oid) < 0) > return error(_("'%s' does not have a commit checked out"), path); > @@ -1104,7 +1104,7 @@ static int has_dir_name(struct index_state *istate, > size_t len_eq_last; > int cmp_last = 0; > > - ensure_full_index(istate); > + expand_to_path(istate, ce->name, ce->ce_namelen, 0); > > /* > * We are frequently called during an iteration on a sorted > @@ -1349,7 +1349,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti > { > int pos; > > - ensure_full_index(istate); > + expand_to_path(istate, ce->name, ce->ce_namelen, 0); > > if (option & ADD_CACHE_JUST_APPEND) > pos = istate->cache_nr; > diff --git a/sparse-index.c b/sparse-index.c > index bf8dce9a09b..a201f3b905c 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -286,6 +286,7 @@ static int in_expand_to_path = 0; > void expand_to_path(struct index_state *istate, > const char *path, size_t pathlen, int icase) > { > + struct cache_entry *ce = NULL; > struct strbuf path_as_dir = STRBUF_INIT; > int pos; > > @@ -320,13 +321,22 @@ void expand_to_path(struct index_state *istate, > > if (pos < 0) > pos = -pos - 1; > + if (pos < istate->cache_nr) > + ce = istate->cache[pos]; > > /* > - * Even if the path doesn't exist, if the value isn't exactly a > - * sparse-directory entry, then there is no need to expand the > - * index. > + * If we didn't land on a sparse directory, then there is > + * nothing to expand. > */ > - if (istate->cache[pos]->ce_mode != CE_MODE_SPARSE_DIRECTORY) > + if (ce && !S_ISSPARSEDIR(ce)) > + goto cleanup; Seems like these changes to expand_to_path() could and maybe should be squashed into the commit that introduces expand_to_path()? > + /* > + * If that sparse directory is not a prefix of the path we > + * are looking for, then we don't need to expand. > + */ > + if (ce && > + (ce->ce_namelen >= path_as_dir.len || > + strncmp(ce->name, path_as_dir.buf, ce->ce_namelen))) Should this also be squashed into the commit that introduces expand_to_path()? Why is this check added here? > goto cleanup; > > trace2_region_enter("index", "expand_to_path", istate->repo); > diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh > index 09650f0755c..ae594ab880c 100755 > --- a/t/t1092-sparse-checkout-compatibility.sh > +++ b/t/t1092-sparse-checkout-compatibility.sh > @@ -390,6 +390,20 @@ test_expect_success 'sparse-index is expanded and converted back' ' > GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > git -C sparse-index -c core.fsmonitor="" reset --hard && > test_region index convert_to_sparse trace2.txt && > + test_region index ensure_full_index trace2.txt && > + > + rm trace2.txt && > + echo >>sparse-index/README.md && > + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > + git -C sparse-index -c core.fsmonitor="" add -A && > + test_region index convert_to_sparse trace2.txt && > + test_region index ensure_full_index trace2.txt && > + > + rm trace2.txt && > + echo >>sparse-index/extra.txt && > + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \ > + git -C sparse-index -c core.fsmonitor="" add extra.txt && > + test_region index convert_to_sparse trace2.txt && > test_region index ensure_full_index trace2.txt > ' > > -- > gitgitgadget >

gitgitgadget · 2021-02-01T23:24:50Z

dir.c

@@ -18,6 +18,7 @@
 #include "ewah/ewok.h"


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > The add_patterns() method has a way to extract a pattern file from the > index. If this pattern file is sparse and within a sparse directory > entry, then we need to expand the index before looking for that entry as > a file path. Why? Correct me if I'm wrong, but I thought the point of add_patterns() was to read .gitignore entries, so that we can know whether to e.g. have status report untracked files within some directory or have clean delete files within a directory. But if we have a sparse directory entry in the index, we probably have no such directory in the working directory. And if we have no such working directory, getting .gitignore entries for those directories is a big waste of time. > For now, convert ensure_full_index() into expand_to_path() to only > expand this way when absolutely necessary. Not only should we probably not need to read these files at all, expand_to_path() still expands a lot more than necessary, right? If two directories are sparse -- moduleA and moduleB, and we need something from under moduleA/, then expand_to_path() will call ensure_full_index() and fill out every entry under both modules, even if moduleB is way bigger than moduleA. Unless I've misunderstood something, there's multiple ways we're falling short of "only...when absolutely necessary". Perhaps both of these things are future work you already had planned; if so, some tweaks to the commit message may help keep this reader oriented. :-) > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > dir.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/dir.c b/dir.c > index 21998c7c4b7..7df8d3b1da0 100644 > --- a/dir.c > +++ b/dir.c > @@ -1093,7 +1093,7 @@ static int add_patterns(const char *fname, const char *base, int baselen, > int pos; > > if (istate) > - ensure_full_index(istate); > + expand_to_path(istate, fname, strlen(fname), 0); > > if (oid_stat->valid && > !match_stat_data_racy(istate, &oid_stat->stat, &st)) > -- > gitgitgadget There's also a read_skip_worktree_file_from_index() call earlier in the same function, which in your big RFC patch you protected with the ensure_full_index() call already. Perhaps it should have an expand_to_path() conversion as well? But, in the big picture, it seems like checking if we can avoid reading in that pattern file (whenever the directory doesn't exist within the working copy) would be a better first step.

gitgitgadget · 2021-02-01T23:28:00Z

pathspec.c

@@ -20,7 +20,7 @@
 * to use find_pathspecs_matching_against_index() instead.


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > The add_pathspec_matches_against_index() focuses on matching a pathspec > to file entries in the index. It is possible that this already works > correctly for its only use: checking if untracked files exist in the > index. > > It is likely that this causes a behavior issue when adding a directory > that exists at HEAD but is outside the sparse cone. I'm marking this as > a place to pursue with future tests. Sounds like you're unsure if this patch is good. Should it be marked RFC or something? > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > pathspec.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/pathspec.c b/pathspec.c > index 9b105855483..61dc771aa02 100644 > --- a/pathspec.c > +++ b/pathspec.c > @@ -36,7 +36,6 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec, > num_unmatched++; > if (!num_unmatched) > return; > - ensure_full_index(istate); > for (i = 0; i < istate->cache_nr; i++) { > const struct cache_entry *ce = istate->cache[i]; > ce_path_match(istate, ce, pathspec, seen); > -- > gitgitgadget >

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/1/2021 6:24 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> >> From: Derrick Stolee <dstolee@microsoft.com> >> >> The add_pathspec_matches_against_index() focuses on matching a pathspec >> to file entries in the index. It is possible that this already works >> correctly for its only use: checking if untracked files exist in the >> index. >> >> It is likely that this causes a behavior issue when adding a directory >> that exists at HEAD but is outside the sparse cone. I'm marking this as >> a place to pursue with future tests. > > Sounds like you're unsure if this patch is good. Should it be marked > RFC or something? ...isn't the whole series marked as RFC? I only specifically marked the ensure_full_index() one because I purposefully squashed it. But in general, everything I'm touching in these areas seems like a potentially problematic change. So many things are used and re-used that I'm not sure what is safe or not. More testing is required for commands to ensure their behavior. I can enable things like GIT_TEST_SPARSE_INDEX to get other tests using sparse-checkout to work (but only if they use cone mode). Hence, I'm relying on what tests I can write to cover the behavior instead of a robust history of valuable tests. I hope to gather more confidence as this goes forward. I definitely work to be confident that I am not making any errors that cause problems for users who do not enable the sparse-index, but I expect there to be a long tail of adjustments required as more corner cases are discovered during the development and testing of the feature. The good news is that I have some engaged users who are willing to test the feature and provide feedback if they hit any snags once there is a minimal set of functionality. Thanks, -Stolee

gitgitgadget · 2021-02-01T23:58:02Z

cache-tree.c

@@ -6,6 +6,7 @@
 #include "object-store.h"


On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > The cache-tree extension was previously disabled with sparse indexes. > However, the cache-tree is an important performance feature for commands > like 'git status' and 'git add'. Integrate it with sparse directory > entries. > > When writing a sparse index, completely clear and recalculate the cache > tree. By starting from scratch, the only integration necessary is to > check if we hit a sparse directory entry and create a leaf of the > cache-tree that has an entry_count of one and no subtrees. > > Once the cache-tree exists within a sparse index, we finally get > improved performance. I test the sparse index performance using a > private monorepo with over 2.1 million files at HEAD, but with a > sparse-checkout definition that has only 68,000 paths in the populated > cone. The sparse index has about 2,000 sparse directory entries. I > compare three scenarios: How many .gitignore entries does this monorepo have? What percentage of those are populated for #2 and #3? Can a testcase be devised that others can repeat? For example, I created https://github.com/newren/gvfs-like-git-bomb once upon a time to create a very small repository with a very large index and a lot of skip-worktree entries, mostly to test some stuff that someone (Ben Peart?) mentioned as being slow and verify for myself that it wasn't Windows specific. > 1. Use the full index. The index size is ~186 MB. > 2. Use the sparse index. The index size is ~5.5 MB. > 3. Use a commit where HEAD matches the populated set. The full index > size is ~5.3MB. I'm not sure I'm understanding the difference between #2 and #3, other than #3 is smaller. How did you form #2? Also, what do you mean by "full index size" for #3, when it's the smallest? Isn't that index the most sparse (or least full)? Or is it an index for a different commit entirely that has far fewer files in it? > The third benchmark is included as a theoretical optimium for a s/optimium/optimum/ > repository of the same object database. This I'm also not understanding, but maybe this goes back to not understanding the difference in how #2 and #3 are constructed. > First, a clean 'git status' improves from 3.1s to 240ms. > > Benchmark #1: full index (git status) > Time (mean ± σ): 3.167 s ± 0.036 s [User: 2.006 s, System: 1.078 s] > Range (min … max): 3.100 s … 3.208 s 10 runs > > Benchmark #2: sparse index (git status) > Time (mean ± σ): 239.5 ms ± 8.1 ms [User: 189.4 ms, System: 226.8 ms] > Range (min … max): 226.0 ms … 251.9 ms 13 runs > > Benchmark #3: small tree (git status) > Time (mean ± σ): 195.3 ms ± 4.5 ms [User: 116.5 ms, System: 84.4 ms] > Range (min … max): 188.8 ms … 202.8 ms 15 runs Always nice to see a speedup factor greater than 10. :-) > > The optimimum is still 45ms faster. This is due in part to the 2,000+ s/optimimum/optimum/ > sparse directory entries, but there might be other optimizations to make > in the sparse-index case. In particular, I find that this performance > difference disappears when I disable FS Monitor, which is somewhat > disabled in the sparse-index case, but might still be adding overhead. The FS monitor wording is unclear to me; it feels like multiple negatives. > The performance numbers for 'git add .' are much closer to optimal: > > Benchmark #1: full index (git add .) > Time (mean ± σ): 3.076 s ± 0.022 s [User: 2.065 s, System: 0.943 s] > Range (min … max): 3.044 s … 3.116 s 10 runs > > Benchmark #2: sparse index (git add .) > Time (mean ± σ): 218.0 ms ± 6.6 ms [User: 195.7 ms, System: 206.6 ms] > Range (min … max): 209.8 ms … 228.2 ms 13 runs > > Benchmark #3: small tree (git add .) > Time (mean ± σ): 217.6 ms ± 5.4 ms [User: 131.9 ms, System: 86.7 ms] > Range (min … max): 212.1 ms … 228.4 ms 14 runs > > In this test, I also used "echo >>README.md" to append a line to the > README.md file, so the 'git add .' command is doing _something_ other > than a no-op. Without this edit (and FS Monitor enabled) the small > tree case again gains about 30ms on the sparse index case. Meaning the small tree is 30 ms faster than reported here, or 30 ms slower, or that both sparse index and small tree are faster but the small tree decreases its time more than the sparse index one does? Sorry, I don't mean to be dense, I'm just struggling with understanding words today it seems. (Also, it seems like there's a joke in there about me being "dense" in a review of a "sparse" feature...but I'm not quite coming up with it.) > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > cache-tree.c | 18 ++++++++++++++++++ > sparse-index.c | 10 +++++++++- > 2 files changed, 27 insertions(+), 1 deletion(-) > > diff --git a/cache-tree.c b/cache-tree.c > index 5f07a39e501..9da6a4394e0 100644 > --- a/cache-tree.c > +++ b/cache-tree.c > @@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it, > > *skip_count = 0; > > + /* > + * If the first entry of this region is a sparse directory > + * entry corresponding exactly to 'base', then this cache_tree > + * struct is a "leaf" in the data structure, pointing to the > + * tree OID specified in the entry. > + */ > + if (entries > 0) { > + const struct cache_entry *ce = cache[0]; > + > + if (S_ISSPARSEDIR(ce) && > + ce->ce_namelen == baselen && > + !strncmp(ce->name, base, baselen)) { > + it->entry_count = 1; > + oidcpy(&it->oid, &ce->oid); > + return 1; > + } > + } > + > if (0 <= it->entry_count && has_object_file(&it->oid)) > return it->entry_count; > > diff --git a/sparse-index.c b/sparse-index.c > index a201f3b905c..9ea3b321400 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -181,7 +181,11 @@ int convert_to_sparse(struct index_state *istate) > istate->cache_nr = convert_to_sparse_rec(istate, > 0, 0, istate->cache_nr, > "", 0, istate->cache_tree); > - istate->drop_cache_tree = 1; > + > + /* Clear and recompute the cache-tree */ > + cache_tree_free(&istate->cache_tree); > + cache_tree_update(istate, 0); > + > istate->sparse_index = 1; > trace2_region_leave("index", "convert_to_sparse", istate->repo); > return 0; > @@ -278,6 +282,10 @@ void ensure_full_index(struct index_state *istate) > > free(full); > > + /* Clear and recompute the cache-tree */ > + cache_tree_free(&istate->cache_tree); > + cache_tree_update(istate, 0); > + > trace2_region_leave("index", "ensure_full_index", istate->repo); > } > > -- > gitgitgadget This is very exciting work!!

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/1/2021 6:54 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> In this test, I also used "echo >>README.md" to append a line to the >> README.md file, so the 'git add .' command is doing _something_ other >> than a no-op. Without this edit (and FS Monitor enabled) the small >> tree case again gains about 30ms on the sparse index case. > > Meaning the small tree is 30 ms faster than reported here, or 30 ms > slower, or that both sparse index and small tree are faster but the > small tree decreases its time more than the sparse index one does? > > Sorry, I don't mean to be dense, I'm just struggling with > understanding words today it seems. (Also, it seems like there's a > joke in there about me being "dense" in a review of a "sparse" > feature...but I'm not quite coming up with it.) I don't blame you! This is a lot to digest, and I appreciate you pushing through to the end of it. Clearly, I was getting a bit inexact near the end. My excitement to share this RFC clearly overshadowed my attention to grammatical detail. I'll go through your feedback more carefully soon and hopefully clarify these and many other questions. Thanks, -Stolee

On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Feb 1, 2021 at 6:41 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 2/1/2021 6:54 PM, Elijah Newren wrote: > > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > >> In this test, I also used "echo >>README.md" to append a line to the > >> README.md file, so the 'git add .' command is doing _something_ other > >> than a no-op. Without this edit (and FS Monitor enabled) the small > >> tree case again gains about 30ms on the sparse index case. > > > > Meaning the small tree is 30 ms faster than reported here, or 30 ms > > slower, or that both sparse index and small tree are faster but the > > small tree decreases its time more than the sparse index one does? > > > > Sorry, I don't mean to be dense, I'm just struggling with > > understanding words today it seems. (Also, it seems like there's a > > joke in there about me being "dense" in a review of a "sparse" > > feature...but I'm not quite coming up with it.) > > I don't blame you! This is a lot to digest, and I appreciate you > pushing through to the end of it. > > Clearly, I was getting a bit inexact near the end. My excitement > to share this RFC clearly overshadowed my attention to grammatical > detail. I'll go through your feedback more carefully soon and > hopefully clarify these and many other questions. I can't blame you for being excited; this series is awesome. I've thought we should do something along this direction for years[1]. Sure I found lots of little nitpicks here and there in your series, but that's just attempting to help find any issues so it can be made even better; overall I'm super excited about it. [1] See "crazy idea" paragraph at https://lore.kernel.org/git/CABPp-BGir_5xyqEfwytDog0rZDydPHXjuqXCpNKk67dVPXjUjA@mail.gmail.com/

gitgitgadget · 2021-02-02T03:16:09Z

On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This is based on ds/more-index-cleanups also available as GitGitGadget PR
> #839.
>
> The sparse checkout feature allows users to specify a "populated set" that
> is smaller than the full list of files at HEAD. Files outside the sparse
> checkout definition are not present in the working directory, but are still
> present in the index (and marked with the CE_SKIP_WORKTREE bit).
>
> This means that the working directory has size O(populated), and commands
> like "git status" or "git checkout" operate using an O(populated) number of
> filesystem operations. However, parsing the index still operates on the
> scale of O(HEAD).
>
> This can be particularly jarring if you are merging a small repository with
> a large monorepo for the purpose of simplifying dependency management. Even
> if users have nothing more in their working directory than they had before,
> they suddenly see a significant increase in their "git status" or "git add"
> times. In these cases, simply parsing the index can be a huge portion of the
> command time.
>
> This RFC proposes an update to the index formats to allow "sparse directory
> entries". These entries correspond to directories that are completely
> excluded from the sparse checkout definition. We can detect that a directory
> is excluded when using "cone mode" patterns.
>
> Since having directory entries is a radical departure from the existing
> index format, a new extension "extensions.sparseIndex" is added. Using a
> sparse index should cause incompatible tools to fail because they do not
> understand this extension.
>
> The index is a critical data structure, so making such a drastic change must
> be handled carefully. This RFC does only enough adjustments to demonstrate
> performance improvements for "git status" and "git add." Other commands
> should operate identically to before, since the other commands will expand a
> sparse index into a full index by parsing trees.
>
> WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout
> tests. I'm not sure what is causing these failures, but I will explore while
> we discuss the possibility of the feature as a whole.
>
> Here is an outline for this RFC:
>
>  * Patches 1-14: create and test the sparse index format. This is just
>    enough to start writing the format, but all Git commands become slower
>    for using it. This is because everything is guarded to expand to a full
>    index before actually operating on the cache entries.
>
>  * Patch 15: This massive patch is actually a bunch of patches squashed
>    together. I have a branch that adds "ensure_full_index()" guards in each
>    necessary file along with some commentary about how the index is being
>    used. This patch is presented here as one big dump because that
>    commentary isn't particularly interesting if the RFC leads to a very
>    different approach.
>
>  * Patches 16-27: These changes make enough code "sparse aware" such that
>    "git status" and "git add" start operating in time O(populated) instead
>    of O(HEAD).
>
> Performance numbers are given in patch 27, but repeated somewhat here. The
> test environment I use has ~2.1 million paths at HEAD, but only 68,000
> populated paths given the sparse-checkout I'm using. The sparse index has
> about 2,000 sparse directory entries.
>
>  1. Use the full index. The index size is ~186 MB.
>  2. Use the sparse index. The index size is ~5.5 MB.
>  3. Use a commit where HEAD matches the populated set. The full index size
>     is ~5.3MB.
>
> The third benchmark is included as a theoretical optimum for a repository of
> the same object database.
>
> First, a clean 'git status' improves from 3.1s to 240ms.
>
> Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s
> [User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10
> runs
>
> Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms
> [User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13
> runs
>
> Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms
> [User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15
> runs
>
> The performance numbers for 'git add .' are much closer to optimal:
>
> Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s
> [User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10
> runs
>
> Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms
> [User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13
> runs
>
> Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms
> [User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14
> runs
>
> I expect that making a sparse index work optimally through the most common
> Git commands will take a year of effort. During this process, I expect to
> add a lot of testing infrastructure around the sparse-checkout feature,
> especially in corner cases. (This RFC focuses on the happy paths of
> operating only within the sparse cone, but that will change in the future.)
>
> If this general approach is acceptable, then I would follow it with a
> sequence of patch submissions that follow this approach:
>
>  1. Basics of the format. (Patches 1-14)
>  2. Add additional guards around index interactions (Patch 15, but split
>     appropriately.)
>  3. Speed up "git status" and "git add" (Patches 16-27)
>
> After those three items that are represented in this RFC, the work starts to
> parallelize a bit. My basic ideas for moving forward from this point are to
> do these basic steps:
>
>  * Add new index API abstractions where appropriate, make them sparse-aware.
>  * Add new tests around sparse-checkout corner cases. Ensure the sparse
>    index works correctly.
>  * For a given builtin, add extra testing for sparse-checkouts then it them
>    sparse-aware.
>
> Here are some specific questions I'm hoping to answer in this RFC period:
>
>  1. Are these sparse directory entries an appropriate way to extend the
>     index format?
>  2. Is extensions.sparseIndex a good way to signal that these entries can
>     exist?
>  3. Is git sparse-checkout init --cone --sparse-index an appropriate way to
>     toggle the format?
>  4. Are there specific areas that I should target to harden the index API
>     before I submit this work?
>  5. Does anyone have a good idea how to test a large portion of the test
>     suite with sparse-index enabled? The problem I see is that most tests
>     don't use sparse-checkout, so the sparse index is identical to the full
>     index. Would it be interesting to enable the test setup to add such
>     "extra" directories during the test setup?
>
> Thanks, -Stolee

Thanks for working on this.  It's very exciting seeing this idea come
alive.  I had lots of little nitpicks and questions and whatnot here
and there, but that almost feels like a diversion.  Overall, I think
you divided up the series in a very logical and easy to follow
fashion, and actually achieved quite a bit already.

I suspect I have partially answered some of your questions above among
all my comments, and left others unanswered or worse, just re-asked
the same question(s) myself.  Feel free to ping again with the next
round and I'll see if I can dodge your questions again...er, I mean,
try to think of something helpful to say.  :-)

>
> Derrick Stolee (27):
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: read-cache --table --no-stat
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   [RFC-VERSION] *: ensure full index
>   unpack-trees: make sparse aware
>   dir.c: accept a directory as part of cone-mode patterns
>   status: use sparse-index throughout
>   status: skip sparse-checkout percentage with sparse-index
>   sparse-index: expand_to_path() trivial implementation
>   sparse-index: expand_to_path no-op if path exists
>   add: allow operating on a sparse-only index
>   submodule: die_path_inside_submodule is sparse aware
>   dir: use expand_to_path in add_patterns()
>   fsmonitor: disable if index is sparse
>   pathspec: stop calling ensure_full_index
>   cache-tree: integrate with sparse directory entries
>
>  Documentation/config/extensions.txt      |   7 +
>  Documentation/git-sparse-checkout.txt    |  14 +
>  Makefile                                 |   1 +
>  apply.c                                  |  10 +-
>  blame.c                                  |   7 +-
>  builtin/add.c                            |   3 +
>  builtin/checkout-index.c                 |   5 +-
>  builtin/commit.c                         |   3 +
>  builtin/grep.c                           |   2 +
>  builtin/ls-files.c                       |   9 +-
>  builtin/merge-index.c                    |   2 +
>  builtin/mv.c                             |   2 +
>  builtin/rm.c                             |   2 +
>  builtin/sparse-checkout.c                |  35 ++-
>  builtin/update-index.c                   |   2 +
>  cache-tree.c                             |  21 ++
>  cache.h                                  |  15 +-
>  diff.c                                   |   2 +
>  dir.c                                    |  19 +-
>  dir.h                                    |   2 +-
>  entry.c                                  |   2 +
>  fsmonitor.c                              |  18 +-
>  merge-recursive.c                        |  22 +-
>  name-hash.c                              |  10 +
>  pathspec.c                               |   4 +-
>  pathspec.h                               |   4 +-
>  preload-index.c                          |   2 +
>  read-cache.c                             |  51 +++-
>  repo-settings.c                          |  15 +
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  rerere.c                                 |   2 +
>  resolve-undo.c                           |   6 +
>  setup.c                                  |   3 +
>  sha1-name.c                              |   3 +
>  sparse-index.c                           | 360 +++++++++++++++++++++++
>  sparse-index.h                           |  23 ++
>  split-index.c                            |   2 +
>  submodule.c                              |  22 +-
>  submodule.h                              |   6 +-
>  t/helper/test-read-cache.c               |  77 ++++-
>  t/t1092-sparse-checkout-compatibility.sh | 130 +++++++-
>  tree.c                                   |   2 +
>  unpack-trees.c                           |  40 ++-
>  wt-status.c                              |  21 +-
>  wt-status.h                              |   1 +
>  46 files changed, 942 insertions(+), 61 deletions(-)
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>
>
> base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/847
> --
> gitgitgadget

derrickstolee · 2021-02-23T17:20:48Z

Closing in favor of the actual submissions, starting in #883.

derrickstolee force-pushed the sparse-index/rfc branch from 05c7879 to 1d93900 Compare January 25, 2021 14:54

derrickstolee added 21 commits January 25, 2021 10:42

derrickstolee force-pushed the sparse-index/rfc branch 2 times, most recently from 12c9c36 to 0e31cfa Compare January 25, 2021 17:00

derrickstolee added 6 commits January 25, 2021 12:17

derrickstolee force-pushed the sparse-index/rfc branch from 0e31cfa to 05e7548 Compare January 25, 2021 17:18

gitgitgadget bot reviewed Jan 27, 2021

View reviewed changes

gitgitgadget bot reviewed Feb 1, 2021

View reviewed changes

derrickstolee closed this Feb 23, 2021

		@@ -204,6 +204,10 @@ struct cache_entry {
		#error "CE_EXTENDED_FLAGS out of range"

		@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
		*/

		@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
		test_expect_success 'setup' '

		@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
		if (is_index_unborn(r->index))

		@@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
		strbuf_make_traverse_path(&name, info, names->path, names->pathlen);

		@@ -6,3 +6,10 @@ extensions.objectFormat::
		Note that this setting should only be set by linkgit:git-init[1] or

		@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
		When `--cone` is provided, the `core.sparseCheckoutCone` setting is

		@@ -3523,6 +3523,8 @@ static int load_current(struct apply_state *state,
		if (!patch->is_new)

		@@ -503,7 +503,7 @@ static inline int ce_path_match(const struct index_state *istate,
		char *seen)

		@@ -892,13 +892,15 @@ void add_pattern(const char string, const char base,
		add_pattern_to_hashsets(pl, pattern);

		@@ -491,6 +491,9 @@ int cmd_add(int argc, const char *argv, const char prefix)
		add_new_files = !take_worktree_changes && !refresh_only && !add_renormalize;

		@@ -20,7 +20,7 @@
		* to use find_pathspecs_matching_against_index() instead.

Conversation

derrickstolee commented Jan 25, 2021 • edited by gitgitgadget bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derrickstolee commented Jan 25, 2021

Uh oh!

gitgitgadget bot commented Jan 25, 2021

Uh oh!

gitgitgadget bot commented Jan 25, 2021

Uh oh!

gitgitgadget bot commented Jan 25, 2021

Uh oh!

gitgitgadget bot commented Jan 25, 2021

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 28, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 28, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Jan 28, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 1, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 1, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 1, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 9, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 1, 2021

Choose a reason for hiding this comment

Uh oh!

derrickstolee commented Jan 25, 2021 •

edited by gitgitgadget bot

Loading