[RFC] Sparse Index#847
[RFC] Sparse Index#847derrickstolee wants to merge 27 commits intogitgitgadget:stolee/more-index-cleanupsfrom
Conversation
05c7879 to
1d93900
Compare
Upcoming changes will introduce modifications to the index format that allow sparse directories. It will be useful to have a mechanism for converting those sparse index files into full indexes by walking the tree at those sparse directories. Name this method ensure_full_index() as it will guarantee that the index is fully expanded. This method is not implemented yet, and instead we focus on the scaffolding to declare it and call it at the appropriate time. Add a 'command_requires_full_index' member to struct repo_settings. This will be an indicator that we need the index in full mode to do certain index operations. This starts as being true for every command, then we will set it to false as some commands integrate with sparse indexes. If 'command_requires_full_index' is true, then we will immediately expand a sparse index to a full one upon reading from disk. This suffices for now, but we will want to add more callers to ensure_full_index() later. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
We will mark an in-memory index_state as having sparse directory entries with the sparse_index bit. These currently cannot exist, but we will add a mechanism for collapsing a full index to a sparse one in a later change. That will happen at write time, so we must first allow parsing the format before writing it. Commands or methods that require a full index in order to operate can call ensure_full_index() to expand that index in-memory. This requires parsing trees using that index's repository. Sparse directory entries have a specific 'ce_mode' value. The macro S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This ce_mode is not possible with the existing index formats, so we don't also verify all properties of a sparse-directory entry, which are: 1. ce->ce_mode == 01000755 2. ce->flags & CE_SKIP_WORKTREE is true 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) 4. ce->oid references a tree object. These are all semi-enforced in ensure_full_index() to some extent. Any deviation will cause a warning at minimum or a failure in the worst case. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Add a new 'sparse-index' repo alongside the 'full-checkout' and 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also add run_on_sparse and test_sparse_match helpers. These helpers will be used when the sparse index is implemented. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
This table is helpful for discovering data in the index to ensure it is being written correctly, especially as we build and test the sparse-index. To make the option parsing slightly more robust, wrap the string comparisons in a loop adapted from test-dir-iterator.c. Care must be taken with the final check for the 'cnt' variable. We continue the expectation that the numerical value is the final argument. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The 'test-tool read-cache --table' output is helpful to understand the full contents of the index entries on-disk. This is particularly helpful when trying to diagnose issues with a real repository example. However, for test cases we might want to compare the index contents of two repositories that were updated in similar ways, but will not actually share the same stat data. Add the '--no-stat' option to remove the timestamps and other stat data from the output. This allows us to compare index contents directly. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
We will use 'test-tool read-cache --table' to check that a sparse index is written as part of init_repos. Since we will no longer always expand a sparse index into a full index, add an '--expand' parameter that adds a call to ensure_full_index() so we can compare a sparse index directly against a full index, or at least what the in-memory index looks like when expanded in this way. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The next change will translate full indexes into sparse indexes at write time. The existing logic provides a way for every sparse index to be expanded to a full index at read time. However, there are cases where an index is written and then continues to be used in-memory to perform further updates. unpack_trees() is frequently called after such a write. In particular, commands like 'git reset' do this double-update of the index. Ensure that we have a full index when entering unpack_trees(), but only when command_requires_full_index is true. This is always true at the moment, but we will later relax that after unpack_trees() is updated to handle sparse directory entries. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
As we modify the sparse-checkout definition, we perform index operations on a pattern_list that only exists in-memory. This allows easy backing out in case the index update fails. However, if the index write itself cares about the sparse-checkout pattern set, we need access to that in-memory copy. Place a pointer to a 'struct pattern_list' in the index so we can access this on-demand. This will be used in the next change which uses the sparse-checkout definition to filter out directories that are outsie the sparse cone. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
If we have a full index, then we can convert it to a sparse index by replacing directories outside of the sparse cone with sparse directory entries. The convert_to_sparse() method does this, when the situation is appropriate. For now, we avoid converting the index to a sparse index if: 1. the index is split. 2. the index is already sparse. 3. sparse-checkout is disabled. 4. sparse-checkout does not use cone mode. Finally, we currently limit the conversion to when the GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git config will be added in a later change. The trickiest thing about this conversion is that we might not be able to mark a directory as a sparse directory just because it is outside the sparse cone. There might be unmerged files within that directory, so we need to look for those. Also, if there is some strange reason why a file is not marked with CE_SKIP_WORKTREE, then we should give up on converting that directory. There is still hope that some of its subdirectories might be able to convert to sparse, so we keep looking deeper. The conversion process is assisted by the cache-tree extension. This is calculated from the full index if it does not already exist. We then abandon the cache-tree as it no longer applies to the newly-sparse index. Thus, this cache-tree will be recalculated in every sparse-full-sparse round-trip until we integrate the cache-tree extension with the sparse index. We can compare the behavior of the sparse-index in t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1 when operating on the 'sparse-index' repo. We can also compare the two sparse repos directly, such as comparing their indexes (when expanded to full in the case of the 'sparse-index' repo). We also verify that the index is actually populated with sparse directory entries. The 'checkout and reset (mixed)' test is marked for failure when comparing a sparse repo to a full repo, but we can compare the two sparse-checkout cases directly to ensure that we are not changing the behavior when using a sparse index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
A submodule is stored as a "Git link" that actually points to a commit within a submodule. Submodules are populated or not depending on submodule configuration, not sparse-checkout. To ensure that the sparse-index feature integrates correctly with submodules, we should not collapse a directory if there is a Git link within its range. This allows us to remove ensure_full_index() from die_path_inside_submodule() because a sparse-index will not remove the entries for Git links. The loop already 'continue's if the cache entry is not a Git link. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The index_pos_by_traverse_info() currently throws a BUG() when a directory entry exists exactly in the index. We need to consider that it is possible to have a directory in a sparse index as long as that entry is itself marked with the skip-worktree bit. The negation of the 'pos' variable must be conditioned to only when it starts as negative. This is identical behavior as before when the index is full. The starts_with() condition matches because our name.buf terminates with a directory separator, just like our sparse directory entries. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Add a test case that uses test_region to ensure that we are truly expanding a sparse index to a full one, then converting back to sparse when writing the index. As we integrate more Git commands with the sparse index, we will convert these commands to check that we do _not_ convert the sparse index to a full index and instead stay sparse the entire time. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Previously, we enabled the sparse index format only using GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to actually select this mode. Further, sparse directory entries are not understood by the index formats as advertised. We _could_ add a new index version that explicitly adds these capabilities, but there are nuances to index formats 2, 3, and 4 that are still valuable to select as options. For now, create a repo extension, "extensions.sparseIndex", that specifies that the tool reading this repository must understand sparse directory entries. This change only encodes the extension and enables it when GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI mechanism. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The sparse index extension is used to signal that index writes should be in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1. Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that specifies if the sparse index should be used. It also updates the index to use the correct format, either way. Add a warning in the documentation that the use of a repository extension might reduce compatibility with third-party tools. 'git sparse-checkout init' already sets extension.worktreeConfig, which places most sparse-checkout users outside of the scope of most third-party tools. Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of GIT_TEST_SPARSE_INDEX=1. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
This giant patch is not intended for actual review. I have a branch that has these changes split out in a sane way with some commentary in each file that is modified. The idea here is to guard certain portions of the codebase that do not know how to handle sparse indexes by ensuring that the index is expanded to a full index before proceeding with the logic. This also provides a good mechanism for testing which code needs updating to enable the sparse index in a Git builtin. The builtin can set the_repository->settings.command_requires_full_index to zero and then we can debug the command with a breakpoint on ensure_full_index(). That identifies the portion of code that needs adjusting before enabling sparse indexes for that command. Some index operations must be changed to operate on a non-const pointer, since ensuring a full index will modify the index itself. There are likely some gaps to these protections, which is why it will be important to carefully test each scenario as we relax the requirements. I expect that to be a long effort. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
As a first step to integrate 'git status' and 'git add' with the sparse index, we must start integrating unpack_trees() with sparse directory entries. These changes are currently impossible to trigger because unpack_trees() calls ensure_full_index() if command_requires_full_index is true. This is the case for all commands at the moment. As we expand more commands to be sparse-aware, we might find that more changes are required to unpack_trees(). The current changes will suffice for 'status' and 'add'. unpack_trees() calls the traverse_trees() API using unpack_callback() to decide if we should recurse into a subtree. We must add new abilities to skip a subtree if it corresponds to a sparse directory entry. It is important to be careful about the trailing directory separator that exists in the sparse directory entries but not in the subtree paths. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When we have sparse directory entries in the index, we want to compare
that directory against sparse-checkout patterns. Those pattern matching
algorithms are built expecting a file path, not a directory path. This
is especially important in the "cone mode" patterns which will match
files that exist within the "parent directories" as well as the
recursive directory matches.
If path_matches_pattern_list() is given a directory, we can add a bogus
filename ("-") to the directory and get the same results as before,
assuming we are in cone mode. Since sparse index requires cone mode
patterns, this is an acceptable assumption.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
By testing 'git -c core.fsmonitor= status -uno', we can check for the simplest index operations that can be made sparse-aware. The necessary implementation details are already integrated with sparse-checkout, so modify command_requires_full_index to be zero for cmd_status(). By running the debugger for 'git status -uno' after that change, we find two instances of ensure_full_index() that were added for extra safety, but can be removed without issue. In refresh_index(), we loop through the index entries. The refresh_cache_ent() method copies the sparse directories into the refreshed index without issue. The loop within run_diff_files() skips things that are in stage 0 and have skip-worktree enabled, so seems safe to disable ensure_full_index() here. While this change avoids calling ensure_full_index(), it actually slows 'git status' because we do not have the cache-tree extension to help us. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
'git status' began reporting a percentage of populated paths when sparse-checkout is enabled in 051df3c (wt-status: show sparse checkout status as well, 2020-07-18). This percentage is incorrect when the index has sparse directories. It would also be expensive to calculate as we would need to parse trees to count the total number of possible paths. Avoid the expensive computation by simplifying the output to only report that a sparse checkout exists, without the percentage. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Before we check if a specific file or directory exists in the index, it would be good to see if a leading directory is a sparse-directory. If so, we will want to expand the index _just enough_ to be sure that the paths we are interested in are in the index. The actually-interesting implementation will follow in a later change. For now, simply call ensure_full_index() to expand every directory simultaneously. Calls like index_dir_exists(), adjust_dirname_case(), and index_file_exists() in name-hash.c can trust the name hash if the index was properly expanded for the requested names. These methods can transition from ensure_full_index() to expand_to_path(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
We need to check the file hashmap first, then look to see if the directory signals a non-sparse directory entry. In such a case, we can rely on the contents of the sparse-index. We still use ensure_full_index() in the case that we hit a path that is within a sparse-directory entry. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
12c9c36 to
0e31cfa
Compare
Replace enough callers to ensure_full_index() to instead call expand_to_path() to reduce how often 'git add' expands a sparse index in memory (before writing a sparse index again). One non-obvious case is index_name_pos_also_unmerged() which is only hit on the Windows platform (in my tests). Use expand_to_path() instead of ensure_full_index(). Add a test to check that 'git add -A' and 'git add <file>' does not expand the index at all, as long as <file> is not within a sparse directory. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Since we already do not collapse a sparse directory if it contains a submodule, we don't need to expand to a full index in die_path_inside_submodule(). A simple scan of the index entries is sufficient. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The add_patterns() method has a way to extract a pattern file from the index. If this pattern file is sparse and within a sparse directory entry, then we need to expand the index before looking for that entry as a file path. For now, convert ensure_full_index() into expand_to_path() to only expand this way when absolutely necessary. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The FS Monitor feature uses a bitmap over the index entries. This currently interacts poorly with a sparse index. We will revisit this interaction in the future, but for now protect the index by refusing to use the FS Monitor extension at all if the index is sparse. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The add_pathspec_matches_against_index() focuses on matching a pathspec to file entries in the index. It is possible that this already works correctly for its only use: checking if untracked files exist in the index. It is likely that this causes a behavior issue when adding a directory that exists at HEAD but is outside the sparse cone. I'm marking this as a place to pursue with future tests. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.
When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.
Once the cache-tree exists within a sparse index, we finally get
improved performance. I test the sparse index performance using a
private monorepo with over 2.1 million files at HEAD, but with a
sparse-checkout definition that has only 68,000 paths in the populated
cone. The sparse index has about 2,000 sparse directory entries. I
compare three scenarios:
1. Use the full index. The index size is ~186 MB.
2. Use the sparse index. The index size is ~5.5 MB.
3. Use a commit where HEAD matches the populated set. The full index
size is ~5.3MB.
The third benchmark is included as a theoretical optimium for a
repository of the same object database.
First, a clean 'git status' improves from 3.1s to 240ms.
Benchmark #1: full index (git status)
Time (mean ± σ): 3.167 s ± 0.036 s [User: 2.006 s, System: 1.078 s]
Range (min … max): 3.100 s … 3.208 s 10 runs
Benchmark #2: sparse index (git status)
Time (mean ± σ): 239.5 ms ± 8.1 ms [User: 189.4 ms, System: 226.8 ms]
Range (min … max): 226.0 ms … 251.9 ms 13 runs
Benchmark #3: small tree (git status)
Time (mean ± σ): 195.3 ms ± 4.5 ms [User: 116.5 ms, System: 84.4 ms]
Range (min … max): 188.8 ms … 202.8 ms 15 runs
The optimimum is still 45ms faster. This is due in part to the 2,000+
sparse directory entries, but there might be other optimizations to make
in the sparse-index case. In particular, I find that this performance
difference disappears when I disable FS Monitor, which is somewhat
disabled in the sparse-index case, but might still be adding overhead.
The performance numbers for 'git add .' are much closer to optimal:
Benchmark #1: full index (git add .)
Time (mean ± σ): 3.076 s ± 0.022 s [User: 2.065 s, System: 0.943 s]
Range (min … max): 3.044 s … 3.116 s 10 runs
Benchmark #2: sparse index (git add .)
Time (mean ± σ): 218.0 ms ± 6.6 ms [User: 195.7 ms, System: 206.6 ms]
Range (min … max): 209.8 ms … 228.2 ms 13 runs
Benchmark #3: small tree (git add .)
Time (mean ± σ): 217.6 ms ± 5.4 ms [User: 131.9 ms, System: 86.7 ms]
Range (min … max): 212.1 ms … 228.4 ms 14 runs
In this test, I also used "echo >>README.md" to append a line to the
README.md file, so the 'git add .' command is doing _something_ other
than a no-op. Without this edit (and FS Monitor enabled) the small
tree case again gains about 30ms on the sparse index case.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
0e31cfa to
05e7548
Compare
|
/submit |
|
Submitted as pull.847.git.1611596533.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): |
|
On the Git mailing list, Derrick Stolee wrote (reply to this): |
|
User |
| @@ -204,6 +204,10 @@ struct cache_entry { | |||
| #error "CE_EXTENDED_FLAGS out of range" | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> We will mark an in-memory index_state as having sparse directory entries
> with the sparse_index bit. These currently cannot exist, but we will add
> a mechanism for collapsing a full index to a sparse one in a later
> change. That will happen at write time, so we must first allow parsing
> the format before writing it.
>
> Commands or methods that require a full index in order to operate can
> call ensure_full_index() to expand that index in-memory. This requires
> parsing trees using that index's repository.
>
> Sparse directory entries have a specific 'ce_mode' value. The macro
> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This
> ce_mode is not possible with the existing index formats, so we don't
> also verify all properties of a sparse-directory entry, which are:
>
> 1. ce->ce_mode == 01000755
This is a weird number. What's the reason for choosing it? It looks
deceptively close to 0100755, normal executable files, but has the
extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no
file type.
Since it's a directory, why not use S_IFDIR (040000)?
(GITLINK does use the weird 0160000 value, but it happens to be
S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a
directory and a symlink")
> 2. ce->flags & CE_SKIP_WORKTREE is true
Makes sense.
> 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
Is there a particular reason for this? I'm used to seeing names
without the trailing slash, both in the index and in tree objects. I
don't know enough to be for or against this idea; just curious at this
point.
> 4. ce->oid references a tree object.
Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000?
Also, as a bit of a side comment: There have been other requests in
the past to support directory objects in the index. The only use I
remember for them requested from others was to allow tracking empty
directories. However, I've long wanted to introduce a new "blobtree"
object to git, so that a user can "git add" some big binary file, but
internally git splits the binary file up and stores it as multiple
blobs within git plus a new "blobtree" object that references all the
individual blobs with enough information about how to stitch all the
blobs together to get the original file. I had a few different forms
of "blobtree" things that I was interested in. I think brian once
suggested some similar idea.
> These are all semi-enforced in ensure_full_index() to some extent. Any
> deviation will cause a warning at minimum or a failure in the worst
> case.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> cache.h | 11 +++++-
> read-cache.c | 9 +++++
> sparse-index.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++-
> sparse-index.h | 1 +
> 4 files changed, 113 insertions(+), 2 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index f9c7a603841..884046ca5b8 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -204,6 +204,10 @@ struct cache_entry {
> #error "CE_EXTENDED_FLAGS out of range"
> #endif
>
> +#define CE_MODE_SPARSE_DIRECTORY 01000755
> +#define SPARSE_DIR_MODE 0100
Another magic value. Feels like the commit message should reference
this one and why it was picked. Seems odd to me, and possibly
problematic to re-use file permission bits that might collide with
files recorded by really old versions of git. Maybe that's not a
concern, though.
> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY)
Should the special sauce apply to ce_flags rather than ce_mode? Thus,
instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro
(similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar
to CE_SKIP_WORKTREE)?
Or, alternatively, do we need a single special state here? Could we
check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)?
> +
> /* Forward structure decls */
> struct pathspec;
> struct child_process;
> @@ -249,6 +253,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
> {
> if (S_ISLNK(mode))
> return S_IFLNK;
> + if (mode == SPARSE_DIR_MODE)
> + return CE_MODE_SPARSE_DIRECTORY;
> if (S_ISDIR(mode) || S_ISGITLINK(mode))
> return S_IFGITLINK;
> return S_IFREG | ce_permissions(mode);
> @@ -319,7 +325,8 @@ struct index_state {
> drop_cache_tree : 1,
> updated_workdir : 1,
> updated_skipworktree : 1,
> - fsmonitor_has_run_once : 1;
> + fsmonitor_has_run_once : 1,
> + sparse_index : 1;
> struct hashmap name_hash;
> struct hashmap dir_hash;
> struct object_id oid;
> @@ -721,6 +728,8 @@ int read_index_from(struct index_state *, const char *path,
> const char *gitdir);
> int is_index_unborn(struct index_state *);
>
> +void ensure_full_index(struct index_state *istate);
> +
> /* For use with `write_locked_index()`. */
> #define COMMIT_LOCK (1 << 0)
> #define SKIP_IF_UNCHANGED (1 << 1)
> diff --git a/read-cache.c b/read-cache.c
> index ecf6f689940..1097ecbf132 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -101,6 +101,9 @@ static const char *alternate_index_output;
>
> static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
> {
> + if (S_ISSPARSEDIR(ce))
> + istate->sparse_index = 1;
> +
> istate->cache[nr] = ce;
> add_name_hash(istate, ce);
> }
> @@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
> trace2_data_intmax("index", the_repository, "read/cache_nr",
> istate->cache_nr);
>
> + if (!istate->repo)
> + istate->repo = the_repository;
> + prepare_repo_settings(istate->repo);
> + if (istate->repo->settings.command_requires_full_index)
> + ensure_full_index(istate);
> +
> return istate->cache_nr;
>
> unmap:
> diff --git a/sparse-index.c b/sparse-index.c
> index 82183ead563..1e70244dc13 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -1,8 +1,100 @@
> #include "cache.h"
> #include "repository.h"
> #include "sparse-index.h"
> +#include "tree.h"
> +#include "pathspec.h"
> +#include "trace2.h"
> +
> +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
> +{
> + ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
> +
> + istate->cache[nr] = ce;
> + add_name_hash(istate, ce);
> +}
> +
> +static int add_path_to_index(const struct object_id *oid,
> + struct strbuf *base, const char *path,
> + unsigned int mode, int stage, void *context)
> +{
> + struct index_state *istate = (struct index_state *)context;
> + struct cache_entry *ce;
> + size_t len = base->len;
> +
> + if (S_ISDIR(mode))
> + return READ_TREE_RECURSIVE;
> +
> + strbuf_addstr(base, path);
> +
> + ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
> + ce->ce_flags |= CE_SKIP_WORKTREE;
> + set_index_entry(istate, istate->cache_nr++, ce);
> +
> + strbuf_setlen(base, len);
> + return 0;
> +}
>
> void ensure_full_index(struct index_state *istate)
> {
> - /* intentionally left blank */
> + int i;
> + struct index_state *full;
> +
> + if (!istate || !istate->sparse_index)
> + return;
> +
> + if (!istate->repo)
> + istate->repo = the_repository;
> +
> + trace2_region_enter("index", "ensure_full_index", istate->repo);
> +
> + /* initialize basics of new index */
> + full = xcalloc(1, sizeof(struct index_state));
> + memcpy(full, istate, sizeof(struct index_state));
> +
> + /* then change the necessary things */
> + full->sparse_index = 0;
> + full->cache_alloc = (3 * istate->cache_alloc) / 2;
> + full->cache_nr = 0;
> + ALLOC_ARRAY(full->cache, full->cache_alloc);
> +
> + for (i = 0; i < istate->cache_nr; i++) {
> + struct cache_entry *ce = istate->cache[i];
> + struct tree *tree;
> + struct pathspec ps;
> +
> + if (!S_ISSPARSEDIR(ce)) {
> + set_index_entry(full, full->cache_nr++, ce);
> + continue;
> + }
> + if (!(ce->ce_flags & CE_SKIP_WORKTREE))
> + warning(_("index entry is a directory, but not sparse (%08x)"),
> + ce->ce_flags);
> +
> + /* recursively walk into cd->name */
> + tree = lookup_tree(istate->repo, &ce->oid);
> +
> + memset(&ps, 0, sizeof(ps));
> + ps.recursive = 1;
> + ps.has_wildcard = 1;
> + ps.max_depth = -1;
> +
> + read_tree_recursive(istate->repo, tree,
> + ce->name, strlen(ce->name),
> + 0, &ps,
> + add_path_to_index, full);
> +
> + /* free directory entries. full entries are re-used */
> + discard_cache_entry(ce);
> + }
> +
> + /* Copy back into original index. */
> + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> + istate->sparse_index = 0;
> + istate->cache = full->cache;
Haven't you leaked the original istate->cache here?
> + istate->cache_nr = full->cache_nr;
> + istate->cache_alloc = full->cache_alloc;
> +
> + free(full);
> +
> + trace2_region_leave("index", "ensure_full_index", istate->repo);
> }
> diff --git a/sparse-index.h b/sparse-index.h
> index 8dda92032e2..a2777dcac59 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -3,5 +3,6 @@
>
> struct index_state;
> void ensure_full_index(struct index_state *istate);
> +int convert_to_sparse(struct index_state *istate);
Seems logically that you'd add one, but was this meant to be included
in a later patch?
>
> #endif
> \ No newline at end of file
> --
> gitgitgadget
>
There was a problem hiding this comment.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 1/26/2021 10:05 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
...
>> Sparse directory entries have a specific 'ce_mode' value. The macro
>> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This
>> ce_mode is not possible with the existing index formats, so we don't
>> also verify all properties of a sparse-directory entry, which are:
>>
>> 1. ce->ce_mode == 01000755
>
> This is a weird number. What's the reason for choosing it? It looks
> deceptively close to 0100755, normal executable files, but has the
> extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no
> file type.
>
> Since it's a directory, why not use S_IFDIR (040000)?
>
> (GITLINK does use the weird 0160000 value, but it happens to be
> S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a
> directory and a symlink")
I forget how exactly I came up with these magic constants, but then
completely forgot to think of them critically because I haven't had
to look at them in a while. They _are_ important, especially because
these values affect the file format itself.
I'll think harder on this before submitting a series intended for
merging.
>> 2. ce->flags & CE_SKIP_WORKTREE is true
>
> Makes sense.
>
>> 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
>
> Is there a particular reason for this? I'm used to seeing names
> without the trailing slash, both in the index and in tree objects. I
> don't know enough to be for or against this idea; just curious at this
> point.
It's yet another way to distinguish directories from files, but
there are cases where we do string searches up to a prefix, and
having these directory separators did help, IIRC.
>> 4. ce->oid references a tree object.
>
> Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000?
...
>> +#define CE_MODE_SPARSE_DIRECTORY 01000755
>> +#define SPARSE_DIR_MODE 0100
>
> Another magic value. Feels like the commit message should reference
> this one and why it was picked. Seems odd to me, and possibly
> problematic to re-use file permission bits that might collide with
> files recorded by really old versions of git. Maybe that's not a
> concern, though.
>
>> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY)
>
> Should the special sauce apply to ce_flags rather than ce_mode? Thus,
> instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro
> (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar
> to CE_SKIP_WORKTREE)?
>
> Or, alternatively, do we need a single special state here? Could we
> check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)?
The intention was that ce_mode be a unique value that could only
be assigned to a directory entry, which would then by necessity be
sparse. Checking both ce_mode and ce_flags seemed wasteful with the
given assumptions
...
>> + /* Copy back into original index. */
>> + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
>> + istate->sparse_index = 0;
>> + istate->cache = full->cache;
>
> Haven't you leaked the original istate->cache here?
Yes, seems so. Will fix.
Thanks,
-Stolee
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Wed, Jan 27, 2021 at 5:43 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 1/26/2021 10:05 PM, Elijah Newren wrote:
> > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> ...
> >> Sparse directory entries have a specific 'ce_mode' value. The macro
> >> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This
> >> ce_mode is not possible with the existing index formats, so we don't
> >> also verify all properties of a sparse-directory entry, which are:
> >>
> >> 1. ce->ce_mode == 01000755
> >
> > This is a weird number. What's the reason for choosing it? It looks
> > deceptively close to 0100755, normal executable files, but has the
> > extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no
> > file type.
> >
> > Since it's a directory, why not use S_IFDIR (040000)?
> >
> > (GITLINK does use the weird 0160000 value, but it happens to be
> > S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a
> > directory and a symlink")
>
> I forget how exactly I came up with these magic constants, but then
> completely forgot to think of them critically because I haven't had
> to look at them in a while. They _are_ important, especially because
> these values affect the file format itself.
>
> I'll think harder on this before submitting a series intended for
> merging.
>
> >> 2. ce->flags & CE_SKIP_WORKTREE is true
> >
> > Makes sense.
> >
> >> 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
> >
> > Is there a particular reason for this? I'm used to seeing names
> > without the trailing slash, both in the index and in tree objects. I
> > don't know enough to be for or against this idea; just curious at this
> > point.
>
> It's yet another way to distinguish directories from files, but
> there are cases where we do string searches up to a prefix, and
> having these directory separators did help, IIRC.
>
> >> 4. ce->oid references a tree object.
> >
> > Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000?
>
> ...
>
> >> +#define CE_MODE_SPARSE_DIRECTORY 01000755
> >> +#define SPARSE_DIR_MODE 0100
> >
> > Another magic value. Feels like the commit message should reference
> > this one and why it was picked. Seems odd to me, and possibly
> > problematic to re-use file permission bits that might collide with
> > files recorded by really old versions of git. Maybe that's not a
> > concern, though.
> >
> >> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY)
> >
> > Should the special sauce apply to ce_flags rather than ce_mode? Thus,
> > instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro
> > (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar
> > to CE_SKIP_WORKTREE)?
> >
> > Or, alternatively, do we need a single special state here? Could we
> > check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)?
>
> The intention was that ce_mode be a unique value that could only
> be assigned to a directory entry, which would then by necessity be
> sparse. Checking both ce_mode and ce_flags seemed wasteful with the
> given assumptions
040000 is a unique value that could only be assigned to a directory
entry. Since we have no other uses of directories within the index,
you are right, we wouldn't need to check ce_skip_worktree(ce) as well;
just a check for the 040000 mode would be enough.
> ...
>
> >> + /* Copy back into original index. */
> >> + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> >> + istate->sparse_index = 0;
> >> + istate->cache = full->cache;
> >
> > Haven't you leaked the original istate->cache here?
>
> Yes, seems so. Will fix.
>
> Thanks,
> -Stolee
There was a problem hiding this comment.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Elijah Newren <newren@gmail.com> writes:
>> 1. ce->ce_mode == 01000755
>
> This is a weird number. What's the reason for choosing it? It looks
> deceptively close to 0100755, normal executable files, but has the
> extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no
> file type.
>
> Since it's a directory, why not use S_IFDIR (040000)?
>
> (GITLINK does use the weird 0160000 value, but it happens to be
> S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a
> directory and a symlink")
Yes, that combination of IFLNK/IFDIR was the reason why we use the
value. I tend to think IFDIR is the best thing to use here.
| @@ -2,18 +2,55 @@ | |||
| #include "cache.h" | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This table is helpful for discovering data in the index to ensure it is
> being written correctly, especially as we build and test the
> sparse-index.
>
> To make the option parsing slightly more robust, wrap the string
> comparisons in a loop adapted from test-dir-iterator.c.
>
> Care must be taken with the final check for the 'cnt' variable. We
> continue the expectation that the numerical value is the final argument.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> t/helper/test-read-cache.c | 49 ++++++++++++++++++++++++++++++++++----
> 1 file changed, 44 insertions(+), 5 deletions(-)
>
> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> index 244977a29bd..cd7d106a675 100644
> --- a/t/helper/test-read-cache.c
> +++ b/t/helper/test-read-cache.c
> @@ -2,18 +2,55 @@
> #include "cache.h"
> #include "config.h"
>
> +static void print_cache_entry(struct cache_entry *ce)
> +{
> + /* stat info */
> + printf("%08x %08x %08x %08x %08x %08x ",
> + ce->ce_stat_data.sd_ctime.sec,
> + ce->ce_stat_data.sd_ctime.nsec,
> + ce->ce_stat_data.sd_mtime.sec,
> + ce->ce_stat_data.sd_mtime.nsec,
> + ce->ce_stat_data.sd_dev,
> + ce->ce_stat_data.sd_ino);
Printing sec & nsec in hexidecimal? Why?
Also, if they'll be displayed in hex, do you want to format them as
0x%08x, similar to what you do with binary below?
> +
> + /* mode in binary */
This comment feels misleading; I think this is the "S_IFMT portion of
mode in binary" not "mode in binary".
> + printf("0b%d%d%d%d ",
> + (ce->ce_mode >> 15) & 1,
> + (ce->ce_mode >> 14) & 1,
> + (ce->ce_mode >> 13) & 1,
> + (ce->ce_mode >> 12) & 1);
Why binary? Also, since you defined a special magic constant of
01000755 which utilizes bit 18; how come you aren't including any bits
higher than 15?
> + /* output permissions? */
> + printf("%04o ", ce->ce_mode & 01777);
01777 instead of 07777 just because we don't have anything using the
setuid or setgid bits? But if it's based on non-use, then we don't
use the sticky bit (01000) either, so this could be just 0777.
Also, if you're using 0b for binary to distinguish and you're clearly
using multiple bases in this code, perhaps use a print format of
0o%04o (or 0o%03o if you only use a mask of 0777).
> + printf("%s ", oid_to_hex(&ce->oid));
> +
> + printf("%s\n", ce->name);
> +}
> +
> +static void print_cache(struct index_state *cache)
> +{
> + int i;
> + for (i = 0; i < the_index.cache_nr; i++)
> + print_cache_entry(the_index.cache[i]);
> +}
> +
> int cmd__read_cache(int argc, const char **argv)
> {
> int i, cnt = 1;
> const char *name = NULL;
> + int table = 0;
>
> - if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
> - argc--;
> - argv++;
> + for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
> + if (skip_prefix(*argv, "--print-and-refresh=", &name))
> + continue;
> + if (!strcmp(*argv, "--table")) {
> + table = 1;
> + }
> }
>
> - if (argc == 2)
> - cnt = strtol(argv[1], NULL, 0);
> + if (argc == 1)
> + cnt = strtol(argv[0], NULL, 0);
> setup_git_directory();
> git_config(git_default_config, NULL);
> for (i = 0; i < cnt; i++) {
> @@ -30,6 +67,8 @@ int cmd__read_cache(int argc, const char **argv)
> ce_uptodate(the_index.cache[pos]) ? "" : " not");
> write_file(name, "%d\n", i);
> }
> + if (table)
> + print_cache(&the_index);
> discard_cache();
> }
> return 0;
> --
> gitgitgadget
>
| @@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *, | |||
| */ | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The next change will translate full indexes into sparse indexes at write
> time. The existing logic provides a way for every sparse index to be
> expanded to a full index at read time. However, there are cases where an
> index is written and then continues to be used in-memory to perform
> further updates.
>
> unpack_trees() is frequently called after such a write. In particular,
> commands like 'git reset' do this double-update of the index.
>
> Ensure that we have a full index when entering unpack_trees(), but only
> when command_requires_full_index is true. This is always true at the
> moment, but we will later relax that after unpack_trees() is updated to
> handle sparse directory entries.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> unpack-trees.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index f5f668f532d..4dd99219073 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
> */
> int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
> {
> + struct repository *repo = the_repository;
> int i, ret;
> static struct cache_entry *dfc;
> struct pattern_list pl;
> @@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
> trace_performance_enter();
> trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
>
> + prepare_repo_settings(repo);
> + if (repo->settings.command_requires_full_index) {
> + ensure_full_index(o->src_index);
> + ensure_full_index(o->dst_index);
I was worried about o->result as well, since there is a
memset(&o->result, 0, sizeof(o->result));
followed by manually initializing the relevant fields of the
index_state. However, the relevant field here is your new
sparse_index bit, and you want that to be 0, i.e. full.
I also checked ensure_full_index() since it is often the case that
o->src_index == o->dst_index, but it'll be safe to be called twice on
the same index state -- at least as currently written.
So, this patch seems good.
> + }
> +
> if (!core_apply_sparse_checkout || !o->update)
> o->skip_sparse_checkout = 1;
> if (!o->skip_sparse_checkout && !o->pl) {
> --
> gitgitgadget
| @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir' | |||
| test_expect_success 'setup' ' | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a new 'sparse-index' repo alongside the 'full-checkout' and
> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
> add run_on_sparse and test_sparse_match helpers. These helpers will be
> used when the sparse index is implemented.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++----
> 1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 8cd3e5a8d22..8876eae0fe3 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
> test_expect_success 'setup' '
> git init initial-repo &&
> (
> + (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) &&
I thought parentheses started a subshell; once the subshell ends,
wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away?
> cd initial-repo &&
> echo a >a &&
> echo "after deep" >e &&
> @@ -87,23 +88,32 @@ init_repos () {
>
> cp -r initial-repo sparse-checkout &&
> git -C sparse-checkout reset --hard &&
> - git -C sparse-checkout sparse-checkout init --cone &&
> +
> + cp -r initial-repo sparse-index &&
> + git -C sparse-index reset --hard &&
>
> # initialize sparse-checkout definitions
> - git -C sparse-checkout sparse-checkout set deep
> + git -C sparse-checkout sparse-checkout init --cone &&
> + git -C sparse-checkout sparse-checkout set deep &&
> + GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
> + GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
> }
>
> run_on_sparse () {
> (
> cd sparse-checkout &&
> - $* >../sparse-checkout-out 2>../sparse-checkout-err
> + GIT_TEST_SPARSE_INDEX=0 $* >../sparse-checkout-out 2>../sparse-checkout-err
> + ) &&
> + (
> + cd sparse-index &&
> + $* >../sparse-index-out 2>../sparse-index-err
> )
> }
>
> run_on_all () {
> (
> cd full-checkout &&
> - $* >../full-checkout-out 2>../full-checkout-err
> + GIT_TEST_SPARSE_INDEX=0 $* >../full-checkout-out 2>../full-checkout-err
> ) &&
> run_on_sparse $*
> }
> @@ -114,6 +124,17 @@ test_all_match () {
> test_cmp full-checkout-err sparse-checkout-err
> }
>
> +test_sparse_match () {
> + run_on_sparse $* &&
> + test_cmp sparse-checkout-out sparse-index-out &&
> + test_cmp sparse-checkout-err sparse-index-err
> +}
> +
> +test_expect_success 'expanded in-memory index matches full index' '
> + init_repos &&
> + test_sparse_match test-tool read-cache --expand --table-no-stat
> +'
> +
> test_expect_success 'status with options' '
> init_repos &&
> test_all_match git status --porcelain=v2 &&
> --
> gitgitgadget
>
There was a problem hiding this comment.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 1/26/2021 10:08 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Add a new 'sparse-index' repo alongside the 'full-checkout' and
>> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
>> add run_on_sparse and test_sparse_match helpers. These helpers will be
>> used when the sparse index is implemented.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>> t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++----
>> 1 file changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
>> index 8cd3e5a8d22..8876eae0fe3 100755
>> --- a/t/t1092-sparse-checkout-compatibility.sh
>> +++ b/t/t1092-sparse-checkout-compatibility.sh
>> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
>> test_expect_success 'setup' '
>> git init initial-repo &&
>> (
>> + (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) &&
>
> I thought parentheses started a subshell; once the subshell ends,
> wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away?
I think the "export" specifically pushes the setting out of the
first level of subshell. This is the recommendation that comes up
if one runs
export GIT_TEST_SPARSE_INDEX=1 &&
inside a test on macOS, since this isn't completely portable.
Thanks,
-Stolee
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Wed, Jan 27, 2021 at 5:30 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 1/26/2021 10:08 PM, Elijah Newren wrote:
> > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> From: Derrick Stolee <dstolee@microsoft.com>
> >>
> >> Add a new 'sparse-index' repo alongside the 'full-checkout' and
> >> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
> >> add run_on_sparse and test_sparse_match helpers. These helpers will be
> >> used when the sparse index is implemented.
> >>
> >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> >> ---
> >> t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++----
> >> 1 file changed, 25 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> >> index 8cd3e5a8d22..8876eae0fe3 100755
> >> --- a/t/t1092-sparse-checkout-compatibility.sh
> >> +++ b/t/t1092-sparse-checkout-compatibility.sh
> >> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
> >> test_expect_success 'setup' '
> >> git init initial-repo &&
> >> (
> >> + (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) &&
> >
> > I thought parentheses started a subshell; once the subshell ends,
> > wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away?
>
> I think the "export" specifically pushes the setting out of the
> first level of subshell. This is the recommendation that comes up
You're having a child process change the environment variables of a
parent process? ...without some kind of gdb or other debugger
wizardry?
> if one runs
>
> export GIT_TEST_SPARSE_INDEX=1 &&
>
> inside a test on macOS, since this isn't completely portable.
Um, I think you meant to run
GIT_TEST_SPARSE_INDEX=0 &&
export GIT_TEST_SPARSE_INDEX &&
in order to avoid the unportable
export GIT_TEST_SPARSE_INDEX=0 &&
because
(GIT_TEST_SPARSE_INDEX=0 &&
export GIT_TEST_SPARSE_INDEX) &&
looks like a useless no-op. At least it would be in normal bash; is
the test harness doing some special magic with it? In normal bash,
the value definitely does NOT survive the subshell; (export just means
that subprocesses of the subshell where the environment variable is
set will see the value):
$ echo Before: $GIT_TEST_SPARSE_INDEX && (GIT_TEST_SPARSE_INDEX=0 &&
export GIT_TEST_SPARSE_INDEX) && echo After: $GIT_TEST_SPARSE_INDEX
Before:
After:
But in contrast, without the parentheses:
$ echo Before: $GIT_TEST_SPARSE_INDEX && GIT_TEST_SPARSE_INDEX=0 &&
export GIT_TEST_SPARSE_INDEX && echo After: $GIT_TEST_SPARSE_INDEX
Before:
After: 0
| @@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl) | |||
| if (is_index_unborn(r->index)) | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> As we modify the sparse-checkout definition, we perform index operations
> on a pattern_list that only exists in-memory. This allows easy backing
> out in case the index update fails.
>
> However, if the index write itself cares about the sparse-checkout
> pattern set, we need access to that in-memory copy. Place a pointer to
> a 'struct pattern_list' in the index so we can access this on-demand.
> This will be used in the next change which uses the sparse-checkout
> definition to filter out directories that are outsie the sparse cone.
s/outsie/outside/
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> builtin/sparse-checkout.c | 17 ++++++++++-------
> cache.h | 2 ++
> 2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 2306a9ad98e..e00b82af727 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
> if (is_index_unborn(r->index))
> return UPDATE_SPARSITY_SUCCESS;
>
> + r->index->sparse_checkout_patterns = pl;
> +
> memset(&o, 0, sizeof(o));
> o.verbose_update = isatty(2);
> o.update = 1;
> @@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
> else
> rollback_lock_file(&lock_file);
>
> + r->index->sparse_checkout_patterns = NULL;
> return result;
> }
>
> @@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
> {
> int result;
> int changed_config = 0;
> - struct pattern_list pl;
> - memset(&pl, 0, sizeof(pl));
> + struct pattern_list *pl = xcalloc(1, sizeof(*pl));
>
> switch (m) {
> case ADD:
> if (core_sparse_checkout_cone)
> - add_patterns_cone_mode(argc, argv, &pl);
> + add_patterns_cone_mode(argc, argv, pl);
> else
> - add_patterns_literal(argc, argv, &pl);
> + add_patterns_literal(argc, argv, pl);
> break;
>
> case REPLACE:
> - add_patterns_from_input(&pl, argc, argv);
> + add_patterns_from_input(pl, argc, argv);
> break;
> }
>
> @@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
> changed_config = 1;
> }
>
> - result = write_patterns_and_update(&pl);
> + result = write_patterns_and_update(pl);
>
> if (result && changed_config)
> set_config(MODE_NO_PATTERNS);
>
> - clear_pattern_list(&pl);
> + clear_pattern_list(pl);
> + free(pl);
> return result;
> }
>
> diff --git a/cache.h b/cache.h
> index 884046ca5b8..b05341cc687 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -311,6 +311,7 @@ static inline unsigned int canon_mode(unsigned int mode)
> struct split_index;
> struct untracked_cache;
> struct progress;
> +struct pattern_list;
>
> struct index_state {
> struct cache_entry **cache;
> @@ -336,6 +337,7 @@ struct index_state {
> struct mem_pool *ce_mem_pool;
> struct progress *progress;
> struct repository *repo;
> + struct pattern_list *sparse_checkout_patterns;
> };
>
> /* Name hashing */
> --
> gitgitgadget
Isn't this the same patch you put in your index cleanup series, or am
I getting confused? It looks very familiar.
There was a problem hiding this comment.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 1/27/2021 12:00 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> As we modify the sparse-checkout definition, we perform index operations
>> on a pattern_list that only exists in-memory. This allows easy backing
>> out in case the index update fails.
>>
>> However, if the index write itself cares about the sparse-checkout
>> pattern set, we need access to that in-memory copy. Place a pointer to
>> a 'struct pattern_list' in the index so we can access this on-demand.
>> This will be used in the next change which uses the sparse-checkout
>> definition to filter out directories that are outsie the sparse cone.
>
> s/outsie/outside/
Thanks!
> Isn't this the same patch you put in your index cleanup series, or am
> I getting confused? It looks very familiar.
I removed it from v2 of that series because it didn't do anything of
value until we start using the sparse_checkout_patterns member in the
next patch of _this_ series.
Thanks,
-Stolee
| @@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names, | |||
| strbuf_make_traverse_path(&name, info, names->path, names->pathlen); | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The index_pos_by_traverse_info() currently throws a BUG() when a
> directory entry exists exactly in the index. We need to consider that it
> is possible to have a directory in a sparse index as long as that entry
> is itself marked with the skip-worktree bit.
>
> The negation of the 'pos' variable must be conditioned to only when it
> starts as negative. This is identical behavior as before when the index
> is full.
The first sentence of this paragraph was really hard for me to parse.
Reading the code and then reading the sentence I could make sense of
it, but I struggled to do so the other way around. Is there some way
to reword this? Or is this just a reading comprehension issue on my
part? (It might be...)
> The starts_with() condition matches because our name.buf terminates with
> a directory separator, just like our sparse directory entries.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> unpack-trees.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 4dd99219073..b324eec2a5d 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
> strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
> strbuf_addch(&name, '/');
> pos = index_name_pos(o->src_index, name.buf, name.len);
> - if (pos >= 0)
> - BUG("This is a directory and should not exist in index");
> - pos = -pos - 1;
> + if (pos >= 0) {
> + if (!o->src_index->sparse_index ||
> + !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
> + BUG("This is a directory and should not exist in index");
> + } else
> + pos = -pos - 1;
> if (pos >= o->src_index->cache_nr ||
> !starts_with(o->src_index->cache[pos]->name, name.buf) ||
> (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
> --
> gitgitgadget
The patch looks pretty straightforward to me.
| @@ -6,6 +6,7 @@ | |||
| #include "object-store.h" | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> If we have a full index, then we can convert it to a sparse index by
> replacing directories outside of the sparse cone with sparse directory
> entries. The convert_to_sparse() method does this, when the situation is
> appropriate.
>
> For now, we avoid converting the index to a sparse index if:
>
> 1. the index is split.
> 2. the index is already sparse.
> 3. sparse-checkout is disabled.
> 4. sparse-checkout does not use cone mode.
>
> Finally, we currently limit the conversion to when the
> GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
> config will be added in a later change.
>
> The trickiest thing about this conversion is that we might not be able
> to mark a directory as a sparse directory just because it is outside the
> sparse cone. There might be unmerged files within that directory, so we
> need to look for those. Also, if there is some strange reason why a file
> is not marked with CE_SKIP_WORKTREE, then we should give up on
> converting that directory. There is still hope that some of its
> subdirectories might be able to convert to sparse, so we keep looking
> deeper.
Oh good, you check for *both* unmerged entries and !CE_SKIP_WORKTREE
ones. Very nice.
>
> The conversion process is assisted by the cache-tree extension. This is
> calculated from the full index if it does not already exist. We then
> abandon the cache-tree as it no longer applies to the newly-sparse
> index. Thus, this cache-tree will be recalculated in every
> sparse-full-sparse round-trip until we integrate the cache-tree
> extension with the sparse index.
When going from full to sparse, won't the parts of the cache-tree for
paths outside of sparsified directories still be valid? Can't we use
those?
Also, when going from sparse to full, can't we just populate the
cache-tree as well since we have to read trees to get the individual
file entries and we will get all the directory (tree) values at the
same time?
> We can compare the behavior of the sparse-index in
> t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
> when operating on the 'sparse-index' repo. We can also compare the two
> sparse repos directly, such as comparing their indexes (when expanded to
> full in the case of the 'sparse-index' repo). We also verify that the
> index is actually populated with sparse directory entries.
>
> The 'checkout and reset (mixed)' test is marked for failure when
> comparing a sparse repo to a full repo, but we can compare the two
> sparse-checkout cases directly to ensure that we are not changing the
> behavior when using a sparse index.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> cache-tree.c | 3 +
> read-cache.c | 18 ++-
> sparse-index.c | 139 +++++++++++++++++++++++
> t/t1092-sparse-checkout-compatibility.sh | 63 +++++++++-
> 4 files changed, 218 insertions(+), 5 deletions(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 2fb483d3c08..5f07a39e501 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
> #include "object-store.h"
> #include "replace-object.h"
> #include "promisor-remote.h"
> +#include "sparse-index.h"
>
> #ifndef DEBUG_CACHE_TREE
> #define DEBUG_CACHE_TREE 0
> @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
> if (i)
> return i;
>
> + ensure_full_index(istate);
> +
> if (!istate->cache_tree)
> istate->cache_tree = cache_tree();
>
> diff --git a/read-cache.c b/read-cache.c
> index 1097ecbf132..0522260416e 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -25,6 +25,7 @@
> #include "fsmonitor.h"
> #include "thread-utils.h"
> #include "progress.h"
> +#include "sparse-index.h"
>
> /* Mask for the name length in ce_flags in the on-disk index */
>
> @@ -1002,8 +1003,15 @@ int verify_path(const char *path, unsigned mode)
>
> c = *path++;
> if ((c == '.' && !verify_dotfile(path, mode)) ||
> - is_dir_sep(c) || c == '\0')
> + is_dir_sep(c))
> return 0;
> + /*
> + * allow terminating directory separators for
> + * sparse directory enries.
s/enries/entries/
> + */
> + if (c == '\0')
> + return mode == CE_MODE_SPARSE_DIRECTORY ||
> + mode == SPARSE_DIR_MODE;
Why two values here? I get confused why you have both (which isn't
new to this patch; I'm just still confused from when I saw
SPARSE_DIR_MODE).
> } else if (c == '\\' && protect_ntfs) {
> if (is_ntfs_dotgit(path))
> return 0;
> @@ -3062,6 +3070,13 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
> {
> int ret;
>
> + ret = convert_to_sparse(istate);
> +
> + if (ret) {
> + warning(_("failed to convert to a sparse-index"));
> + return ret;
> + }
> +
> /*
> * TODO trace2: replace "the_repository" with the actual repo instance
> * that is associated with the given "istate".
> @@ -3165,6 +3180,7 @@ static int write_shared_index(struct index_state *istate,
> int ret;
>
> move_cache_to_base_index(istate);
> + convert_to_sparse(istate);
>
> trace2_region_enter_printf("index", "shared/do_write_index",
> the_repository, "%s", (*temp)->filename.buf);
> diff --git a/sparse-index.c b/sparse-index.c
> index 1e70244dc13..d8f1a5a13d7 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -4,6 +4,145 @@
> #include "tree.h"
> #include "pathspec.h"
> #include "trace2.h"
> +#include "cache-tree.h"
> +#include "config.h"
> +#include "dir.h"
> +#include "fsmonitor.h"
> +
> +static struct cache_entry *construct_sparse_dir_entry(
> + struct index_state *istate,
> + const char *sparse_dir,
> + struct cache_tree *tree)
> +{
> + struct cache_entry *de;
> +
> + de = make_cache_entry(istate, SPARSE_DIR_MODE, &tree->oid, sparse_dir, 0, 0);
> +
> + de->ce_flags |= CE_SKIP_WORKTREE;
> + return de;
> +}
> +
> +/*
> + * Returns the number of entries "inserted" into the index.
> + */
> +static int convert_to_sparse_rec(struct index_state *istate,
> + int num_converted,
> + int start, int end,
> + const char *ct_path, size_t ct_pathlen,
> + struct cache_tree *ct)
> +{
> + int i, can_convert = 1;
> + int start_converted = num_converted;
> + enum pattern_match_result match;
> + int dtype;
> + struct strbuf child_path = STRBUF_INIT;
> + struct pattern_list *pl = istate->sparse_checkout_patterns;
> +
> + /*
> + * Is the current path outside of the sparse cone?
> + * Then check if the region can be replaced by a sparse
> + * directory entry (everything is sparse and merged).
> + */
> + match = path_matches_pattern_list(ct_path, ct_pathlen,
> + NULL, &dtype, pl, istate);
> + if (match != NOT_MATCHED)
> + can_convert = 0;
I know some people hate gotos, but this seems like one of the cases
where a goto jumping after the following for & if would be clearer
then setting can_convert to 0.
> +
> + for (i = start; can_convert && i < end; i++) {
Instead of checking can_convert here...
> + struct cache_entry *ce = istate->cache[i];
> +
> + if (ce_stage(ce) ||
> + !(ce->ce_flags & CE_SKIP_WORKTREE))
> + can_convert = 0;
...could you just insert a break here?
> + }
> +
> + if (can_convert) {
> + struct cache_entry *se;
> + se = construct_sparse_dir_entry(istate, ct_path, ct);
> +
> + istate->cache[num_converted++] = se;
> + return 1;
> + }
> +
> + for (i = start; i < end; ) {
> + int count, span, pos = -1;
> + const char *base, *slash;
> + struct cache_entry *ce = istate->cache[i];
> +
> + /*
> + * Detect if this is a normal entry oustide of any subtree
s/oustide/outside/
> + * entry.
> + */
> + base = ce->name + ct_pathlen;
> + slash = strchr(base, '/');
> +
> + if (slash)
> + pos = cache_tree_subtree_pos(ct, base, slash - base);
> +
> + if (pos < 0) {
> + istate->cache[num_converted++] = ce;
> + i++;
> + continue;
> + }
> +
> + strbuf_setlen(&child_path, 0);
> + strbuf_add(&child_path, ce->name, slash - ce->name + 1);
> +
> + span = ct->down[pos]->cache_tree->entry_count;
> + count = convert_to_sparse_rec(istate,
> + num_converted, i, i + span,
> + child_path.buf, child_path.len,
> + ct->down[pos]->cache_tree);
> + num_converted += count;
> + i += span;
And there's no i++ in the loop itself, so this is good.
> + }
> +
> + strbuf_release(&child_path);
> + return num_converted - start_converted;
> +}
> +
> +int convert_to_sparse(struct index_state *istate)
> +{
> + if (istate->split_index || istate->sparse_index ||
> + !core_apply_sparse_checkout || !core_sparse_checkout_cone)
> + return 0;
> +
> + /*
> + * For now, only create a sparse index with the
> + * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> + * this once we have a proper way to opt-in (and later still,
> + * opt-out).
> + */
> + if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> + return 0;
> +
> + if (!istate->sparse_checkout_patterns) {
> + istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
> + if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
> + return 0;
> + }
> +
> + if (!istate->sparse_checkout_patterns->use_cone_patterns) {
> + warning(_("attempting to use sparse-index without cone mode"));
> + return -1;
> + }
> +
> + if (cache_tree_update(istate, 0)) {
> + warning(_("unable to update cache-tree, staying full"));
> + return -1;
> + }
> +
> + remove_fsmonitor(istate);
> +
> + trace2_region_enter("index", "convert_to_sparse", istate->repo);
> + istate->cache_nr = convert_to_sparse_rec(istate,
> + 0, 0, istate->cache_nr,
> + "", 0, istate->cache_tree);
> + istate->drop_cache_tree = 1;
> + istate->sparse_index = 1;
> + trace2_region_leave("index", "convert_to_sparse", istate->repo);
> + return 0;
> +}
>
> static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
> {
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 3aa9b0d21b4..22becbaca2e 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -2,6 +2,9 @@
>
> test_description='compare full workdir to sparse workdir'
>
> +GIT_TEST_CHECK_CACHE_TREE=0
Why do you need to set this? I vaguely remember needing to mess with
this when working with sparse checkouts because it did weird stuff but
I don't remember details. But since you patch touches cache_trees, it
seems weird to show up without explanation.
> +GIT_TEST_SPLIT_INDEX=0
> +
> . ./test-lib.sh
>
> test_expect_success 'setup' '
> @@ -106,7 +109,7 @@ run_on_sparse () {
> ) &&
> (
> cd sparse-index &&
> - $* >../sparse-index-out 2>../sparse-index-err
> + GIT_TEST_SPARSE_INDEX=1 $* >../sparse-index-out 2>../sparse-index-err
> )
> }
>
> @@ -121,7 +124,9 @@ run_on_all () {
> test_all_match () {
> run_on_all $* &&
> test_cmp full-checkout-out sparse-checkout-out &&
> - test_cmp full-checkout-err sparse-checkout-err
> + test_cmp full-checkout-out sparse-index-out &&
> + test_cmp full-checkout-err sparse-checkout-err &&
> + test_cmp full-checkout-err sparse-index-err
> }
>
> test_sparse_match () {
> @@ -130,6 +135,38 @@ test_sparse_match () {
> test_cmp sparse-checkout-err sparse-index-err
> }
>
> +test_expect_success 'sparse-index contents' '
> + init_repos &&
> +
> + test-tool -C sparse-index read-cache --table --no-stat >cache &&
> + for dir in folder1 folder2 x
> + do
> + TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> + grep "0b0000 0755 $TREE $dir/" cache \
> + || return 1
> + done &&
> +
> + GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
> +
> + test-tool -C sparse-index read-cache --table --no-stat >cache &&
> + for dir in deep folder2 x
> + do
> + TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> + grep "0b0000 0755 $TREE $dir/" cache \
It would seem clearer to me if this output were to better match `git
ls-tree -rt ...` output (or at least the ' tree ' lines from such
output).
> + || return 1
> + done &&
> +
> + GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
> +
> + test-tool -C sparse-index read-cache --table --no-stat >cache &&
> + for dir in deep/deeper2 folder1 folder2 x
> + do
> + TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> + grep "0b0000 0755 $TREE $dir/" cache \
> + || return 1
> + done
> +'
> +
> test_expect_success 'expanded in-memory index matches full index' '
> init_repos &&
> test_sparse_match test-tool read-cache --expand --table --no-stat
> @@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
>
> test_expect_success 'status with options' '
> init_repos &&
> + test_sparse_match ls &&
> test_all_match git status --porcelain=v2 &&
> test_all_match git status --porcelain=v2 -z -u &&
> test_all_match git status --porcelain=v2 -uno &&
> @@ -169,7 +207,7 @@ test_expect_success 'add, commit, checkout' '
>
> test_all_match git add -A &&
> test_all_match git status --porcelain=v2 &&
> - test_all_match git commit -m "Extend README.md" &&
> + test_all_match git commit -m "Extend-README.md" &&
Why this change?
>
> test_all_match git checkout HEAD~1 &&
> test_all_match git checkout - &&
> @@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
> test_all_match git reset update-folder2
> '
>
> +# Ensure that sparse-index behaves identically to
> +# sparse-checkout with a full index.
> +test_expect_success 'checkout and reset (mixed) [sparse]' '
> + init_repos &&
> +
> + test_sparse_match git checkout -b reset-test update-deep &&
> + test_sparse_match git reset deepest &&
> + test_sparse_match git reset update-folder1 &&
> + test_sparse_match git reset update-folder2
> +'
> +
> test_expect_success 'merge' '
> init_repos &&
>
> @@ -309,14 +358,20 @@ test_expect_success 'clean' '
> test_all_match git status --porcelain=v2 &&
> test_all_match git clean -f &&
> test_all_match git status --porcelain=v2 &&
> + test_sparse_match ls &&
> + test_sparse_match ls folder1 &&
>
> test_all_match git clean -xf &&
> test_all_match git status --porcelain=v2 &&
> + test_sparse_match ls &&
> + test_sparse_match ls folder1 &&
>
> test_all_match git clean -xdf &&
> test_all_match git status --porcelain=v2 &&
> + test_sparse_match ls &&
> + test_sparse_match ls folder1 &&
>
> - test_path_is_dir sparse-checkout/folder1
> + test_sparse_match test_path_is_dir folder1
> '
>
> test_done
> --
> gitgitgadget
| @@ -2,11 +2,15 @@ | |||
|
|
|||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a test case that uses test_region to ensure that we are truly
> expanding a sparse index to a full one, then converting back to sparse
> when writing the index. As we integrate more Git commands with the
> sparse index, we will convert these commands to check that we do _not_
> convert the sparse index to a full index and instead stay sparse the
> entire time.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 22becbaca2e..a22def89e37 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -374,4 +374,21 @@ test_expect_success 'clean' '
> test_sparse_match test_path_is_dir folder1
> '
>
> +test_expect_success 'sparse-index is expanded and converted back' '
> + init_repos &&
> +
> + (
> + (GIT_TEST_SPARSE_INDEX=1 && export GIT_TEST_SPARSE_INDEX) &&
Drop the parentheses.
What system are you running on that this test passed for you with
those parentheses there? I checked out this particular commit and ran
the test -- and it fails for me. Removing the parentheses makes the
test pass.
Is there some shell where parentheses only function as grouping,
similar to bash's {...}, rather than as a subshell, the way bash
handles (...) ?
> + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> + git -C sparse-index -c core.fsmonitor="" reset --hard &&
> + test_region index convert_to_sparse trace2.txt &&
> + test_region index ensure_full_index trace2.txt &&
> +
> + rm trace2.txt &&
> + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> + git -C sparse-index -c core.fsmonitor="" status -uno &&
> + test_region index ensure_full_index trace2.txt
> + )
> +'
> +
> test_done
> --
> gitgitgadget
Otherwise, I like the test and this commit.
| @@ -6,3 +6,10 @@ extensions.objectFormat:: | |||
| Note that this setting should only be set by linkgit:git-init[1] or | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Previously, we enabled the sparse index format only using
> GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
> actually select this mode. Further, sparse directory entries are not
> understood by the index formats as advertised.
>
> We _could_ add a new index version that explicitly adds these
> capabilities, but there are nuances to index formats 2, 3, and 4 that
> are still valuable to select as options. For now, create a repo
> extension, "extensions.sparseIndex", that specifies that the tool
> reading this repository must understand sparse directory entries.
>
> This change only encodes the extension and enables it when
> GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
> mechanism.
One other interesting thing to note is that last I checked, jgit
doesn't support index format v4, which makes us unable to use it.
Making a v5 would force jgit to support all previous index formats in
order to support your new feature.
However, the jgit thing is going to make it hard for me to find other
users willing to test out this feature at $DAYJOB. But I don't think
there's anyway around that; you need to change the index format. And
you at least have jrnieder cc'ed. :-)
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/config/extensions.txt | 7 ++++++
> cache.h | 1 +
> repo-settings.c | 7 ++++++
> repository.h | 3 ++-
> setup.c | 3 +++
> sparse-index.c | 38 +++++++++++++++++++++++++----
> 6 files changed, 53 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
> index 4e23d73cdca..5c86b364873 100644
> --- a/Documentation/config/extensions.txt
> +++ b/Documentation/config/extensions.txt
> @@ -6,3 +6,10 @@ extensions.objectFormat::
> Note that this setting should only be set by linkgit:git-init[1] or
> linkgit:git-clone[1]. Trying to change it after initialization will not
> work and will produce hard-to-diagnose issues.
> +
> +extensions.sparseIndex::
> + When combined with `core.sparseCheckout=true` and
> + `core.sparseCheckoutCone=true`, the index may contain entries
> + corresponding to directories outside of the sparse-checkout
> + definition. Versions of Git that do not understand this extension
> + do not expect directory entries in the index.
Perhaps to make this slightly more explicit ("corresponding to" can be
fuzzy and be read to assume you are talking about file entries
underneath a directory rather than directory entries, so add an extra
phrase to rule that out):
...the index may contain entries corresponding to directories outside
of the sparse-checkout definition in lieu of containing each path
under such directories...
> diff --git a/cache.h b/cache.h
> index b05341cc687..dcf089b7006 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1054,6 +1054,7 @@ struct repository_format {
> int worktree_config;
> int is_bare;
> int hash_algo;
> + int sparse_index;
> char *work_tree;
> struct string_list unknown_extensions;
> struct string_list v1_only_extensions;
> diff --git a/repo-settings.c b/repo-settings.c
> index d63569e4041..9677d50f923 100644
> --- a/repo-settings.c
> +++ b/repo-settings.c
> @@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
> * removed.
> */
> r->settings.command_requires_full_index = 1;
> +
> + /*
> + * Initialize this as off.
> + */
> + r->settings.sparse_index = 0;
> + if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
> + r->settings.sparse_index = 1;
> }
> diff --git a/repository.h b/repository.h
> index e06a2301569..a45f7520fd9 100644
> --- a/repository.h
> +++ b/repository.h
> @@ -42,7 +42,8 @@ struct repo_settings {
>
> int core_multi_pack_index;
>
> - unsigned command_requires_full_index:1;
> + unsigned command_requires_full_index:1,
> + sparse_index:1;
> };
>
> struct repository {
> diff --git a/setup.c b/setup.c
> index c04cd25a30d..cd839456461 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
> return error("invalid value for 'extensions.objectformat'");
> data->hash_algo = format;
> return EXTENSION_OK;
> + } else if (!strcmp(ext, "sparseindex")) {
> + data->sparse_index = 1;
> + return EXTENSION_OK;
> }
> return EXTENSION_UNKNOWN;
> }
> diff --git a/sparse-index.c b/sparse-index.c
> index 5dd0b835b9d..71544095267 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
> return num_converted - start_converted;
> }
>
> +static int enable_sparse_index(struct repository *repo)
> +{
> + const char *config_path = repo_git_path(repo, "config.worktree");
> +
> + if (upgrade_repository_format(1) < 0) {
> + warning(_("unable to upgrade repository format to enable sparse-index"));
> + return -1;
> + }
> + git_config_set_in_file_gently(config_path,
> + "extensions.sparseIndex",
> + "true");
> +
> + prepare_repo_settings(repo);
> + repo->settings.sparse_index = 1;
> + return 0;
> +}
> +
> int convert_to_sparse(struct index_state *istate)
> {
> if (istate->split_index || istate->sparse_index ||
> !core_apply_sparse_checkout || !core_sparse_checkout_cone)
> return 0;
>
> + if (!istate->repo)
> + istate->repo = the_repository;
> +
> + /*
> + * The GIT_TEST_SPARSE_INDEX environment variable triggers the
> + * extensions.sparseIndex config variable to be on.
> + */
> + if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
> + int err = enable_sparse_index(istate->repo);
> + if (err < 0)
> + return err;
> + }
> +
> /*
> - * For now, only create a sparse index with the
> - * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> - * this once we have a proper way to opt-in (and later still,
> - * opt-out).
> + * Only convert to sparse if extensions.sparseIndex is set.
> */
> - if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> + prepare_repo_settings(istate->repo);
> + if (!istate->repo->settings.sparse_index)
> return 0;
>
> if (!istate->sparse_checkout_patterns) {
> --
> gitgitgadget
| @@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the | |||
| When `--cone` is provided, the `core.sparseCheckoutCone` setting is | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The sparse index extension is used to signal that index writes should be
> in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.
>
> Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
> specifies if the sparse index should be used. It also updates the index
> to use the correct format, either way. Add a warning in the
> documentation that the use of a repository extension might reduce
> compatibility with third-party tools. 'git sparse-checkout init' already
> sets extension.worktreeConfig, which places most sparse-checkout users
> outside of the scope of most third-party tools.
Heh, looks like you're addressing my comments on the last patch about
jgit. If I would have just read on...
One side question, though -- I thought I remembered seeing that we
record index versions or extension information directly in the index,
so that third party tools have a way of noting that the index has
something they won't understand, rather than just reading values that
appear to be corrupt to them. Perhaps I missed it, but have you done
anything like that with this series?
>
> Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
> GIT_TEST_SPARSE_INDEX=1.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/git-sparse-checkout.txt | 14 +++++++++
> builtin/sparse-checkout.c | 17 ++++++++++-
> sparse-index.c | 38 ++++++++++++++++--------
> sparse-index.h | 3 ++
> t/t1092-sparse-checkout-compatibility.sh | 33 ++++++++++----------
> 5 files changed, 75 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
> index a0eeaeb02ee..b51b8450cfd 100644
> --- a/Documentation/git-sparse-checkout.txt
> +++ b/Documentation/git-sparse-checkout.txt
> @@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
> When `--cone` is provided, the `core.sparseCheckoutCone` setting is
> also set, allowing for better performance with a limited set of
> patterns (see 'CONE PATTERN SET' below).
> ++
> +Use the `--[no-]sparse-index` option to toggle the use of the sparse
> +index format. This reduces the size of the index to be more closely
> +aligned with your sparse-checkout definition. This can have significant
> +performance advantages for commands such as `git status` or `git add`.
> +This feature is still experimental. Some commands might be slower with
> +a sparse index until they are properly integrated with the feature.
> ++
> +**WARNING:** Using a sparse index requires modifying the index in a way
> +that is not completely understood by other tools. Enabling sparse index
> +enables the `extensions.spareseIndex` config value, which might cause
extensions.sparseIndex; you have an extra 'e' in there.
> +other tools to stop working with your repository. If you have trouble with
> +this compatibility, then run `git sparse-checkout sparse-index disable` to
> +remove this config and rewrite your index to not be sparse.
>
> 'set'::
> Write a set of patterns to the sparse-checkout file, as given as
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index e00b82af727..ca63e2c64e9 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -14,6 +14,7 @@
> #include "unpack-trees.h"
> #include "wt-status.h"
> #include "quote.h"
> +#include "sparse-index.h"
>
> static const char *empty_base = "";
>
> @@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
> }
>
> static char const * const builtin_sparse_checkout_init_usage[] = {
> - N_("git sparse-checkout init [--cone]"),
> + N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
This all makes sense, but between partial clones, sparse-checkouts and
sparse-indexes, I wonder if we're overloading users with terms and
conditions. Perhaps that's inevitable in the short-term due to the
various caveats that exist, but I'd just like to put out a fuzzy
high-level goal of allowing users in the future to just specify "I
want a sparse clone of this stuff" with as few special knobs and flags
as possible. I don't want them to have to specify all of the
individual things that means, such as they want (a) the history to be
sparse (i.e. partial clone), (b) the checkout to be sparse, (c) the
index to be sparse, (d) several commands to operate in a sparse
manner, limiting their output based on the sparsity paths (hopefully
they aren't required to list each one), and (e) several other commands
shouldn't be limited by the sparsity paths. I guess it might be nice
to _allow_ them to specify all the things it means for users who want
control, but it'd be nice to avoid requiring it of all users.
> NULL
> };
>
> static struct sparse_checkout_init_opts {
> int cone_mode;
> + int sparse_index;
> } init_opts;
>
> static int sparse_checkout_init(int argc, const char **argv)
> @@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
> static struct option builtin_sparse_checkout_init_options[] = {
> OPT_BOOL(0, "cone", &init_opts.cone_mode,
> N_("initialize the sparse-checkout in cone mode")),
> + OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
> + N_("toggle the use of a sparse index")),
> OPT_END(),
> };
>
> repo_read_index(the_repository);
>
> + init_opts.sparse_index = -1;
> +
> argc = parse_options(argc, argv, NULL,
> builtin_sparse_checkout_init_options,
> builtin_sparse_checkout_init_usage, 0);
> @@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
> sparse_filename = get_sparse_checkout_filename();
> res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
>
> + if (init_opts.sparse_index >= 0) {
> + if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
> + die(_("failed to modify sparse-index config"));
> +
> + /* force an index rewrite */
> + repo_read_index(the_repository);
> + the_repository->index->updated_workdir = 1;
> + }
> +
> /* If we already have a sparse-checkout file, use it. */
> if (res >= 0) {
> free(sparse_filename);
> diff --git a/sparse-index.c b/sparse-index.c
> index 71544095267..3552f88fb03 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -104,23 +104,38 @@ static int convert_to_sparse_rec(struct index_state *istate,
>
> static int enable_sparse_index(struct repository *repo)
> {
> - const char *config_path = repo_git_path(repo, "config.worktree");
> + int res;
>
> if (upgrade_repository_format(1) < 0) {
> warning(_("unable to upgrade repository format to enable sparse-index"));
> return -1;
> }
> - git_config_set_in_file_gently(config_path,
> - "extensions.sparseIndex",
> - "true");
> + res = git_config_set_gently("extensions.sparseindex", "true");
>
> prepare_repo_settings(repo);
> repo->settings.sparse_index = 1;
> - return 0;
> + return res;
> +}
> +
> +int set_sparse_index_config(struct repository *repo, int enable)
> +{
> + int res;
> +
> + if (enable)
> + return enable_sparse_index(repo);
> +
> + /* Don't downgrade repository format, just remove the extension. */
> + res = git_config_set_multivar_gently("extensions.sparseindex", NULL, "",
> + CONFIG_FLAGS_MULTI_REPLACE);
> +
> + prepare_repo_settings(repo);
> + repo->settings.sparse_index = 0;
> + return res;
> }
>
> int convert_to_sparse(struct index_state *istate)
> {
> + int test_env;
> if (istate->split_index || istate->sparse_index ||
> !core_apply_sparse_checkout || !core_sparse_checkout_cone)
> return 0;
> @@ -129,14 +144,13 @@ int convert_to_sparse(struct index_state *istate)
> istate->repo = the_repository;
>
> /*
> - * The GIT_TEST_SPARSE_INDEX environment variable triggers the
> - * extensions.sparseIndex config variable to be on.
> + * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
> + * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
> + * then purposefully disable the setting.
> */
> - if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
> - int err = enable_sparse_index(istate->repo);
> - if (err < 0)
> - return err;
> - }
> + test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
> + if (test_env >= 0)
> + set_sparse_index_config(istate->repo, test_env);
>
> /*
> * Only convert to sparse if extensions.sparseIndex is set.
> diff --git a/sparse-index.h b/sparse-index.h
> index a2777dcac59..ca936e95d11 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -5,4 +5,7 @@ struct index_state;
> void ensure_full_index(struct index_state *istate);
> int convert_to_sparse(struct index_state *istate);
>
> +struct repository;
> +int set_sparse_index_config(struct repository *repo, int enable);
> +
> #endif
> \ No newline at end of file
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index a22def89e37..c6b7e8b8891 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -4,6 +4,7 @@ test_description='compare full workdir to sparse workdir'
>
> GIT_TEST_CHECK_CACHE_TREE=0
> GIT_TEST_SPLIT_INDEX=0
> +GIT_TEST_SPARSE_INDEX=
>
> . ./test-lib.sh
>
> @@ -98,8 +99,9 @@ init_repos () {
> # initialize sparse-checkout definitions
> git -C sparse-checkout sparse-checkout init --cone &&
> git -C sparse-checkout sparse-checkout set deep &&
> - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
> - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
> + git -C sparse-index sparse-checkout init --cone --sparse-index &&
> + test_cmp_config -C sparse-index true extensions.sparseindex &&
> + git -C sparse-index sparse-checkout set deep
> }
>
> run_on_sparse () {
> @@ -109,7 +111,7 @@ run_on_sparse () {
> ) &&
> (
> cd sparse-index &&
> - GIT_TEST_SPARSE_INDEX=1 $* >../sparse-index-out 2>../sparse-index-err
> + $* >../sparse-index-out 2>../sparse-index-err
> )
> }
>
> @@ -146,7 +148,7 @@ test_expect_success 'sparse-index contents' '
> || return 1
> done &&
>
> - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
> + git -C sparse-index sparse-checkout set folder1 &&
>
> test-tool -C sparse-index read-cache --table --no-stat >cache &&
> for dir in deep folder2 x
> @@ -156,7 +158,7 @@ test_expect_success 'sparse-index contents' '
> || return 1
> done &&
>
> - GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
> + git -C sparse-index sparse-checkout set deep/deeper1 &&
>
> test-tool -C sparse-index read-cache --table --no-stat >cache &&
> for dir in deep/deeper2 folder1 folder2 x
> @@ -377,18 +379,15 @@ test_expect_success 'clean' '
> test_expect_success 'sparse-index is expanded and converted back' '
> init_repos &&
>
> - (
> - (GIT_TEST_SPARSE_INDEX=1 && export GIT_TEST_SPARSE_INDEX) &&
> - GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> - git -C sparse-index -c core.fsmonitor="" reset --hard &&
> - test_region index convert_to_sparse trace2.txt &&
> - test_region index ensure_full_index trace2.txt &&
> -
> - rm trace2.txt &&
> - GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> - git -C sparse-index -c core.fsmonitor="" status -uno &&
> - test_region index ensure_full_index trace2.txt
> - )
> + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> + git -C sparse-index -c core.fsmonitor="" reset --hard &&
> + test_region index convert_to_sparse trace2.txt &&
> + test_region index ensure_full_index trace2.txt &&
> +
> + rm trace2.txt &&
> + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> + git -C sparse-index -c core.fsmonitor="" status -uno &&
> + test_region index ensure_full_index trace2.txt
> '
>
> test_done
> --
> gitgitgadget
I need to take a break from reviewing again at this point and work on
some other tasks. I'll resume reviewing the series later, perhaps
tomorrow afternoon.
There was a problem hiding this comment.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 1/27/2021 1:18 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>
> I need to take a break from reviewing again at this point and work on
> some other tasks. I'll resume reviewing the series later, perhaps
> tomorrow afternoon.
I appreciate your efforts here! I'm delaying detailed responses until
enough time has passed for interested parties to comment. If it helps,
then focusing on the "big things" is most important for now. I'll be
more careful about typos and things when I submit patches for full
review.
For my part, I'm taking time to look around at other things that need
my attention after being heads-down on this prototype. The sparse
index will need proper care to implement and I'm not expecting it to
move very quickly.
Thanks,
-Stolee
| @@ -3523,6 +3523,8 @@ static int load_current(struct apply_state *state, | |||
| if (!patch->is_new) | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This giant patch is not intended for actual review. I have a branch that
> has these changes split out in a sane way with some commentary in each
> file that is modified.
>
> The idea here is to guard certain portions of the codebase that do not
> know how to handle sparse indexes by ensuring that the index is expanded
> to a full index before proceeding with the logic.
>
> This also provides a good mechanism for testing which code needs
> updating to enable the sparse index in a Git builtin. The builtin can
> set the_repository->settings.command_requires_full_index to zero and
> then we can debug the command with a breakpoint on ensure_full_index().
> That identifies the portion of code that needs adjusting before enabling
> sparse indexes for that command.
>
> Some index operations must be changed to operate on a non-const pointer,
> since ensuring a full index will modify the index itself.
>
> There are likely some gaps to these protections, which is why it will be
> important to carefully test each scenario as we relax the requirements.
> I expect that to be a long effort.
I think the idea makes sense; it provides a way for us to
incrementally build support for this new feature.
I skimmed over the code and noticed various interesting places that
had the ensure_full_index() call (e.g.
read_skip_worktree_file_from_index() -- whose existence comes from
sparsity; what irony...). Better breakouts would be great, so I'll
defer commenting much until then. But, just to verify I'm
understanding: the primary defence is the command_requires_full_index
setting, and you have added several ensure_full_index() calls
throughout the code in places you believe would need to be fixed up in
case someone switches the command_requires_full_index setting. Is
that correct? And your comment on the gaps is just that there may be
other places that are missing the secondary protection (as opposed to
my first reading of that paragraph as suggesting we aren't sure if we
have enough protections yet and need to add more before this moves out
of RFC); is that right?
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> apply.c | 10 +++++++++-
> blame.c | 7 ++++++-
> builtin/checkout-index.c | 5 ++++-
> builtin/grep.c | 2 ++
> builtin/ls-files.c | 9 ++++++++-
> builtin/merge-index.c | 2 ++
> builtin/mv.c | 2 ++
> builtin/rm.c | 2 ++
> builtin/sparse-checkout.c | 1 +
> builtin/update-index.c | 2 ++
> cache.h | 1 +
> diff-lib.c | 2 ++
> diff.c | 2 ++
> dir.c | 14 +++++++++++++-
> entry.c | 2 ++
> fsmonitor.c | 11 ++++++++++-
> merge-recursive.c | 22 +++++++++++++++++++---
> name-hash.c | 6 ++++++
> pathspec.c | 5 +++--
> pathspec.h | 4 ++--
> read-cache.c | 19 +++++++++++++++++--
> rerere.c | 2 ++
> resolve-undo.c | 6 ++++++
> sha1-name.c | 3 +++
> split-index.c | 2 ++
> submodule.c | 24 +++++++++++++++++++-----
> submodule.h | 6 +++---
> tree.c | 2 ++
> wt-status.c | 7 +++++++
> 29 files changed, 159 insertions(+), 23 deletions(-)
>
> diff --git a/apply.c b/apply.c
> index 668b16e9893..5bfbd928b38 100644
> --- a/apply.c
> +++ b/apply.c
> @@ -3523,6 +3523,8 @@ static int load_current(struct apply_state *state,
> if (!patch->is_new)
> BUG("patch to %s is not a creation", patch->old_name);
>
> + ensure_full_index(state->repo->index);
> +
> pos = index_name_pos(state->repo->index, name, strlen(name));
> if (pos < 0)
> return error(_("%s: does not exist in index"), name);
> @@ -3692,7 +3694,11 @@ static int check_preimage(struct apply_state *state,
> }
>
> if (state->check_index && !previous) {
> - int pos = index_name_pos(state->repo->index, old_name,
> + int pos;
> +
> + ensure_full_index(state->repo->index);
> +
> + pos = index_name_pos(state->repo->index, old_name,
> strlen(old_name));
> if (pos < 0) {
> if (patch->is_new < 0)
> @@ -3751,6 +3757,8 @@ static int check_to_create(struct apply_state *state,
> if (state->check_index && (!ok_if_exists || !state->cached)) {
> int pos;
>
> + ensure_full_index(state->repo->index);
> +
> pos = index_name_pos(state->repo->index, new_name, strlen(new_name));
> if (pos >= 0) {
> struct cache_entry *ce = state->repo->index->cache[pos];
> diff --git a/blame.c b/blame.c
> index a5044fcfaa6..0aa368a35cf 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -108,6 +108,7 @@ static void verify_working_tree_path(struct repository *r,
> return;
> }
>
> + ensure_full_index(r->index);
> pos = index_name_pos(r->index, path, strlen(path));
> if (pos >= 0)
> ; /* path is in the index */
> @@ -277,7 +278,11 @@ static struct commit *fake_working_tree_commit(struct repository *r,
>
> len = strlen(path);
> if (!mode) {
> - int pos = index_name_pos(r->index, path, len);
> + int pos;
> +
> + ensure_full_index(r->index);
> +
> + pos = index_name_pos(r->index, path, len);
> if (0 <= pos)
> mode = r->index->cache[pos]->ce_mode;
> else
> diff --git a/builtin/checkout-index.c b/builtin/checkout-index.c
> index 4bbfc92dce5..24c85b1c125 100644
> --- a/builtin/checkout-index.c
> +++ b/builtin/checkout-index.c
> @@ -48,11 +48,14 @@ static void write_tempfile_record(const char *name, const char *prefix)
> static int checkout_file(const char *name, const char *prefix)
> {
> int namelen = strlen(name);
> - int pos = cache_name_pos(name, namelen);
> + int pos;
> int has_same_name = 0;
> int did_checkout = 0;
> int errs = 0;
>
> + ensure_full_index(the_repository->index);
> + pos = index_name_pos(the_repository->index, name, namelen);
> +
> if (pos < 0)
> pos = -pos - 1;
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index ca259af4416..e53cf817204 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -506,6 +506,8 @@ static int grep_cache(struct grep_opt *opt,
> if (repo_read_index(repo) < 0)
> die(_("index file corrupt"));
>
> + ensure_full_index(repo->index);
> +
> for (nr = 0; nr < repo->index->cache_nr; nr++) {
> const struct cache_entry *ce = repo->index->cache[nr];
> strbuf_setlen(&name, name_base_len);
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index c8eae899b82..933e259cdbe 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -150,7 +150,7 @@ static void show_other_files(const struct index_state *istate,
> }
> }
>
> -static void show_killed_files(const struct index_state *istate,
> +static void show_killed_files(struct index_state *istate,
> const struct dir_struct *dir)
> {
> int i;
> @@ -159,6 +159,8 @@ static void show_killed_files(const struct index_state *istate,
> char *cp, *sp;
> int pos, len, killed = 0;
>
> + ensure_full_index(istate);
> +
> for (cp = ent->name; cp - ent->name < ent->len; cp = sp + 1) {
> sp = strchr(cp, '/');
> if (!sp) {
> @@ -313,6 +315,7 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
> show_killed_files(repo->index, dir);
> }
> if (show_cached || show_stage) {
> + ensure_full_index(repo->index);
> for (i = 0; i < repo->index->cache_nr; i++) {
> const struct cache_entry *ce = repo->index->cache[i];
>
> @@ -332,6 +335,7 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
> }
> }
> if (show_deleted || show_modified) {
> + ensure_full_index(repo->index);
> for (i = 0; i < repo->index->cache_nr; i++) {
> const struct cache_entry *ce = repo->index->cache[i];
> struct stat st;
> @@ -368,6 +372,7 @@ static void prune_index(struct index_state *istate,
>
> if (!prefix || !istate->cache_nr)
> return;
> + ensure_full_index(istate);
> pos = index_name_pos(istate, prefix, prefixlen);
> if (pos < 0)
> pos = -pos-1;
> @@ -428,6 +433,8 @@ void overlay_tree_on_index(struct index_state *istate,
> if (!tree)
> die("bad tree-ish %s", tree_name);
>
> + ensure_full_index(istate);
> +
> /* Hoist the unmerged entries up to stage #3 to make room */
> for (i = 0; i < istate->cache_nr; i++) {
> struct cache_entry *ce = istate->cache[i];
> diff --git a/builtin/merge-index.c b/builtin/merge-index.c
> index 38ea6ad6ca2..3e1ddabd650 100644
> --- a/builtin/merge-index.c
> +++ b/builtin/merge-index.c
> @@ -80,6 +80,8 @@ int cmd_merge_index(int argc, const char **argv, const char *prefix)
>
> read_cache();
>
> + ensure_full_index(&the_index);
> +
> i = 1;
> if (!strcmp(argv[i], "-o")) {
> one_shot = 1;
> diff --git a/builtin/mv.c b/builtin/mv.c
> index 7dac714af90..2ab6416fce9 100644
> --- a/builtin/mv.c
> +++ b/builtin/mv.c
> @@ -145,6 +145,8 @@ int cmd_mv(int argc, const char **argv, const char *prefix)
> if (read_cache() < 0)
> die(_("index file corrupt"));
>
> + ensure_full_index(&the_index);
> +
> source = internal_prefix_pathspec(prefix, argv, argc, 0);
> modes = xcalloc(argc, sizeof(enum update_mode));
> /*
> diff --git a/builtin/rm.c b/builtin/rm.c
> index 4858631e0f0..2db4fcd22d9 100644
> --- a/builtin/rm.c
> +++ b/builtin/rm.c
> @@ -291,6 +291,8 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
>
> refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED, &pathspec, NULL, NULL);
>
> + ensure_full_index(&the_index);
> +
> seen = xcalloc(pathspec.nr, 1);
>
> for (i = 0; i < active_nr; i++) {
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index ca63e2c64e9..14022b5e182 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -123,6 +123,7 @@ static int update_working_directory(struct pattern_list *pl)
> o.pl = pl;
>
> setup_work_tree();
> + ensure_full_index(r->index);
>
> repo_hold_locked_index(r, &lock_file, LOCK_DIE_ON_ERROR);
>
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 79087bccea4..521a6c23c75 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -1088,6 +1088,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>
> the_index.updated_skipworktree = 1;
>
> + ensure_full_index(&the_index);
> +
> /*
> * Custom copy of parse_options() because we want to handle
> * filename arguments as they come.
> diff --git a/cache.h b/cache.h
> index dcf089b7006..306eab444b9 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -346,6 +346,7 @@ void add_name_hash(struct index_state *istate, struct cache_entry *ce);
> void remove_name_hash(struct index_state *istate, struct cache_entry *ce);
> void free_name_hash(struct index_state *istate);
>
> +void ensure_full_index(struct index_state *istate);
>
> /* Cache entry creation and cleanup */
>
> diff --git a/diff-lib.c b/diff-lib.c
> index b73cc1859a4..3743e4463b4 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -96,6 +96,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
> uint64_t start = getnanotime();
> struct index_state *istate = revs->diffopt.repo->index;
>
> + ensure_full_index(istate);
> +
> diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
>
> refresh_fsmonitor(istate);
> diff --git a/diff.c b/diff.c
> index 2253ec88029..02fafee8587 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3901,6 +3901,8 @@ static int reuse_worktree_file(struct index_state *istate,
> if (!want_file && would_convert_to_git(istate, name))
> return 0;
>
> + ensure_full_index(istate);
> +
> len = strlen(name);
> pos = index_name_pos(istate, name, len);
> if (pos < 0)
> diff --git a/dir.c b/dir.c
> index d153a63bbd1..ad6eb033cb1 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -892,13 +892,15 @@ void add_pattern(const char *string, const char *base,
> add_pattern_to_hashsets(pl, pattern);
> }
>
> -static int read_skip_worktree_file_from_index(const struct index_state *istate,
> +static int read_skip_worktree_file_from_index(struct index_state *istate,
> const char *path,
> size_t *size_out, char **data_out,
> struct oid_stat *oid_stat)
> {
> int pos, len;
>
> + ensure_full_index(istate);
> +
> len = strlen(path);
> pos = index_name_pos(istate, path, len);
> if (pos < 0)
> @@ -1088,6 +1090,10 @@ static int add_patterns(const char *fname, const char *base, int baselen,
> close(fd);
> if (oid_stat) {
> int pos;
> +
> + if (istate)
> + ensure_full_index(istate);
> +
> if (oid_stat->valid &&
> !match_stat_data_racy(istate, &oid_stat->stat, &st))
> ; /* no content change, oid_stat->oid still good */
> @@ -1696,6 +1702,8 @@ static enum exist_status directory_exists_in_index(struct index_state *istate,
> if (ignore_case)
> return directory_exists_in_index_icase(istate, dirname, len);
>
> + ensure_full_index(istate);
> +
> pos = index_name_pos(istate, dirname, len);
> if (pos < 0)
> pos = -pos-1;
> @@ -2050,6 +2058,8 @@ static int get_index_dtype(struct index_state *istate,
> int pos;
> const struct cache_entry *ce;
>
> + ensure_full_index(istate);
> +
> ce = index_file_exists(istate, path, len, 0);
> if (ce) {
> if (!ce_uptodate(ce))
> @@ -3536,6 +3546,8 @@ static void connect_wt_gitdir_in_nested(const char *sub_worktree,
> if (repo_read_index(&subrepo) < 0)
> die(_("index file corrupt in repo %s"), subrepo.gitdir);
>
> + ensure_full_index(subrepo.index);
> +
> for (i = 0; i < subrepo.index->cache_nr; i++) {
> const struct cache_entry *ce = subrepo.index->cache[i];
>
> diff --git a/entry.c b/entry.c
> index a0532f1f000..d505e6f2c6e 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -412,6 +412,8 @@ static void mark_colliding_entries(const struct checkout *state,
>
> ce->ce_flags |= CE_MATCHED;
>
> + ensure_full_index(state->istate);
> +
> for (i = 0; i < state->istate->cache_nr; i++) {
> struct cache_entry *dup = state->istate->cache[i];
>
> diff --git a/fsmonitor.c b/fsmonitor.c
> index fe9e9d7baf4..7b8cd3975b9 100644
> --- a/fsmonitor.c
> +++ b/fsmonitor.c
> @@ -97,6 +97,9 @@ int read_fsmonitor_extension(struct index_state *istate, const void *data,
> void fill_fsmonitor_bitmap(struct index_state *istate)
> {
> unsigned int i, skipped = 0;
> +
> + ensure_full_index(istate);
> +
> istate->fsmonitor_dirty = ewah_new();
> for (i = 0; i < istate->cache_nr; i++) {
> if (istate->cache[i]->ce_flags & CE_REMOVE)
> @@ -158,7 +161,11 @@ static int query_fsmonitor(int version, const char *last_update, struct strbuf *
>
> static void fsmonitor_refresh_callback(struct index_state *istate, const char *name)
> {
> - int pos = index_name_pos(istate, name, strlen(name));
> + int pos;
> +
> + ensure_full_index(istate);
> +
> + pos = index_name_pos(istate, name, strlen(name));
>
> if (pos >= 0) {
> struct cache_entry *ce = istate->cache[pos];
> @@ -330,6 +337,8 @@ void tweak_fsmonitor(struct index_state *istate)
>
> if (istate->fsmonitor_dirty) {
> if (fsmonitor_enabled) {
> + ensure_full_index(istate);
> +
> /* Mark all entries valid */
> for (i = 0; i < istate->cache_nr; i++) {
> istate->cache[i]->ce_flags |= CE_FSMONITOR_VALID;
> diff --git a/merge-recursive.c b/merge-recursive.c
> index f736a0f6323..12109f37723 100644
> --- a/merge-recursive.c
> +++ b/merge-recursive.c
> @@ -522,6 +522,8 @@ static struct string_list *get_unmerged(struct index_state *istate)
>
> unmerged->strdup_strings = 1;
>
> + ensure_full_index(istate);
> +
> for (i = 0; i < istate->cache_nr; i++) {
> struct string_list_item *item;
> struct stage_data *e;
> @@ -762,6 +764,8 @@ static int dir_in_way(struct index_state *istate, const char *path,
> strbuf_addstr(&dirpath, path);
> strbuf_addch(&dirpath, '/');
>
> + ensure_full_index(istate);
> +
> pos = index_name_pos(istate, dirpath.buf, dirpath.len);
>
> if (pos < 0)
> @@ -785,9 +789,13 @@ static int dir_in_way(struct index_state *istate, const char *path,
> static int was_tracked_and_matches(struct merge_options *opt, const char *path,
> const struct diff_filespec *blob)
> {
> - int pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
> + int pos;
> struct cache_entry *ce;
>
> + ensure_full_index(&opt->priv->orig_index);
> +
> + pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
> +
> if (0 > pos)
> /* we were not tracking this path before the merge */
> return 0;
> @@ -802,7 +810,11 @@ static int was_tracked_and_matches(struct merge_options *opt, const char *path,
> */
> static int was_tracked(struct merge_options *opt, const char *path)
> {
> - int pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
> + int pos;
> +
> + ensure_full_index(&opt->priv->orig_index);
> +
> + pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
>
> if (0 <= pos)
> /* we were tracking this path before the merge */
> @@ -814,6 +826,9 @@ static int was_tracked(struct merge_options *opt, const char *path)
> static int would_lose_untracked(struct merge_options *opt, const char *path)
> {
> struct index_state *istate = opt->repo->index;
> + int pos;
> +
> + ensure_full_index(istate);
>
> /*
> * This may look like it can be simplified to:
> @@ -832,7 +847,7 @@ static int would_lose_untracked(struct merge_options *opt, const char *path)
> * update_file()/would_lose_untracked(); see every comment in this
> * file which mentions "update_stages".
> */
> - int pos = index_name_pos(istate, path, strlen(path));
> + pos = index_name_pos(istate, path, strlen(path));
>
> if (pos < 0)
> pos = -1 - pos;
> @@ -3086,6 +3101,7 @@ static int handle_content_merge(struct merge_file_info *mfi,
> * flag to avoid making the file appear as if it were
> * deleted by the user.
> */
> + ensure_full_index(&opt->priv->orig_index);
> pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
> ce = opt->priv->orig_index.cache[pos];
> if (ce_skip_worktree(ce)) {
> diff --git a/name-hash.c b/name-hash.c
> index 4e03fac9bb1..0f6d4fcca5a 100644
> --- a/name-hash.c
> +++ b/name-hash.c
> @@ -679,6 +679,8 @@ int index_dir_exists(struct index_state *istate, const char *name, int namelen)
> {
> struct dir_entry *dir;
>
> + ensure_full_index(istate);
> +
> lazy_init_name_hash(istate);
> dir = find_dir_entry(istate, name, namelen);
> return dir && dir->nr;
> @@ -689,6 +691,8 @@ void adjust_dirname_case(struct index_state *istate, char *name)
> const char *startPtr = name;
> const char *ptr = startPtr;
>
> + ensure_full_index( istate);
> +
> lazy_init_name_hash(istate);
> while (*ptr) {
> while (*ptr && *ptr != '/')
> @@ -712,6 +716,8 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na
> struct cache_entry *ce;
> unsigned int hash = memihash(name, namelen);
>
> + ensure_full_index(istate);
> +
> lazy_init_name_hash(istate);
>
> ce = hashmap_get_entry_from_hash(&istate->name_hash, hash, NULL,
> diff --git a/pathspec.c b/pathspec.c
> index 7a229d8d22f..9b105855483 100644
> --- a/pathspec.c
> +++ b/pathspec.c
> @@ -20,7 +20,7 @@
> * to use find_pathspecs_matching_against_index() instead.
> */
> void add_pathspec_matches_against_index(const struct pathspec *pathspec,
> - const struct index_state *istate,
> + struct index_state *istate,
> char *seen)
> {
> int num_unmatched = 0, i;
> @@ -36,6 +36,7 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
> num_unmatched++;
> if (!num_unmatched)
> return;
> + ensure_full_index(istate);
> for (i = 0; i < istate->cache_nr; i++) {
> const struct cache_entry *ce = istate->cache[i];
> ce_path_match(istate, ce, pathspec, seen);
> @@ -51,7 +52,7 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
> * given pathspecs achieves against all items in the index.
> */
> char *find_pathspecs_matching_against_index(const struct pathspec *pathspec,
> - const struct index_state *istate)
> + struct index_state *istate)
> {
> char *seen = xcalloc(pathspec->nr, 1);
> add_pathspec_matches_against_index(pathspec, istate, seen);
> diff --git a/pathspec.h b/pathspec.h
> index 454ce364fac..f19c5dcf022 100644
> --- a/pathspec.h
> +++ b/pathspec.h
> @@ -150,10 +150,10 @@ static inline int ps_strcmp(const struct pathspec_item *item,
> }
>
> void add_pathspec_matches_against_index(const struct pathspec *pathspec,
> - const struct index_state *istate,
> + struct index_state *istate,
> char *seen);
> char *find_pathspecs_matching_against_index(const struct pathspec *pathspec,
> - const struct index_state *istate);
> + struct index_state *istate);
> int match_pathspec_attrs(const struct index_state *istate,
> const char *name, int namelen,
> const struct pathspec_item *item);
> diff --git a/read-cache.c b/read-cache.c
> index 0522260416e..65679d70d7c 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -622,7 +622,11 @@ void remove_marked_cache_entries(struct index_state *istate, int invalidate)
>
> int remove_file_from_index(struct index_state *istate, const char *path)
> {
> - int pos = index_name_pos(istate, path, strlen(path));
> + int pos;
> +
> + ensure_full_index(istate);
> +
> + pos = index_name_pos(istate, path, strlen(path));
> if (pos < 0)
> pos = -pos-1;
> cache_tree_invalidate_path(istate, path);
> @@ -640,9 +644,12 @@ static int compare_name(struct cache_entry *ce, const char *path, int namelen)
> static int index_name_pos_also_unmerged(struct index_state *istate,
> const char *path, int namelen)
> {
> - int pos = index_name_pos(istate, path, namelen);
> + int pos;
> struct cache_entry *ce;
>
> + ensure_full_index(istate);
> +
> + pos = index_name_pos(istate, path, namelen);
> if (pos >= 0)
> return pos;
>
> @@ -717,6 +724,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
> int hash_flags = HASH_WRITE_OBJECT;
> struct object_id oid;
>
> + ensure_full_index(istate);
> +
> if (flags & ADD_CACHE_RENORMALIZE)
> hash_flags |= HASH_RENORMALIZE;
>
> @@ -1095,6 +1104,8 @@ static int has_dir_name(struct index_state *istate,
> size_t len_eq_last;
> int cmp_last = 0;
>
> + ensure_full_index(istate);
> +
> /*
> * We are frequently called during an iteration on a sorted
> * list of pathnames and while building a new index. Therefore,
> @@ -1338,6 +1349,8 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
> {
> int pos;
>
> + ensure_full_index(istate);
> +
> if (option & ADD_CACHE_JUST_APPEND)
> pos = istate->cache_nr;
> else {
> @@ -1547,6 +1560,8 @@ int refresh_index(struct index_state *istate, unsigned int flags,
> * we only have to do the special cases that are left.
> */
> preload_index(istate, pathspec, 0);
> +
> + ensure_full_index(istate);
> for (i = 0; i < istate->cache_nr; i++) {
> struct cache_entry *ce, *new_entry;
> int cache_errno = 0;
> diff --git a/rerere.c b/rerere.c
> index 9281131a9f1..1836a6cfbcf 100644
> --- a/rerere.c
> +++ b/rerere.c
> @@ -962,6 +962,8 @@ static int handle_cache(struct index_state *istate,
> struct rerere_io_mem io;
> int marker_size = ll_merge_marker_size(istate, path);
>
> + ensure_full_index(istate);
> +
> /*
> * Reproduce the conflicted merge in-core
> */
> diff --git a/resolve-undo.c b/resolve-undo.c
> index 236320f179c..a4265834977 100644
> --- a/resolve-undo.c
> +++ b/resolve-undo.c
> @@ -125,6 +125,8 @@ int unmerge_index_entry_at(struct index_state *istate, int pos)
> if (!istate->resolve_undo)
> return pos;
>
> + ensure_full_index(istate);
> +
> ce = istate->cache[pos];
> if (ce_stage(ce)) {
> /* already unmerged */
> @@ -172,6 +174,8 @@ void unmerge_marked_index(struct index_state *istate)
> if (!istate->resolve_undo)
> return;
>
> + ensure_full_index(istate);
> +
> for (i = 0; i < istate->cache_nr; i++) {
> const struct cache_entry *ce = istate->cache[i];
> if (ce->ce_flags & CE_MATCHED)
> @@ -186,6 +190,8 @@ void unmerge_index(struct index_state *istate, const struct pathspec *pathspec)
> if (!istate->resolve_undo)
> return;
>
> + ensure_full_index(istate);
> +
> for (i = 0; i < istate->cache_nr; i++) {
> const struct cache_entry *ce = istate->cache[i];
> if (!ce_path_match(istate, ce, pathspec, NULL))
> diff --git a/sha1-name.c b/sha1-name.c
> index 0b23b86ceb4..c2f17e526ab 100644
> --- a/sha1-name.c
> +++ b/sha1-name.c
> @@ -1734,6 +1734,8 @@ static void diagnose_invalid_index_path(struct repository *r,
> if (!prefix)
> prefix = "";
>
> + ensure_full_index(r->index);
> +
> /* Wrong stage number? */
> pos = index_name_pos(istate, filename, namelen);
> if (pos < 0)
> @@ -1854,6 +1856,7 @@ static enum get_oid_result get_oid_with_context_1(struct repository *repo,
>
> if (!repo->index || !repo->index->cache)
> repo_read_index(repo);
> + ensure_full_index(repo->index);
> pos = index_name_pos(repo->index, cp, namelen);
> if (pos < 0)
> pos = -pos - 1;
> diff --git a/split-index.c b/split-index.c
> index c0e8ad670d0..3150fa6476a 100644
> --- a/split-index.c
> +++ b/split-index.c
> @@ -4,6 +4,8 @@
>
> struct split_index *init_split_index(struct index_state *istate)
> {
> + ensure_full_index(istate);
> +
> if (!istate->split_index) {
> istate->split_index = xcalloc(1, sizeof(*istate->split_index));
> istate->split_index->refcount = 1;
> diff --git a/submodule.c b/submodule.c
> index b3bb59f0664..f80cfddbd52 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -33,9 +33,13 @@ static struct oid_array ref_tips_after_fetch;
> * will be disabled because we can't guess what might be configured in
> * .gitmodules unless the user resolves the conflict.
> */
> -int is_gitmodules_unmerged(const struct index_state *istate)
> +int is_gitmodules_unmerged(struct index_state *istate)
> {
> - int pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
> + int pos;
> +
> + ensure_full_index(istate);
> +
> + pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
> if (pos < 0) { /* .gitmodules not found or isn't merged */
> pos = -1 - pos;
> if (istate->cache_nr > pos) { /* there is a .gitmodules */
> @@ -77,7 +81,11 @@ int is_writing_gitmodules_ok(void)
> */
> int is_staging_gitmodules_ok(struct index_state *istate)
> {
> - int pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
> + int pos;
> +
> + ensure_full_index(istate);
> +
> + pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
>
> if ((pos >= 0) && (pos < istate->cache_nr)) {
> struct stat st;
> @@ -301,7 +309,7 @@ int is_submodule_populated_gently(const char *path, int *return_error_code)
> /*
> * Dies if the provided 'prefix' corresponds to an unpopulated submodule
> */
> -void die_in_unpopulated_submodule(const struct index_state *istate,
> +void die_in_unpopulated_submodule(struct index_state *istate,
> const char *prefix)
> {
> int i, prefixlen;
> @@ -311,6 +319,8 @@ void die_in_unpopulated_submodule(const struct index_state *istate,
>
> prefixlen = strlen(prefix);
>
> + ensure_full_index(istate);
> +
> for (i = 0; i < istate->cache_nr; i++) {
> struct cache_entry *ce = istate->cache[i];
> int ce_len = ce_namelen(ce);
> @@ -331,11 +341,13 @@ void die_in_unpopulated_submodule(const struct index_state *istate,
> /*
> * Dies if any paths in the provided pathspec descends into a submodule
> */
> -void die_path_inside_submodule(const struct index_state *istate,
> +void die_path_inside_submodule(struct index_state *istate,
> const struct pathspec *ps)
> {
> int i, j;
>
> + ensure_full_index(istate);
> +
> for (i = 0; i < istate->cache_nr; i++) {
> struct cache_entry *ce = istate->cache[i];
> int ce_len = ce_namelen(ce);
> @@ -1420,6 +1432,8 @@ static int get_next_submodule(struct child_process *cp,
> {
> struct submodule_parallel_fetch *spf = data;
>
> + ensure_full_index(spf->r->index);
> +
> for (; spf->count < spf->r->index->cache_nr; spf->count++) {
> const struct cache_entry *ce = spf->r->index->cache[spf->count];
> const char *default_argv;
> diff --git a/submodule.h b/submodule.h
> index 4ac6e31cf1f..84640c49c11 100644
> --- a/submodule.h
> +++ b/submodule.h
> @@ -39,7 +39,7 @@ struct submodule_update_strategy {
> };
> #define SUBMODULE_UPDATE_STRATEGY_INIT {SM_UPDATE_UNSPECIFIED, NULL}
>
> -int is_gitmodules_unmerged(const struct index_state *istate);
> +int is_gitmodules_unmerged(struct index_state *istate);
> int is_writing_gitmodules_ok(void);
> int is_staging_gitmodules_ok(struct index_state *istate);
> int update_path_in_gitmodules(const char *oldpath, const char *newpath);
> @@ -60,9 +60,9 @@ int is_submodule_active(struct repository *repo, const char *path);
> * Otherwise the return error code is the same as of resolve_gitdir_gently.
> */
> int is_submodule_populated_gently(const char *path, int *return_error_code);
> -void die_in_unpopulated_submodule(const struct index_state *istate,
> +void die_in_unpopulated_submodule(struct index_state *istate,
> const char *prefix);
> -void die_path_inside_submodule(const struct index_state *istate,
> +void die_path_inside_submodule(struct index_state *istate,
> const struct pathspec *ps);
> enum submodule_update_type parse_submodule_update_type(const char *value);
> int parse_submodule_update_strategy(const char *value,
> diff --git a/tree.c b/tree.c
> index e76517f6b18..60f575440c8 100644
> --- a/tree.c
> +++ b/tree.c
> @@ -170,6 +170,8 @@ int read_tree(struct repository *r, struct tree *tree, int stage,
> * to matter.
> */
>
> + ensure_full_index(istate);
> +
> /*
> * See if we have cache entry at the stage. If so,
> * do it the original slow way, otherwise, append and then
> diff --git a/wt-status.c b/wt-status.c
> index 7074bbdd53c..5366d336938 100644
> --- a/wt-status.c
> +++ b/wt-status.c
> @@ -509,6 +509,8 @@ static int unmerged_mask(struct index_state *istate, const char *path)
> int pos, mask;
> const struct cache_entry *ce;
>
> + ensure_full_index(istate);
> +
> pos = index_name_pos(istate, path, strlen(path));
> if (0 <= pos)
> return 0;
> @@ -657,6 +659,8 @@ static void wt_status_collect_changes_initial(struct wt_status *s)
> struct index_state *istate = s->repo->index;
> int i;
>
> + ensure_full_index(istate);
> +
> for (i = 0; i < istate->cache_nr; i++) {
> struct string_list_item *it;
> struct wt_status_change_data *d;
> @@ -2295,6 +2299,9 @@ static void wt_porcelain_v2_print_unmerged_entry(
> */
> memset(stages, 0, sizeof(stages));
> sum = 0;
> +
> + ensure_full_index(istate);
> +
> pos = index_name_pos(istate, it->string, strlen(it->string));
> assert(pos < 0);
> pos = -pos-1;
> --
> gitgitgadget
>
There was a problem hiding this comment.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 2/1/2021 3:22 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> This giant patch is not intended for actual review. I have a branch that
>> has these changes split out in a sane way with some commentary in each
>> file that is modified.
>>
>> The idea here is to guard certain portions of the codebase that do not
>> know how to handle sparse indexes by ensuring that the index is expanded
>> to a full index before proceeding with the logic.
>>
>> This also provides a good mechanism for testing which code needs
>> updating to enable the sparse index in a Git builtin. The builtin can
>> set the_repository->settings.command_requires_full_index to zero and
>> then we can debug the command with a breakpoint on ensure_full_index().
>> That identifies the portion of code that needs adjusting before enabling
>> sparse indexes for that command.
>>
>> Some index operations must be changed to operate on a non-const pointer,
>> since ensuring a full index will modify the index itself.
>>
>> There are likely some gaps to these protections, which is why it will be
>> important to carefully test each scenario as we relax the requirements.
>> I expect that to be a long effort.
>
> I think the idea makes sense; it provides a way for us to
> incrementally build support for this new feature.
>
> I skimmed over the code and noticed various interesting places that
> had the ensure_full_index() call (e.g.
> read_skip_worktree_file_from_index() -- whose existence comes from
> sparsity; what irony...). Better breakouts would be great, so I'll
> defer commenting much until then. But, just to verify I'm
> understanding: the primary defence is the command_requires_full_index
> setting, and you have added several ensure_full_index() calls
> throughout the code in places you believe would need to be fixed up in
> case someone switches the command_requires_full_index setting. Is
> that correct? And your comment on the gaps is just that there may be
> other places that are missing the secondary protection (as opposed to
> my first reading of that paragraph as suggesting we aren't sure if we
> have enough protections yet and need to add more before this moves out
> of RFC); is that right?
Yes, the idea is that we can incrementally enable
command_requires_full_index for some builtins and be confident that
corner cases will be protected by ensure_full_index(). Further, we
can test whether ensure_full_index() was called using test_region
in test scripts to demonstrate that a command is truly "sparse aware"
or if it is converting to full and back to sparse.
There is also the case that when we write the index into a sparse
format, the in-memory structure is modified. If the index is re-used
afterwards, then we must expand to full again for these code paths.
unpack_trees() already has one of these calls because it was necessary
for the sparse-index write to work.
The ensure_full_index() pattern also works when updating a builtin to
work with the sparse-index because of the breakpoint trick.
When I submit this as a full series, this patch will be one full
patch series submission with careful comments about why each of these
is added on a file-by-file basis.
Thanks,
-Stolee
| @@ -503,7 +503,7 @@ static inline int ce_path_match(const struct index_state *istate, | |||
| char *seen) | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> As a first step to integrate 'git status' and 'git add' with the sparse
> index, we must start integrating unpack_trees() with sparse directory
> entries. These changes are currently impossible to trigger because
> unpack_trees() calls ensure_full_index() if command_requires_full_index
> is true. This is the case for all commands at the moment. As we expand
> more commands to be sparse-aware, we might find that more changes are
> required to unpack_trees(). The current changes will suffice for
> 'status' and 'add'.
>
> unpack_trees() calls the traverse_trees() API using unpack_callback()
> to decide if we should recurse into a subtree. We must add new abilities
> to skip a subtree if it corresponds to a sparse directory entry.
Makes sense.
> It is important to be careful about the trailing directory separator
> that exists in the sparse directory entries but not in the subtree
> paths.
The comment makes me wonder if leaving the trailing directory
separator out would be better, as it'd allow direct comparisons. Of
course, you have a better idea of what is easier or harder based on
this decision. Is there any chance you have a quick list of the
places that the code was simplified by this decision and a list of
places like this one that were made slightly harder?
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> dir.h | 2 +-
> preload-index.c | 2 ++
> read-cache.c | 3 +++
> unpack-trees.c | 24 ++++++++++++++++++++++--
> 4 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/dir.h b/dir.h
> index facfae47402..300305ec335 100644
> --- a/dir.h
> +++ b/dir.h
> @@ -503,7 +503,7 @@ static inline int ce_path_match(const struct index_state *istate,
> char *seen)
> {
> return match_pathspec(istate, pathspec, ce->name, ce_namelen(ce), 0, seen,
> - S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
> + S_ISSPARSEDIR(ce) || S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
I think this hunk becomes unnecessary if you use ce_mode = 040000 for
sparse directory entries.
> }
>
> static inline int dir_path_match(const struct index_state *istate,
> diff --git a/preload-index.c b/preload-index.c
> index ed6eaa47388..323fc8c5100 100644
> --- a/preload-index.c
> +++ b/preload-index.c
> @@ -54,6 +54,8 @@ static void *preload_thread(void *_data)
> continue;
> if (S_ISGITLINK(ce->ce_mode))
> continue;
> + if (S_ISSPARSEDIR(ce))
> + continue;
> if (ce_uptodate(ce))
> continue;
> if (ce_skip_worktree(ce))
> diff --git a/read-cache.c b/read-cache.c
> index 65679d70d7c..ab0c2b86ec0 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1572,6 +1572,9 @@ int refresh_index(struct index_state *istate, unsigned int flags,
> if (ignore_submodules && S_ISGITLINK(ce->ce_mode))
> continue;
>
> + if (istate->sparse_index && S_ISSPARSEDIR(ce))
> + continue;
> +
> if (pathspec && !ce_path_match(istate, ce, pathspec, seen))
> filtered = 1;
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index b324eec2a5d..90644856a80 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -583,6 +583,13 @@ static void mark_ce_used(struct cache_entry *ce, struct unpack_trees_options *o)
> {
> ce->ce_flags |= CE_UNPACKED;
>
> + /*
> + * If this is a sparse directory, don't advance cache_bottom.
> + * That will be advanced later using the cache-tree data.
> + */
> + if (S_ISSPARSEDIR(ce))
> + return;
I don't grok the cache_bottom stuff -- in general, nothing specific
about your patch. But since I don't grok that stuff, it means I don't
understand how your comment here relates; you may want to ping another
reviewer about this portion of the patch.
> +
> if (o->cache_bottom < o->src_index->cache_nr &&
> o->src_index->cache[o->cache_bottom] == ce) {
> int bottom = o->cache_bottom;
> @@ -980,6 +987,9 @@ static int do_compare_entry(const struct cache_entry *ce,
> ce_len -= pathlen;
> ce_name = ce->name + pathlen;
>
> + /* remove directory separator if a sparse directory entry */
> + if (S_ISSPARSEDIR(ce))
> + ce_len--;
Here's where your comment about trailing separator comes in; makes sense.
> return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode);
> }
>
> @@ -989,6 +999,10 @@ static int compare_entry(const struct cache_entry *ce, const struct traverse_inf
> if (cmp)
> return cmp;
>
> + /* If ce is a sparse directory, then allow equality here. */
> + if (S_ISSPARSEDIR(ce))
> + return 0;
> +
This seems surprising to me. Is there a chance you are comparing
sparse directory A with sparse directory B and you return with
equality? Or sparse_directory A with regular file B? Do the callers
still do the right thing? If your code change here is right, it seems
like it deserves an extra comment either in the code or the commit
message.
> /*
> * Even if the beginning compared identically, the ce should
> * compare as bigger than a directory leading up to it!
> @@ -1239,6 +1253,7 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
> struct cache_entry *src[MAX_UNPACK_TREES + 1] = { NULL, };
> struct unpack_trees_options *o = info->data;
> const struct name_entry *p = names;
> + unsigned recurse = 1;
>
> /* Find first entry with a real name (we could use "mask" too) */
> while (!p->mode)
> @@ -1280,12 +1295,16 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
> }
> }
> src[0] = ce;
> +
> + if (S_ISSPARSEDIR(ce))
> + recurse = 0;
> }
> break;
> }
> }
>
> - if (unpack_nondirectories(n, mask, dirmask, src, names, info) < 0)
> + if (recurse &&
> + unpack_nondirectories(n, mask, dirmask, src, names, info) < 0)
> return -1;
>
> if (o->merge && src[0]) {
> @@ -1315,7 +1334,8 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
> }
> }
>
> - if (traverse_trees_recursive(n, dirmask, mask & ~dirmask,
> + if (recurse &&
> + traverse_trees_recursive(n, dirmask, mask & ~dirmask,
> names, info) < 0)
> return -1;
> return mask;
The unpack_callback() code has some comparison to a cache-tree, but
I'd assume that you'd need to update cache-tree.c somewhat to take
advantage of these sparse directory entries. Am I wrong, and you just
get cache-tree.c working with sparse directory entries for free? Or
is this something coming in a later patch?
There was a problem hiding this comment.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 2/1/2021 3:50 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> It is important to be careful about the trailing directory separator
>> that exists in the sparse directory entries but not in the subtree
>> paths.
>
> The comment makes me wonder if leaving the trailing directory
> separator out would be better, as it'd allow direct comparisons. Of
> course, you have a better idea of what is easier or harder based on
> this decision. Is there any chance you have a quick list of the
> places that the code was simplified by this decision and a list of
> places like this one that were made slightly harder?
I'm going through all of your comments and making notes about areas
to fix and clean up before starting a new series for full review.
This question of the trailing slash is important, and I will take
particular care about answering it as I rework the series. However,
the questions in this patch poke at the right places...
>> + /* remove directory separator if a sparse directory entry */
>> + if (S_ISSPARSEDIR(ce))
>> + ce_len--;
>
> Here's where your comment about trailing separator comes in; makes sense.
>
>> return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode);
>> }
>>
>> @@ -989,6 +999,10 @@ static int compare_entry(const struct cache_entry *ce, const struct traverse_inf
>> if (cmp)
>> return cmp;
>>
>> + /* If ce is a sparse directory, then allow equality here. */
>> + if (S_ISSPARSEDIR(ce))
>> + return 0;
>> +
>
> This seems surprising to me. Is there a chance you are comparing
> sparse directory A with sparse directory B and you return with
> equality? Or sparse_directory A with regular file B? Do the callers
> still do the right thing? If your code change here is right, it seems
> like it deserves an extra comment either in the code or the commit
> message.
Sometimes a caller is asking for the first index entry corresponding
to a directory. In these cases, the input could be "A/B/C/". We want
to ensure that a sparse directory entry corresponding exactly to that
directory is correctly matched. If we place "A/B/C" in the index instead,
this search becomes more complicated (I think; I will justify this more
after thinking about it).
At this point in time, we are just saying "We found the entry with
equal path value!" and not failing with the check in the rest of the
method:
/*
* Even if the beginning compared identically, the ce should
* compare as bigger than a directory leading up to it!
*/
return ce_namelen(ce) > traverse_path_len(info, tree_entry_len(n));
>> - if (traverse_trees_recursive(n, dirmask, mask & ~dirmask,
>> + if (recurse &&
>> + traverse_trees_recursive(n, dirmask, mask & ~dirmask,
>> names, info) < 0)
>> return -1;
>> return mask;
>
> The unpack_callback() code has some comparison to a cache-tree, but
> I'd assume that you'd need to update cache-tree.c somewhat to take
> advantage of these sparse directory entries. Am I wrong, and you just
> get cache-tree.c working with sparse directory entries for free? Or
> is this something coming in a later patch?
In the RFC, I integrate the cache-tree with the sparse-index at the
very end. I will move that integration to be much earlier in the next
submission, so it becomes part of the format discussion.
Thanks,
-Stolee
| @@ -892,13 +892,15 @@ void add_pattern(const char *string, const char *base, | |||
| add_pattern_to_hashsets(pl, pattern); | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When we have sparse directory entries in the index, we want to compare
> that directory against sparse-checkout patterns. Those pattern matching
> algorithms are built expecting a file path, not a directory path. This
> is especially important in the "cone mode" patterns which will match
> files that exist within the "parent directories" as well as the
> recursive directory matches.
>
> If path_matches_pattern_list() is given a directory, we can add a bogus
> filename ("-") to the directory and get the same results as before,
> assuming we are in cone mode. Since sparse index requires cone mode
> patterns, this is an acceptable assumption.
Why is "-" a bogus filename? Is that only on certain operating
systems, or are you just not expecting a user to name their file with
such a bad name? What if there is a file with that name in that
directory in the repository; do you need the pathname to be bogus?
What do you mean by "get the same results as before"? The first
paragraph suggests the code wouldn't handle a directory path, and that
not handling it was problematic, so it seems unlikely you want the
same results as that. But it's not clear what the "before" refers to
here.
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> dir.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/dir.c b/dir.c
> index ad6eb033cb1..c786fa98d0e 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1384,6 +1384,11 @@ enum pattern_match_result path_matches_pattern_list(
> strbuf_addch(&parent_pathname, '/');
> strbuf_add(&parent_pathname, pathname, pathlen);
>
> + /* Directory requests should be added as if they are a file */
> + if (parent_pathname.len > 1 &&
> + parent_pathname.buf[parent_pathname.len - 1] == '/')
Ah, this looks like a case where the trailing slash is helpful;
without it, you might have to feed extra data in through the call
hierarchy to signify that this is a directory entry.
> + strbuf_add(&parent_pathname, "-", 1);
> +
> if (hashmap_contains_path(&pl->recursive_hashmap,
> &parent_pathname)) {
> result = MATCHED_RECURSIVE;
hashmap_contains_path? Don't we already know (modulo special cases of
our bogus value not quite being bogus enough) that this is false since
we were adding a bogus path? How could the hashmap have a bogus value
in it? Won't this particular call fail with or without our adding "-"
to the end of the path?
After this hashmap_contains_path() call, the subsequent code looks for
the parent of the path by stripping off everything after the last
'/'...which seems like the relevant code anyway. Is the problem that
the hashmap_contains_path() call was returning true when we didn't add
"-" to the end? If so, can we use and if or a goto instead to make
the code skip this first check and move on to where we want it to go?
Or am I misunderstanding something about this code?
| @@ -8,6 +8,7 @@ | |||
| #include "cache.h" | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> We need to check the file hashmap first, then look to see if the
> directory signals a non-sparse directory entry. In such a case, we can
> rely on the contents of the sparse-index.
>
> We still use ensure_full_index() in the case that we hit a path that is
> within a sparse-directory entry.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> name-hash.c | 6 ++++++
> sparse-index.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 59 insertions(+)
>
> diff --git a/name-hash.c b/name-hash.c
> index 641f6900a7c..cb0f316f652 100644
> --- a/name-hash.c
> +++ b/name-hash.c
> @@ -110,6 +110,12 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
> if (ce->ce_flags & CE_HASHED)
> return;
> ce->ce_flags |= CE_HASHED;
> +
> + if (ce->ce_mode == CE_MODE_SPARSE_DIRECTORY) {
> + add_dir_entry(istate, ce);
> + return;
> + }
> +
> hashmap_entry_init(&ce->ent, memihash(ce->name, ce_namelen(ce)));
> hashmap_add(&istate->name_hash, &ce->ent);
>
> diff --git a/sparse-index.c b/sparse-index.c
> index dd1a06dfdd3..bf8dce9a09b 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -281,9 +281,62 @@ void ensure_full_index(struct index_state *istate)
> trace2_region_leave("index", "ensure_full_index", istate->repo);
> }
>
> +static int in_expand_to_path = 0;
> +
> void expand_to_path(struct index_state *istate,
> const char *path, size_t pathlen, int icase)
> {
> + struct strbuf path_as_dir = STRBUF_INIT;
> + int pos;
> +
> + /* prevent extra recursion */
> + if (in_expand_to_path)
> + return;
Maybe "prevent extra expand_to_path() <-> index_file_exists()
recursion", just to be extra explicit?
> +
> + if (!istate || !istate->sparse_index)
> + return;
> +
> + if (!istate->repo)
> + istate->repo = the_repository;
So, we assume the_repository if istate->repo isn't set. I guess given
the number of the_repository assumptions we have in the code, this
isn't a big deal. And instead of a
USE_THE_REPOSITORY_COMPATIBILITY_MACROS we have a
NO_THE_REPOSITORY_COMPATIBILITY_MACROS, so there's nothing to mark
this either.
> +
> + in_expand_to_path = 1;
> +
> + /*
> + * We only need to actually expand a region if the
> + * following are both true:
> + *
> + * 1. 'path' is not already in the index.
> + * 2. Some parent directory of 'path' is a sparse directory.
> + */
> +
> + strbuf_add(&path_as_dir, path, pathlen);
> + strbuf_addch(&path_as_dir, '/');
> +
> + /* in_expand_to_path prevents infinite recursion here */
> + if (index_file_exists(istate, path, pathlen, icase))
> + goto cleanup;
Shouldn't the editing of path_as_dir be done after the
index_file_exists() call? In the case that the entry already exists,
writing to path_as_dir is wasted work.
> + pos = index_name_pos(istate, path_as_dir.buf, path_as_dir.len);
> +
> + if (pos < 0)
> + pos = -pos - 1;
> +
> + /*
> + * Even if the path doesn't exist, if the value isn't exactly a
> + * sparse-directory entry, then there is no need to expand the
> + * index.
> + */
> + if (istate->cache[pos]->ce_mode != CE_MODE_SPARSE_DIRECTORY)
> + goto cleanup;
This looked wrong to me until I tried to come up with a
counter-example. Here you are relying on the fact that before the
comment, pos is going to be the index of a sparse directory entry --
either for path_as_dir or some ancestor directory. It would be nice
if the comment mentioned that.
> +
> + trace2_region_enter("index", "expand_to_path", istate->repo);
> +
> /* for now, do the obviously-correct, slow thing */
> ensure_full_index(istate);
> +
> + trace2_region_leave("index", "expand_to_path", istate->repo);
> +
> +cleanup:
> + strbuf_release(&path_as_dir);
> + in_expand_to_path = 0;
> }
> --
> gitgitgadget
Looks good otherwise.
| @@ -491,6 +491,9 @@ int cmd_add(int argc, const char **argv, const char *prefix) | |||
| add_new_files = !take_worktree_changes && !refresh_only && !add_renormalize; | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Replace enough callers to ensure_full_index() to instead call
> expand_to_path() to reduce how often 'git add' expands a sparse index in
> memory (before writing a sparse index again).
>
> One non-obvious case is index_name_pos_also_unmerged() which is only hit
> on the Windows platform (in my tests). Use expand_to_path() instead of
> ensure_full_index().
I read this paragraph as saying that the conversion of
index_name_pos_also_unmerged() was tricky, whereas after reading the
patch it looks like the conversion was trivial and you were perhaps
meaning to say that it is easy to miss that this function also needs a
conversion. Also, since the second sentence is true of all the
conversions, not sure how much it helps to highlight it when just
talking about this one function.
Both of these are minor quibbles, but if there's a clever way to
reword here that reduces potential confusion, that'd be great.
> Add a test to check that 'git add -A' and 'git add <file>' does not
> expand the index at all, as long as <file> is not within a sparse
> directory.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> builtin/add.c | 3 +++
> dir.c | 8 ++++----
> read-cache.c | 10 +++++-----
> sparse-index.c | 18 ++++++++++++++----
> t/t1092-sparse-checkout-compatibility.sh | 14 ++++++++++++++
> 5 files changed, 40 insertions(+), 13 deletions(-)
>
> diff --git a/builtin/add.c b/builtin/add.c
> index a825887c503..b73f8d51de6 100644
> --- a/builtin/add.c
> +++ b/builtin/add.c
> @@ -491,6 +491,9 @@ int cmd_add(int argc, const char **argv, const char *prefix)
> add_new_files = !take_worktree_changes && !refresh_only && !add_renormalize;
> require_pathspec = !(take_worktree_changes || (0 < addremove_explicit));
>
> + prepare_repo_settings(the_repository);
> + the_repository->settings.command_requires_full_index = 0;
> +
> hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR);
>
> /*
> diff --git a/dir.c b/dir.c
> index c786fa98d0e..21998c7c4b7 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -18,6 +18,7 @@
> #include "ewah/ewok.h"
> #include "fsmonitor.h"
> #include "submodule-config.h"
> +#include "sparse-index.h"
>
> /*
> * Tells read_directory_recursive how a file or directory should be treated.
> @@ -899,9 +900,9 @@ static int read_skip_worktree_file_from_index(struct index_state *istate,
> {
> int pos, len;
>
> - ensure_full_index(istate);
> -
> len = strlen(path);
> +
> + expand_to_path(istate, path, len, 0);
> pos = index_name_pos(istate, path, len);
> if (pos < 0)
> return -1;
> @@ -1707,8 +1708,7 @@ static enum exist_status directory_exists_in_index(struct index_state *istate,
> if (ignore_case)
> return directory_exists_in_index_icase(istate, dirname, len);
>
> - ensure_full_index(istate);
> -
> + expand_to_path(istate, dirname, len, 0);
> pos = index_name_pos(istate, dirname, len);
> if (pos < 0)
> pos = -pos-1;
> diff --git a/read-cache.c b/read-cache.c
> index 78910d8f1b7..8c974829497 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -647,7 +647,7 @@ static int index_name_pos_also_unmerged(struct index_state *istate,
> int pos;
> struct cache_entry *ce;
>
> - ensure_full_index(istate);
> + expand_to_path(istate, path, namelen, 0);
>
> pos = index_name_pos(istate, path, namelen);
> if (pos >= 0)
> @@ -724,8 +724,6 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
> int hash_flags = HASH_WRITE_OBJECT;
> struct object_id oid;
>
> - ensure_full_index(istate);
> -
> if (flags & ADD_CACHE_RENORMALIZE)
> hash_flags |= HASH_RENORMALIZE;
>
> @@ -733,6 +731,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
> return error(_("%s: can only add regular files, symbolic links or git-directories"), path);
>
> namelen = strlen(path);
> + expand_to_path(istate, path, namelen, 0);
> +
> if (S_ISDIR(st_mode)) {
> if (resolve_gitlink_ref(path, "HEAD", &oid) < 0)
> return error(_("'%s' does not have a commit checked out"), path);
> @@ -1104,7 +1104,7 @@ static int has_dir_name(struct index_state *istate,
> size_t len_eq_last;
> int cmp_last = 0;
>
> - ensure_full_index(istate);
> + expand_to_path(istate, ce->name, ce->ce_namelen, 0);
>
> /*
> * We are frequently called during an iteration on a sorted
> @@ -1349,7 +1349,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
> {
> int pos;
>
> - ensure_full_index(istate);
> + expand_to_path(istate, ce->name, ce->ce_namelen, 0);
>
> if (option & ADD_CACHE_JUST_APPEND)
> pos = istate->cache_nr;
> diff --git a/sparse-index.c b/sparse-index.c
> index bf8dce9a09b..a201f3b905c 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -286,6 +286,7 @@ static int in_expand_to_path = 0;
> void expand_to_path(struct index_state *istate,
> const char *path, size_t pathlen, int icase)
> {
> + struct cache_entry *ce = NULL;
> struct strbuf path_as_dir = STRBUF_INIT;
> int pos;
>
> @@ -320,13 +321,22 @@ void expand_to_path(struct index_state *istate,
>
> if (pos < 0)
> pos = -pos - 1;
> + if (pos < istate->cache_nr)
> + ce = istate->cache[pos];
>
> /*
> - * Even if the path doesn't exist, if the value isn't exactly a
> - * sparse-directory entry, then there is no need to expand the
> - * index.
> + * If we didn't land on a sparse directory, then there is
> + * nothing to expand.
> */
> - if (istate->cache[pos]->ce_mode != CE_MODE_SPARSE_DIRECTORY)
> + if (ce && !S_ISSPARSEDIR(ce))
> + goto cleanup;
Seems like these changes to expand_to_path() could and maybe should be
squashed into the commit that introduces expand_to_path()?
> + /*
> + * If that sparse directory is not a prefix of the path we
> + * are looking for, then we don't need to expand.
> + */
> + if (ce &&
> + (ce->ce_namelen >= path_as_dir.len ||
> + strncmp(ce->name, path_as_dir.buf, ce->ce_namelen)))
Should this also be squashed into the commit that introduces
expand_to_path()? Why is this check added here?
> goto cleanup;
>
> trace2_region_enter("index", "expand_to_path", istate->repo);
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 09650f0755c..ae594ab880c 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -390,6 +390,20 @@ test_expect_success 'sparse-index is expanded and converted back' '
> GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> git -C sparse-index -c core.fsmonitor="" reset --hard &&
> test_region index convert_to_sparse trace2.txt &&
> + test_region index ensure_full_index trace2.txt &&
> +
> + rm trace2.txt &&
> + echo >>sparse-index/README.md &&
> + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> + git -C sparse-index -c core.fsmonitor="" add -A &&
> + test_region index convert_to_sparse trace2.txt &&
> + test_region index ensure_full_index trace2.txt &&
> +
> + rm trace2.txt &&
> + echo >>sparse-index/extra.txt &&
> + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> + git -C sparse-index -c core.fsmonitor="" add extra.txt &&
> + test_region index convert_to_sparse trace2.txt &&
> test_region index ensure_full_index trace2.txt
> '
>
> --
> gitgitgadget
>
| @@ -18,6 +18,7 @@ | |||
| #include "ewah/ewok.h" | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The add_patterns() method has a way to extract a pattern file from the
> index. If this pattern file is sparse and within a sparse directory
> entry, then we need to expand the index before looking for that entry as
> a file path.
Why?
Correct me if I'm wrong, but I thought the point of add_patterns() was
to read .gitignore entries, so that we can know whether to e.g. have
status report untracked files within some directory or have clean
delete files within a directory. But if we have a sparse directory
entry in the index, we probably have no such directory in the working
directory. And if we have no such working directory, getting
.gitignore entries for those directories is a big waste of time.
> For now, convert ensure_full_index() into expand_to_path() to only
> expand this way when absolutely necessary.
Not only should we probably not need to read these files at all,
expand_to_path() still expands a lot more than necessary, right? If
two directories are sparse -- moduleA and moduleB, and we need
something from under moduleA/, then expand_to_path() will call
ensure_full_index() and fill out every entry under both modules, even
if moduleB is way bigger than moduleA. Unless I've misunderstood
something, there's multiple ways we're falling short of "only...when
absolutely necessary".
Perhaps both of these things are future work you already had planned;
if so, some tweaks to the commit message may help keep this reader
oriented. :-)
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> dir.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/dir.c b/dir.c
> index 21998c7c4b7..7df8d3b1da0 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1093,7 +1093,7 @@ static int add_patterns(const char *fname, const char *base, int baselen,
> int pos;
>
> if (istate)
> - ensure_full_index(istate);
> + expand_to_path(istate, fname, strlen(fname), 0);
>
> if (oid_stat->valid &&
> !match_stat_data_racy(istate, &oid_stat->stat, &st))
> --
> gitgitgadget
There's also a read_skip_worktree_file_from_index() call earlier in
the same function, which in your big RFC patch you protected with the
ensure_full_index() call already. Perhaps it should have an
expand_to_path() conversion as well? But, in the big picture, it
seems like checking if we can avoid reading in that pattern file
(whenever the directory doesn't exist within the working copy) would
be a better first step.
| @@ -20,7 +20,7 @@ | |||
| * to use find_pathspecs_matching_against_index() instead. | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The add_pathspec_matches_against_index() focuses on matching a pathspec
> to file entries in the index. It is possible that this already works
> correctly for its only use: checking if untracked files exist in the
> index.
>
> It is likely that this causes a behavior issue when adding a directory
> that exists at HEAD but is outside the sparse cone. I'm marking this as
> a place to pursue with future tests.
Sounds like you're unsure if this patch is good. Should it be marked
RFC or something?
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> pathspec.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/pathspec.c b/pathspec.c
> index 9b105855483..61dc771aa02 100644
> --- a/pathspec.c
> +++ b/pathspec.c
> @@ -36,7 +36,6 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
> num_unmatched++;
> if (!num_unmatched)
> return;
> - ensure_full_index(istate);
> for (i = 0; i < istate->cache_nr; i++) {
> const struct cache_entry *ce = istate->cache[i];
> ce_path_match(istate, ce, pathspec, seen);
> --
> gitgitgadget
>
There was a problem hiding this comment.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 2/1/2021 6:24 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The add_pathspec_matches_against_index() focuses on matching a pathspec
>> to file entries in the index. It is possible that this already works
>> correctly for its only use: checking if untracked files exist in the
>> index.
>>
>> It is likely that this causes a behavior issue when adding a directory
>> that exists at HEAD but is outside the sparse cone. I'm marking this as
>> a place to pursue with future tests.
>
> Sounds like you're unsure if this patch is good. Should it be marked
> RFC or something?
...isn't the whole series marked as RFC? I only specifically marked the
ensure_full_index() one because I purposefully squashed it.
But in general, everything I'm touching in these areas seems like a
potentially problematic change. So many things are used and re-used
that I'm not sure what is safe or not. More testing is required for
commands to ensure their behavior.
I can enable things like GIT_TEST_SPARSE_INDEX to get other tests
using sparse-checkout to work (but only if they use cone mode). Hence,
I'm relying on what tests I can write to cover the behavior instead of
a robust history of valuable tests.
I hope to gather more confidence as this goes forward. I definitely
work to be confident that I am not making any errors that cause
problems for users who do not enable the sparse-index, but I expect
there to be a long tail of adjustments required as more corner
cases are discovered during the development and testing of the
feature.
The good news is that I have some engaged users who are willing to
test the feature and provide feedback if they hit any snags once
there is a minimal set of functionality.
Thanks,
-Stolee
| @@ -6,6 +6,7 @@ | |||
| #include "object-store.h" | |||
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The cache-tree extension was previously disabled with sparse indexes.
> However, the cache-tree is an important performance feature for commands
> like 'git status' and 'git add'. Integrate it with sparse directory
> entries.
>
> When writing a sparse index, completely clear and recalculate the cache
> tree. By starting from scratch, the only integration necessary is to
> check if we hit a sparse directory entry and create a leaf of the
> cache-tree that has an entry_count of one and no subtrees.
>
> Once the cache-tree exists within a sparse index, we finally get
> improved performance. I test the sparse index performance using a
> private monorepo with over 2.1 million files at HEAD, but with a
> sparse-checkout definition that has only 68,000 paths in the populated
> cone. The sparse index has about 2,000 sparse directory entries. I
> compare three scenarios:
How many .gitignore entries does this monorepo have? What percentage
of those are populated for #2 and #3?
Can a testcase be devised that others can repeat? For example, I
created https://github.com/newren/gvfs-like-git-bomb once upon a time
to create a very small repository with a very large index and a lot of
skip-worktree entries, mostly to test some stuff that someone (Ben
Peart?) mentioned as being slow and verify for myself that it wasn't
Windows specific.
> 1. Use the full index. The index size is ~186 MB.
> 2. Use the sparse index. The index size is ~5.5 MB.
> 3. Use a commit where HEAD matches the populated set. The full index
> size is ~5.3MB.
I'm not sure I'm understanding the difference between #2 and #3, other
than #3 is smaller. How did you form #2? Also, what do you mean by
"full index size" for #3, when it's the smallest? Isn't that index
the most sparse (or least full)? Or is it an index for a different
commit entirely that has far fewer files in it?
> The third benchmark is included as a theoretical optimium for a
s/optimium/optimum/
> repository of the same object database.
This I'm also not understanding, but maybe this goes back to not
understanding the difference in how #2 and #3 are constructed.
> First, a clean 'git status' improves from 3.1s to 240ms.
>
> Benchmark #1: full index (git status)
> Time (mean ± σ): 3.167 s ± 0.036 s [User: 2.006 s, System: 1.078 s]
> Range (min … max): 3.100 s … 3.208 s 10 runs
>
> Benchmark #2: sparse index (git status)
> Time (mean ± σ): 239.5 ms ± 8.1 ms [User: 189.4 ms, System: 226.8 ms]
> Range (min … max): 226.0 ms … 251.9 ms 13 runs
>
> Benchmark #3: small tree (git status)
> Time (mean ± σ): 195.3 ms ± 4.5 ms [User: 116.5 ms, System: 84.4 ms]
> Range (min … max): 188.8 ms … 202.8 ms 15 runs
Always nice to see a speedup factor greater than 10. :-)
>
> The optimimum is still 45ms faster. This is due in part to the 2,000+
s/optimimum/optimum/
> sparse directory entries, but there might be other optimizations to make
> in the sparse-index case. In particular, I find that this performance
> difference disappears when I disable FS Monitor, which is somewhat
> disabled in the sparse-index case, but might still be adding overhead.
The FS monitor wording is unclear to me; it feels like multiple negatives.
> The performance numbers for 'git add .' are much closer to optimal:
>
> Benchmark #1: full index (git add .)
> Time (mean ± σ): 3.076 s ± 0.022 s [User: 2.065 s, System: 0.943 s]
> Range (min … max): 3.044 s … 3.116 s 10 runs
>
> Benchmark #2: sparse index (git add .)
> Time (mean ± σ): 218.0 ms ± 6.6 ms [User: 195.7 ms, System: 206.6 ms]
> Range (min … max): 209.8 ms … 228.2 ms 13 runs
>
> Benchmark #3: small tree (git add .)
> Time (mean ± σ): 217.6 ms ± 5.4 ms [User: 131.9 ms, System: 86.7 ms]
> Range (min … max): 212.1 ms … 228.4 ms 14 runs
>
> In this test, I also used "echo >>README.md" to append a line to the
> README.md file, so the 'git add .' command is doing _something_ other
> than a no-op. Without this edit (and FS Monitor enabled) the small
> tree case again gains about 30ms on the sparse index case.
Meaning the small tree is 30 ms faster than reported here, or 30 ms
slower, or that both sparse index and small tree are faster but the
small tree decreases its time more than the sparse index one does?
Sorry, I don't mean to be dense, I'm just struggling with
understanding words today it seems. (Also, it seems like there's a
joke in there about me being "dense" in a review of a "sparse"
feature...but I'm not quite coming up with it.)
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> cache-tree.c | 18 ++++++++++++++++++
> sparse-index.c | 10 +++++++++-
> 2 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 5f07a39e501..9da6a4394e0 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
>
> *skip_count = 0;
>
> + /*
> + * If the first entry of this region is a sparse directory
> + * entry corresponding exactly to 'base', then this cache_tree
> + * struct is a "leaf" in the data structure, pointing to the
> + * tree OID specified in the entry.
> + */
> + if (entries > 0) {
> + const struct cache_entry *ce = cache[0];
> +
> + if (S_ISSPARSEDIR(ce) &&
> + ce->ce_namelen == baselen &&
> + !strncmp(ce->name, base, baselen)) {
> + it->entry_count = 1;
> + oidcpy(&it->oid, &ce->oid);
> + return 1;
> + }
> + }
> +
> if (0 <= it->entry_count && has_object_file(&it->oid))
> return it->entry_count;
>
> diff --git a/sparse-index.c b/sparse-index.c
> index a201f3b905c..9ea3b321400 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -181,7 +181,11 @@ int convert_to_sparse(struct index_state *istate)
> istate->cache_nr = convert_to_sparse_rec(istate,
> 0, 0, istate->cache_nr,
> "", 0, istate->cache_tree);
> - istate->drop_cache_tree = 1;
> +
> + /* Clear and recompute the cache-tree */
> + cache_tree_free(&istate->cache_tree);
> + cache_tree_update(istate, 0);
> +
> istate->sparse_index = 1;
> trace2_region_leave("index", "convert_to_sparse", istate->repo);
> return 0;
> @@ -278,6 +282,10 @@ void ensure_full_index(struct index_state *istate)
>
> free(full);
>
> + /* Clear and recompute the cache-tree */
> + cache_tree_free(&istate->cache_tree);
> + cache_tree_update(istate, 0);
> +
> trace2_region_leave("index", "ensure_full_index", istate->repo);
> }
>
> --
> gitgitgadget
This is very exciting work!!
There was a problem hiding this comment.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 2/1/2021 6:54 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> In this test, I also used "echo >>README.md" to append a line to the
>> README.md file, so the 'git add .' command is doing _something_ other
>> than a no-op. Without this edit (and FS Monitor enabled) the small
>> tree case again gains about 30ms on the sparse index case.
>
> Meaning the small tree is 30 ms faster than reported here, or 30 ms
> slower, or that both sparse index and small tree are faster but the
> small tree decreases its time more than the sparse index one does?
>
> Sorry, I don't mean to be dense, I'm just struggling with
> understanding words today it seems. (Also, it seems like there's a
> joke in there about me being "dense" in a review of a "sparse"
> feature...but I'm not quite coming up with it.)
I don't blame you! This is a lot to digest, and I appreciate you
pushing through to the end of it.
Clearly, I was getting a bit inexact near the end. My excitement
to share this RFC clearly overshadowed my attention to grammatical
detail. I'll go through your feedback more carefully soon and
hopefully clarify these and many other questions.
Thanks,
-Stolee
There was a problem hiding this comment.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Mon, Feb 1, 2021 at 6:41 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/1/2021 6:54 PM, Elijah Newren wrote:
> > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >> In this test, I also used "echo >>README.md" to append a line to the
> >> README.md file, so the 'git add .' command is doing _something_ other
> >> than a no-op. Without this edit (and FS Monitor enabled) the small
> >> tree case again gains about 30ms on the sparse index case.
> >
> > Meaning the small tree is 30 ms faster than reported here, or 30 ms
> > slower, or that both sparse index and small tree are faster but the
> > small tree decreases its time more than the sparse index one does?
> >
> > Sorry, I don't mean to be dense, I'm just struggling with
> > understanding words today it seems. (Also, it seems like there's a
> > joke in there about me being "dense" in a review of a "sparse"
> > feature...but I'm not quite coming up with it.)
>
> I don't blame you! This is a lot to digest, and I appreciate you
> pushing through to the end of it.
>
> Clearly, I was getting a bit inexact near the end. My excitement
> to share this RFC clearly overshadowed my attention to grammatical
> detail. I'll go through your feedback more carefully soon and
> hopefully clarify these and many other questions.
I can't blame you for being excited; this series is awesome. I've
thought we should do something along this direction for years[1].
Sure I found lots of little nitpicks here and there in your series,
but that's just attempting to help find any issues so it can be made
even better; overall I'm super excited about it.
[1] See "crazy idea" paragraph at
https://lore.kernel.org/git/CABPp-BGir_5xyqEfwytDog0rZDydPHXjuqXCpNKk67dVPXjUjA@mail.gmail.com/
|
On the Git mailing list, Elijah Newren wrote (reply to this): |
|
Closing in favor of the actual submissions, starting in #883. |
This is based on ds/more-index-cleanups also available as GitGitGadget PR #839.
The sparse checkout feature allows users to specify a "populated set" that is smaller than the full list of files at HEAD. Files outside the sparse checkout definition are not present in the working directory, but are still present in the index (and marked with the CE_SKIP_WORKTREE bit).
This means that the working directory has size O(populated), and commands like "git status" or "git checkout" operate using an O(populated) number of filesystem operations. However, parsing the index still operates on the scale of O(HEAD).
This can be particularly jarring if you are merging a small repository with a large monorepo for the purpose of simplifying dependency management. Even if users have nothing more in their working directory than they had before, they suddenly see a significant increase in their "git status" or "git add" times. In these cases, simply parsing the index can be a huge portion of the command time.
This RFC proposes an update to the index formats to allow "sparse directory entries". These entries correspond to directories that are completely excluded from the sparse checkout definition. We can detect that a directory is excluded when using "cone mode" patterns.
Since having directory entries is a radical departure from the existing index format, a new extension "extensions.sparseIndex" is added. Using a sparse index should cause incompatible tools to fail because they do not understand this extension.
The index is a critical data structure, so making such a drastic change must be handled carefully. This RFC does only enough adjustments to demonstrate performance improvements for "git status" and "git add." Other commands should operate identically to before, since the other commands will expand a sparse index into a full index by parsing trees.
WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout tests. I'm not sure what is causing these failures, but I will explore while we discuss the possibility of the feature as a whole.
Here is an outline for this RFC:
Patches 1-14: create and test the sparse index format. This is just enough to start writing the format, but all Git commands become slower for using it. This is because everything is guarded to expand to a full index before actually operating on the cache entries.
Patch 15: This massive patch is actually a bunch of patches squashed together. I have a branch that adds "ensure_full_index()" guards in each necessary file along with some commentary about how the index is being used. This patch is presented here as one big dump because that commentary isn't particularly interesting if the RFC leads to a very different approach.
Patches 16-27: These changes make enough code "sparse aware" such that "git status" and "git add" start operating in time O(populated) instead of O(HEAD).
Performance numbers are given in patch 27, but repeated somewhat here. The test environment I use has ~2.1 million paths at HEAD, but only 68,000 populated paths given the sparse-checkout I'm using. The sparse index has about 2,000 sparse directory entries.
size is ~5.3MB.
The third benchmark is included as a theoretical optimum for a repository of the same object database.
First, a clean 'git status' improves from 3.1s to 240ms.
Benchmark #1: full index (git status)
Time (mean ± σ): 3.167 s ± 0.036 s [User: 2.006 s, System: 1.078 s]
Range (min … max): 3.100 s … 3.208 s 10 runs
Benchmark #2: sparse index (git status)
Time (mean ± σ): 239.5 ms ± 8.1 ms [User: 189.4 ms, System: 226.8 ms]
Range (min … max): 226.0 ms … 251.9 ms 13 runs
Benchmark #3: small tree (git status)
Time (mean ± σ): 195.3 ms ± 4.5 ms [User: 116.5 ms, System: 84.4 ms]
Range (min … max): 188.8 ms … 202.8 ms 15 runs
The performance numbers for 'git add .' are much closer to optimal:
Benchmark #1: full index (git add .)
Time (mean ± σ): 3.076 s ± 0.022 s [User: 2.065 s, System: 0.943 s]
Range (min … max): 3.044 s … 3.116 s 10 runs
Benchmark #2: sparse index (git add .)
Time (mean ± σ): 218.0 ms ± 6.6 ms [User: 195.7 ms, System: 206.6 ms]
Range (min … max): 209.8 ms … 228.2 ms 13 runs
Benchmark #3: small tree (git add .)
Time (mean ± σ): 217.6 ms ± 5.4 ms [User: 131.9 ms, System: 86.7 ms]
Range (min … max): 212.1 ms … 228.4 ms 14 runs
I expect that making a sparse index work optimally through the most common Git commands will take a year of effort. During this process, I expect to add a lot of testing infrastructure around the sparse-checkout feature, especially in corner cases. (This RFC focuses on the happy paths of operating only within the sparse cone, but that will change in the future.)
If this general approach is acceptable, then I would follow it with a sequence of patch submissions that follow this approach:
After those three items that are represented in this RFC, the work starts to parallelize a bit. My basic ideas for moving forward from this point are to do these basic steps:
Here are some specific questions I'm hoping to answer in this RFC period:
git sparse-checkout init --cone --sparse-indexan appropriate way to toggle the format?Thanks,
-Stolee
Cc: gitster@pobox.com
Cc: newren@gmail.com
Cc: peff@peff.net
Cc: jrnieder@gmail.com
Cc: sunshine@sunshineco.com
Cc: pclouds@gmail.com
cc: Derrick Stolee stolee@gmail.com