Skip to content

hasher/operations: fix in-flight hash caching, fingerprinting, and modtime-fallback for hash-less remotes#9500

Draft
davispw wants to merge 15 commits into
rclone:masterfrom
davispw:fix-equal-hash-modtime-unsupported
Draft

hasher/operations: fix in-flight hash caching, fingerprinting, and modtime-fallback for hash-less remotes#9500
davispw wants to merge 15 commits into
rclone:masterfrom
davispw:fix-equal-hash-modtime-unsupported

Conversation

@davispw

@davispw davispw commented Jun 7, 2026

Copy link
Copy Markdown

Summary

Four related fixes that together make the hasher backend overlay work reliably with Google Photos, and improve correctness for any remote without native hash support or reliable modtime or size.

  1. equal() modtime fallback: equal() now falls back to comparing hashes when modtime is not supported but a common hash type exists.
  2. In-flight update hashing: hasher now computes and caches hashes in-flight during file updates for remotes without native hashes.
  3. Local size fingerprinting: Hasher stores the computed hash under the local source file size fingerprint rather than the remote size (preventing async batch 0-size upload loops and -1 mismatch loop issues).
  4. Google Photos Async Trash Fix: Simplifies the duplicate check in Google Photos Update to execute the trash workaround immediately during async uploads, preventing duplicates in the album.

Previously required (all of the following, and still had edge cases)

--checksum                     # force hash-based equality check (no modtime)
--ignore-checksum              # ignore remote hash (Google Photos changes it)
--gphotos-read-size            # fetch actual file size via HEAD (otherwise -1)
--gphotos-batch-mode=sync      # async caused a nil-pointer panic and duplicate uploads

Even with all of the above, overwriting a file caused the hasher cache entry to be pruned without replacement, so the next sync always re-downloaded the full file to re-verify the hash.

Now required

--ignore-checksum              # still needed (Google Photos may change hash)
--ignore-size                  # REQUIRED (tells hasher to ignore size on fingerprint matching)

--checksum is no longer needed (Part 1 below handles this automatically).
--gphotos-read-size is NO LONGER needed when using --ignore-size (Part 3 below handles this automatically), avoiding extra HEAD requests and saving massive API transaction quota.
--gphotos-batch-mode=async now works without panicking or creating duplicates in albums (fixed in Part 4 below).


Part 1: equal() no longer silently skips when modtime is unsupported

Problem

When modifyWindow == fs.ModTimeNotSupported and file sizes match, equal() in fs/operations/operations.go returned true immediately — assuming the files were identical. A file whose content changed but size didn't would be silently skipped during a sync. This made --checksum necessary to force a hash comparison.

Fix

if modifyWindow == fs.ModTimeNotSupported {
    common := src.Fs().Hashes().Overlap(dst.Fs().Hashes())
    if common.Count() == 0 {
        return true  // no common hash type — can't do better
    }
    // Fall through to CheckHashes below
}

When a common hash type exists (e.g. MD5 via the hasher overlay), we fall through to compare hashes. For backends with no common hash, the previous behaviour is preserved. --checksum is no longer needed.


Part 2: Update now caches its hash in-flight

Problem

When hasher wraps a hash-less remote (e.g. Google Photos), Put already computed the hash in-flight via a hashingReader and cached it immediately. But Update (overwrite) only pruned the existing cache entry without computing a replacement. The next sync found no cached hash and had to re-download the entire file to verify it.

Fix

In backend/hasher/object.go, Update now:

  1. Checks if the underlying remote has no native hashes (o.Object.Fs().Hashes().IsEmpty()).
  2. If so, wraps the reader in a hashingReader to compute the hash in-flight.
  3. On successful upload, stores the result via putHashes(ctx, hashes, src.Size()).

For remotes that do support hashes natively (S3, B2, etc.), the existing behaviour (prune + let remote verification handle it) is unchanged.


Part 3: Hash stored under local source size, not remote-reported size

Problem (Primary Motivation: Async Upload Loops)

The hasher cache is keyed on an object's fingerprint: "size,modtime,hash".

  1. The Async Batch 0 Size Loop (Primary Issue):
    In batch_mode = async, the upload returns immediately before the item is fully processed and committed on Google's servers. At this point, the returned size is 0. The hasher cache would record the computed hash under fingerprint 0,-,-. On the next sync, the remote file is listed with its real size (e.g., 12345). Because the fingerprint 12345,-,- has no cached hash (it was stored under 0,-,-), this causes a cache miss. The mismatch triggers a new upload, leading to an infinite upload loop.

  2. The -1 Listing Mismatch:
    Without batch_mode = async, Google Photos returns a size of -1 unless --gphotos-read-size is explicitly used. Storing the hash under -1,-,- and later listing the file with its real size (12345) similarly caused cache misses.

Fix

putHashes accepts an optional localSize override (named expectedSize in the code signature as a variadic parameter). When called from Update and Put, it now passes src.Size() — the local file size — as the fingerprint key. A new fingerprintWithSize(ctx, size) helper separates this from the standard fingerprint(ctx) path used for lookups.

Additionally, hasher now implements ignore-size fingerprint matching in backend/hasher/kv.go when the global IgnoreSize flag is enabled. When ci.IgnoreSize is active, the size component of the fingerprint is ignored during database lookup, and a record matches if its modtime and fast-hash components match (which, for Google Photos, are both stable as -).

This means:

  • The hash is stored under the stable local source size 12345,-,-.
  • When syncing with --ignore-size, hasher ignores the size component, allowing the lookup to query the database with fingerprint -1,-,- (or any other size) and successfully obtain a cache hit
  • This completely removes the requirement for --gphotos-read-size, saving massive API transaction quota.

Impact on other backends

For remotes where o.Object.Size() == src.Size() (S3, B2, and any backend with accurate, immediate size reporting), the stored fingerprint is identical to what it was before — no behavioural change. The fix only makes a meaningful difference on backends where the remote size lags, is zero during async processing, or is permanently -1.


Part 4: Google Photos Trash Workaround Async Fix

Problem

When using the hasher backend overlay to sync metadata changes (like updating EXIF description or tags, which changes the file hash), rclone performs an Update operation. If Google Photos is configured with batch_mode = async, the new upload is committed in the background, so the newly uploaded media item's details are not returned immediately (the batcher returns nil synchronously).

Because info is nil in async mode, o.id does not get updated to the new media item's ID synchronously. The overwrite check if oldID != "" && oldID != o.id then evaluated oldID != oldID (which is false), completely skipping the trash/update workaround and leaving duplicate files in the Google Photos album.

Fix

The check in backend/googlephotos/googlephotos.go is simplified to:

if oldID != "" {
	err = o.fs.trashMediaItem(ctx, oldID, albumID)
	...
}

Since Object.Update is only called by rclone when replacing/overwriting a file that already exists at the destination, a non-empty oldID always guarantees that an overwrite is occurring, and thus the old item should always be moved to the trash album.


Automated Tests

fs/operations/operations_test.go:

Test Covers
TestEqualHashFallback Hash comparison used when modtime unsupported and common hash exists

backend/hasher/hasher_internal_test.go:

Test Covers
UpdateInFlightHashing/UnderlyingLacksHashes Hash computed in-flight and cached in BoltDB on Update for hash-less remotes
UpdateInFlightHashing/UnderlyingSupportsHashes Cache pruned (not re-written) on Update for hash-native remotes

backend/googlephotos/googlephotos_workaround_test.go:

Test Covers
TestUpdateTrashWorkaroundAsync Trash workaround executes immediately on Update under async batch mode
ok  github.com/rclone/rclone/backend/hasher      2.8s
ok  github.com/rclone/rclone/fs/operations       20.9s
ok  github.com/rclone/rclone/backend/googlephotos  2.5s

Manual Test Plan (Add, Update, Remove Operations)

Prerequisites

go build .

# Download three distinct test images
curl -L -o photo_a.jpg "https://picsum.photos/seed/rclone_pr3_a/800/600"
curl -L -o photo_b.jpg "https://picsum.photos/seed/rclone_pr3_b/800/600"
curl -L -o photo_c.jpg "https://picsum.photos/seed/rclone_pr3_c/800/600"

# Configure the hasher overlay remote (if not already done)
rclone config create gphotos_cache hasher remote=gphotos: hashes=md5 max_age=24h

Step 1: Add operation — hashes cached in-flight

mkdir -p /tmp/pr3_test
cp photo_a.jpg photo_b.jpg /tmp/pr3_test/

./rclone sync /tmp/pr3_test/ gphotos_cache:album/rclone_PR3_Test \
  --gphotos-batch-mode=sync --ignore-size --ignore-checksum -v

Expected: both files transferred.

Transferred:      2 / 2, 100%

Step 2: Skip operation (Re-sync with no changes) — zero transfers (cache hit)

./rclone sync /tmp/pr3_test/ gphotos_cache:album/rclone_PR3_Test \
  --gphotos-batch-mode=sync --ignore-size --ignore-checksum -v

Expected: nothing transferred. Without the ignore-size fingerprint fix, rclone would re-upload the files or re-download them to verify.

There was nothing to transfer

Step 3: Update operation (Overwrite one photo) — change detected, one file transferred

cp photo_c.jpg /tmp/pr3_test/photo_a.jpg

./rclone sync /tmp/pr3_test/ gphotos_cache:album/rclone_PR3_Test \
  --gphotos-batch-mode=sync --ignore-size --ignore-checksum -v

Expected: exactly one file transferred (photo_a.jpg, whose content changed). photo_b.jpg is skipped (unchanged, cache hit).

Transferred:      1 / 1, 100%

Step 4: Skip operation (Re-sync after overwrite) — zero transfers (new hash cached correctly)

./rclone sync /tmp/pr3_test/ gphotos_cache:album/rclone_PR3_Test \
  --gphotos-batch-mode=sync --ignore-size --ignore-checksum -v

Expected: nothing transferred. Verifies that the hash written during Step 3's overwrite (stored under the src.Size() fingerprint key) is correctly found on lookup.

There was nothing to transfer

Step 5: Remove operation (Delete one photo) — change detected, remote photo deleted/trashed

rm /tmp/pr3_test/photo_b.jpg

./rclone sync /tmp/pr3_test/ gphotos_cache:album/rclone_PR3_Test \
  --gphotos-batch-mode=sync --ignore-size --ignore-checksum -v

Expected: exactly one deletion/removal processed (photo_b.jpg removed).

Deleted: 1 (files)

Step 6: Skip operation (Re-sync after deletion) — zero transfers and no re-creation

./rclone sync /tmp/pr3_test/ gphotos_cache:album/rclone_PR3_Test \
  --gphotos-batch-mode=sync --ignore-size --ignore-checksum -v

Expected: nothing transferred, and photo_b.jpg is NOT recreated. Hasher's cache was successfully pruned of the deleted file's hash.

There was nothing to transfer

Cleanup

Warning

Running rclone purge on rclone_Trash deletes the album container but leaves the files themselves orphaned ("zombie files") in your main Google Photos library. You must delete the files from your library first.

  1. Open the Google Photos Web UI, navigate to the rclone_Trash album, select all photos, and click "Move to trash" (or delete them).
  2. Purge the empty album containers:
./rclone purge gphotos_cache:album/rclone_PR3_Test
./rclone purge gphotos:album/rclone_Trash
  1. Remove local test files:
rm -rf /tmp/pr3_test photo_a.jpg photo_b.jpg photo_c.jpg

Dependencies (gh-stack)

This PR is part of a stack. It depends on:

davispw added 2 commits June 6, 2026 14:26
The Google Photos Library API does not support permanently deleting
media items (https://issuetracker.google.com/issues/109759781).

To work around this, rclone now moves deleted and overwritten items
into a designated "trash" album (default: "rclone_Trash") and removes
them from the active album. Users can then review and permanently
delete items from the trash album via the Google Photos web UI.

- Move deleted items to trash album in Remove()
- Move old item to trash album after Update() re-uploads
- Add TrashAlbumName to Options (defaults to "rclone_Trash")
- Add api.BatchAddItems and api.BatchRemoveItems request types
- Add unit tests: TestRemoveTrashWorkaround, TestUpdateTrashWorkaround
The Google Photos Library API has no delete endpoint for media items
(https://issuetracker.google.com/issues/109759781). Add documentation
covering the trash album workaround added in the previous commit,
including the trash_album_name option and instructions for users to
review and permanently delete items from the trash album via the
Google Photos web UI.
@davispw davispw force-pushed the fix-equal-hash-modtime-unsupported branch 5 times, most recently from dea1384 to ef9f34e Compare June 7, 2026 18:08
@davispw davispw changed the title fs/operations: use hash comparison when modtime is not supported hasher/operations: fix in-flight hash caching, fingerprinting, and modtime-fallback for hash-less remotes Jun 7, 2026
@davispw davispw force-pushed the fix-equal-hash-modtime-unsupported branch 2 times, most recently from f321a92 to f218a82 Compare June 7, 2026 18:43
@davispw davispw force-pushed the fix-equal-hash-modtime-unsupported branch 3 times, most recently from 3ba5f44 to 320d2a2 Compare June 7, 2026 19:44
davispw added 10 commits June 7, 2026 17:48
When uploading media, rclone can now read description metadata from
EXIF/IPTC/XMP tags and pass it to the Google Photos API as the media
item description, visible in the Google Photos UI.

This is controlled by two new options:
- read_exif_description: enable the feature (default: false)
- exif_description_fields: ordered list of tag names to try
  (default: Description,Caption-Abstract,ImageDescription,Title,ObjectName)

The first non-empty matching tag value is used. The feature uses the
github.com/bep/imagemeta library and reads only the first 512 KiB of
the upload stream to extract metadata before uploading.

Add unit test TestEXIFDescriptionMapping.
…ption fields

This adds a custom HandleXMP parser to extract Dublin Core nested title and
description tags (dc:title and dc:description) which are skipped by the default
imagemeta parser or are otherwise ignored since they are nested tags and not attributes.

This ensures that Lightroom-exported titles and descriptions successfully map
to the Google Photos description on upload.
…upported

Add unit tests covering the fix that makes equal() use hash comparison
when a backend does not support modtime but does advertise a common hash
type. Tests are written TDD-style (this commit introduces them before
the fix) and cover all three new branches:
  A. No common hash → size-only fallback (existing behaviour preserved)
  B. Common hash, same content → equal
  C. Common hash, different content → not equal (the behaviour fixed by
     the next commit)
…less remotes

fix(hasher): cache hashes under source size for remotes with delayed size reporting

chore: merge all PR branches for combined testing (davispw-head)
@davispw davispw force-pushed the fix-equal-hash-modtime-unsupported branch from 320d2a2 to 1ffd962 Compare June 8, 2026 00:49
@davispw davispw force-pushed the fix-equal-hash-modtime-unsupported branch from 1ffd962 to 57f0115 Compare June 8, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant