Skip to content

Metadata-based Duplication Checking#25

Closed
MihaiStreames wants to merge 3 commits into
Googolplexed0:mainfrom
MihaiStreames:main
Closed

Metadata-based Duplication Checking#25
MihaiStreames wants to merge 3 commits into
Googolplexed0:mainfrom
MihaiStreames:main

Conversation

@MihaiStreames
Copy link
Copy Markdown

I noticed, while downloading playlists where I would update the order of songs (by adding or removing songs), that despite songs being the same, if their filenames weren't identical, the songs would be re-downloaded.

  • It would verify if a file with the exact same filename already exists and is not empty (i.e., its size is greater than zero), even though the IDs would be in the .song_ids or .song_archive files.

I decided to modify that logic by simply checking metadata:

  • We use music_tag to check that if a file with the same name exists, then we read the tracktitle and artist tags and compare them to the new song's metadata

This should normally work

@Googolplexed0
Copy link
Copy Markdown
Owner

Googolplexed0 commented Jun 16, 2025

This sounds like a bug that a few others have reported previously. I think the issue may be the fundamental design choice that SKIP_EXISTING requires both a filename and directory database match to trigger the skip. This was how the skip was designed when I took over the project, and I have done my best to only change functionality details where necessary (to minimize deviation). I thought that SKIP_PREVIOUSLY_DOWNLOADED (the global database) would catch edge-cases where filename is different but song_ids are the same. On further thought, you may be right about removing the filename requirement for local database match skips. I'll play around with that change and see if there was a good reason it was implemented that way originally.

EDIT: See if 4cc7e18 fixes the redownloads you were seeing before.

As far as using song metadata to check for duplicates, I am fairly sure that is not the direction that I want to go. Metadata can be (and often is) changed over the lifetime of a track on the API side. Theoretically, song_ids are the one thing that should not change, which is the reason a local record of them was chosen to maintain "identity". Additionally, many users (myself included) may make manual changes to metadata to fit personal stylistic/organizational formats.
TL;DR: metadata isn't enough to confirm a track's identity in my opinion. Willing to discuss this further though.

@Googolplexed0 Googolplexed0 added the wontfix This will not be worked on label Jul 10, 2025
@Googolplexed0 Googolplexed0 changed the title Better duplication checking Metadata-based Duplication Checking Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wontfix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants