Skip to content

Mine hard negatives: optionally output similarity scores#3506

Merged
tomaarsen merged 9 commits intohuggingface:mainfrom
tsbalzhanov:mine_hard_negatives
Dec 11, 2025
Merged

Mine hard negatives: optionally output similarity scores#3506
tomaarsen merged 9 commits intohuggingface:mainfrom
tsbalzhanov:mine_hard_negatives

Conversation

@tsbalzhanov
Copy link
Contributor

@tsbalzhanov tsbalzhanov commented Aug 30, 2025

Hello

This PR adds an option to include similarity scores into result of mine hard negatives function.
This options might be helpful to fine-tune parameters of the mining function without a need to recalculate scores again or to extract logic of selecting negatives outside of the mining function.

Tsyren Balzhanov

@tsbalzhanov tsbalzhanov marked this pull request as ready for review August 30, 2025 14:46
@tomaarsen
Copy link
Member

Hello!

I think it's indeed a good idea to also allow exporting scores, but so far I've introduced that via the n-tuple-scores: https://github.com/UKPLab/sentence-transformers/blob/1def8d3d6289e72bfa6a6a48592b1342053e6ff2/sentence_transformers/util/hard_negatives.py#L209

If we instead add a parameter akin to include_scores, then we'll have to deprecate the n-tuple-scores presumably. That's not really an issue, though. I'll do some more thinking on it.

  • Tom Aarsen

@tsbalzhanov
Copy link
Contributor Author

@tomaarsen

Hi, did you decide on what to do with output_format=n-tuple-scores?
I've just updated the PR: rebased on current master branch and made output_format=n-tuple with include_scores=True equivalent with output_format=n-tuple-scores

@tomaarsen
Copy link
Member

Apologies for the delay. I think it would be preferable indeed to move towards output_scores and deprecate n-tuple-scores. If n-tuple-scores is passed, we can simply give a warning and set output_format="n-tuple" and include_scores=True indeed.

I want share that I'll be taking 3 weeks off starting Monday, so I won't be able to move this PR forward in the meantime. Apologies for this.

  • Tom Aarsen

@tsbalzhanov
Copy link
Contributor Author

If n-tuple-scores is passed, we can simply give a warning and set output_format="n-tuple" and include_scores=True indeed.

Okay, I've implemented this

…r both

And consider "scores" and "labels" special label columns for all model archetypes, not just CrossEncoder
@tomaarsen
Copy link
Member

tomaarsen commented Dec 8, 2025

@tsbalzhanov
I made some more changes to be a bit more in line with the general format of Sentence Transformer training datasets:

  1. The score and label (and also scores and labels for the CrossEncoder class) columns are "special" in ST: they're considered the label column and they're passed to the labels section of a loss. (In this PR I'm extending these special columns to all 4 for all classes)
    a. This means that positive_scores and negative_scores wouldn't be considered labels, and wouldn't be passed to a loss, not ideal.
    b. Multiple columns that match the special columns is also not supported: it would become unclear whether the binarized "labels" or the "scores" would be used in the losses.

So, I've made it so it's either labels or scores, instead of both for labeled-pair and labeled-list.

I think the current implementation gives you all the outputs that you might want, while also working nicely out of the box with the Sentence Transformers trainers etc.
I hope you like the proposal here, I'd like to include it in the upcoming v5.2 release.

  • Tom Aarsen

@tsbalzhanov
Copy link
Contributor Author

@tomaarsen

I think using scores without labels defeats the purpose of using them in the first place, because we need an ability to distinguish between hard negatives and positives for training cross encoder.
Information about labels is the important bit, scores are very useful, but, ultimately, secondary.

In case of labeled-pair we can't distinguish between positives and negatives at all, and while in cases of triplet, n-tuple and labeled-list we can use the position in scores array, it's not a great solution, because it relies on particularities of current implementation and it might be changed in the future.

Is there some way to have both scores and labels included in the output?

@tomaarsen
Copy link
Member

Thanks for your considered response. I agree completely that more information is almost always better, but this time it contradicts one of the goals of mine_hard_negatives: that it produces a dataset that immediately works with some loss(es) in Sentence Transformers. If there's both a labels and a scores in that order, then I think the scores would be included as a text column and only labels as the labels for the loss. It would be rather confusing.

I also agree that the "gold" (human-annotated) positives vs negatives (i.e. labels) is valuable for some losses, while others prefer the "silver" (machine-generated) positives vs negatives (i.e. scores) for a more detailed range from not similar to extremely similar. But I can't really picture a situation where someone would want both simultaneously (except I suppose to experiment with multiple losses). We also recently added caching to embeddings, so it would be possible to cheaply rerun the hard negatives mining with different output formats (unless you're also using a CrossEncoder).

Do you know of a situation where you'd need both columns ?

  • Tom Aarsen

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional output_scores parameter to the mine_hard_negatives function, allowing users to include similarity scores in the output dataset alongside the mined hard negatives. This enables fine-tuning mining parameters without recalculating scores and supports extracting selection logic outside the mining function. The PR also deprecates the n-tuple-scores output format in favor of using n-tuple with output_scores=True.

Key changes:

  • Added output_scores parameter to optionally include similarity scores in all output formats
  • Deprecated n-tuple-scores format with a migration path to n-tuple + output_scores=True
  • Updated data collator to recognize "labels" and "scores" as valid label columns

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
sentence_transformers/util/hard_negatives.py Implements output_scores parameter, deprecates n-tuple-scores format, and adds score extraction logic for all output formats
tests/util/test_hard_negatives.py Adds comprehensive test coverage for output_scores parameter across all output formats and validates deprecated format behavior
sentence_transformers/data_collator.py Extends valid label columns to include "labels" and "scores" alongside existing "label" and "score"
docs/sentence_transformer/training_overview.md Updates documentation to reflect new valid label column names
docs/cross_encoder/training_overview.md Removes obsolete comparison note about label column differences
docs/cross_encoder/loss_overview.md Documents ability to output similarity scores instead of binary labels

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tomaarsen tomaarsen enabled auto-merge (squash) December 11, 2025 10:57
@tomaarsen tomaarsen merged commit 32cb5de into huggingface:main Dec 11, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants