Skip to content

getsentry/grouping-trainer

grouping-trainer

Training and eval code for Sentry's AI grouping model.

Set up local

  1. Install uv, direnv, Google Cloud SDK.

  2. Sign up for WandB and add a secret at wandb-api-key in your GCP project.

  3. Create .env

    cp .env.example .env

    Fill in your GCP project name and GCS bucket name.

  4. Set up the local Python environment:

    bin/set_up_local.sh

Usage

Assumes data w/ the columns in src/grouping_trainer/data.py are written to GCS in gs://$GROUPING_TRAINER_BUCKET/final_csvs/.

Train

Sanity check that plumbing works locally:

python train.py --tiny_run

Launch a full remote training run:

python train.py --gpu h100 --run_shortname my-run

For DDP:

python train.py --gpu h100-ddp-4 --run_shortname my-run

Debug

Launch a bare instance to SSH into
python -m grouping_trainer.launch --gpu h100
SSH into an instance from local

Pls add this function to your ~/.zshrc or similar:

gssh() {
    gcloud compute ssh "$1" --zone="${2:-us-central1-a}" --tunnel-through-iap
}

Then find your instance:

gcloud compute instances list --filter="name~grouping-trainer"

And SSH in:

gssh your-instance
# Override the zone if needed
gssh your-instance your-instance-zone
Check instance output

SSH into the instance and run:

logs
# shortcut for:
# sudo tail -n 50 -f /var/log/grouping_trainer_run.log

If that file doesn't exist, the startup script never reached the eval $COMMAND block. Check what it actually did:

sudo journalctl -u google-startup-scripts.service --no-pager

From local (use when you can't SSH in, e.g., the boot itself failed):

gcloud compute instances get-serial-port-output your-instance --zone=your-instance-zone | tail -100

Eval

See ./eval/.

About

Training and evaluation for Sentry AI grouping

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors