Training and eval code for Sentry's AI grouping model.
-
Install uv, direnv, Google Cloud SDK.
-
Sign up for WandB and add a secret at
wandb-api-keyin your GCP project. -
Create
.envcp .env.example .env
Fill in your GCP project name and GCS bucket name.
-
Set up the local Python environment:
bin/set_up_local.sh
Assumes data w/ the columns in src/grouping_trainer/data.py are written to GCS in
gs://$GROUPING_TRAINER_BUCKET/final_csvs/.
Sanity check that plumbing works locally:
python train.py --tiny_run
Launch a full remote training run:
python train.py --gpu h100 --run_shortname my-runFor DDP:
python train.py --gpu h100-ddp-4 --run_shortname my-runLaunch a bare instance to SSH into
python -m grouping_trainer.launch --gpu h100SSH into an instance from local
Pls add this function to your ~/.zshrc or similar:
gssh() {
gcloud compute ssh "$1" --zone="${2:-us-central1-a}" --tunnel-through-iap
}Then find your instance:
gcloud compute instances list --filter="name~grouping-trainer"And SSH in:
gssh your-instance
# Override the zone if needed
gssh your-instance your-instance-zoneCheck instance output
SSH into the instance and run:
logs
# shortcut for:
# sudo tail -n 50 -f /var/log/grouping_trainer_run.logIf that file doesn't exist, the startup script never reached the eval $COMMAND block. Check what it actually did:
sudo journalctl -u google-startup-scripts.service --no-pagerFrom local (use when you can't SSH in, e.g., the boot itself failed):
gcloud compute instances get-serial-port-output your-instance --zone=your-instance-zone | tail -100See ./eval/.