Skip to content

Conversation

@HermitSun
Copy link
Contributor

@HermitSun HermitSun commented Apr 1, 2025

Motivation

Resolve #4822.

Modifications

Support loading safetensors weights with runai_streamer. This can be enabled by adding the option --load-format runai_streamer when launching.

Checklist

"bitsandbytes",
"layered",
"remote",
"runai_streamer",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add a description for it below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, I've already added it.
As for why I didn't add a comment for remote, it's because I think the logic of runai_streamer and remote can be merged. I'll try to refactor this logic a bit later.

@brayden-hai
Copy link
Contributor

Hi @HermitSun I'm wondering if the existing SGLang already supports runai streamer, as I was able to install it but the performance was still not as good as expected. I'm interested in the S3 use case. Right now I am using the basic MP model loader, I wonder if you have compared this performance with the MP loader in #7277

@ajmyyra
Copy link

ajmyyra commented Oct 22, 2025

Hi @HermitSun, would you be willing to rebase this PR so it could be considered? I was doing a similar implementation for a PR when I found yours and as you were the first, it would be good to have your changes considered (& hopefully merged). If you're short on time, I can help to test it out.

RunAI's Model Streamer performs up to 2x better when loading safetensor weights from a filesystem over a network (such as NFS). It has also been implemented with similar naming in vLLM, and as the model loader design seems to follow that quite closely, it would be good to support this in SGLang as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Load model weight in parallel

4 participants