Skip to content

Conversation

@leonardehrenfried
Copy link
Member

Summary

This is a proof of concept for more efficient processing of stop times and shapes: rather than reading all of them into a huge list/array they are streamed off the CSV source line by line.

This has huge memory savings - in a typical graph build you can save 30-40%!

Combined with #6752 this saves about 60% of memory.

The downside is that we now have two ways of reading GTFS data: one streaming and one from the OBA library.

We need to discuss the various trade offs to make and therefore this is a draft. (It also depends on a PR that isn't merged yet.)

cc @tkalvas @abyrd @jessicaKoehnke

Copy link
Member

@optionsome optionsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one option would also be to add some sort of a streaming reader mode to the OBA library and read the rows through it, but we probably would need to do it a slightly more generic way which might lead to more memory consumption-

Comment on lines 320 to 321
dao.setPackShapePoints(true);
dao.setPackStopTimes(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They instruct OBA to use a more compact way of representing these entities. But if the do the streaming approach it is no longer necessary.

@leonardehrenfried
Copy link
Member Author

I guess one option would also be to add some sort of a streaming reader mode to the OBA library and read the rows through it, but we probably would need to do it a slightly more generic way which might lead to more memory consumption-

I had the same idea. The problem is that streaming the entities will give up referential integrity checks in the library and for example the StopTime.trip is no longer a full Trip but a trip id, which the consumer has to resolve.

This means that we need a new data model. So with a new way of reading data and a new data model there isn't much left of OBA. Also, now that I've maintained OBA for a while, I see that there is a huge amount of complicated indirection in there which to me doesn't make a lot of sense.

My favourite solution is this: we create a new module in this repo where we develop a new streaming library. Once we are satisfied with it we can consider moving it to another repo either in the OBA or the OTP orgs.

@t2gran t2gran added this to the 2.8 (next release) milestone Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants