-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Streaming GTFS stop times and shapes #6754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-2.x
Are you sure you want to change the base?
Streaming GTFS stop times and shapes #6754
Conversation
5aacb43 to
67a963b
Compare
optionsome
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one option would also be to add some sort of a streaming reader mode to the OBA library and read the rows through it, but we probably would need to do it a slightly more generic way which might lead to more memory consumption-
| dao.setPackShapePoints(true); | ||
| dao.setPackStopTimes(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They instruct OBA to use a more compact way of representing these entities. But if the do the streaming approach it is no longer necessary.
I had the same idea. The problem is that streaming the entities will give up referential integrity checks in the library and for example the StopTime.trip is no longer a full Trip but a trip id, which the consumer has to resolve. This means that we need a new data model. So with a new way of reading data and a new data model there isn't much left of OBA. Also, now that I've maintained OBA for a while, I see that there is a huge amount of complicated indirection in there which to me doesn't make a lot of sense. My favourite solution is this: we create a new module in this repo where we develop a new streaming library. Once we are satisfied with it we can consider moving it to another repo either in the OBA or the OTP orgs. |
67a963b to
4dc8e10
Compare
Summary
This is a proof of concept for more efficient processing of stop times and shapes: rather than reading all of them into a huge list/array they are streamed off the CSV source line by line.
This has huge memory savings - in a typical graph build you can save 30-40%!
Combined with #6752 this saves about 60% of memory.
The downside is that we now have two ways of reading GTFS data: one streaming and one from the OBA library.
We need to discuss the various trade offs to make and therefore this is a draft. (It also depends on a PR that isn't merged yet.)
cc @tkalvas @abyrd @jessicaKoehnke