PYTHON-1721 Improve GridFS file download performance#413
PYTHON-1721 Improve GridFS file download performance#413ShaneHarvey wants to merge 10 commits intomongodb:masterfrom
Conversation
gridfs/grid_file.py
Outdated
|
|
||
| try: | ||
| chunk = self.__chunk_iter.next() | ||
| except StopIteration: |
There was a problem hiding this comment.
Is this possible? It looks like _GridOutChunkIterator will never raise StopIteration.
There was a problem hiding this comment.
_GridOutChunkIterator does raise StopIteration after reading the final chunk and checking that no extra chunks exist and I believe I had to make this change to catch a specific test failure when the final chunk in a file was truncated. However, this was on an earlier version of these changes. Now that _GridOutChunkIterator checks for missing/empty chunks and raises CorruptGridFile itself you are right that StopIteration is impossible here. Also see the coverage report saying this line was never reached: https://s3.amazonaws.com/mciuploads/mongo-python-driver/coverage/a84f50b998ea4614312b8ed13278c52031db2a54/5c7f1e502a60ed235e482752/htmlcov/gridfs_grid_file_py.html#n488
| try: | ||
| self.__chunk_iter.next() | ||
| except StopIteration: | ||
| pass |
There was a problem hiding this comment.
The gridfs spec has this to say:
Should drivers report an error if a stored file has extra chunks?
The length and the chunkSize fields of the files collection document together imply exactly how many chunks a stored file should have. If the chunks collection has any extra chunks the stored file is in an inconsistent state. Ideally we would like to report that as an error, but this is an extremely unlikely state and we don't want to pay a performance penalty checking for an error that is almost never there. Therefore, drivers MAY ignore extra chunks.
Given that, I've removed the check on every call to read() but left the check for after reading the entire file. Calling next() one extra time will have no performance impact in the common case (because the cursor is already exhausted) and still keeps the CorruptGridFile check.
Note that there is a minor difference here: An app used to see CorruptGridFile("Extra chunk found..") on the first read() of a file with extra chunks. Now the app will only see the error on the final read(). I think this is a totally acceptable change and is so minor I don't think it's necessary to call out in the docs.
There was a problem hiding this comment.
You should call it out in the changelog.
There was a problem hiding this comment.
I added some versionchanged entries and updated the changelog. I also tweaked the chunk iterator to retry once on CursorNotFound for backwards compatibility -- previously an app could spent an arbitrary amount of time between two read() calls because each used a find_one. However when using a cursor the server side cursor can timeout and thus the same code would not work. I will add a test for this on Monday.
There was a problem hiding this comment.
I added a new test that simulates a cursor timeout by killing the underlying cursor before a getMore call.
This change uses a single cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk.
4650b53 to
4a9b5c8
Compare
This change uses a cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk. Detect truncated/missing/extra chunks in _GridOutChunkIterator. Only detect extra chunks after reading the final chunk, not on every call to read(). Retry once after CursorNotFound for backward compatibility.
This change uses a cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk. Detect truncated/missing/extra chunks in _GridOutChunkIterator. Only detect extra chunks after reading the final chunk, not on every call to read(). Retry once after CursorNotFound for backward compatibility. (cherry picked from commit 956fd92)
Jira: https://jira.mongodb.org/browse/PYTHON-1721
This change uses a single cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk. I've also refactored the code a bit to add earlier detection of truncated chunks.
The performance benefit of using a single cursor will vary but it should always be an improvement.
Python 2.7 local server: negligible benefit (latency is ~0.09ms):
Python 3.7 local server: 15% improvement:
Now, once we run the benchmark against a server with some real world latency, using a cursor becomes much more advantageous. Let's test with an Atlas (M10) cluster.
Python 2.7, Atlas 4.0.6 replica set, 33% speed up (latency is ~5ms):
Python 3.7, Atlas 4.0.6 replica set: Again ~33% speed up: