PYTHON-1721 Improve GridFS file download performance by ShaneHarvey · Pull Request #413 · mongodb/mongo-python-driver

ShaneHarvey · 2019-03-06T01:11:44Z

Jira: https://jira.mongodb.org/browse/PYTHON-1721

This change uses a single cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk. I've also refactored the code a bit to add earlier detection of truncated chunks.

The performance benefit of using a single cursor will vary but it should always be an improvement.

Python 2.7 local server: negligible benefit (latency is ~0.09ms):

$ mongo-python-driver git:(PYTHON-1721-master-final) TEST_PATH=driver-performance-test-data python test/performance/perf_test.py TestGridFsDownload # Local repl 1 node, cursor
Running TestGridFsDownload. MEDIAN=0.322574853897
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 162532197.9273853
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 35.708s

OK
$ mongo-python-driver git:(PYTHON-1721-master-final) gcm
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ mongo-python-driver git:(master) TEST_PATH=driver-performance-test-data python test/performance/perf_test.py TestGridFsDownload # Local repl 1 node, find_one
Running TestGridFsDownload. MEDIAN=0.326012134552
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 160818553.80029458
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 33.828s

OK

Python 3.7 local server: 15% improvement:

$ mongo-python-driver git:(PYTHON-1721-master-final) TEST_PATH=driver-performance-test-data python3.7 test/performance/perf_test.py TestGridFsDownload # Local repl 1 node, cursor
Running TestGridFsDownload. MEDIAN=0.25183119000000076
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 208190256.33798516
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 29.609s

OK
$ mongo-python-driver git:(PYTHON-1721-master-final) gcm
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ mongo-python-driver git:(master) TEST_PATH=driver-performance-test-data python3.7 test/performance/perf_test.py TestGridFsDownload # Local repl 1 node, cursor
Running TestGridFsDownload. MEDIAN=0.29158426900000034
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 179806682.23222953
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 30.502s

OK

Now, once we run the benchmark against a server with some real world latency, using a cursor becomes much more advantageous. Let's test with an Atlas (M10) cluster.

Python 2.7, Atlas 4.0.6 replica set, 33% speed up (latency is ~5ms):

$ mongo-python-driver git:(PYTHON-1721-master-final) <ATLAS_ENV_VARS> TEST_PATH=driver-performance-test-data python test/performance/perf_test.py TestGridFsDownload # Atlas find cursor
test/performance/perf_test.py:114: UserWarning: Test timed out, completed 83 iterations.
  warnings.warn('Test timed out, completed %s iterations.' % i)
Running TestGridFsDownload. MEDIAN=3.35720396042
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 15616805.120610315
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 313.563s

OK
$ mongo-python-driver git:(PYTHON-1721-master-final) gcm
M	test/__init__.py
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ mongo-python-driver git:(master) <ATLAS_ENV_VARS> TEST_PATH=driver-performance-test-data python test/performance/perf_test.py TestGridFsDownload # Atlas find_one
test/performance/perf_test.py:114: UserWarning: Test timed out, completed 65 iterations.
  warnings.warn('Test timed out, completed %s iterations.' % i)
Running TestGridFsDownload. MEDIAN=4.43294596672
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 11827078.514739025
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 311.489s

OK

Python 3.7, Atlas 4.0.6 replica set: Again ~33% speed up:

$ mongo-python-driver git:(PYTHON-1721-master-final) <ATLAS_ENV_VARS> TEST_PATH=driver-performance-test-data python3.7 test/performance/perf_test.py TestGridFsDownload # Atlas find cursor
test/performance/perf_test.py:114: UserWarning: Test timed out, completed 91 iterations.
  warnings.warn('Test timed out, completed %s iterations.' % i)
Running TestGridFsDownload. MEDIAN=3.1946222420000083
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 16411580.471303768
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 317.615s

OK
$ mongo-python-driver git:(PYTHON-1721-master-final) gcm
M	test/__init__.py
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ mongo-python-driver git:(master) <ATLAS_ENV_VARS> TEST_PATH=driver-performance-test-data python3.7 test/performance/perf_test.py TestGridFsDownload # Atlas find_one
test/performance/perf_test.py:114: UserWarning: Test timed out, completed 65 iterations.
  warnings.warn('Test timed out, completed %s iterations.' % i)
Running TestGridFsDownload. MEDIAN=4.296203145000021
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 12203519.766288832
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 313.310s

OK

behackett · 2019-03-06T01:29:10Z

gridfs/grid_file.py

+
+            try:
+                chunk = self.__chunk_iter.next()
+            except StopIteration:


Is this possible? It looks like _GridOutChunkIterator will never raise StopIteration.

_GridOutChunkIterator does raise StopIteration after reading the final chunk and checking that no extra chunks exist and I believe I had to make this change to catch a specific test failure when the final chunk in a file was truncated. However, this was on an earlier version of these changes. Now that _GridOutChunkIterator checks for missing/empty chunks and raises CorruptGridFile itself you are right that StopIteration is impossible here. Also see the coverage report saying this line was never reached: https://s3.amazonaws.com/mciuploads/mongo-python-driver/coverage/a84f50b998ea4614312b8ed13278c52031db2a54/5c7f1e502a60ed235e482752/htmlcov/gridfs_grid_file_py.html#n488

ShaneHarvey · 2019-03-07T20:09:13Z

gridfs/grid_file.py

+            try:
+                self.__chunk_iter.next()
+            except StopIteration:
+                pass


The gridfs spec has this to say:

Should drivers report an error if a stored file has extra chunks?

The length and the chunkSize fields of the files collection document together imply exactly how many chunks a stored file should have. If the chunks collection has any extra chunks the stored file is in an inconsistent state. Ideally we would like to report that as an error, but this is an extremely unlikely state and we don't want to pay a performance penalty checking for an error that is almost never there. Therefore, drivers MAY ignore extra chunks.

Given that, I've removed the check on every call to read() but left the check for after reading the entire file. Calling next() one extra time will have no performance impact in the common case (because the cursor is already exhausted) and still keeps the CorruptGridFile check.

Note that there is a minor difference here: An app used to see CorruptGridFile("Extra chunk found..") on the first read() of a file with extra chunks. Now the app will only see the error on the final read(). I think this is a totally acceptable change and is so minor I don't think it's necessary to call out in the docs.

You should call it out in the changelog.

I added some versionchanged entries and updated the changelog. I also tweaked the chunk iterator to retry once on CursorNotFound for backwards compatibility -- previously an app could spent an arbitrary amount of time between two read() calls because each used a find_one. However when using a cursor the server side cursor can timeout and thus the same code would not work. I will add a test for this on Monday.

I added a new test that simulates a cursor timeout by killing the underlying cursor before a getMore call.

This change uses a single cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk.

This change uses a cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk. Detect truncated/missing/extra chunks in _GridOutChunkIterator. Only detect extra chunks after reading the final chunk, not on every call to read(). Retry once after CursorNotFound for backward compatibility.

This change uses a cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk. Detect truncated/missing/extra chunks in _GridOutChunkIterator. Only detect extra chunks after reading the final chunk, not on every call to read(). Retry once after CursorNotFound for backward compatibility. (cherry picked from commit 956fd92)

ShaneHarvey · 2019-03-13T22:50:42Z

Merged to master: 956fd92 and v3.8: 8069e13.

behackett reviewed Mar 6, 2019

View reviewed changes

ShaneHarvey commented Mar 7, 2019

View reviewed changes

behackett approved these changes Mar 8, 2019

View reviewed changes

ShaneHarvey added 9 commits March 11, 2019 15:56

PYTHON-1721 Improve GridFS file download performance

ea299af

This change uses a single cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk.

Detect truncated/missing/extra chunks in _GridOutChunkIterator

2e96f59

Standardize on %d over %i

5ccb503

Remove unused properties

2a237ae

Remove dead code

4de3d73

Only detect axtra chunks after reading the final chunk

1a07af3

Update docs and add changelog

e2c4cc6

GridOut should survive CursorNotFound for backwards compatibility

b3e4292

Add test that GridOut survives cursor timeout

4a9b5c8

ShaneHarvey force-pushed the PYTHON-1721-master-final branch from 4650b53 to 4a9b5c8 Compare March 11, 2019 23:07

ShaneHarvey requested a review from behackett March 11, 2019 23:07

behackett approved these changes Mar 12, 2019

View reviewed changes

Remove completed TODO comment

7a753ac

ShaneHarvey closed this Mar 13, 2019

ShaneHarvey deleted the PYTHON-1721-master-final branch March 13, 2019 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PYTHON-1721 Improve GridFS file download performance#413

PYTHON-1721 Improve GridFS file download performance#413
ShaneHarvey wants to merge 10 commits intomongodb:masterfrom
ShaneHarvey:PYTHON-1721-master-final

ShaneHarvey commented Mar 6, 2019

Uh oh!

behackett Mar 6, 2019

Uh oh!

ShaneHarvey Mar 7, 2019

Uh oh!

ShaneHarvey Mar 7, 2019

Uh oh!

behackett Mar 8, 2019

Uh oh!

ShaneHarvey Mar 8, 2019

Uh oh!

ShaneHarvey Mar 11, 2019

Uh oh!

ShaneHarvey commented Mar 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShaneHarvey commented Mar 6, 2019

Uh oh!

behackett Mar 6, 2019

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Mar 7, 2019

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Mar 7, 2019

Choose a reason for hiding this comment

Uh oh!

behackett Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Mar 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants