Skip to content

PYTHON-1721 Improve GridFS file download performance#413

Closed
ShaneHarvey wants to merge 10 commits intomongodb:masterfrom
ShaneHarvey:PYTHON-1721-master-final
Closed

PYTHON-1721 Improve GridFS file download performance#413
ShaneHarvey wants to merge 10 commits intomongodb:masterfrom
ShaneHarvey:PYTHON-1721-master-final

Conversation

@ShaneHarvey
Copy link
Copy Markdown
Contributor

Jira: https://jira.mongodb.org/browse/PYTHON-1721

This change uses a single cursor to download all the chunks in a GridFS file instead of using individual find_one operations to read each chunk. I've also refactored the code a bit to add earlier detection of truncated chunks.

The performance benefit of using a single cursor will vary but it should always be an improvement.

Python 2.7 local server: negligible benefit (latency is ~0.09ms):

$ mongo-python-driver git:(PYTHON-1721-master-final) TEST_PATH=driver-performance-test-data python test/performance/perf_test.py TestGridFsDownload # Local repl 1 node, cursor
Running TestGridFsDownload. MEDIAN=0.322574853897
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 162532197.9273853
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 35.708s

OK
$ mongo-python-driver git:(PYTHON-1721-master-final) gcm
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ mongo-python-driver git:(master) TEST_PATH=driver-performance-test-data python test/performance/perf_test.py TestGridFsDownload # Local repl 1 node, find_one
Running TestGridFsDownload. MEDIAN=0.326012134552
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 160818553.80029458
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 33.828s

OK

Python 3.7 local server: 15% improvement:

$ mongo-python-driver git:(PYTHON-1721-master-final) TEST_PATH=driver-performance-test-data python3.7 test/performance/perf_test.py TestGridFsDownload # Local repl 1 node, cursor
Running TestGridFsDownload. MEDIAN=0.25183119000000076
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 208190256.33798516
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 29.609s

OK
$ mongo-python-driver git:(PYTHON-1721-master-final) gcm
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ mongo-python-driver git:(master) TEST_PATH=driver-performance-test-data python3.7 test/performance/perf_test.py TestGridFsDownload # Local repl 1 node, cursor
Running TestGridFsDownload. MEDIAN=0.29158426900000034
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 179806682.23222953
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 30.502s

OK

Now, once we run the benchmark against a server with some real world latency, using a cursor becomes much more advantageous. Let's test with an Atlas (M10) cluster.

Python 2.7, Atlas 4.0.6 replica set, 33% speed up (latency is ~5ms):

$ mongo-python-driver git:(PYTHON-1721-master-final) <ATLAS_ENV_VARS> TEST_PATH=driver-performance-test-data python test/performance/perf_test.py TestGridFsDownload # Atlas find cursor
test/performance/perf_test.py:114: UserWarning: Test timed out, completed 83 iterations.
  warnings.warn('Test timed out, completed %s iterations.' % i)
Running TestGridFsDownload. MEDIAN=3.35720396042
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 15616805.120610315
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 313.563s

OK
$ mongo-python-driver git:(PYTHON-1721-master-final) gcm
M	test/__init__.py
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ mongo-python-driver git:(master) <ATLAS_ENV_VARS> TEST_PATH=driver-performance-test-data python test/performance/perf_test.py TestGridFsDownload # Atlas find_one
test/performance/perf_test.py:114: UserWarning: Test timed out, completed 65 iterations.
  warnings.warn('Test timed out, completed %s iterations.' % i)
Running TestGridFsDownload. MEDIAN=4.43294596672
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 11827078.514739025
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 311.489s

OK

Python 3.7, Atlas 4.0.6 replica set: Again ~33% speed up:

$ mongo-python-driver git:(PYTHON-1721-master-final) <ATLAS_ENV_VARS> TEST_PATH=driver-performance-test-data python3.7 test/performance/perf_test.py TestGridFsDownload # Atlas find cursor
test/performance/perf_test.py:114: UserWarning: Test timed out, completed 91 iterations.
  warnings.warn('Test timed out, completed %s iterations.' % i)
Running TestGridFsDownload. MEDIAN=3.1946222420000083
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 16411580.471303768
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 317.615s

OK
$ mongo-python-driver git:(PYTHON-1721-master-final) gcm
M	test/__init__.py
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ mongo-python-driver git:(master) <ATLAS_ENV_VARS> TEST_PATH=driver-performance-test-data python3.7 test/performance/perf_test.py TestGridFsDownload # Atlas find_one
test/performance/perf_test.py:114: UserWarning: Test timed out, completed 65 iterations.
  warnings.warn('Test timed out, completed %s iterations.' % i)
Running TestGridFsDownload. MEDIAN=4.296203145000021
.{
    "results": [
        {
            "name": "TestGridFsDownload",
            "results": {
                "1": {
                    "ops_per_sec": 12203519.766288832
                }
            }
        }
    ]
}

----------------------------------------------------------------------
Ran 1 test in 313.310s

OK


try:
chunk = self.__chunk_iter.next()
except StopIteration:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible? It looks like _GridOutChunkIterator will never raise StopIteration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_GridOutChunkIterator does raise StopIteration after reading the final chunk and checking that no extra chunks exist and I believe I had to make this change to catch a specific test failure when the final chunk in a file was truncated. However, this was on an earlier version of these changes. Now that _GridOutChunkIterator checks for missing/empty chunks and raises CorruptGridFile itself you are right that StopIteration is impossible here. Also see the coverage report saying this line was never reached: https://s3.amazonaws.com/mciuploads/mongo-python-driver/coverage/a84f50b998ea4614312b8ed13278c52031db2a54/5c7f1e502a60ed235e482752/htmlcov/gridfs_grid_file_py.html#n488

try:
self.__chunk_iter.next()
except StopIteration:
pass
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gridfs spec has this to say:

Should drivers report an error if a stored file has extra chunks?

The length and the chunkSize fields of the files collection document together imply exactly how many chunks a stored file should have. If the chunks collection has any extra chunks the stored file is in an inconsistent state. Ideally we would like to report that as an error, but this is an extremely unlikely state and we don't want to pay a performance penalty checking for an error that is almost never there. Therefore, drivers MAY ignore extra chunks.

Given that, I've removed the check on every call to read() but left the check for after reading the entire file. Calling next() one extra time will have no performance impact in the common case (because the cursor is already exhausted) and still keeps the CorruptGridFile check.

Note that there is a minor difference here: An app used to see CorruptGridFile("Extra chunk found..") on the first read() of a file with extra chunks. Now the app will only see the error on the final read(). I think this is a totally acceptable change and is so minor I don't think it's necessary to call out in the docs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should call it out in the changelog.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some versionchanged entries and updated the changelog. I also tweaked the chunk iterator to retry once on CursorNotFound for backwards compatibility -- previously an app could spent an arbitrary amount of time between two read() calls because each used a find_one. However when using a cursor the server side cursor can timeout and thus the same code would not work. I will add a test for this on Monday.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new test that simulates a cursor timeout by killing the underlying cursor before a getMore call.

@ShaneHarvey ShaneHarvey force-pushed the PYTHON-1721-master-final branch from 4650b53 to 4a9b5c8 Compare March 11, 2019 23:07
@ShaneHarvey ShaneHarvey requested a review from behackett March 11, 2019 23:07
ShaneHarvey added a commit that referenced this pull request Mar 13, 2019
This change uses a cursor to download all the chunks in a GridFS file
instead of using individual find_one operations to read each chunk.
Detect truncated/missing/extra chunks in _GridOutChunkIterator.
Only detect extra chunks after reading the final chunk, not on every
call to read().
Retry once after CursorNotFound for backward compatibility.
ShaneHarvey added a commit that referenced this pull request Mar 13, 2019
This change uses a cursor to download all the chunks in a GridFS file
instead of using individual find_one operations to read each chunk.
Detect truncated/missing/extra chunks in _GridOutChunkIterator.
Only detect extra chunks after reading the final chunk, not on every
call to read().
Retry once after CursorNotFound for backward compatibility.

(cherry picked from commit 956fd92)
@ShaneHarvey
Copy link
Copy Markdown
Contributor Author

Merged to master: 956fd92 and v3.8: 8069e13.

@ShaneHarvey ShaneHarvey deleted the PYTHON-1721-master-final branch March 13, 2019 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants