feat(datasets): allow multipart uploads for large datasets#384
Conversation
|
mostly not awful, just need to navigate some lingering query param errors: ref. https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html |
610c066 to
5481ffe
Compare
|
|
||
| @classmethod | ||
| def _put(cls, path, url, content_type): | ||
| # @classmethod |
There was a problem hiding this comment.
I don't think this breaks anything; I haven't been able to at least
5481ffe to
8de0ab6
Compare
| update_status() | ||
| pool.put(self._put, url=pre_signed.url, | ||
| path=result['path'], content_type=result['mimetype']) | ||
| pool.put( |
There was a problem hiding this comment.
Granted, this isn't really ideal. We're single-threading all parts of an upload in a single worker rather than distributing all N parts among all M workers in the pool. This will result in longer upload times, but that's better than the broken upload we have today. Soooo baby steps.
| # less than the part_minsize, AND we want to 1-index | ||
| # our range to match what AWS expects for part | ||
| # numbers | ||
| for part in range(1, (size // part_minsize) + 2): |
There was a problem hiding this comment.
this'll also add an extra empty part if the upload is exactly divisible by 500MB, which will probably cause an error from AWS due to it being too small. But also 🤷
This attempts to fall back to a multipart upload strategy with presigned URLs in the event that a dataset is larger than 500MB
8de0ab6 to
5b2a78c
Compare
marquiswashere
left a comment
There was a problem hiding this comment.
LGTM, we can test multi-threading later haha.
|
🎉 This PR is included in version 1.11.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This attempts to fall back to a multipart upload strategy with presigned
URLs in the event that a dataset is larger than 500MB