feat(datasets): allow multipart uploads for large datasets by cwetherill-ps · Pull Request #384 · Paperspace/gradient-cli

cwetherill-ps · 2022-04-07T21:20:18Z

This attempts to fall back to a multipart upload strategy with presigned
URLs in the event that a dataset is larger than 500MB

cwetherill-ps · 2022-04-07T21:21:07Z

mostly not awful, just need to navigate some lingering query param errors:

<Error>
<Code>AuthorizationQueryParametersError</Code>
<Message>
Query-string authentication version 4 requires the X-Amz-Algorithm, X-Amz-Credential, X-Amz-Signature, X-Amz-Date, X-Amz-SignedHeaders, and X-Amz-Expires parameters.
</Message>
<Key>
te6x30gzr/datasets/dswrkyj0ymtibue/versions/glkpp4r/data/asdf.csv
</Key>
<BucketName>bucket</BucketName>
<Resource>
/bucket/te6x30gzr/datasets/dswrkyj0ymtibue/versions/glkpp4r/data/asdf.csv
</Resource>
<RequestId>16E3B91D7F4FEF72</RequestId>
<HostId>669e732c-41ff-4697-98ec-4a8644d1e8f5</HostId>
</Error>

ref. https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html

cwetherill-ps · 2022-04-08T16:30:08Z

gradient/commands/datasets.py


-    @classmethod
-    def _put(cls, path, url, content_type):
+    # @classmethod


I don't think this breaks anything; I haven't been able to at least

cwetherill-ps · 2022-04-08T16:40:41Z

gradient/commands/datasets.py

            update_status()
-            pool.put(self._put, url=pre_signed.url,
-                     path=result['path'], content_type=result['mimetype'])
+            pool.put(


Granted, this isn't really ideal. We're single-threading all parts of an upload in a single worker rather than distributing all N parts among all M workers in the pool. This will result in longer upload times, but that's better than the broken upload we have today. Soooo baby steps.

cwetherill-ps · 2022-04-08T16:50:42Z

gradient/commands/datasets.py

+                        # less than the part_minsize, AND we want to 1-index
+                        # our range to match what AWS expects for part
+                        # numbers
+                        for part in range(1, (size // part_minsize) + 2):


this'll also add an extra empty part if the upload is exactly divisible by 500MB, which will probably cause an error from AWS due to it being too small. But also 🤷

This cost me my day. Use ceil...

This attempts to fall back to a multipart upload strategy with presigned URLs in the event that a dataset is larger than 500MB

marquiswashere

LGTM, we can test multi-threading later haha.

PSBOT · 2022-04-08T18:31:37Z

🎉 This PR is included in version 1.11.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

e-r-i-k-a requested a review from marquiswashere April 7, 2022 22:56

cwetherill-ps force-pushed the cwetherill/nb-917-bug-uploading-large-datasets-from-cli branch from 610c066 to 5481ffe Compare April 8, 2022 16:28

cwetherill-ps marked this pull request as ready for review April 8, 2022 16:28

cwetherill-ps requested a review from lamroger April 8, 2022 16:29

cwetherill-ps commented Apr 8, 2022

View reviewed changes

cwetherill-ps force-pushed the cwetherill/nb-917-bug-uploading-large-datasets-from-cli branch from 5481ffe to 8de0ab6 Compare April 8, 2022 16:31

cwetherill-ps commented Apr 8, 2022

View reviewed changes

feat(datasets): allow multipart uploads for large datasets

5b2a78c

This attempts to fall back to a multipart upload strategy with presigned URLs in the event that a dataset is larger than 500MB

cwetherill-ps force-pushed the cwetherill/nb-917-bug-uploading-large-datasets-from-cli branch from 8de0ab6 to 5b2a78c Compare April 8, 2022 17:38

marquiswashere approved these changes Apr 8, 2022

View reviewed changes

cwetherill-ps merged commit 53d4c87 into master Apr 8, 2022

cwetherill-ps deleted the cwetherill/nb-917-bug-uploading-large-datasets-from-cli branch April 8, 2022 18:26

PSBOT added the released label Apr 8, 2022

This was referenced Aug 13, 2022

fix(datasets): increase create version request timeout #389

Merged

Replace faulty chunk counting logic #391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): allow multipart uploads for large datasets#384

feat(datasets): allow multipart uploads for large datasets#384
cwetherill-ps merged 1 commit intomasterfrom
cwetherill/nb-917-bug-uploading-large-datasets-from-cli

cwetherill-ps commented Apr 7, 2022

Uh oh!

cwetherill-ps commented Apr 7, 2022 •

edited

Loading

Uh oh!

cwetherill-ps Apr 8, 2022

Uh oh!

cwetherill-ps Apr 8, 2022

Uh oh!

cwetherill-ps Apr 8, 2022

Uh oh!

ghost Aug 13, 2022 •

edited by ghost

Loading

Uh oh!

marquiswashere left a comment

Uh oh!

PSBOT commented Apr 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cwetherill-ps commented Apr 7, 2022

Uh oh!

cwetherill-ps commented Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cwetherill-ps Apr 8, 2022

Choose a reason for hiding this comment

Uh oh!

cwetherill-ps Apr 8, 2022

Choose a reason for hiding this comment

Uh oh!

cwetherill-ps Apr 8, 2022

Choose a reason for hiding this comment

Uh oh!

ghost Aug 13, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marquiswashere left a comment

Choose a reason for hiding this comment

Uh oh!

PSBOT commented Apr 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cwetherill-ps commented Apr 7, 2022 •

edited

Loading

ghost Aug 13, 2022 •

edited by ghost

Loading