doi: add Dataverse direct mode with token, versions, ingest and tree#9467
doi: add Dataverse direct mode with token, versions, ingest and tree#9467ErykKul wants to merge 1 commit into
Conversation
|
As you noted we have a doi backend https://rclone.org/doi/ with a dataverse provider already. Why can't this be a patch to the doi backend? |
|
Loading the doi backend (that is meant to be a generic resolver also covering Zenodo and Invenio) with Dataverse-specific things like That's why I went with the third option: a separate Dataverse backend that addresses installs directly via One thing worth flagging: Flora's implementation matches rclone idiom ( It's possible I got some of this wrong, or was too eager about a specific use case to see the broader fit, happy to be told. Open to discussion. |
cfae02d to
087eb2d
Compare
|
Quick heads-up before you dig in: I've pushed a round of polish to this branch since opening it; mainly a lazy, paginated listing path (feature-detected, with a fallback to the original whole-version listing) plus some cleanup to match rclone's usual |
|
I don't like the substantial overlap with the existing In your PR description "Relationship to
What's in the code is the simpler The new capabilities are: direct The biggest gap to close in // Options gains three fields:
type Options struct {
Doi string `config:"doi"`
Provider string `config:"provider"`
DoiResolverAPIURL string `config:"doi_resolver_api_url"`
Host string `config:"host"` // NEW: direct-mode base URL, skips doi.org
DatasetPID string `config:"dataset_pid"` // NEW: persistentId for direct mode
Token string `config:"token"` // NEW: X-Dataverse-Key (optional)
}The new config options exact meaning could depend on provider. // init() gains options, e.g.:
}, {
Name: "host",
Help: `Base URL of the installation, e.g. https://demo.dataverse.org.
When set with dataset_pid, rclone addresses the dataset directly and skips
doi.org resolution. Lets you reach local/staging installs and unpublished
drafts whose PIDs aren't globally resolvable.`,
Required: false,
Advanced: true,
}, {
Name: "token",
Help: "API token (Dataverse X-Dataverse-Key). Blank means guest access.",
Sensitive: true,
Required: false,
Advanced: true,
},
// ...plus dataset_pid and version (Dataverse-specific).// resolveEndpoint branches before touching doi.org:
func resolveEndpoint(ctx context.Context, srv *rest.Client, pacer *fs.Pacer, opt *Options) (provider Provider, endpoint *url.URL, err error) {
// Direct mode: caller gave us the installation + persistentId, no resolution needed.
if opt.Host != "" && opt.DatasetPID != "" {
base, err := url.Parse(opt.Host)
if err != nil {
return "", nil, fmt.Errorf("invalid host %q: %w", opt.Host, err)
}
switch strings.ToLower(opt.Provider) {
case string(Dataverse), "": // default direct provider is Dataverse
q := url.Values{}
q.Add("persistentId", opt.DatasetPID)
endpoint = base.ResolveReference(&url.URL{
Path: "/api/datasets/:persistentId/",
RawQuery: q.Encode(),
})
return Dataverse, endpoint, nil
default:
return "", nil, fmt.Errorf("direct mode not supported for provider %q", opt.Provider)
}
}
// Resolved mode: existing behaviour, unchanged.
resolvedURL, err := resolveDoiURL(ctx, srv, pacer, opt)
// ...
}Token auth would then be a one-liner where the Dataverse provider builds its requests (set So my suggestion is to try to avoid the code duplication by folding these features into the |
087eb2d to
3c457f0
Compare
3c457f0 to
112b5a7
Compare
|
Hi @ncw I've now folded everything into the existing
It's a single commit on top of For the record on the history: the original low-level implementation did have the singleflight dedup, the X-Amz-Expires TTL URL cache, and the bytes=0-0 probe. When I moved to lib/rest plumbing (per your steer toward repo idiom) I dropped all three in favour of the single redirect-following GET + resumingReader, since the cache/probe stopped earning their complexity once reads go through the standard path. Sorry for the confusion. I will update the PR description. |
What is the purpose of this change?
Adds Dataverse direct mode to the existing
backend/doi. Sethost+dataset_pidand rclone addresses a single Dataverse dataset directly over the Native API, skipping doi.org resolution — sorclone mount/copy/treework on any dataset by host + persistent ID, with the dataset's folder structure and human-readable names preserved.rclone config create dv doi \ host=https://demo.dataverse.org \ dataset_pid=doi:10.5072/FK2/ABCD rclone mount --read-only dv: /mnt/datasethost+dataset_pidtogether select Dataverse direct mode (noproviderneeded). One remote == one dataset.Why direct mode (vs doi.org resolution). The existing doi-Dataverse provider resolves a registered DOI through doi.org. Direct mode skips resolution, so it reaches installs whose PIDs aren't globally registered (registration carries recurring per-agency costs some institutions skip), local/staging/pre-prod stacks, and unpublished drafts whose PIDs aren't resolvable yet. Works across PID schemes — DOI (
doi:), Handle (hdl:), PermaLink (perma:, local PIDs added in Dataverse 5.14, expanded in 6.2).Authentication. New
tokenoption (X-Dataverse-Key). Optional — blank is guest access (public datasets and files); a token is only needed for restricted files, drafts, or owner-only datasets. The token is attached to Dataverse API/list/read calls and stripped on the cross-host redirect to S3, so it never reaches the storage host.Reads (rewritten for direct mode). The shared
Object.Opennow reads through a redirect-followinghttp.Clientoff the pacer: one GET to the Dataverse access endpoint, following the 302 to the S3-direct presigned URL and stripping theX-Dataverse-Keyon the cross-host hop so it never reaches the storage host; proxy-mode instances return the bytes directly (200/206). This replaces the previouslib/restread, which only re-issued once on a strayLocationand never sent a token. There's no mode-detection probe and no presigned-URL cache — the URL is minted fresh by Dataverse on each open, so it's never stale at open time; content caching stays in rclone's VFS/transfer layers. Native-API-only, so it stays agnostic to Dataverse's storage driver (local FS, S3, Swift, …).Mid-stream resume (new).
Openwraps the body in a resuming reader: on a non-EOFReaderror (presigned-URL TTL crossed mid-stream, S3 dropped the connection, a transient blip) it re-fetches from the byte offset already delivered (Range: bytes=N-) and continues, bounded to 5 refreshes perOpen. rclone's higher layers also re-open on body errors, but resuming in place avoids rebuilding the full TCP + redirect chain on long single-file transfers. BecauseOpenis shared across the backend's providers, this — and the rewritten read above — also covers Invenio/Zenodo reads.Dataset version selection. New
versionoption::latest(default),:draft,:latest-published, or1.0/2.0/… — for reproducibility and for mounting in-progress drafts.Lazy listing via
/tree. When the instance exposes the paginated dataset/treeendpoint (feature-detected; from IQSS/dataverse#12382), listing is lazy and paged, and carries per-file access markers (public/restricted/embargoed) that make access-denied errors actionable. Any instance without it falls back to the whole-version file listing.Tabular ingest. Dataverse parses CSV/Stata/SPSS uploads into a normalised
.tabarchival form alongside the original.ingest_formatchooses which to surface (originalwas already supported by the doi-Dataverse provider;archivalis new):original(default): original filename + original bytes (?format=original) + the stored MD5, sorclone copy/checkverify end-to-end.archival: the post-ingest.tabname and bytes.Size()is the archival size (so length is still verified on copy), but the stored MD5 is the original's and won't match the archival bytes, so hash checks don't apply to ingested files in this mode.Non-ingested files are unaffected either way.
Read-only.
Put/Update/Remove/Mkdir/Rmdirreturn errors; uploads go through Dataverse's UI / Native API.Layout. Changes live in
backend/doi: the Dataverse provider indataverse.go, the sharedFs/Object/read path indoi.go, wire types inapi/dataversetypes.go. Docs indocs/content/doi.md.Tests. 22 Dataverse-focused unit tests (
dataverse_internal_test.go,httptest-driven) covering: direct-mode listing (root / subdir / unknown dir / root-is-file),/treelisting + pagination + forwarded ingest originals, ingestoriginalvsarchival+ invalid-format rejection, full and range reads, mid-stream resume (asserts the body keeps streaming and that a re-fetch happened), transient-status retry, token-strip on cross-host redirect, read-only enforcement, host+PID validation, and attributed restricted/embargoed/bad-auth errors. The standardfstestsharness (TestIntegration) exercises a configured remote.Manually verified end-to-end against:
demo.dataverse.orgas guest (no token) — list + range +rclone checkMD5 match. Exercises S3-direct.download-redirect=true— S3-direct with--multi-thread-streams; range slices byte-identical to source.Limitations. Read-only by design; file list frozen at
NewFstime so version bumps need a remount; restricted files can appear in listings if the token can list-but-not-read; metadata blocks aren't surfaced as objects.Downstream. gdcc/dataverse-recipes#35 packages this into a one-command Docker image (mount a dataset, or publish it as a personal Globus endpoint for HPC transfers). It builds against this branch and switches to upstream once merged.
Was the change discussed in an issue or in the forum before?
Discussed in this PR thread with @ncw — the standalone-backend vs
backend/doi-direct-mode question, resolved by folding intodoi.Checklist