Skip to content

doi: add Dataverse direct mode with token, versions, ingest and tree#9467

Open
ErykKul wants to merge 1 commit into
rclone:masterfrom
ErykKul:dataverse-backend
Open

doi: add Dataverse direct mode with token, versions, ingest and tree#9467
ErykKul wants to merge 1 commit into
rclone:masterfrom
ErykKul:dataverse-backend

Conversation

@ErykKul

@ErykKul ErykKul commented May 28, 2026

Copy link
Copy Markdown

What is the purpose of this change?

Adds Dataverse direct mode to the existing backend/doi. Set host + dataset_pid and rclone addresses a single Dataverse dataset directly over the Native API, skipping doi.org resolution — so rclone mount/copy/tree work on any dataset by host + persistent ID, with the dataset's folder structure and human-readable names preserved.

Note on history: this opened as a standalone backend/dataverse and was reworked into backend/doi per review. The original low-level implementation had a per-(file,format) presigned-URL cache (TTL from X-Amz-Expires), singleflight first-touch dedup, and a bytes=0-0 mode-detection probe. Folding into backend/doi and moving to the standard plumbing dropped all three in favour of a single redirect-following GET plus an in-place resuming reader — simpler, and the cache/probe stopped earning their complexity on the shared read path. The description below reflects what's actually in the diff; the standalone backend is gone and this is a single commit on top of backend/doi.

rclone config create dv doi \
    host=https://demo.dataverse.org \
    dataset_pid=doi:10.5072/FK2/ABCD
rclone mount --read-only dv: /mnt/dataset

host + dataset_pid together select Dataverse direct mode (no provider needed). One remote == one dataset.

Why direct mode (vs doi.org resolution). The existing doi-Dataverse provider resolves a registered DOI through doi.org. Direct mode skips resolution, so it reaches installs whose PIDs aren't globally registered (registration carries recurring per-agency costs some institutions skip), local/staging/pre-prod stacks, and unpublished drafts whose PIDs aren't resolvable yet. Works across PID schemes — DOI (doi:), Handle (hdl:), PermaLink (perma:, local PIDs added in Dataverse 5.14, expanded in 6.2).

Authentication. New token option (X-Dataverse-Key). Optional — blank is guest access (public datasets and files); a token is only needed for restricted files, drafts, or owner-only datasets. The token is attached to Dataverse API/list/read calls and stripped on the cross-host redirect to S3, so it never reaches the storage host.

Reads (rewritten for direct mode). The shared Object.Open now reads through a redirect-following http.Client off the pacer: one GET to the Dataverse access endpoint, following the 302 to the S3-direct presigned URL and stripping the X-Dataverse-Key on the cross-host hop so it never reaches the storage host; proxy-mode instances return the bytes directly (200/206). This replaces the previous lib/rest read, which only re-issued once on a stray Location and never sent a token. There's no mode-detection probe and no presigned-URL cache — the URL is minted fresh by Dataverse on each open, so it's never stale at open time; content caching stays in rclone's VFS/transfer layers. Native-API-only, so it stays agnostic to Dataverse's storage driver (local FS, S3, Swift, …).

Mid-stream resume (new). Open wraps the body in a resuming reader: on a non-EOF Read error (presigned-URL TTL crossed mid-stream, S3 dropped the connection, a transient blip) it re-fetches from the byte offset already delivered (Range: bytes=N-) and continues, bounded to 5 refreshes per Open. rclone's higher layers also re-open on body errors, but resuming in place avoids rebuilding the full TCP + redirect chain on long single-file transfers. Because Open is shared across the backend's providers, this — and the rewritten read above — also covers Invenio/Zenodo reads.

Dataset version selection. New version option: :latest (default), :draft, :latest-published, or 1.0/2.0/… — for reproducibility and for mounting in-progress drafts.

Lazy listing via /tree. When the instance exposes the paginated dataset /tree endpoint (feature-detected; from IQSS/dataverse#12382), listing is lazy and paged, and carries per-file access markers (public/restricted/embargoed) that make access-denied errors actionable. Any instance without it falls back to the whole-version file listing.

Tabular ingest. Dataverse parses CSV/Stata/SPSS uploads into a normalised .tab archival form alongside the original. ingest_format chooses which to surface (original was already supported by the doi-Dataverse provider; archival is new):

  • original (default): original filename + original bytes (?format=original) + the stored MD5, so rclone copy/check verify end-to-end.
  • archival: the post-ingest .tab name and bytes. Size() is the archival size (so length is still verified on copy), but the stored MD5 is the original's and won't match the archival bytes, so hash checks don't apply to ingested files in this mode.

Non-ingested files are unaffected either way.

Read-only. Put/Update/Remove/Mkdir/Rmdir return errors; uploads go through Dataverse's UI / Native API.

Layout. Changes live in backend/doi: the Dataverse provider in dataverse.go, the shared Fs/Object/read path in doi.go, wire types in api/dataversetypes.go. Docs in docs/content/doi.md.

Tests. 22 Dataverse-focused unit tests (dataverse_internal_test.go, httptest-driven) covering: direct-mode listing (root / subdir / unknown dir / root-is-file), /tree listing + pagination + forwarded ingest originals, ingest original vs archival + invalid-format rejection, full and range reads, mid-stream resume (asserts the body keeps streaming and that a re-fetch happened), transient-status retry, token-strip on cross-host redirect, read-only enforcement, host+PID validation, and attributed restricted/embargoed/bad-auth errors. The standard fstests harness (TestIntegration) exercises a configured remote.

Manually verified end-to-end against:

  • demo.dataverse.org as guest (no token) — list + range + rclone check MD5 match. Exercises S3-direct.
  • Local Dataverse + MinIO with download-redirect=true — S3-direct with --multi-thread-streams; range slices byte-identical to source.
  • Local Dataverse + file-storage driver — proxy mode; full reads + MD5s match Dataverse-reported hashes.

Limitations. Read-only by design; file list frozen at NewFs time so version bumps need a remount; restricted files can appear in listings if the token can list-but-not-read; metadata blocks aren't surfaced as objects.

Downstream. gdcc/dataverse-recipes#35 packages this into a one-command Docker image (mount a dataset, or publish it as a personal Globus endpoint for HPC transfers). It builds against this branch and switches to upstream once merged.

Was the change discussed in an issue or in the forum before?

Discussed in this PR thread with @ncw — the standalone-backend vs backend/doi-direct-mode question, resolved by folding into doi.

Checklist

  • I have read the contribution guidelines.
  • I have added tests for all changes in this PR if appropriate.
  • I have added documentation for the changes if appropriate.
  • All commit messages are in house style.
  • I'm done, this Pull Request is ready for review :-)

@ErykKul

ErykKul commented May 28, 2026

Copy link
Copy Markdown
Author

Refs gdcc/dataverse-recipes#35

@ncw

ncw commented May 28, 2026

Copy link
Copy Markdown
Member

As you noted we have a doi backend https://rclone.org/doi/ with a dataverse provider already. Why can't this be a patch to the doi backend?

@ErykKul

ErykKul commented May 28, 2026

Copy link
Copy Markdown
Author

Loading the doi backend (that is meant to be a generic resolver also covering Zenodo and Invenio) with Dataverse-specific things like token / version / ingest_format / bypassing-doi.org-resolution felt wrong architecturally; at that point it stops being a DOI resolver and becomes a Dataverse client wearing the doi name. Deleting the existing doi-dataverse provider in favor of delegating it to a new backend also felt wrong. There is nothing wrong with Flora's design for its audience.

That's why I went with the third option: a separate Dataverse backend that addresses installs directly via host + dataset_pid. Works across all PID schemes like hdl: Handle (e.g. dataverse.fgv.br Brazil, data.brin.go.id Indonesia, abacus.library.ubc.ca Canada, lida.dataverse.lt, datasets.up.edu.pe), and perma: PermaLink for fully-local PIDs (added in Dataverse 5.14, expanded in 6.2; e.g. lore.list.lu Luxembourg, datos.unlp.edu.ar Argentina). Also, the new backend is tuned for the Globus endpoint case by doing URL caching with TTL, singleflight first-touch dedup, in-place byte-resume on mid-transfer URL expiry. I think that the Globus endpoint use case is the most valuable extension on top of the mount as it adds free Globus transfers to all Dataverse installations without any modifications to their setups.

One thing worth flagging: Flora's implementation matches rclone idiom (lib/rest, lib/pacer, lib/cache), which is the right call for a backend that primarily reuses standard plumbing. I went with bare http.Client because that's what I'm comfortable with, and the specific things I needed (Range: bytes=0-0 probe without redirect-follow, per-(file, format) URL cache with X-Amz-Expires TTL, singleflight dedup, in-place body resume across mid-stream errors) didn't feel like where lib/rest would add value. Could be wrong about that; lib/rest would likely have worked too with a custom redirect policy.

It's possible I got some of this wrong, or was too eager about a specific use case to see the broader fit, happy to be told. Open to discussion.

@ErykKul ErykKul force-pushed the dataverse-backend branch from cfae02d to 087eb2d Compare May 29, 2026 12:53
@ErykKul

ErykKul commented May 29, 2026

Copy link
Copy Markdown
Author

Quick heads-up before you dig in: I've pushed a round of polish to this branch since opening it; mainly a lazy, paginated listing path (feature-detected, with a fallback to the original whole-version listing) plus some cleanup to match rclone's usual lib/rest plumbing. The lazy path builds on a new dataset tree endpoint we're adding to Dataverse over in IQSS/dataverse#12382; it's optional and feature-detected, so the backend still works against any current release without it. No urgency at all, just flagging that it's freshened up ahead of your first look. Thanks for taking it on!

@ncw

ncw commented May 29, 2026

Copy link
Copy Markdown
Member

I don't like the substantial overlap with the existing backend/doi code. backend/doi/dataverse.go lists the same /api/datasets/:persistentId/ endpoint, produces a near-identical Object (remote/contentURL/size/modTime/md5/contentType), is read-only, synthesises the directory tree from the flat file list, handles non-compliant redirects in Open, and already implements your ingest_format=original behaviour (it swaps in OriginalFileName/OriginalFileSize/OriginalFileFormat). So a large fraction of dataverse.go plus the api/ wire types duplicate code that's already in tree.

In your PR description "Relationship to backend/doi" section you describe three mechanisms as the differentiators, but I can't find them in the diff:

  • singleflight-deduped URL fetches - no golang.org/x/sync import; readers look independent.
  • per-(file,format) access-URL cache with TTL from X-Amz-Expires + evict-on-403 - there's no URL cache; X-Amz-Expires only appears in a test fixture.
  • a Range: bytes=0-0 probe with CheckRedirect: ErrUseLastResponse - the code comments actually say "There is no separate access-URL probe" / "No separate probe."

What's in the code is the simpler resumingReader (re-fetch Range: bytes=N- and resume on a body error, bounded to 5) plus a redirect-following client that strips the token on the cross-host hop. Could you update the PR description to match? It matters here because it's also the core of the "different scope" argument.

The new capabilities are: direct host+dataset_pid addressing (no doi.org resolution, so local/staging/unregistered installs and unpublished drafts work), token auth via X-Dataverse-Key, version selection, ingest_format=archival, /tree lazy pagination, and resume-in-place. None of these are incompatible with backend/doi. They read to me as enhancements to its Dataverse provider rather than a reason for a parallel backend.

The biggest gap to close in backend/doi is that it always resolves through doi.org first (resolveEndpoint -> resolveDoiURL). A "direct mode" that skips resolution when host+dataset_pid are supplied would enable us get this functionality in the doi backend.

// Options gains three fields:
type Options struct {
    Doi               string `config:"doi"`
    Provider          string `config:"provider"`
    DoiResolverAPIURL string `config:"doi_resolver_api_url"`
    Host              string `config:"host"`         // NEW: direct-mode base URL, skips doi.org
    DatasetPID        string `config:"dataset_pid"`  // NEW: persistentId for direct mode
    Token             string `config:"token"`        // NEW: X-Dataverse-Key (optional)
}

The new config options exact meaning could depend on provider.

// init() gains options, e.g.:
}, {
    Name: "host",
    Help: `Base URL of the installation, e.g. https://demo.dataverse.org.

When set with dataset_pid, rclone addresses the dataset directly and skips
doi.org resolution. Lets you reach local/staging installs and unpublished
drafts whose PIDs aren't globally resolvable.`,
    Required: false,
    Advanced: true,
}, {
    Name:      "token",
    Help:      "API token (Dataverse X-Dataverse-Key). Blank means guest access.",
    Sensitive: true,
    Required:  false,
    Advanced:  true,
},
// ...plus dataset_pid and version (Dataverse-specific).
// resolveEndpoint branches before touching doi.org:
func resolveEndpoint(ctx context.Context, srv *rest.Client, pacer *fs.Pacer, opt *Options) (provider Provider, endpoint *url.URL, err error) {
    // Direct mode: caller gave us the installation + persistentId, no resolution needed.
    if opt.Host != "" && opt.DatasetPID != "" {
        base, err := url.Parse(opt.Host)
        if err != nil {
            return "", nil, fmt.Errorf("invalid host %q: %w", opt.Host, err)
        }
        switch strings.ToLower(opt.Provider) {
        case string(Dataverse), "": // default direct provider is Dataverse
            q := url.Values{}
            q.Add("persistentId", opt.DatasetPID)
            endpoint = base.ResolveReference(&url.URL{
                Path:     "/api/datasets/:persistentId/",
                RawQuery: q.Encode(),
            })
            return Dataverse, endpoint, nil
        default:
            return "", nil, fmt.Errorf("direct mode not supported for provider %q", opt.Provider)
        }
    }

    // Resolved mode: existing behaviour, unchanged.
    resolvedURL, err := resolveDoiURL(ctx, srv, pacer, opt)
    // ...
}

Token auth would then be a one-liner where the Dataverse provider builds its requests (set X-Dataverse-Key when opt.Token != "", and strip it on cross-host redirects as you already do). version becomes a query param on the endpoint. The archival variant of ingest_format slots into the existing original-handling block. /tree paging and the resume-in-place wrapper are the two larger additions - and the resume wrapper, if moved into the shared Object.Open, would also benefit the Invenio/Zenodo providers.

So my suggestion is to try to avoid the code duplication by folding these features into the backend/doi Dataverse provider via the direct-mode branch above.

@ErykKul ErykKul force-pushed the dataverse-backend branch from 087eb2d to 3c457f0 Compare May 29, 2026 22:03
@ErykKul ErykKul force-pushed the dataverse-backend branch from 3c457f0 to 112b5a7 Compare May 29, 2026 22:18
@ErykKul

ErykKul commented May 29, 2026

Copy link
Copy Markdown
Author

Hi @ncw

I've now folded everything into the existing backend/doi Dataverse provider as a "direct mode" rather than maintaining a parallel backend:

  • host + dataset_pid address a dataset directly, skipping doi.org resolution. Drafts and restricted datasets with no public DOI work too. Plus token, version, and ingest_format (and you're right, ingest_format=original was already there).
  • Each read is a single redirect-following GET (no per-file probe) with transparent range-resume. The speedup is just from dropping now-redundant per-file calls and avoiding the unnecessary 10 ms pacer sleeps they triggered; the pacer still paces the listing/metadata calls. It also speeds up the existing Invenio/Zenodo reads and gives them resume for free.
  • Listing uses the lazy, paginated /tree endpoint when present (from Reusable React components on JSF: file uploader and lazy file tree view (#6691, #12179) IQSS/dataverse#12382, feature-detected) and falls back to the whole-version list otherwise.

It's a single commit on top of backend/doi now, and the standalone backend/dataverse is gone.

For the record on the history: the original low-level implementation did have the singleflight dedup, the X-Amz-Expires TTL URL cache, and the bytes=0-0 probe. When I moved to lib/rest plumbing (per your steer toward repo idiom) I dropped all three in favour of the single redirect-following GET + resumingReader, since the cache/probe stopped earning their complexity once reads go through the standard path. Sorry for the confusion. I will update the PR description.

@ErykKul ErykKul changed the title dataverse: add new read-only backend doi: add Dataverse direct mode with token, versions, ingest and tree May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants