Skip to content

fix(spark): S3/GCS PyArrow filesystem resolution for staging paths#6442

Merged
ntkathole merged 3 commits into
feast-dev:masterfrom
abhijeet-dhumal:fix/spark-staging-filesystem
Jun 3, 2026
Merged

fix(spark): S3/GCS PyArrow filesystem resolution for staging paths#6442
ntkathole merged 3 commits into
feast-dev:masterfrom
abhijeet-dhumal:fix/spark-staging-filesystem

Conversation

@abhijeet-dhumal

@abhijeet-dhumal abhijeet-dhumal commented May 27, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it

Staging parquet reads fail on S3-compatible stores (MinIO, Ceph, custom AWS endpoints) because raw s3:// / s3a:// URIs are passed directly to pyarrow.dataset:

FileNotFoundError: s3://my-bucket/...

Replaces _normalize_staging_paths with _resolve_staging_filesystem which builds the correct pyarrow.fs.S3FileSystem or pyarrow.fs.GcsFileSystem from the URI scheme. Picks up AWS_ENDPOINT_URL_S3 for MinIO/LocalStack and AWS_DEFAULT_REGION from the environment. Local and file:// paths pass through unchanged.

Which issue(s) this PR fixes

Enables staging path reads for on-prem / private cloud Spark deployments (MinIO, Ceph, LocalStack).

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests — 12 cases covering S3 endpoint override, GCS, local, file://, MinIO, region fallback
  • Manual tests — staging reads from MinIO endpoint

@abhijeet-dhumal abhijeet-dhumal changed the title Fix/spark staging filesystem fix(spark): S3/GCS PyArrow filesystem for staging and offline-only feature view validation May 27, 2026
@abhijeet-dhumal abhijeet-dhumal force-pushed the fix/spark-staging-filesystem branch 2 times, most recently from 7c3fe0f to 394f421 Compare May 27, 2026 12:25
@abhijeet-dhumal abhijeet-dhumal changed the title fix(spark): S3/GCS PyArrow filesystem for staging and offline-only feature view validation fix(spark): S3/GCS PyArrow filesystem resolution for staging paths May 27, 2026
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review May 28, 2026 06:23
@abhijeet-dhumal abhijeet-dhumal requested a review from a team as a code owner May 28, 2026 06:23

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

)
kwargs["scheme"] = "https" if endpoint.startswith("https") else "http"
fs = pafs.S3FileSystem(**kwargs)
stripped = [p.replace("s3a://", "").replace("s3://", "") for p in paths]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use p.removeprefix("s3a://").removeprefix("s3://"), same for https/http and other paths

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! Updated to use removeprefix() for both the S3 path stripping and the endpoint scheme removal.

return fs, stripped

if sample.startswith("gs://"):
import pyarrow.fs as pafs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple times same import

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.. Yes.. hoisted import pyarrow.fs as pafs above the S3/GCS branches so it's imported once.

@abhijeet-dhumal abhijeet-dhumal force-pushed the fix/spark-staging-filesystem branch from 394f421 to feec382 Compare June 1, 2026 07:27
@abhijeet-dhumal abhijeet-dhumal requested a review from ntkathole June 1, 2026 07:36
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@ntkathole ntkathole force-pushed the fix/spark-staging-filesystem branch from feec382 to 63adf09 Compare June 3, 2026 08:03
@ntkathole ntkathole merged commit ae50414 into feast-dev:master Jun 3, 2026
40 of 43 checks passed
franciscojavierarceo pushed a commit that referenced this pull request Jun 13, 2026
# [0.64.0](v0.63.0...v0.64.0) (2026-06-13)

### Bug Fixes

* Add async_supported property to RedisOnlineStore ([9b088fe](9b088fe))
* Add missing feast init templates to operator CRD and enhance persistence documentation ([1941d4d](1941d4d))
* Allow to publish from reference branch ([5458ec8](5458ec8))
* API calls list ([4203eb7](4203eb7))
* **bigquery:** Enable list inference for parquet loads in offline_write_batch ([9243497](9243497)), closes [#5845](#5845)
* Bump grpcio dependencies ([07b4782](07b4782))
* **compute-engine/local:** Honor field_mapping on join keys in dedup + join nodes ([#6395](#6395)) ([bd01824](bd01824))
* **dynamodb:** Avoid tag race condition by using diff-based tag updates ([#6479](#6479)) ([bad2b7d](bad2b7d)), closes [#6418](#6418)
* **dynamodb:** Fix mypy type for _build_projection_expression return ([217b4da](217b4da))
* Fix intermittent async test failures for DynamoDB and Redis ([63c5eb1](63c5eb1))
* Fix mongodb blog title ([57d28d4](57d28d4))
* Fix shared SQL registry crash - avoid unnecessary UDF deserialization in proto cache building ([ac588d7](ac588d7))
* Fix SparkRetrievalJob.persist() failing for SparkSource ([209d7cd](209d7cd))
* Fixed formatting and image for mongo blog ([#6377](#6377)) ([f8389fb](f8389fb))
* Fixes for ray source ([7f592a4](7f592a4))
* **go:** skip registry refresh when cache_ttl_seconds <= 0 ([97ed40c](97ed40c))
* Handle array of strings columns in Athena materialization ([#6324](#6324)) ([4ed0278](4ed0278))
* make milvus VARCHAR max_length configurable, remove hardcoded 512 limit ([3b98c22](3b98c22))
* **operator:** Set appProtocol: grpc on registry gRPC Service ([#6367](#6367)) ([c9ae2b4](c9ae2b4))
* PyJWT 2.10+ added validation that rejects empty HMAC keys ([e756ffe](e756ffe))
* RemoteOnlineStore sends all features in a single HTTP request ([8f187dd](8f187dd))
* Remove registry proto dump to enforce RBAC and add permission checks to Commit/Refresh RPCs ([328431f](328431f))
* Remove selector migration job - no longer needed ([51c325e](51c325e))
* replace broken .claude skill symlink with correct relative path ([4541690](4541690))
* Replace selector label strip patch with migration Job for upgrade-safe selector uniqueness ([00dea50](00dea50))
* Scope feature view name conflict check to current project in file-based registry ([#6369](#6369)) ([a4fde83](a4fde83)), closes [#6209](#6209)
* **snowflake:** Stop double-quoting connection identifiers ([#6462](#6462)) ([e914d59](e914d59))
* **spark:** S3/GCS PyArrow filesystem resolution for staging paths ([#6442](#6442)) ([ae50414](ae50414))
* **trino:** Clean up temporary entity tables after retrieval ([#6381](#6381)) ([d86b13d](d86b13d)), closes [#6306](#6306)
* Update go-feature-server base image to Go 1.25 and fix operator Dockerfile COPY permissions ([86ef0bc](86ef0bc))

### Features

* [Backend] Data Quality Monitoring with native compute, multi-backend support, REST API, CLI ([#6202](#6202)) ([5458c37](5458c37))
* Add apache flink compute engine ([#6476](#6476)) ([9636d6a](9636d6a))
* Add demo noteboooks for users ([e362173](e362173))
* Add enabled/disabled toggle for feature views ([#6401](#6401)) ([5f1fa0d](5f1fa0d)), closes [#6395](#6395)
* Add Label View to init template ([ec272d5](ec272d5))
* Add mTLS support to remote registry gRPC client ([#6474](#6474)) ([c9602d8](c9602d8))
* Add Prometheus gauges for FeatureStore installation telemetry ([#6354](#6354)) ([1b681b7](1b681b7))
* Adds registry REST API endpoints for managing entities, data sources, and feature views ([#6413](#6413)) ([f77bd1d](f77bd1d))
* Allow CRUD on entities, data sources, and feature views from UI ([#6412](#6412)) ([2321c07](2321c07))
* Allow default openlineage configuration ([#6467](#6467)) ([276b6df](276b6df))
* **bigquery:** Support DATE-type event timestamp columns ([#6362](#6362)) ([753dee5](753dee5)), closes [#2530](#2530)
* **cli:** Add `feast projects delete` command (closes [#5095](#5095)) ([#6318](#6318)) ([1a4b96c](1a4b96c))
* Data Quality Monitoring added in feast UI ([#6422](#6422)) ([fa271be](fa271be))
* **dynamodb:** Use ProjectionExpression when requested_features is set ([0adc906](0adc906)), closes [#6058](#6058)
* Enhance DataSource and FeatureView modals with error handling and submission states ([96d7169](96d7169))
* Expose registry endpoints on feature server for MCP access ([f77981c](f77981c))
* Feast First-Class LabelView Implementation ([#6292](#6292)) ([c0e7e5d](c0e7e5d))
* Feast-MLflow Integration ([#6235](#6235)) ([7279c75](7279c75))
* Operational metrics for offline store and SOX metrics for both ([#6340](#6340)) ([65b1b80](65b1b80))
* Pre-compute feature service ([8011550](8011550))
* REST API-backed UI for RBAC compatibility and per-page lazy loading ([#6414](#6414)) ([6ae80af](6ae80af))
* Support non-string map key types ([#6382](#6382)) ([#6383](#6383)) ([728aa2e](728aa2e))
* Update FeatureStore CRD with DRA Fields ([01241e4](01241e4))

### Performance Improvements

* Cache feature view resolution in get_online_features to reduce per-request overhead ([55c2f18](55c2f18))
* Optimize feature serving latency with batched async Redis, cached checks fix ([103809a](103809a))
* Replace MessageToDict with optimized custom dict builder ([#6015](#6015)) ([9902064](9902064))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants