You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current Extraction functions make an assumption that writers will always write to a local directory. This limits flexibility and prevents integration with various storage backends like:
Block storage systems
HTTP APIs
In-memory storage
Temporary file systems
etc
Proposed Solution:
Introduce a WritableFS interface that abstracts file system operations, allowing for multiple storage implementations. This provides:
// WritableFS defines an interface for a writable file system or directory-like storage.typeWritableFSinterface {
// Base returns the base path or identifier of the file system (e.g., directory path or "memory").Base() string// WriteFile writes data to the named file, creating it if necessary.// If the file exists, it may overwrite it (implementation-dependent).// The perm parameter specifies the file permissions (e.g., 0644).WriteFile(namestring, data []byte, perm fs.FileMode) error// Mkdir creates a directory named path, along with any necessary parents.// The perm parameter specifies the directory permissions (e.g., 0755).Mkdir(pathstring, perm fs.FileMode) error
}
Better separation of concerns
Flexibility to use different storage backends
Cleaner API surface
Improved testability with in-memory implementations
Key Changes:
Added WritableFS
Implemented Two concrete types:
OsFS for local file system operations
MemFS for in-memory storage (useful for testing)
Example Usage:
// Using local filesystemosFs:=NewOsFS("/output/dir")
// Using in-memory storage (for testing)memFs:=NewMemFS()
// Normal way:ExtractContent(inFile, "/output/dir", pages, conf)
// Using WritableFS:ExtractContentFS(inFile, NewOsFS("/output/dir"), pages, conf)
Benefits:
Backward compatible (can still use local file system)
Enables new storage integrations without changing core logic
Makes code more testable
Follows Go's interface-oriented design principles
Opens possibilities for future storage implementations (S3, GCS, etc.)
This is a production code hence Baby steps are required, hence i wanted to start with ExtractContentFS
This is a nice idea. It could also clear the way for writing to stdout, for piped processing. I regularly use an operation of the form pdfcpu | jq | pdfcpu, but of course at the moment it creates temp files.
The core idea is abstracting where extracted PDF content gets written. Instead of hardcoding filesystem operations, the PDF extraction logic delegates writing to an interface (WritableFS). This enables:
Writing to disk (via OsFS).
Writing to memory (via MemFS, useful for tests).
Writing to stdout or pipes (see below).
Future backends (e.g., cloud storage, ZIP files, databases).
Example
1. Support for Pipes/Stdout (Your Example)
The comment you referenced highlights a real pain point: temporary files break clean piping (e.g., pdfcpu | jq | pdfcpu). With WritableFS, you could add a StdoutFS implementation:
No changes to ExtractContentFS are needed—just pass S3FS instead of OsFS.
4. Consistent Directory Handling
Both OsFS and MemFS:
Normalize paths (e.g., clean/../path → path).
Auto-create parent directories (no "directory not found" errors).
Respect file permissions (even if just in-memory for tests).
Why This Design Matters
Separation of concerns: PDF extraction doesn’t care where data goes.
Unix philosophy: Enable pipes by supporting stdout (via StdoutFS).
Future-proof: New storage backends don’t require refactoring.
For your specific use case (avoiding temp files in pipes), the missing piece is a StdoutFS implementation—which would be trivial to add thanks to WritableFS. The interface makes this explicitly possible without hacks.
The only thing is that am not sure if _ fs.FileMode is even needed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current Extraction functions make an assumption that writers will always write to a local directory. This limits flexibility and prevents integration with various storage backends like:
Proposed Solution:
Introduce a
WritableFSinterface that abstracts file system operations, allowing for multiple storage implementations. This provides:Key Changes:
WritableFSOsFSfor local file system operationsMemFSfor in-memory storage (useful for testing)Example Usage:
Benefits:
This is a production code hence Baby steps are required, hence i wanted to start with ExtractContentFS