Introduce `WritableFS` by olekukonko · Pull Request #1153 · pdfcpu/pdfcpu

olekukonko · 2025-05-10T22:08:19Z

The current Extraction functions make an assumption that writers will always write to a local directory. This limits flexibility and prevents integration with various storage backends like:

Block storage systems
HTTP APIs
In-memory storage
Temporary file systems
etc

Proposed Solution:
Introduce a WritableFS interface that abstracts file system operations, allowing for multiple storage implementations. This provides:

// WritableFS defines an interface for a writable file system or directory-like storage.
type WritableFS interface {
    // Base returns the base path or identifier of the file system (e.g., directory path or "memory").
    Base() string

    // WriteFile writes data to the named file, creating it if necessary.
    // If the file exists, it may overwrite it (implementation-dependent).
    // The perm parameter specifies the file permissions (e.g., 0644).
    WriteFile(name string, data []byte, perm fs.FileMode) error

    // Mkdir creates a directory named path, along with any necessary parents.
    // The perm parameter specifies the directory permissions (e.g., 0755).
    Mkdir(path string, perm fs.FileMode) error
}

Better separation of concerns
Flexibility to use different storage backends
Cleaner API surface
Improved testability with in-memory implementations

Key Changes:

Added WritableFS
Implemented Two concrete types:
- OsFS for local file system operations
- MemFS for in-memory storage (useful for testing)

Example Usage:

// Using local filesystem
osFs := NewOsFS("/output/dir")

// Using in-memory storage (for testing)
memFs := NewMemFS()

// Normal way:
ExtractContent(inFile, "/output/dir", pages, conf)

// Using WritableFS:
ExtractContentFS(inFile, NewOsFS("/output/dir"), pages, conf)

Benefits:

Backward compatible (can still use local file system)
Enables new storage integrations without changing core logic
Makes code more testable
Follows Go's interface-oriented design principles
Opens possibilities for future storage implementations (S3, GCS, etc.)

This is a production code hence Baby steps are required, hence i wanted to start with ExtractContentFS

Introduce writeable

add basic test

CLAassistant · 2025-05-10T22:08:25Z

All committers have signed the CLA.

hhrutter · 2025-05-12T07:54:54Z

Please create a feature request issue.
PRs are the result of issue discussions.
Thank you!

mdmcconnell · 2025-05-19T12:47:17Z

This is a nice idea. It could also clear the way for writing to stdout, for piped processing. I regularly use an operation of the form pdfcpu | jq | pdfcpu, but of course at the moment it creates temp files.

olekukonko · 2025-05-20T08:26:33Z

What Problem Does `WritableFS` Solve?

The core idea is abstracting where extracted PDF content gets written. Instead of hardcoding filesystem operations, the PDF extraction logic delegates writing to an interface (WritableFS). This enables:

Writing to disk (via OsFS).
Writing to memory (via MemFS, useful for tests).
Writing to stdout or pipes (see below).
Future backends (e.g., cloud storage, ZIP files, databases).

Example

1. Support for Pipes/Stdout (Your Example)

The comment you referenced highlights a real pain point: temporary files break clean piping (e.g., pdfcpu | jq | pdfcpu). With WritableFS, you could add a StdoutFS implementation:

type StdoutFS struct{} // Implements WritableFS

func (s *StdoutFS) WriteFile(name string, data []byte, _ fs.FileMode) error {
    _, err := os.Stdout.Write(data)
    return err
}

// Usage: ExtractContentFS(pdfData, &StdoutFS{}, ...)

Now, the extracted PDF content streams directly to stdout—no temp files needed. This fits Unix-style pipelines perfectly.

2. Testability Without I/O Overhead

MemFS lets you verify what would be written without touching the disk:

func TestExtractContent(t *testing.T) {
    memFS := NewMemFS()
    err := ExtractContentFS(pdfReader, memFS, []string{"1"}, nil)
    // Check memFS.files["1_123.txt"] contains expected data
}

Faster tests: No filesystem cleanup.
More reliable: No permission/disk space issues.

3. Swappable Storage Backends

The same extraction logic works across backends. For example, to add S3 support:

type S3FS struct { Bucket string }

func (s *S3FS) WriteFile(name string, data []byte, _ fs.FileMode) error {
    return s3.Upload(s.Bucket, name, data)
}

No changes to ExtractContentFS are needed—just pass S3FS instead of OsFS.

4. Consistent Directory Handling

Both OsFS and MemFS:

Normalize paths (e.g., clean/../path → path).
Auto-create parent directories (no "directory not found" errors).
Respect file permissions (even if just in-memory for tests).

Why This Design Matters

Separation of concerns: PDF extraction doesn’t care where data goes.
Unix philosophy: Enable pipes by supporting stdout (via StdoutFS).
Future-proof: New storage backends don’t require refactoring.

For your specific use case (avoiding temp files in pipes), the missing piece is a StdoutFS implementation—which would be trivial to add thanks to WritableFS. The interface makes this explicitly possible without hacks.

The only thing is that am not sure if _ fs.FileMode is even needed

olekukonko added 2 commits May 10, 2025 23:05

Create writeable.go

8c26f1c

Introduce writeable

Update extract_test.go

947c984

add basic test

olekukonko mentioned this pull request May 12, 2025

Introducing WritableFS #1155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce `WritableFS`#1153

Introduce `WritableFS`#1153
olekukonko wants to merge 2 commits into
pdfcpu:masterfrom
olekukonko:master

olekukonko commented May 10, 2025

Uh oh!

CLAassistant commented May 10, 2025 •

edited

Loading

Uh oh!

hhrutter commented May 12, 2025

Uh oh!

mdmcconnell commented May 19, 2025

Uh oh!

olekukonko commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

olekukonko commented May 10, 2025

Uh oh!

CLAassistant commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhrutter commented May 12, 2025

Uh oh!

mdmcconnell commented May 19, 2025

Uh oh!

olekukonko commented May 20, 2025

What Problem Does WritableFS Solve?

Example

1. Support for Pipes/Stdout (Your Example)

2. Testability Without I/O Overhead

3. Swappable Storage Backends

4. Consistent Directory Handling

Why This Design Matters

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented May 10, 2025 •

edited

Loading

What Problem Does `WritableFS` Solve?