Skip to content

Introduce WritableFS#1153

Open
olekukonko wants to merge 2 commits into
pdfcpu:masterfrom
olekukonko:master
Open

Introduce WritableFS#1153
olekukonko wants to merge 2 commits into
pdfcpu:masterfrom
olekukonko:master

Conversation

@olekukonko
Copy link
Copy Markdown

The current Extraction functions make an assumption that writers will always write to a local directory. This limits flexibility and prevents integration with various storage backends like:

  • Block storage systems
  • HTTP APIs
  • In-memory storage
  • Temporary file systems
  • etc

Proposed Solution:
Introduce a WritableFS interface that abstracts file system operations, allowing for multiple storage implementations. This provides:

// WritableFS defines an interface for a writable file system or directory-like storage.
type WritableFS interface {
    // Base returns the base path or identifier of the file system (e.g., directory path or "memory").
    Base() string

    // WriteFile writes data to the named file, creating it if necessary.
    // If the file exists, it may overwrite it (implementation-dependent).
    // The perm parameter specifies the file permissions (e.g., 0644).
    WriteFile(name string, data []byte, perm fs.FileMode) error

    // Mkdir creates a directory named path, along with any necessary parents.
    // The perm parameter specifies the directory permissions (e.g., 0755).
    Mkdir(path string, perm fs.FileMode) error
}
  1. Better separation of concerns
  2. Flexibility to use different storage backends
  3. Cleaner API surface
  4. Improved testability with in-memory implementations

Key Changes:

  1. Added WritableFS
  2. Implemented Two concrete types:
    • OsFS for local file system operations
    • MemFS for in-memory storage (useful for testing)

Example Usage:

// Using local filesystem
osFs := NewOsFS("/output/dir")

// Using in-memory storage (for testing)
memFs := NewMemFS()
// Normal way:
ExtractContent(inFile, "/output/dir", pages, conf)

// Using WritableFS:
ExtractContentFS(inFile, NewOsFS("/output/dir"), pages, conf)

Benefits:

  • Backward compatible (can still use local file system)
  • Enables new storage integrations without changing core logic
  • Makes code more testable
  • Follows Go's interface-oriented design principles
  • Opens possibilities for future storage implementations (S3, GCS, etc.)

This is a production code hence Baby steps are required, hence i wanted to start with ExtractContentFS

Introduce writeable
add basic test
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 10, 2025

CLA assistant check
All committers have signed the CLA.

@hhrutter
Copy link
Copy Markdown
Collaborator

Please create a feature request issue.
PRs are the result of issue discussions.
Thank you!

@mdmcconnell
Copy link
Copy Markdown
Contributor

This is a nice idea. It could also clear the way for writing to stdout, for piped processing. I regularly use an operation of the form pdfcpu | jq | pdfcpu, but of course at the moment it creates temp files.

@olekukonko
Copy link
Copy Markdown
Author

What Problem Does WritableFS Solve?

The core idea is abstracting where extracted PDF content gets written. Instead of hardcoding filesystem operations, the PDF extraction logic delegates writing to an interface (WritableFS). This enables:

  1. Writing to disk (via OsFS).
  2. Writing to memory (via MemFS, useful for tests).
  3. Writing to stdout or pipes (see below).
  4. Future backends (e.g., cloud storage, ZIP files, databases).

Example

1. Support for Pipes/Stdout (Your Example)

The comment you referenced highlights a real pain point: temporary files break clean piping (e.g., pdfcpu | jq | pdfcpu). With WritableFS, you could add a StdoutFS implementation:

type StdoutFS struct{} // Implements WritableFS

func (s *StdoutFS) WriteFile(name string, data []byte, _ fs.FileMode) error {
    _, err := os.Stdout.Write(data)
    return err
}

// Usage: ExtractContentFS(pdfData, &StdoutFS{}, ...) 

Now, the extracted PDF content streams directly to stdout—no temp files needed. This fits Unix-style pipelines perfectly.

2. Testability Without I/O Overhead

MemFS lets you verify what would be written without touching the disk:

func TestExtractContent(t *testing.T) {
    memFS := NewMemFS()
    err := ExtractContentFS(pdfReader, memFS, []string{"1"}, nil)
    // Check memFS.files["1_123.txt"] contains expected data
}
  • Faster tests: No filesystem cleanup.
  • More reliable: No permission/disk space issues.

3. Swappable Storage Backends

The same extraction logic works across backends. For example, to add S3 support:

type S3FS struct { Bucket string }

func (s *S3FS) WriteFile(name string, data []byte, _ fs.FileMode) error {
    return s3.Upload(s.Bucket, name, data)
}

No changes to ExtractContentFS are needed—just pass S3FS instead of OsFS.

4. Consistent Directory Handling

Both OsFS and MemFS:

  • Normalize paths (e.g., clean/../pathpath).
  • Auto-create parent directories (no "directory not found" errors).
  • Respect file permissions (even if just in-memory for tests).

Why This Design Matters

  • Separation of concerns: PDF extraction doesn’t care where data goes.
  • Unix philosophy: Enable pipes by supporting stdout (via StdoutFS).
  • Future-proof: New storage backends don’t require refactoring.

For your specific use case (avoiding temp files in pipes), the missing piece is a StdoutFS implementation—which would be trivial to add thanks to WritableFS. The interface makes this explicitly possible without hacks.

The only thing is that am not sure if _ fs.FileMode is even needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants