OCRSmith

OCRSmith is a powerful Python library for generating synthetic OCR datasets with comprehensive support for Arabic and Latin text. It provides a flexible, modular architecture for creating high-quality training data for OCR models from various text sources.

🚀 Features

Core Functionality

Synthetic text image generation with configurable fonts and backgrounds
Multi-language support: Arabic and Latin text rendering with proper font handling
Flexible text placement strategies: random, centered, grid-based, and contextual positioning
Rich augmentation pipeline: noise, blur, brightness, rotation, and custom effects
Comprehensive background generation: solid colors, gradients, noise patterns, and custom images

Text Placement & Layout

Contextual placement strategies:
- Page titles: Top-centered positioning with proper margins
- Page numbers: Bottom-right corner placement
- Random placement: Within configurable margins
- Grid-based placement: Structured positioning
- Center placement: Perfect centering on backgrounds
Smart composition: Each placement strategy handles image composition internally
Rich metadata: Detailed placement information for training optimization

Data Sources

Multiple input formats:
- CSV files with configurable text columns
- Hugging Face datasets with automatic loading
- Parquet files for efficient data handling
- Direct text input
Batch processing: Generate thousands of samples efficiently
Memory-optimized: Iterator-based text loading for large datasets

Augmentation System

Pipeline-based augmentation: Chain multiple effects with probability control
Built-in augmentations:
- Gaussian noise injection
- Blur effects
- Brightness adjustment
- Rotation transforms
Extensible: Easy to add custom augmentation strategies
Configurable probabilities: Fine-tune augmentation frequency

🏗️ Architecture

OCRSmith follows a modular, strategy-pattern architecture:

OCRSmith/
├── core/
│   ├── BackgroundManager.py       # Background generation orchestration
│   ├── FontManager.py            # Font loading and caching
│   ├── TextRenderer.py           # Text-to-image rendering
│   ├── placement/                # Text placement strategies
│   │   ├── RandomPlacement       # Random positioning
│   │   ├── CenterPlacement       # Centered positioning
│   │   ├── GridPlacement         # Grid-based positioning
│   │   ├── PageTitlePlacement    # Title positioning
│   │   └── PageNumberPlacement   # Page number positioning
│   ├── backgrounds/              # Background generation
│   │   ├── SolidColorBackground  # Solid color backgrounds
│   │   ├── GradientBackground    # Gradient backgrounds
│   │   ├── NoiseBackground       # Noise pattern backgrounds
│   │   └── ImageBackground       # Custom image backgrounds
│   ├── augmentation/             # Image augmentation pipeline
│   │   ├── NoiseAugmentation     # Noise injection
│   │   ├── BlurAugmentation      # Blur effects
│   │   ├── BrightnessAugmentation # Brightness adjustment
│   │   └── RotationAugmentation  # Rotation transforms
│   └── fonts/                    # Font management system
├── datasets/                     # Data loading utilities
│   ├── CSVTextLoader            # CSV file support
│   ├── ParquetTextLoader        # Parquet file support
│   └── HuggingFaceTextLoader    # HuggingFace dataset support
└── config/                      # Configuration management

📦 Installation

# Create and activate conda environment
conda create -n ocrsmith python=3.10 -y
conda activate ocrsmith

# Clone the repository
git clone https://github.com/yourusername/OCRSmith.git
cd OCRSmith

# Install in production mode
pip install .

# Install in development mode
pip install -e .

🎯 Quick Start

Basic Usage

from ocrsmith.core.BackgroundManager import BackgroundManager
from ocrsmith.core.backgrounds.BackgroundFactory import BackgroundFactory
from ocrsmith.core.backgrounds.creators import *
from ocrsmith.config import load_config
from ocrsmith.core.FontManager import FontManager
from ocrsmith.core.TextRenderer import TextRenderer
from ocrsmith.core.text_renderers.strategies.HorizontalRenderingStrategy import HorizontalRenderingStrategy
from ocrsmith.core.placement import PlacementManager, RandomPlacementStrategy

# Setup background factory
factory = BackgroundFactory()
factory.register_creator('solid', SolidColorBackground)
factory.register_creator('gradient', GradientBackground)
factory.register_creator('noise', NoiseBackground)
factory.register_creator('image', ImageBackground)

# Load configuration and initialize managers
configs = load_config()
background_manager = BackgroundManager(configs, factory)
font_manager = FontManager(font_paths=["assets/fonts"], default_size=24)

# Setup placement
placement_manager = PlacementManager()
placement_manager.register_strategy('random', RandomPlacementStrategy())

# Generate text image
font = font_manager.load_font(font_size=18)
text = "Sample text for OCR training"
text_renderer = TextRenderer(HorizontalRenderingStrategy())
text_image, mask, (width, height) = text_renderer.generate_text_image(font, text)

# Generate background
background_creator = background_manager.get_random_background()
background_image = background_creator.render(width + 100, height + 100)

# Place text and get composed image
placement_result = placement_manager.place_text(text_image, background_image, 'random')
final_image = placement_result.composed_image

# Save result
final_image.save("output.png")

CLI Usage (examples)

You can run the app directly with command-line overrides:

Example 1 — set source path/type/column inline:

python -m ocrsmith.core.app --num-samples 100 --output-dir outputs \
  --set text_data.source_path=assets/text_data/sentences.csv \
  --set text_data.source_type=csv \
  --set text_data.text_column=darija_ar \
  --set seed=123 \
  --workers 4

Example 2 — specify a config file (update src/ocrsmith/config/default_config.yaml first if needed):

python -m ocrsmith.core.app --config src/ocrsmith/config/default_config.yaml \
  --num-samples 100 --output-dir outputs --workers 6 --seed 123

Example 3 — use the built-in default config (no --config):

python -m ocrsmith.core.app --num-samples 100 --output-dir outputs --workers 6 --seed 123

You can also update the default config file at src/ocrsmith/config/default_config.yaml (fonts, text_data, layout, augmentations, etc.) and run with the --config option shown above.

🔧 Configuration

OCRSmith uses YAML configuration files for easy customization:

# config/default_config.yaml
backgrounds:
  solid:
    enabled: true
    colors: ["#FFFFFF", "#F0F0F0", "#E0E0E0"]
  
  gradient:
    enabled: true
    directions: ["horizontal", "vertical", "diagonal"]
  
  noise:
    enabled: true
    intensity: [0.1, 0.3]

fonts:
  default_size: 24
  size_range: [16, 32]
  paths: ["assets/fonts"]

placement:
  default_strategy: "random"
  margins:
    x: 20
    y: 20

augmentation:
  noise:
    enabled: true
    factor: 0.05
    probability: 0.3
  
  blur:
    enabled: true
    radius: 0.5
    probability: 0.2

📊 Supported Font Collections

OCRSmith includes extensive font support:

Arabic Fonts

Amiri: Traditional Arabic typography (Regular, Bold, Italic, BoldItalic)
Fustat: Modern Arabic font family (7 weights)
IBM Plex Sans Arabic: Professional Arabic fonts (7 weights)
Kufam: Versatile Arabic/Latin dual-script font (10 styles)
Mada: Clean, modern Arabic font (8 weights)
Mirza: Elegant Arabic display font (4 weights)
Noto Sans Arabic: Google's comprehensive Arabic font family
Noto Kufi Arabic: Kufi-style Arabic fonts
Noto Naskh Arabic: Traditional Naskh Arabic fonts
Vazirmatn: High-quality Persian/Arabic font (9 weights)

Latin Fonts

IBM Plex Sans: Modern, professional Latin fonts
Noto Sans Mono: Monospace fonts for technical text

📈 Dataset Generation

Output Format

OCRSmith generates datasets with rich annotations:

{
  "image_path": "sample_000001.png",
  "text": "النص العربي للاختبار",
  "bbox": [45, 67, 234, 98],
  "placement_metadata": {
    "placement_type": "random",
    "position": [45, 67],
    "margins": [20, 20],
    "content_type": "body_text"
  }
}

Batch Generation

# Generate large datasets efficiently
engine.generate_dataset(
    num_samples=10000,
    output_dir="large_dataset",
    placement_strategies=['random', 'center', 'title']
)

🎨 Customization

Adding Custom Placement Strategies

class CustomPlacementStrategy(TextPlacementStrategy):
    def place_text(self, text_image, background_image, **kwargs):
        # Custom placement logic
        x, y = self.calculate_position(text_image, background_image)
  
        # Compose image
        composed_image = background_image.copy()
        composed_image.paste(text_image, (x, y), text_image)
  
        # Return result with metadata
        bbox = (x, y, x + text_image.size[0], y + text_image.size[1])
        metadata = {'placement_type': 'custom', 'position': (x, y)}
  
        return PlacementResult(composed_image, bbox, metadata)

# Register custom strategy
placement_manager.register_strategy('custom', CustomPlacementStrategy())

Adding Custom Augmentations

class CustomAugmentation(AugmentationStrategy):
    def apply(self, image, **kwargs):
        # Custom augmentation logic
        return modified_image

# Add to pipeline
engine.augmentation_pipeline.add_augmentation(
    CustomAugmentation(), 
    probability=0.4
)

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Create and activate conda environment
conda create -n ocrsmith python=3.10 -y
conda activate ocrsmith

# Clone the repository
git clone https://github.com/yourusername/OCRSmith.git
cd OCRSmith

# Install in development mode
pip install -e .

# Run tests
pytest tests/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Font providers for high-quality Arabic and Latin fonts

The OCR community for inspiration and feedback

Contributors who help improve OCRSmith

Made with ❤️ for the OCR community pip install -e .

Run tests

pytest tests/


---

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

Font providers for high-quality Arabic and Latin fonts

The OCR community for inspiration and feedback

Contributors who help improve OCRSmith

---

*Made with ❤️ for the OCR community*

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
src/ocrsmith		src/ocrsmith
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCRSmith

🚀 Features

Core Functionality

Text Placement & Layout

Data Sources

Augmentation System

🏗️ Architecture

📦 Installation

🎯 Quick Start

Basic Usage

CLI Usage (examples)

🔧 Configuration

📊 Supported Font Collections

Arabic Fonts

Latin Fonts

📈 Dataset Generation

Output Format

Batch Generation

🎨 Customization

Adding Custom Placement Strategies

Adding Custom Augmentations

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

Run tests

About

Uh oh!

Releases

Packages

Languages

License

atlasia-ma/OCRSmith

Folders and files

Latest commit

History

Repository files navigation

OCRSmith

🚀 Features

Core Functionality

Text Placement & Layout

Data Sources

Augmentation System

🏗️ Architecture

📦 Installation

🎯 Quick Start

Basic Usage

CLI Usage (examples)

🔧 Configuration

📊 Supported Font Collections

Arabic Fonts

Latin Fonts

📈 Dataset Generation

Output Format

Batch Generation

🎨 Customization

Adding Custom Placement Strategies

Adding Custom Augmentations

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

Run tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages