Singapore Project Data Collector & Intelligence System

건축/토목/지반 분야의 글로벌 프로젝트 정보를 수집(Collector), 저장(Storage), 가공(Processor)하는 ELT 시스템입니다.

1. Architecture Overview

graph LR
    A[Collector (Batch)] -->|Raw Data| B(Storage API)
    B -->|Upsert| D[(PostgreSQL)]
    D -->|Read| C[Processor (Batch/Stream)]
    C -->|Update Analysis| B

2. Project Goals

Singapore Data Collection (Phase 1)
- 타겟: GeBIZ (통합), LTA (교통), PUB (수자원), BCA (건설)
- 목표: 발주처, 설계사, 시공사, 공사비, 기간 등 **"프로젝트 중심"**의 데이터 확보.
ELT Pipeline Establishment
- Extract: Playwright를 통한 안정적인 웹 데이터 수집.
- Load: API를 통해 즉시 DB에 적재하여 데이터 유실 방지.
- Transform: 추후 LLM/OCR을 통해 비정형 데이터(PDF 등)를 가치 있는 정보로 변환.
Hybrid Data Strategy
- Economic Analysis: 공사비, 기간 등 정형 데이터는 RDBMS로 엄격 관리.
- Technical Expansion: 시방서, 지반 조건 등 가변적인 기술 데이터는 JSONB로 유연하게 저장.
Microservice Readiness
- 각 컴포넌트(Collector, Storage, Processor)를 독립적으로 배포/확장 가능한 구조로 설계.

3. Directory Structure & Execution Models

🟢 Collector (`/collector`)

Type: Batch Job (Short-lived)
Role: 데이터 수집 (Playwright + PDF Parsing)
Architecture: Base Crawler 패턴으로 소스별 크롤러 구현
Orchestration: Airflow, Cron, KubernetesJob 등에 의해 주기적으로 실행되는 것을 목표로 합니다.
Key Content:
- src/crawlers/base_crawler.py - 공통 기능 기반 클래스
- src/crawlers/singapore/ - 싱가포르 소스별 크롤러
- main.py - 통합 entrypoint

🔵 Storage (`/storage`)

Type: Long-running Service (Always-on)
Role: 데이터 저장 및 검증 API (FastAPI + SQLAlchemy)
Orchestration: Docker Container 또는 Kubernetes Deployment로 상시 구동됩니다. 수집기(Job)들의 데이터를 받아 DB에 적재합니다.

🟣 Processor (`/processor`)

Type: Batch Job or Stream Consumer
Role: 데이터 가공 및 AI 분석 (Pandas + LLM)
Orchestration: 수집이 끝난 후 실행되는 Airflow Task(후처리) 또는 Kafka Consumer로 동작합니다.

4. Collector Usage

Quick Start

# Run single crawler
cd collector
python main.py bca --limit 50

# Run with browser UI (debugging)
python main.py gebiz --no-headless --limit 10

# Dry-run mode (no storage)
python main.py bca --dry-run

# Run all crawlers
python main.py all --limit 100

Available Sources

Source	Description	Max Concurrent Pages
`gebiz`	GeBIZ (Government e-procurement)	1 (strict limit)
`bca`	Building & Construction Authority	5 (more lenient)

Adding New Sources

# 1. Create crawler class inheriting from BaseCrawler
from src.crawlers.base_crawler import BaseCrawler

class LTACrawler(BaseCrawler):
    async def scrape_data(self, browser):
        # LTA-specific scraping logic
        pass

# 2. Register in factory (crawler_factory.py)
CrawlerFactory.register('lta', LTACrawler)

# 3. Add config (config.py)
CRAWLER_CONFIG['lta'] = {
    'max_concurrent_pages': 3,
    'default_headless': True,
}

# 4. Use immediately
python main.py lta --limit 50

5. Operational Protocols

Mode Switching: 작업을 시작할 때 task_boundary를 통해 명확한 목표를 설정하세요.
Documentation: 중요한 설계 변경이 있을 경우 docs/ 폴더를 업데이트하세요.
Skill Usage: 반복적인 작업(예: Git 커밋)은 .agent/skills/ 내의 스킬을 활용하세요.

6. Key Commands

Run DB Infra: cd docker && docker-compose up -d
Run Storage: make run-storage
Run Collector: cd collector && python main.py <source> [options]
Setup Local Env: make setup

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.agent/skills		.agent/skills
collector		collector
docker		docker
docs		docs
processor		processor
storage		storage
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Singapore Project Data Collector & Intelligence System

1. Architecture Overview

2. Project Goals

3. Directory Structure & Execution Models

🟢 Collector (`/collector`)

🔵 Storage (`/storage`)

🟣 Processor (`/processor`)

4. Collector Usage

Quick Start

Available Sources

Adding New Sources

5. Operational Protocols

6. Key Commands

About

Uh oh!

Releases

Packages

Languages

lambda3486/crawler

Folders and files

Latest commit

History

Repository files navigation

Singapore Project Data Collector & Intelligence System

1. Architecture Overview

2. Project Goals

3. Directory Structure & Execution Models

🟢 Collector (/collector)

🔵 Storage (/storage)

🟣 Processor (/processor)

4. Collector Usage

Quick Start

Available Sources

Adding New Sources

5. Operational Protocols

6. Key Commands

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🟢 Collector (`/collector`)

🔵 Storage (`/storage`)

🟣 Processor (`/processor`)

Packages