arxiv URL in -> translation-pipeline research brief out
┌─────────────────────────────┐ ┌──────────────────────────────────────┐
│ │ │ {paper_slug}/ │
│ /paper2code │ │ ├── README.md │
│ https://arxiv.org/abs/ │ ───▶ │ ├── EVIDENCE_AUDIT.md │
│ 1706.03762 │ │ ├── PIPELINE_FIT.md │
│ │ │ ├── EXPERIMENT_PLAN.md │
│ │ │ └── OBSIDIAN_NOTE.md │
└─────────────────────────────┘ └──────────────────────────────────────┘
This fork keeps the strong parts of the original paper2code workflow:
- arXiv acquisition
- appendix and footnote mining
- paper structure extraction
- official-code discovery
But it changes the end product. Instead of turning papers into implementation repos, this fork turns them into a decision-ready research pack for the IS->EN translation pipeline.
The translation-pipeline does not need generic paper summaries or speculative rewrites. It needs a workflow that answers questions like:
- Does this paper help with chunking, terminology, translation, QA, or bilingual memoQ output?
- Are the reported gains relevant to legal or regulatory document delivery?
- Does the paper assume language pairs, datasets, or hardware that make it a bad fit here?
- What is the smallest safe experiment to run in the current pipeline?
This fork is tuned for that kind of research and discovery.
-
Evidence discipline Benchmark wins are not treated as deployable improvements. Claims are separated into supported, partial, and unknown.
-
Pipeline-fit mapping Every paper is mapped onto concrete translation-pipeline surfaces such as
src/prompts/,src/validators/,src/db/termbase.py,src/bilingual/runner.py, and the human-gated 9-stage flow. -
Operational realism Legal/regulatory quality, terminology control, bilingual
.docxhandling, tag integrity, and deterministic validation matter more than leaderboard deltas. -
Experiment-first outputs The final deliverable is a recommendation and experiment plan, not speculative production code.
-
Vault-ready packaging The workflow produces an Obsidian-friendly literature note so research can move directly into the wider knowledge system.
npx skills add laufeyg/paper2code/skills/paper2codeOnce installed, run:
/paper2code https://arxiv.org/abs/1706.03762
/paper2code https://arxiv.org/abs/1706.03762
/paper2code https://arxiv.org/abs/2006.11239 --mode full
/paper2code https://arxiv.org/abs/2106.09685 --mode educational
{paper_slug}/
├── README.md # Executive brief and recommendation
├── EVIDENCE_AUDIT.md # Claims, datasets, metrics, missing details, reproducibility
├── PIPELINE_FIT.md # Mapping to translation-pipeline stages, modules, and gates
├── EXPERIMENT_PLAN.md # Minimal safe experiment inside the current workflow
└── OBSIDIAN_NOTE.md # Vault-ready literature note with frontmatter
| File | Purpose |
|---|---|
README.md |
The shortest possible answer to "should I care about this paper?" |
EVIDENCE_AUDIT.md |
Separates demonstrated results from paper hype or missing detail. |
PIPELINE_FIT.md |
Shows exactly where the paper could affect the pipeline, if anywhere. |
EXPERIMENT_PLAN.md |
Defines a bounded next step instead of a vague "we should try this." |
OBSIDIAN_NOTE.md |
Makes the paper easy to store, link, and revisit in the vault. |
Every run should end with one of these:
ADOPT NOW— narrow change, strong evidence, low operational riskPROTOTYPE— promising, but needs a bounded experiment in the pipelineWATCHLIST— interesting but not ready to spend engineering time onREJECT— weak fit, weak evidence, or too much operational risk
REJECT is a valid outcome. The point of this fork is better decisions, not more experiments.
The paper is judged against translation-pipeline realities:
- IS->EN or otherwise relevant document-translation evidence
- terminology control and termbase compatibility
- bilingual/memoQ or document-structure friendliness
- quality assurance compatibility with validators and human gates
- legal/regulatory robustness, not just sentence-level benchmark quality
- operational cost and complexity relative to likely gain
- It will not pretend BLEU or COMET gains automatically improve client deliverables.
- It will not recommend major architecture rewrites from a thin paper.
- It will not treat missing datasets, prompts, or evaluation details as "close enough."
- It will not generate production code just to make the output look complete.
The most useful additions are:
- Translation-relevant worked examples
- Domain knowledge for legal/regulatory translation research
- Better guardrails for evidence quality and pipeline-fit decisions
- Tighter vault note templates
If a paper repeatedly suggests the same kind of improvement, capture that pattern in knowledge/ or guardrails/ instead of rediscovering it every run.