High-performance C++ neural network for MNIST digit recognition with SIMD kernels, OpenMP, and reproducible benchmarks.
- SIMD-accelerated matrix ops (AVX2/AVX-512/NEON) with aligned storage.
- OpenMP-aware hot paths for dot, transpose, and axpy.
- P2 PGM parser with in-memory + on-disk cache for repeat runs.
- CLI training + evaluation pipeline with configurable epochs.
- Catch2 tests wired to CTest.
- Google Benchmark suite with published results + charts.
- Doxygen docs target and clang-format config.
- CI on Linux/macOS/Windows via GitHub Actions.
python3 tools/run.pyThis downloads MNIST, builds the project, and runs a training pass.
Use python3 tools/run.py --help for flags.
Run files:
docs/benchmarks/runs/bench-20251226-154121-baseline.jsondocs/benchmarks/runs/bench-20251226-154121-native.jsondocs/benchmarks/runs/bench-20251226-154121-openmp-native.json
Configs:
- baseline: OpenMP off, native off
- native: OpenMP off, native on
- openmp+native: OpenMP on, native on
Environment:
- Apple M2, macOS 15.5
- Apple clang 17.0.0
- Release (
-O3, OpenMP on/off,-march=nativeon/off)
Matrix ops (ns/op, lower is better):
| Case | baseline | native | openmp+native |
|---|---|---|---|
| dot 32 | 6165 |
6229 |
6287 |
| dot 64 | 65252 |
57222 |
89130 |
| dot 128 | 575281 |
587767 |
374400 |
| dot 256 | 4835360 |
4759132 |
1379835 |
| transpose 128 | 5441 |
5292 |
23662 |
| transpose 256 | 23098 |
22104 |
31108 |
| transpose 512 | 198735 |
178676 |
87914 |
| transpose 1024 | 978383 |
861078 |
502426 |
| axpy 128 | 3486 |
3477 |
23917 |
| axpy 256 | 13886 |
13896 |
26335 |
| axpy 512 | 55848 |
55441 |
35846 |
| axpy 1024 | 230626 |
229230 |
114910 |
Training/inference throughput (img/s, higher is better):
| Case | baseline | native | openmp+native |
|---|---|---|---|
| learn step | 48755 |
49399 |
48636 |
| classify | 81628 |
80712 |
69994 |
OpenMP overhead shows up on smaller sizes; the line charts illustrate where parallelism starts to pay off.
See docs/benchmarks/benchmarks.md for methodology and scripts.
python3 tools/run_benchmarks.py --openmp --nativecmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir buildmacOS quickstart:
./tools/bootstrap_macos.sh./build/fast_mnist_cli data 5000 10 TrainingSetList.txt TestingSetList.txtclang-format -i src/*.cpp include/fast_mnist/*.h apps/*.cppcmake -S . -B build -DFAST_MNIST_ENABLE_DOXYGEN=ON
cmake --build build --target docspython3 tools/prepare_mnist.py --output data --list-dir .The script auto-installs tqdm for progress bars; pass
--no-auto-install to skip that step.
MIT -- see LICENSE.