Skip to content

ShreeChaturvedi/fast-mnist-nn

Repository files navigation

Fast MNIST NN Fast MNIST NN


ci license c++

High-performance C++ neural network for MNIST digit recognition with SIMD kernels, OpenMP, and reproducible benchmarks.

Highlights

  • SIMD-accelerated matrix ops (AVX2/AVX-512/NEON) with aligned storage.
  • OpenMP-aware hot paths for dot, transpose, and axpy.
  • P2 PGM parser with in-memory + on-disk cache for repeat runs.
  • CLI training + evaluation pipeline with configurable epochs.
  • Catch2 tests wired to CTest.
  • Google Benchmark suite with published results + charts.
  • Doxygen docs target and clang-format config.
  • CI on Linux/macOS/Windows via GitHub Actions.

Quickstart

python3 tools/run.py

This downloads MNIST, builds the project, and runs a training pass. Use python3 tools/run.py --help for flags.

Benchmarks

Run files:

  • docs/benchmarks/runs/bench-20251226-154121-baseline.json
  • docs/benchmarks/runs/bench-20251226-154121-native.json
  • docs/benchmarks/runs/bench-20251226-154121-openmp-native.json

Configs:

  • baseline: OpenMP off, native off
  • native: OpenMP off, native on
  • openmp+native: OpenMP on, native on

Environment:

  • Apple M2, macOS 15.5
  • Apple clang 17.0.0
  • Release (-O3, OpenMP on/off, -march=native on/off)

Matrix ops (ns/op, lower is better):

Case baseline native openmp+native
dot 32 6165 6229 6287
dot 64 65252 57222 89130
dot 128 575281 587767 374400
dot 256 4835360 4759132 1379835
transpose 128 5441 5292 23662
transpose 256 23098 22104 31108
transpose 512 198735 178676 87914
transpose 1024 978383 861078 502426
axpy 128 3486 3477 23917
axpy 256 13886 13896 26335
axpy 512 55848 55441 35846
axpy 1024 230626 229230 114910

Training/inference throughput (img/s, higher is better):

Case baseline native openmp+native
learn step 48755 49399 48636
classify 81628 80712 69994

OpenMP overhead shows up on smaller sizes; the line charts illustrate where parallelism starts to pay off.

Dot scaling Dot scaling

Transpose scaling Transpose scaling

Axpy scaling Axpy scaling

Throughput comparison Throughput comparison

See docs/benchmarks/benchmarks.md for methodology and scripts.

Run Benchmarks

python3 tools/run_benchmarks.py --openmp --native

Build and Test

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build

macOS quickstart:

./tools/bootstrap_macos.sh

Run

./build/fast_mnist_cli data 5000 10 TrainingSetList.txt TestingSetList.txt

Formatting

clang-format -i src/*.cpp include/fast_mnist/*.h apps/*.cpp

Documentation

cmake -S . -B build -DFAST_MNIST_ENABLE_DOXYGEN=ON
cmake --build build --target docs

Data

python3 tools/prepare_mnist.py --output data --list-dir .

The script auto-installs tqdm for progress bars; pass --no-auto-install to skip that step.

License

MIT -- see LICENSE.

About

A very fast NN implementation on the MNIST dataset with custom training, evaluation, and matrix multiplication.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors