Ever dreamed of building a clap recognition system that stays focused—even when a dog barks or dishes clatter in the background? Lightweight neural network delivers 97% accurate clap detection, all while running efficiently on Arduino boards without straining their resources.
The system is built of the following components:
-
Hardware: an Arduino board (including AVR boards like Mega 2560) with an analog microphone and LED.
-
Sound sampling of audio input.
-
Neural network that recognizes claps from other loud sounds.
-
Training sound patterns collected using the same system to improve model accuracy.
This library was inspired by a great article Neural Network from Scratch (C++) by Thakee Nathees.
For neural network theory, you can check out Backpropagation calculus.
While developing this library, I focused more on simplicity and code elegance rather than raw performance, meaning there’s still room for optimization.
Clap recognition, including its neural network is not very demanding. An Arduino Mega 2560 will be just fine or an ESP32 for example. Analog microphones are chip but those that accuratelly output analog sound signal are not so easy to find. A LED diode is connected to board's pin through 220 ohm resistor.
A clap lasts around 7 ms. In this time we want to capture 256 samples which gives a sampling rate of 35 kHz. It would be better to sample 10 ms with 40 kHz but due to limited processing power 7 ms with 35 kHz will do just fine.
The neuralNetwork.hpp library is designed to use as little memory as possible so it is suitable for Arduino controlers. Besides it doesn't use any heap memory at all making it also suitable for AVR boards.
Let's go briefly through the basics. A neural network is a set of layers, each containing its neurons. The output of one layer is the input to the next. Each input influences each neuron, but not equally. This is why a matrix of weights is used at each layer. Neuron values at layer L are calculated as:
Bias is a vector of equal size as the output vector of each layer. By providing each neuron with a trainable constant value, bias increases the flexibility of the model, allowing the network to fit the data more accurately.
af is a non-linear neuron activation function. It must be non-linear; otherwise, the whole neural network would just be a linear system. The ones used here are Sigmoid and ReLU:
The size of the input, the weight matrix, the output, and bias vary in each layer.
The output layer neuron values can be expressed as:
The output of the neural network is an array containing probabilities of categories that the input pattern can belong to. The equation above does not produce the exact probabilities so the neuron values at the output layer need to be further normalized with softmax function for example:
The correctness of the output depends on how weights and biases are set in all neural network layers. This is where training comes in, which will be addressed later.
When calculating the output from the input pattern, the calculation starts at the input layer and proceeds to the output layer. This is why this process is called a forward pass.
#include "neuralnetwork.hpp"
// .--- the number of neurons in the first layer - it corresponds to the size of the patterns that neural network will use to make the categorization
// | .--- second layer activation function
// | | .--- the number of neurons the second layer
// | | | .--- output layer activation function
// | | | | .--- the number of neurons in the output layer - it corresponds to the number of categories that the neural network recognizes
// | | | | |
neuralNetworkLayer_t<8, ReLU, 16, /* add more if needed */ Sigmoid, 2> neuralNetwork;
// at this point neuralNetwork is initialized with random weights and biases and it is ready for training
// - you can either start training it and export the trained model when fiished
// - or you can load already trained model that is cappable of making usable outputs
void setup () {
cinit (); // instead of Serial.begin (115200) or Serial.begin (9600) for AVR boards
// import trained model from C++ initializer list or int32_t array
neuralNetwork = {1030875393,1053884929,-1080728422,1065850332, ... ,1056579159};
// categorize the input pattern
auto probability = neuralNetwork.forwardPass ( { 18, 20, 7, 2, 1 } ); // forwardPass returns the array which size corresponds to the output layer neurons
cout << "probabilities: ( "; // instead of Serial.print
for (auto p : probability)
cout << p << " ";
cout << ")\n";It is difficult to tell the neural network topology (meaning the number of layers, how many neurons each would have and their activation functions) in advance. Just try different arrangements and see which works best for your case.
// hidden layers
template <size_t inputCount, size_t af, size_t neuronCount, size_t... sizes>
class neuralNetworkLayer_t<inputCount, af, neuronCount, sizes...> {
// data structures needed for this layer: weight and bias
float weight [neuronCount][inputCount];
float bias [neuronCount];
// include the next layer instance which will include the next layer itself, ...
neuralNetworkLayer_t<neuronCount, sizes...> nextLayer;
public:
// calculates the neurons of this layer and returns the category that the input belongs to
array<float, outputCount> forwardPass (const float input [inputCount]) {
float neuron [neuronCount];
// z = weight x input + bias
// neuron = af (z)
float z [neuronCount];
for (size_t n = 0; n < neuronCount; n++) {
z [n] = bias [n];
for (size_t i = 0; i < inputCount; i++)
z [n] += weight [n][i] * input [i];
neuron [n] = af (z [n]);
}
// return what the next layer thinks about the neurons clculated here
return nextLayer.forwardPass (neuron);
}
};
// output layer
template <size_t inputCount, size_t af, size_t neuronCount>
class neuralNetworkLayer_t<inputCount, af, neuronCount> {
// data structures needed for this layer: weight and bias
float weight [neuronCount][inputCount];
float bias [neuronCount];
public:
// calculates the output neurons of the neural network and returns the category that the input belongs to
array<float, neuronCount> forwardPass (const float input [inputCount]) {
array<float, neuronCount> neuron {};
// z = weight x input + bias
// neuron = af (z)
float z [neuronCount];
for (size_t n = 0; n < neuronCount; n++) {
z [n] = bias [n];
for (size_t i = 0; i < inputCount; i++)
z [n] += weight [n][i] * input [i];
neuron [n] = af (z [n]);
}
// softmax normalization of the result
float sum = 0;
for (size_t n = 0; n < neuronCount; n++)
sum += expf (neuron [n]);
for (size_t n = 0; n < neuronCount; n++)
if (sum > 0)
neuron [n] = expf (neuron [n]) / sum;
else
neuron [n] = 0;
// start returning the result through all the previous layers
return neuron;
}
};The following training code is a little oversimplified but works well for demonstration.
// This part, including testing different typologies, can be done more efficiently on larger computers and not necessarily on a controller,
// as Arduino code is portable to standard C++.
#define epoch 1000 // choose the right number of training iterations so the model gets trained but not overtrained
for (int trainingIteration = 0; trainingIteration < epoch; trainingIteration++) {
// normally we would need like 1.000 training patterns
float errorOverAllPatterns = 0;
// .--- tell neuralNetwork that the pattern belongs to category 0 (0 is the index of output vector that designates category 0)
// |
errorOverAllPatterns += neuralNetwork.backwardPropagation ( { 1, 2, 6, 18, 20 }, { 1, 0, 0 } ); // expected = probability vector telling the neural network that the pattern belongs to category (with index) 0
// .--- tell neuralNetwork that the pattern belongs to category 1 (1 is the index of output vector that designates category 1)
// |
errorOverAllPatterns += neuralNetwork.backwardPropagation ( { 1, 2, 25, 3, 1 }, { 0, 1, 0 } ); // expected = probability vector telling the neural network that the pattern belongs to category (with index) 1
// .--- tell neuralNetwork that the pattern belongs to category 2 (2 is the index of output vector that designates category 2)
// |
errorOverAllPatterns += neuralNetwork.backwardPropagation ( { 19, 10, 3, 2, 1 }, { 0, 0, 1 } ); // expected = probability vector telling the neural network that the pattern belongs to category (with index) 2
}
// export trained model as C++ int32_t initializer list
cout << neuralNetwork << endl;Training the neural network involves setting weights and biases using training patterns at the input and expected results at the output. For each given pattern and expected result, an error is assessed at the output layer, and the weights and biases of the output layer are adjusted to minimize the error. Then, the same process is applied to the previous layer, and so on. This is why the process is called backward propagation—it propagates the error at the output layer back to the previous layers.
With repeated training iterations, the error typically decreases. However, more training doesn't always lead to better classification accuracy. A neural network can become overtrained—meaning it learns the training patterns too precisely, resulting in minimal error on known data but poor generalization to new inputs. This is not the desired outcome. The graph illustrates a typical error reduction trend during the training process.
Since there are many variables to optimize, the error function contains numerous local minima, and we can never be certain in which one the training process will end up. Therefore, the whole process needs to be repeated multiple times with different random initializations in order to obtain a better result.
Training may be too demanding for Arduino boards, but you can do it on bigger machines, export the trained model there and import it to Arduino. This is why compatibility.h library is added, to let Arduino code compile in standard C++ and the other way arround.
Initially, all the weights in all layers of the neural network are initialized with random values.
Normal Xavier initialization (besides normal Xavier initialization there is also a uniform one that is calculates slightly differently) uses a Gaussian probability distribution with a mean of 0 and a standard deviation of
He initialization uses a Gaussian probability distribution with a mean of 0 and a standard deviation of
When the Sigmoid activation function is used, input values are squashed into the range between 0 and 1. This nonlinearity can lead to vanishing gradients, making it difficult for backpropagation to effectively update the network's weights. Xavier initialization addresses this issue by keeping the variance of activations and gradients stable across layers, improving signal flow during training.
In contrast, the ReLU activation function zeroes out all negative inputs, effectively “killing” half of the signal. To compensate for this drop in signal strength, He initialization uses a larger variance than Xavier. This helps maintain the flow of information through the network and ensures more stable training with ReLU-activated layers.
C++ has a built-in pseudo-random generator that produces uniformly distributed random values. The Box-Muller transform can efficiently transform them into Gaussian distributed random numbers. It produces two independent random numbers N1 and N2 from two independent uniformly distributed random numbers U1 and U2 in the interval (0, 1), but we only need one of them.
The another would be
// Box-Muller Transform to calculate normaly distributed random variable from uniformly distrubuted random function
// 1. select 2 independent uniformly distributed random values in interval (0, 1)
#define MAX_LONG 2147483647
float U1 = ((float) random (MAX_LONG - 1) + 1) / MAX_LONG;
float U2 = ((float) random (MAX_LONG - 1) + 1) / MAX_LONG;
// 2. use Box-Muller transform to tranform them to two independed normally distributed random values with mean of 0 and variance of 1
float N1 = sqrt (-2 * log (U1)) * cos (2 * M_PI * U2);
// float N2 = sqrt (-2 * log (U1)) * sin (2 * M_PI * U2); // we don't actually need the second independent random variable here
// 3. apply the desired mean of 0 and variance of sqrt (2.0 / inputCount) to random variable N1
float Xavier = 0 + sqrt (2.0 / (inputCount + neuronCount)) * N1;
float He = 0 + sqrt (2.0 / inputCount) * N1; Biases are usually set to 0.
This part is based on Backpropagation calculus.
At this point, the results that the neural network produces are, well, pretty random. The neural network must be trained first with a set of patterns belonging to already known categories. The idea of training on each of the known input patterns is to minimize the difference (or the error) between the expected result and the result that the neural network actually produced. To assess the error of a single pattern, we’ll use a variant of the MSE function (mean squared error):
We could use some other error functions as well but this one differentiates nicely.
What we need to do is calculate the error function gradients
These are the notations and definitions we are going to use so it will be easier to follow the deduction.
L – we will count the layers of the neural network with L
Loutput = 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 𝑜𝑓 𝑛𝑒𝑢𝑟𝑎𝑙 𝑛𝑒𝑡𝑤𝑜𝑟𝑘
n - we will count the neurons within each layer with n
NL = the number of neurons at layer L, which is also the number of outputs of layer L
i – we will usually count the inputs to each layer with i
IL = 𝑡ℎ𝑒 𝑛𝑎𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑝𝑢𝑡𝑠 𝑡𝑜 𝑡ℎ𝑒 𝑙𝑎𝑦𝑒𝑟 𝐿
weightL = weight matrix at layer L
weight L[n][i] = element [n][i] of weight matrix at layer L
biasL = bias vector at layer L
biasL[n] = element [n] of bias vector at layer L
zL = intermediate result vector: zL = weightL x biasL
zL[n] = element [n] of z vector at layer L
af = neuron activation function: af (z)
afL[n] = element [n] of calculated neuron value (output) vector at layer L: af (zL[n])
af‘ = neuron activation function derivative
expected = vector of expected values at output layer, since there is only one, we do not have to explicitly label it as eLoutput
expected[n] = element [n] of vector of expected values
E = error function calculated on the output of neural network for a single pattern on its input
delta =
Since E is a function of z: E = E (z) and z is a function of weight: z = z (weight) we can use chain rule for derivatives (
- Let's solve this expression part by part. In the first part all the derivateives of
$\sum_{k=0}^{N^{Loutput}-1} \frac{1}{2} \cdot \left(af_{[k]}^{Loutput} - expected_{[k]} \right)^{2}$ terms are 0 except when k = n:
- The second part is af' by definition.
- In the third part all the derivateives of
$\sum_{k=0}^{I^{Loutput}-1} \left( {weight_{[n][i]}^{Loutput}} \cdot af_{[k]}^{Loutput-1} \right) + bias_{[n]}^{Loutput}$ terms are 0 except when k = i:
Note
Let us at this point define delta as
- update weights
- update biases
- propagate error to the previous layers
Considering the terms that we have just solved above we get:
And finally:
Similary we can calculate
Important
Considering that the neurons at the output layer are already calculated when the error occurs we can calculate deltaLoutput:
Considering that
Similary we can update biasLoutput:
Gradient $\frac{\partial E}{\partial weight}$ in hidden layers
Similarly as we did with the output layer we can go one layer back:
While we can use the same deduction as we did for outptul layer for the second part, which is af'[i]Loutput-1 and the third part, which is af[j]Loutput-2 we can not do the same for the first part
As we have already derived for output layer, the first two parts under E are delta[n]Loutput. The third part is weight[n][i]Loutput since all the derivateives of
So finally:
Note
This equation gives us a usefull connection between deltaLoutput and deltaLoutput-1. By going to even lower layers it would just mean repeating the same steps again at each layer:
so we can write a recursive equation for all hidden layers:
Gradient $\frac{\partial E}{\partial bias}$ in previous (hidden) layers
Similary we can calculate
Important
Weight and bias update formulas for previous (hidden) layer
Let's shift L by 1 in recursive equation for hidden layers deltaL for practical reasons:
Updating weightL is the same as for output layer:
Updating biasL is the same as for output layer:
Please note that delta gets updated in two distinct layers: L and L+1. Neural network here is implemented as C++ variadic template so one layer can not directly access another's internal data. This is why delta gets updated in two parts. In layer L+1 weightL+1 x deltaL+1 gets calculated. Once the procssing is returned to layer L it is multiplied by af'(zL).
// hidden layers
template<typename input_t, typename expected_t>
void backwardPropagation (const input_t (&input) [inputCount], const expected_t (&expected) [outputCount], float previousLayerDelta [inputCount] = NULL) { // the size of expected in all layers equals the size of the output of the output layer
private:
// data structures needed for this layer: weight and bias
float weight [neuronCount][inputCount];
float bias [neuronCount];
// include the next layer instance which will include the next layer itself, ...
neuralNetworkLayer_t<neuronCount, sizes...> nextLayer;
public:
// iterate from the last layer to the first and update weight and bias meanwhile
template<typename input_t, typename expected_t>
float backwardPropagation (const input_t (&input) [inputCount], const expected_t (&expected) [outputCount], float previousLayerDelta [inputCount] = NULL) { // the size of expected in all layers equals the size of the output of the output layer
// while moving forward do exactly the same as forwardPass function does
float z [neuronCount];
float neuron [neuronCount];
// z = weight x input + bias
// neuron = af (z)
for (size_t n = 0; n < neuronCount; n++) {
z [n] = bias [n];
for (size_t i = 0; i < inputCount; i++)
z [n] += weight [n][i] * input [i];
neuron [n] = af (z [n]);
}
// calculate the first part of delta in the next layer then apply activation function derivative here
// delta = next layer weight * next layer delta * af' (z)
float delta [neuronCount];
float error = nextLayer.__backwardPropagation__ (neuron, expected, delta);
// calculate only the second part of delta, the first par has already been calculated at the next layer
for (size_t n = 0; n < neuronCount; n++)
delta [n] *= afDerivative (z [n]);
// update weight and bias at this layer
for (size_t n = 0; n < neuronCount; n++) {
// update weight
for (size_t i = 0; i < inputCount; i++)
weight [n][i] -= learningRate * delta [n] * input [i];
// update bias
bias [n] -= learningRate * delta [n];
}
// calculate only the first part of previous layer delta, since z from previous layer is not available here
// previousLayerDelta = weight * delta * af' (previous layer z)
for (size_t i = 0; i < inputCount; i++) {
previousLayerDelta [i] = 0;
for (size_t n = 0; n < neuronCount; n++)
previousLayerDelta [i] += weight [n][i] * delta [n];
}
return error;
}
};
// output layer
template<typename input_t, typename expected_t>
void backwardPropagation (const input_t (&input) [inputCount], const expected_t (&expected) [neuronCount], float previousLayerDelta [inputCount] = NULL) { // the size of expected in all layers equals the size of the output of the output layer
private:
// data structures needed for this layer: weight and bias
float weight [neuronCount][inputCount];
float bias [neuronCount];
public:
// iterate from the last layer to the first and adjust weight and bias meanwhile, returns the error clculated at output layer
template<typename input_t, typename expected_t>
float backwardPropagation (const input_t (&input) [inputCount], const expected_t (&expected) [neuronCount], float previousLayerDelta [inputCount] = NULL) { // the size of expected in all layers equals the size of the output of the output layer
// while moving forward do exactly the same as forwardPass function does
float z [neuronCount];
float neuron [neuronCount];
// z = weight x input + bias
// neuron = af (z)
for (size_t n = 0; n < neuronCount; n++) {
z [n] = bias [n];
for (size_t i = 0; i < inputCount; i++)
z [n] += weight [n][i] * input [i];
neuron [n] = af (z [n]);
}
// calculate the error
float error = 0;
for (size_t n = 0; n < neuronCount; n++)
error += (expected [n] - neuron [n]) * (expected [n] - neuron [n]);
error = sqrt (error) / 2;
// update weight and bias at output layer
// delta = (neuron - expected) * af' (z)
// weight -= learningRate * delta * input
// bias -= learningRate * delta
float delta [neuronCount];
for (size_t n = 0; n < neuronCount; n++) {
// calculat delta at output layer
delta [n] = (neuron [n] - expected [n]) * afDerivative (z [n]);
// update weight
for (size_t i = 0; i < inputCount; i++)
weight [n][i] -= learningRate * delta [n] * input [i];
// update bias
bias [n] -= learningRate * delta [n];
}
// calculate only the first part of previous layer delta, since z from previous layer is not available at this layer
// previousLayerDelta = weight * delta * af' (previous layer z)
for (size_t i = 0; i < inputCount; i++) {
previousLayerDelta [i] = 0;
for (size_t n = 0; n < neuronCount; n++)
previousLayerDelta [i] += weight [n][i] * delta [n];
}
return error;
}
};Proper preparation of input data is crucial for effective neural network performance. If we had built a convolutional neural network (CNN), we could feed it raw audio recordings directly, allowing the CNN to learn spatial and temporal patterns on its own. However, with a fully connected neural network like ours, a different strategy is required. Instead of raw data, we must extract a set of distinctive features from the sound recording—features that the network can then use for accurate classification.
A clap has a distinctive shape on the oscilloscope, characterized by a loud, rapidly fading high-frequency sound. The most significant condition – a high amplitude of the sound is easily checked before the other features get estimated. These features are extracted or calculated from audio recording and (hopefully) distinguish claps from other loud sounds. Here is an example of a clap audio recording and its features. These features serve as the input to our neural network.
The sound is sampled using 256 samples at a rate of 35.74 kHz, resulting in approximately 7.1 milliseconds of audio. This duration is sufficient to reliably detect a clap.
Some features can be extracted directly from the time-domain signal, as the waveform varies over time.
Zero crossings are calculated by counting how many times the signal crosses the time axis. For a typical clap, around 20 zero crossings can be expected within a 7-millisecond window.
To estimate how quickly the sound energy decays, we calculate the linear regression coefficient of the signal’s power. Instead of using raw power values, we apply a logarithmic transformation, which better reflects human perception of loudness. This also ensures that louder segments of the recording don’t dominate the analysis, allowing quieter parts to contribute meaningfully.
More distinctive features can be extracted from the frequency domain of the sound signal. Since the focus here is on the mathematical modeling and neural network implementation, we’ll only briefly outline the frequency-based feature extraction process.
To analyze the frequency spectrum of a signal, we apply a Fourier Transform. This mathematical technique decomposes the signal into a sum of sine waves of varying frequencies and amplitudes, revealing which frequencies are present and how strongly they contribute. Since our signal is digitized, we use the discrete form—known as the Discrete Fourier Transform (DFT). Efficient algorithms like the Fast Fourier Transform (FFT) compute the DFT in 𝑂(𝑛 log 𝑛) time.
Human perception of sound—such as pitch and loudness—is inherently nonlinear. We perceive pitch on a logarithmic scale, not a linear one. To better align with this perceptual model, frequency magnitudes are mapped to the mel scale (short for “melody”). This transformation ensures that the frequency representation reflects how we actually hear sound.
Next, mel filters are applied to resample the frequency magnitudes from a linear scale to the mel scale. These magnitudes are then transformed logarithmically to compress the dynamic range and highlight perceptually significant features.
The number of mel filters can vary depending on the application. For example, 20 mel filters are highly effective for recognizing hand claps.
Mel filters are triangular filters distributed evenly across the mel scale. Because the mel scale is logarithmic, their spacing on the linear frequency axis is uneven. Each mel filter is defined by three points: a start, a peak, and an end. To construct 20 mel filters, we calculate 22 mel frequency points, which serve as the boundaries and peaks for the filters.
The final step in feature extraction for neural networks is computing Mel-Frequency Cepstral Coefficients (MFCCs). While the raw output of mel filters can be fed directly into a neural network, using cepstral coefficients offers several advantages:
-
Reduced correlation between features
-
Lower dimensionality (typically only the first few coefficients are used)
-
Improved noise robustness (higher-order coefficients tend to capture noise)
These benefits often lead to better performance in tasks like speech recognition or audio classification.
MFCCs are computed using the Discrete Cosine Transform (DCT), which converts the logarithmic mel filter outputs into cepstral coefficients. The DCT is similar to the DFT but operates only on real numbers. In essence, we’re analyzing the spectrum of a spectrum and in a playful twist, someone reversed the letters in “spectrum” to coin the term “cepstrum.”
Preparing the features to feed into the neural network may have been a bit confusing so far. Here is an overview that recapitulates everything.






