Understanding Batch Size, Epochs, and Training Steps in Neural Networks
Written on
When diving into deep learning, terms such as batch size, epochs, and iterations can often lead to confusion, especially for newcomers.
In deep learning, these elements—batch size, epochs, and training steps—are known as model hyperparameters, which must be set manually. They are standard parameters found in nearly every neural network model.
Understanding the definitions of these hyperparameters is crucial because we will need to adjust their values during the training process within the fit() method, as demonstrated below.
In the code example above:
- The batch_size indicates the number of samples in each batch.
- The epochs represent the total cycles through the training dataset.
- The steps_per_epoch denotes the number of training steps within a single epoch.
Additionally, the shuffle parameter is another hyperparameter that plays a role in the training process (more on that soon).
Upon training a neural network model with the specified hyperparameter values, the output will appear similar to the following.
You've likely encountered such outputs when invoking the fit() method during the model's training phase.
Let’s break down each term and explore the relationships among them.
Training Connections: Batch Size, Epochs, and Training Steps
Neural networks typically work with extensive datasets that may contain thousands or millions of samples. Utilizing the entire dataset for each gradient update would be both time-consuming and resource-intensive. In fact, very large datasets may not even fit into the system's memory.
To address this, we implement batches, which are subsets of the dataset, to conduct gradient updates during training.
A batch consists of multiple training instances (samples).
Batch size refers to the number of instances within a batch. For example, when batch_size=128, it indicates that each batch contains 128 training instances.
It's essential to differentiate between batch size and the number of batches. The number of batches is calculated as follows:
No. of batches = (Size of the entire dataset / batch size) + 1
This calculation shows that the number of batches relies on two factors: the total size of the dataset and the chosen batch size. The addition of 1 accounts for any decimal remainder.
Let’s clarify this with an example:
Suppose the training dataset has 60,000 instances. To find the number of batches, we divide the dataset size by the batch size of 128.
60,000 / 128 = 468.75
Since this results in a fractional number, we add 1 to 468, leading to:
No. of batches = 468 + 1 = 469
Hence, in this case, we have 469 batches. As highlighted in the earlier output, 468 of these batches will contain 128 training instances each, while the last batch (resulting from the fractional part) will have 96 training instances [60,000 - (128x468)].
During training, batches of data flow through the neural network, where the loss function computes the error. This error is then backpropagated through the network to adjust its parameters, enhancing predictive performance in subsequent steps.
Let’s delve into the mechanics of this process.
Imagine the training data comprises 60,000 instances with a batch size of 128.
Before drawing batches from the dataset, the algorithm may shuffle the training data if we set shuffle=True in the fit() method.
The algorithm then begins to extract batches from the dataset, starting with the first 128 instances (first batch). It trains the model, computes the average error, and performs a gradient update—this constitutes one training step (or iteration).
More precisely, a training step (iteration) refers to a single gradient update.
Next, the algorithm processes the second batch of 128 instances, trains the model, calculates the average error, and updates the parameters again. This represents another training step.
The algorithm continues this process until all batches are processed, totaling 469 in our example. This marks the conclusion of one epoch, during which the model has seen the entire dataset.
Epochs refer to how many times the model is exposed to the full dataset.
The total number of gradient updates executed in one epoch corresponds to the number of training steps (iterations) during that epoch.
It is important to note that epochs and iterations are distinct concepts.
We can summarize the relationship with the following equation regarding one training epoch:
No. of training steps = No. of batches = No. of gradient updates
In our example, with 469 batches, there are also 469 training steps or gradient updates in one epoch.
Considering we have 20 epochs total, we have only completed one epoch of training thus far.
Next, the algorithm prepares for the second epoch, again shuffling the training data. The same procedure from the first epoch is repeated here.
Once the algorithm completes all 20 epochs (the full dataset has been shown to the model 20 times), the training process concludes.
To calculate the total number of gradient updates for the complete training process, we can use the following formula:
No. of ALL gradient updates = No. of batches x No. of epochs
Thus, in this scenario, the algorithm has performed 9380 (469x20) gradient updates or completed 9380 training steps throughout the entire training process.
Identifying an Optimal Batch Size
Increasing the batch size has the following effects:
- The algorithm performs more stable gradient updates.
- Each training step (iteration) takes longer to complete.
- The overall training process becomes more resource-intensive and time-consuming.
In tf.keras, the batch size is specified using the batch_size hyperparameter (argument) within the model's fit() method.
The batch_size parameter accepts an integer or None. If set to None or left unspecified, the default is 32. Common alternatives for batch_size are 16, 64, 128, and 256.
The minimum batch size is 1, which uses every training instance for a single gradient update. In this case, the count of gradient updates, training steps, or batches in one epoch equals the size of the complete training dataset.
Conversely, the maximum batch size is equivalent to the total dataset size, performing one gradient update for the entire dataset. Here, the count of gradient updates, training steps, or batches in one epoch equals 1.
Any integer value between these two extremes can be selected. The number of gradient updates, training steps, or batches in one epoch will depend on the chosen batch size.
Batch Size and Variants of Gradient Descent
Gradient descent is an iterative optimization method used to train machine learning and deep learning models, adjusting model parameters based on the loss function. The batch_size and epochs are key hyperparameters in this algorithm, set in the model's fit() methods as mentioned earlier.
When the batch size equals the full dataset size, the algorithm is referred to as batch gradient descent. Here, model parameters are updated once after each training epoch, offering computational efficiency.
When the batch size is set to 1, it’s termed stochastic gradient descent. In this scenario, model parameters are updated after each training example within one epoch, which is computationally costly.
When the batch size falls between 1 and the total dataset size, it is identified as mini-batch gradient descent.
Determining the Appropriate Number of Epochs
In tf.keras, the number of epochs is defined using the epochs hyperparameter (argument) in the fit() method of the model. This parameter accepts an integer.
The algorithm requires a sufficient number of epochs to effectively complete training. Initially, you might set the number of epochs to integers such as 5, 10, 20, 50, etc., and then track the training and validation losses against the epoch counts.
If the resulting plot suggests a simultaneous decrease in both training and validation losses, consider increasing the number of epochs to explore further reductions in losses.
In this case, the model may be underfitting the training data, indicating subpar performance on both training and validation datasets.
Now, consider this alternate plot.
If you observe this type of plot, it suggests overtraining the model, as both training and validation losses plateau after the 20th epoch. Thus, 20 epochs may suffice for effective training in this case.
Now, review this plot.
If this plot appears, it’s wise to halt the training process at the 5th epoch since the validation loss begins to rise thereafter. Otherwise, you risk overfitting the model to the training data, which would lead to strong performance on training data but poor results on new, unseen data.
Generally, increasing the number of epochs results in a more resource-demanding and time-consuming training process.
Determining Training Steps Within an Epoch
In tf.keras, the number of training steps in a single epoch is defined by the steps_per_epoch hyperparameter (argument) in the model's fit() method. This can be an integer or None, with the default being set to None. You generally do not need to specify a value for this parameter, as the algorithm automatically calculates it using the following equation:
No. of steps = (Size of the entire dataset / batch size) + 1
Essentially, the number of training steps per epoch corresponds to the number of batches!
If you choose to specify a value for the steps_per_epoch hyperparameter, it will override the default. For instance, setting steps_per_epoch=500 will instruct the algorithm to use 500 iterations (batches) in each epoch instead of the 469 iterations (batches) previously mentioned.
This concludes today’s discussion.
Please feel free to reach out with any questions or feedback.
Recommended Reading
- All episodes of my “Neural Networks and Deep Learning Course”
Support My Writing
If you enjoyed this article, please consider supporting me by signing up for a membership to gain unlimited access to Medium. It costs just $5 a month, and I will receive a portion of your membership fee.
Thank you immensely for your continued support! I look forward to seeing you in the next article. Happy learning!
Rukshan Pramoditha 2022–08–26