To ensure that a Siamese network gives the required encodings of the input images is to define and apply gradient descent on the triplet loss function.
The learning objective of the model is to compare an anchor image with one positive image from the same person and one negative image from a different person, by minimising the degree of difference between the anchor and the positive image while maximising that between the anchor and the negative image . This learning objective function is known as the triplet loss function – always looking at three images (anchor, positive, and negative) during model training.
A small margin α is added to the equation to make sure the neural networks don’t always produce zeros as output for all the image encodings so that the triplet loss function can be satisfied without the learning functions parameter, i.e. outputting zeroes for all image encodings.. This is also to ensure that the difference between anchor image and positive image is much larger than that between anchor image and negative image.
To train this network for a face recognition system, it is required to have multiple images for the same person. Ideally, we want to choose the triplets that are ‘hard’ to train on, i.e. the difference between anchor and positive image is similar to that between anchor and negative image, to increase the computational efficiency of the learning algorithm. Conversely, if images are chosen randomly, it is very easy for the neural networks to satisfy the objective function, therefore gradient descent is not able to learn much.
Once the network is trained and tuned, it can be applied to the one-shot learning system, where for the face recognition system, a single image of a person is enough for the system.