Convolutional Neural Networks - Part 2

February 15, 2019 — 12 min

Convolutional neural networks are widely used in addressing image-based problems, such as object/character detection and face recognition. In this article, we will focus on the most famous architectures from LeNet to Siamese networks, having for most of them the following architecture:

3d arch

If you don’t have any knowledge about convolutional neural networks, I advise you to read the first part of this article, discussing the fundamentals of CNNs.

Table of content

1 - Cross-Entropy
2 - Image classification
3 - Object detection - YOLOv3
4 - Face Recognition - Siamese Networks

1 - Cross-Entropy

When classifying an image, we often use a softmax function at the last layer having the size (C,1)(C,1) where CC is the number of classes in question.
The ithi^{th} line of the vector is the probability that the input image belongs to the class ii. The predicted class is set to be the one corresponding to the highest probability.
At the last layer a[L]a^{[L]}, we have:

(softmax(a[L]))i=P(inputclassi)=exp(ai[L])j=1Nexp(aj[L]) i[1,C](softmax(a^{[L]}))_i=\mathbb{P}(input\in class_i)=\frac{exp(a^{[L]}_i)}{\sum_{j=1}^Nexp(a^{[L]}_j)}~\forall i\in [|1,C|]

The network learns by using backpropagation and optimizes the cross entropy defined as follows:

H(p,q)=xclassp(x,class)log(q(x,class))\mathcal{H}(p,q)=-\sum_x \sum_{class} p(x, class)log(q(x, class))


  • p(x,class)p(x, class) is the reference probability and equals 1 if the object really belongs to the class filled out and 0 otherwise
  • q(x,class)q(x, class) is the probability, learned by the network through the softmax, that the object x belongs to that class

For an input xclassjx\in class_j:

p(x,classi)=1i=jp(x, class_i)=\mathbb{1}_{i=j} (General truth) q(x,classi)=P(inputclassi)=exp(ai[L])j=1Nexp(aj[L])\\q(x, class_i)=\mathbb{P}(input\in class_i)=\frac{exp(a^{[L]}_i)}{\sum_{j=1}^Nexp(a^{[L]}_j)} (Predicted)

Thus, we set the loss function as:

cross_entropy=1mkmi=1C1class(inputk)=ilog(P(inputkclassi))=1mkmlog(P(inputkclass(inputk)))cross\_entropy=-\frac{1}{m}\sum_k^{m}\sum_{i=1}^{C}\mathbb{1}_{class(input_k)=i}log(\mathbb{P}(input_k\in class_i))\\=-\frac{1}{m}\sum_k^{m}log(\mathbb{P}(input_k\in class(input_k)))

We average the loss, where mm is the size of the training set.

2 - Image classification

LeNet - Digits Recognition

LeNet is an architecture developed by Yann Lecun in 19981998 and it aims at detecting the digit present in the input.
Given gray-scale images of hand-written digits from 00 to 99, the convolutional neural network predicts the digit of the image.

The trainset is called MNIST, which is a dataset containing more than 7000070000 images having 28×28×128\times 28\times 1 pixels. The neural network has the following architecture counting more than 60K60K parameters:

LeNet 5

For more details, I advise you to read the official paper.


AlexNet is a famous architecture which won the ImageNet competition in 2012. It is similar to LeNet but has more layers, dropouts and ReLU activation function most of time.


The training set is a subset of the ImageNet database, a 1515 million labeled images having a high resolution and representing more than 22k22k categories.
AlexNet used more than 1.21.2 million images in the training set, 50k50k in the validation set and 150k150k in the test set, which were all resized to 227×227×3227\times 227\times 3. The architecture has more than 60 million parameters and thus was trained on 2 GPUs, it outputs a softmax vector with size (1000,1)(1000,1). For more information, I advise you to read the official paper.


VGG-16 in a convolutional neural network for image classification, trained on the same dataset ImageNetand has more than 138 million parameters trained on GPUs.
The architecture is the following:

VGG 16

It is more accurate and deeper than AlexNet since it replaced the large kernels 11×1111\times 11 and 5×55\times 5 by successive 3×33\times 3 kernels.
For more details, check the official paper of the VGG project.

3 - Object detection - YOLO

Object detection is the task of detecting multiple objects in an image which comprehenses both object localization and object classification. A first rough approach would be sliding a window with customizable dimensions and predict each time the class of the content using a network trained on cropped images. This process has a high computational cost and can luckily be automized using convolutions.
YOLO stands for You Only Look Once and the basic idea consists on placing a grid on the image (usually 19×1919\times 19) where:

Only one cell, the one containing the center/midpoint of an object is responsible for detecting this object

Each cell of the grid (i,j)(i,j) is labelled as follows:

yi,j=T[pc,bx,by,bw,bh,c1,c2,...,cN]y_{i,j}={}^T[p_c, b_x, b_y, b_w, b_h,c_1,c_2,...,c_N]


  • pcp_c: the probability of presence of an object
  • (bx,by)(b_x,b_y): the normalized coordinates of the object’s center
  • (bw,bh)(b_w, b_h): the normalized width and height of the object’s bounding box
  • cic_i: the probability that the object belongs to the ithi^{th} class
  • NN: the number of classes
annot yolo

Hence, for each image the target output is of size:

size(target_output)=[nx_cell×ny_cell×(5+N),1]size(target\_output)=[n_x\_cell\times n_y\_cell\times (5+N), 1]

Where (nx_cell×ny_cell)(n_x\_cell\times n_y\_cell) is the size of the grid.


In order to evaluate the object localization, we use the Intersection Over Union which measures the overlap between two bounding boxes:


When predicting the bounding box of a given object in a given cell of the grid, many outputs might be given, Non-Max Suppression helps you detect the object only once. It takes the highest probability pcp_c and suppresses the other boxes having a high overlap (IOU).
For each cell of the grid, the algorithm is the following:

  • For each output prediction T[pc,bx,by,bw,bh,c1,c2,...,cN]{}^T[p_c, b_x, b_y, b_w, b_h,c_1,c_2,...,c_N]

    • Discard all the boxes with pc<threshold=0.6p_c<threshold=0.6 (presence of an object)
    • While there are any remaining boxes:
    • Pick the box with the highest pcp_c and output it as a prediction
    • Discard any remaining box with IOUthreshold=0.5IOU\geq threshold=0.5 with the previous output box

Anchor Boxes

In most cases, a grid cell might contain multiple objects, anchor boxes allow the detection of all of them. In case of 2 anchor boxes, each cell of the grid (i,j)(i,j) is labeled as follows:

yi,j=T[pc,bx,by,bw,bh,c1,c2,...,cN,pc,bx,by,bw,bh,c1,c2,...,cN]y_{i,j}={}^T[p_c, b_x, b_y, b_w, b_h,c_1,c_2,...,c_N, \boldsymbol{ p_c, b_x, b_y, b_w, b_h,c_1,c_2,...,c_N }]


More general, the output target is of size:

size(target_output)=[nx_cell×ny_cell×(M×(5+N)),1]size(target\_output)=[n_x\_cell\times n_y\_cell\times (M\times(5+N)), 1]

Where NN is the number of classes and MM the number of anchor boxes.

YOLOv3 Algorithm

YOLO was trained on coco dataset, a large-scale object detection, segmentation and captioning database with 80 object categories. YOLOv3 has a Darknet-53 architecture as a feature extractor also called a backbone.

The training is carried out by minimizing a loss function using gradient methods as well.
It is combined of:

  • Logistic regression loss on pcp_c
  • Squared error loss for bib_i
  • Softmax loss(cross-entropy) for the probabilities cic_i

At each epoch, in each cell, we generate the output yi,jy_{i,j} and evaluate the loss function.
When making predictions, we check that pcp_c is high enough and for each grid-cell, we get rid of low probability prediction and use non-max suppression for each class to generate the final output.
For more information, I advise you to read the official paper.

4 - Face Recognition - Siamese Networks

Siamese networks are neural networks, often convolutional, which allow to calculate the degree of similarity between two inputs, images in our case, as follows:

siamese network

The purpose of the CNN module is to represent the information on the image in another space, called embedding space thanks to a ff function. We then, compare the two embeddings using a certain distance. Learning in Siamese networks is done by minimizing an objective function composed of a loss function called triplet.

The triplet function takes 3 vector variables as an input: an Anchor AA, a positive PP(similar to AA) and a last negative NN (different from AA). Thus, we are looking to have:

f(A)f(P)2f(A)f(N)2\|f(A)-f(P)\|^2 \leq \|f(A)-f(N)\|^2

Where x2=<x,x>\|x\|^2=<x,x> for a given scalar product. To prevent the learned function ff from being null, we define the margin 0<α10<\alpha\leq 1 so that:

f(A)f(P)2f(A)f(N)2α    f(A)f(P)2+αf(A)f(N)20\|f(A)-f(P)\|^2 - \|f(A)-f(N)\|^2 \leq -\alpha \\ \iff \|f(A)-f(P)\|^2 + \alpha - \|f(A)-f(N)\|^2 \leq 0.

Thus, we define the loss function as follows:

L(A,P,N)=max(0,f(A)f(P)2+αf(A)f(N)2)\mathcal{L}(A,P,N)=max(0,\|f(A)-f(P)\|^2 + \alpha - \|f(A)-f(N)\|^2)

Starting from a learning database (A(i),P(i),N(i))i[1,...,n](A^{(i)}, P^{(i)}, N^{(i)})_{i\in[1,...,n]} of size nn, the objective function to be minimized is:

J=1ni=1nL(A(i),P(i),N(i))\mathcal{J}=\frac{1}{n}\sum_{i=1}^n \mathcal{L}(A^{(i)}, P^{(i)}, N^{(i)}).

Using gradient methods.
When training the architecture and for each epoch, we fixe the number of triplets and for each one:

  • We randomly choose two images of the same class (Anchor & Positive)
  • We randomly pick an image from another class (Negative)

A triplet (A,N,P)(A, N, P) can be:

  • Easy negative, when f(A)f(P)2+αf(A)f(N)20\|f(A)-f(P)\|^2 + \alpha - \|f(A)-f(N)\|^2 \leq 0
  • Semi-hard negative, when f(A)f(P)2+α>f(A)f(N)2>f(A)f(P)2\|f(A)-f(P)\|^2 + \alpha > \|f(A)-f(N)\|^2 > \|f(A)-f(P)\|^2
  • Hard negative, when f(A)f(N)2<f(A)f(P)2\|f(A)-f(N)\|^2 < \|f(A)-f(P)\|^2

We usually choose to focus on the semi-hard negatives to train the neural network.

Application: Face Recognition

Siamese networks can be used to develop a system capable of identifying faces. Given an image taken by camera, the architecture compares it to all the images in the database.
Since we can not have multiple images of the same person in our database, we usually train the siamese network on an open-source imageset rich enough to create the triplets.

face recognition

The convolutional neural network learns a similarity function ff which is the embedding of the image in Rn[L]\mathbb{R}^{n^{[L]}}, where n[L]n^{[L]} is the size of the output layer.
Given a camera-picture, we compare it to each imagejimage_j fo the database such that:

  • if d(f(image,imagej))τd(f(image, image_j))\leq\tau, both of the image represent the same person
  • if d(f(image,imagej))>τd(f(image, image_j))>\tau, the images are of two different persons

We choose the face imagejimage_j which is the closest to the imageimage in terms of the distance dd. The threshold τ\tau is chosen in such a way that the F1F_1-score is the highest for example.