4.2 Proposed Method
4.2.3 Face-IM Pair Classification for Splicing Detection
It has been observed that the face images captured under the same illumination environment will have the same visual features (i.e., texture, shape, and colour) in their corresponding IMs [9], [41]. In case the faces come from different illumination environments, these features will be different. Therefore, by checking the consistencies of these visual features in a pair-wise manner, the authenticity of an image can be decided. If there are N faces present in an image, there will be N(N −1)/2 pairs. An image is considered as spliced if at least one face pair is classified as spliced; else, the image is considered as authentic.
4.2.3.1 Siamese Network for Feature Learning
A siamese neural network-based method was first proposed by Broomley et al. [99] for solving the signature verification problem. A siamese network consists of two identical sub- neural networks accepting one input each. The outputs of the two sub-networks are passed to a distance layer, which computes a distance metric between them. The distance metric produces a high value if the two images in a pair come from two different classes and a low value if they come from the same class.
To learn features from the face-IMs that can differentiate the authentic face pairs from the spliced ones, we propose to employ a siamese network consisting of twin CNNs [94]. In this TH-2553_136102029
4. Deep Learning-based Classification of Illumination Maps for Detecting Spliced Faces
way, the CNN part of the network learns the features helpful for detecting the spliced faces automatically from the training data itself. The proposed siamese network takes two inputs,F1 andF2, one for each of the CNNs and the outputs are fed to the distance layer [94]. The distance layer computes a distance metric between the feature vectors learned by the twin networks. The weight sharing ensures that the same set of features are learned from both inputs. Figure 4.3 shows the block diagram of the siamese network. It contains twin CNNs,i.e., CNN1 and CNN2, which compute the featuresf1andf2from the input face-IM pairs. The distance layer first finds a difference vector by computing the absolute difference between the corresponding components of the featuresf1andf2. The difference vector is then fed to a fully-connected layer consisting of a single neuron with sigmoidal activation function. The output of the distance layer is in the range [0,1]. Hence, it can be considered as the class prediction of the pair of face-IMs. A high value of this prediction indicates that the face-IMs in the pair have different visual features, and in turn indicates that the two faces come from different illumination environments,i.e.,spliced pair. Similarly, a low value of this prediction indicates that the two face-IMs are similar and hence are from the same illumination environment,i.e.,authentic pair.
4.2.3.2 Network Architecture 1) CNN
The CNN part of the siamese network is responsible for learning visual features from the training face-IM pairs. These features help in discriminating the spliced pairs from the authentic ones. Each of the twin CNNs used in this method takes an input image of size 155×155. We have experimented with smaller input sizes, such as 32×32 and 64×64. However, the results on these smaller input sizes were worse than on 155×155. The reason for this might be the following: Since our focus is on extracting features from face-IMs, we have to keep as much illumination-related information as possible. When the image size is reduced, the illumination estimation from the superpixels becomes less accurate due to the lower number of pixels in the superpixels. The number of superpixels in the face regions also reduces with the reduced input size. This results in homogenous face-IMs with only a few illuminant colours over the face regions. The dataset we use for training the siamese network contains images where the face TH-2553_136102029
4.2 Proposed Method
Figure 4.4: The convolutional network architecture used in the siamese network.
regions are at least of the resolution 155×155. In case the face regions in the test images are of different sizes, we resize the face-IMs to 155×155.
In the initial investigations, we experimented with deep CNNs, similar to the one proposed in [100] (VGGNet) with 16 layers. However, due to a large number of parameters, the deep CNNs overfitted the training data. This is because our training set contains a small number of images (as discussed in Section 4.3, below) compared to the number of images in datasets used to train deep networks like VGGNet [100]. Therefore, we have proposed to employ a shallow CNN containing only a few convolutional layers. The shallow CNNs with a few layers have shown to be effective in other computer vision tasks, e.g., as in [101]. Finally, the CNN part of the proposed siamese network is designed to have 4 convolutional layers and 4 max-pooling layers, as shown in Figure 4.4. The number of filters in the first, second, third, and fourth convolutional layers are 16, 32, 32, and 16, respectively, with a kernel of size 3×3 and stride 1 in all the layers. The batch normalization technique [102] is used after each convolutional layer, and it is followed by rectified linear unit (ReLU) nonlinearity being applied on the output. All the four max-pooling layers employ kernels of size 2×2 with stride 2.
2) Distance Layer
The distance layer computes the weighted-L1distance between the featuresf1 andf2, com- puted by CNN1 and CNN2, respectively. The sigmoid nonlinearity is applied to this layer to map it to the range [0,1]. Hence, the output of this layer can be considered to be the class predic- tion of the input face-IM pair. For this, a new vector first is formed by computing the element- wise absolute difference between f1 and f2. The difference vector is fed to a fully-connected single neuron with the sigmoid activation function. So, this layer outputs the weighted-L1 dis- tance between the features as p = σ(P
jαj
f1j −f2j
), where σ(.) and |.|denote the sigmoidal TH-2553_136102029
4. Deep Learning-based Classification of Illumination Maps for Detecting Spliced Faces
activation function and the absolute value, respectively. The αjs are the learnable weights of the final fully-connected layer, representing the importance of each component of the difference vector. There are 256 parameters in the distance layer.
3) Learning
The proposed siamese network is trained by minimizing the cross-entropy loss over a mini- batch of face-IM pairs. As the sigmoid activation function is used in the final layer of our siamese network, the cross-entropy loss is the preferred cost function to train the network. We assign the ground-truth labelg(F1,F2) = 0 when both the images,F1 andF2, in the pair, come from the same image. Otherwise, we assigng(F1,F2)= 1. The cross-entropy loss is given by
LCE = 1 M
M
X
i=1
g(Fi1,Fi2) logp(Fi1,Fi2)+(1−g(Fi1,Fi2)) log(1− p(Fi1,Fi2)) (4.6) where p(F1,F2) is the prediction of the network andMis the number of face pairs in the mini- batch.
4.2.3.3 Feature Extraction
There is no face splicing dataset that contains a sufficient number of authentic and spliced face pairs to train the siamese network. Therefore, we propose to generate the authentic and the spliced pairs artificially from a set of authentic images. We convert the images to IMs, extract the face parts, and create pairs of face-IMs. If both faces of a pair come from the same image, we label the pair as authentic. If the two faces of a pair come from two different images, the pair is labeled as spliced. The siamese network is trained on these artificially created spliced and authentic face-IM pairs.
After the training process, the CNN part of the network is used to extract features from the face-IMs present in real-life spliced and authentic images. Though the siamese network learns to differentiate between the authentic and the artificially-created spliced face pairs present in the training set, the difference in the artificial spliced face-IM pairs is more as compared to the spliced face pairs present in real-life spliced images. The siamese network trained on the artificial training dataset will not be able to correctly classify the real spliced faces. It is experi- mentally found that the siamese network trained on the artificiall-created spliced and authentic TH-2553_136102029