I have used the python for coding the convolutional neural network. The code has been taken from  and was modified to work for this dataset. The convolutional neural network was built using the Theano library in Python . The model is simplified by this 23 implementation because it does not implement location-specific gain and bias parameters and also it implements pooling by maximum and not by average. The LeNet5 model uses logistic regression for image classification. The convolutional neural network was trained by passing the compressed train, test and validation datasets. There is one bias per output feature map. The feature maps are convolved with filters and each feature map is individually downsampled using max-pooling. The compressed dataset is divided into small batch sizes so as to reduce the overhead of computing and copying data for each individual image. The batch size for this model is set to 500. We keep the learning rate which is the factor for the stochastic gradient as 0.1. The maximum number of epochs for running the optimizer is kept as 200 which means the learning for each label goes on for 200 epochs so as to optimize the network. When the first convolutional pooling layer is constructed, filtering reduces the image size to 24×24, which is further reduced to 12×12 by max-pooling. During the construction of the second convolutional pooling layer the image size is reduced to 8×8 by filtering and max-pooling reduces it further to 4×4. Since the hidden layer is fully-connected it operates on 2D matrices of rasterized images. This generates a matrix of shape (500, 800) with default values. The values of the fully-connected hidden layer are classified using Logistic Regression. The cost which is minimized during training is the negative log likelihood of the model. A Theano function  test model is constructed to compute the incorrect calculations that are made by the model. We create two lists, one of all model parameters that have to be fit by the gradient descent and the other of gradients of all model parameter. The updating of the model parameters by Stochastic Gradient Descent(SGD) is done by the Train Model which is a Theano function. Manually creating update rules for each model parameters results in being tedious because of many parameters present in this model. The updates list is thus created by looping over all pairs automatically. We keep a improvement threshold which means that a relative improvement of this much value is considered as significant. Once the training of the convolutional neural network is done, it is the train model that is returned.
Deep learning refers to a class of machine learning techniques, where many layers of information processing stages in hierarchical architectures are exploited for pattern classifi- cation and for feature or representation learning . It lies in the intersections of several research areas, including neural networks, graphical modeling, optimization, pattern recognition, and signal processing, etc.  Yann LeCun adopted the deep supervised backpropagation convolutional network for digit recognition. In the recent past, it has become a valuable research topic in the fields of both computer vision and machine learning where deep learning achieves state-of-the art results for a variety of tasks. The deep convolutional neural networks (CNNs) proposed by Hinton came out first in the image classification task of Imagenet classification with deep convolutional neural networks. The model was trained on more than one million images, and has achieved a winning top-5 test error rate of 15.3% over 1, 000 classes. After that, some recent works got better results by improving CNN models. The top-5 test error rate decreased to 13.24% in by training the model to simultaneously classify, locate and detect objects. Besides image classification, the object detection task can also benefit from the CNN model, as reported in. Generally speaking, three important reasons for the popularity of deep learning today are drastically increased chip processing abilities (e.g., GPU units), the significantly lower cost of computing hardware, and recent advances in machine learning and signal/information processing research. Over the past several years, a rich family of deep learning techniques has been proposed and extensively studied, e.g., Deep Belief Network (DBN), Boltzmann Machines (BM), Restricted Boltzmann Machines (RBM), Deep Boltzmann Machine (DBM), Deep Neural 6 Networks (DNN), etc. Among various techniques, the deep convolutional neural networks, which is a discriminative deep architecture and belongs to the DNN category, has found state-of-the-art performance on various tasks and competitions in computer vision and image recognition. Specifically, the CNN model consists of several convolutional layers and pooling layers, which are stacked up with one on top of another. The convolutional layer shares many weights, and the pooling layer sub-samples the output of the convolutional layer and reduces the data rate from the layer below. The weight sharing in the convolutional layer, together with appropriately chosen pooling schemes, endows the CNN with some invariance properties (e.g., translation invariance). My work is similar to the work of Ji Wan et al. but differs from them in the sense that the dataset I am using is different from the ones they have used in their study. Also my approach of image matching will be completely novel which has not been used in any study similar to mine.
The research in the past decade related to CBIR touched many aspects of the problem and it was seen that deep learning gave the best results. The problem of annotated images was also touched upon but it was not used with the deep learning method. In my thesis I propose to show better results for annotated images using not only the images but also the annotations provided with each image. I will be using convolutional neural network
I have worked in the past on CBIR using Bag-of-Words model with the same dataset. The results I get in my study have been evaluated against the results I achieved in my previous study and are discussed in chapter 5. The work has also been compared against the results shown in Ji Wan et al’s work  where they used deep learning for CBIR with various different datasets but they differ in the sense that the datasets used by them were just plain images without any annotations.
”Convolutional neural network (CNN) is a type of feed-forward artificial neural network where the individual neurons are tiled in such a way that they respond to overlapping regions in the visual field” . They are biologically-inspired invariant of Multilayer Perceptrons (MLP) which are designed for the purpose of minimal preprocessing. These models are widely used in image and video recognition. When CNNs are used for image recognition, they look at small portions of the input image called receptive fields with the help of multiple layers of small neuron collections which the model contains . The results we get from this collection are tiled in order for them to overlap such that a better representation of the original image is obtained; every such layer repeats this process. This is the reason they are able if the input image is translated in any way. The outputs of neuron clusters are combined by local or global pooling layers which may be included in convolutional networks. Inspired by biological process, convolutional networks also contain various combinations of fully connected layers and convolutional layers, with point-wise nonlinearity applied at the end of or after each layer . The convolution operation is used on small regions so as to avoid the situation when if all the layers are fully connected billions of parameters will exist. Convolutional networks use shared weights in the convolutional layers i.e. for each pixel in the layer same filter (weights bank) is used which is advantageous because it reduces the required memory size and improves performance. CNNs use relatively less amount of pre-processing as compared to other image classification algorithms,
CNNs enforce a local connectivity pattern between neurons of adjacent layers to exploit spatially-local correlation . We have illustrated in fig.4.1 that in layer m the inputs of hidden units are from a subset of units in layer m-1, units containing spatially adjoining receptive fields.
Every filter hi in CNNs is duplicated across the complete visual field. The duplicated filters consists of the same parameters i.e. weights and bias that form a feature map. We can see in fig.4.2 that same feature map contains 3 hidden units. The weights of same color are shared that are constrained to be identical . We can still use gradient descent to learn such shared parameters by altering the original algorithm by a very small margin. When the gradients of the shared parameters are summed, then it gives the gradient of a shared weight. We can detect the features regardless of their location in the visual field by duplicating the units. The huge reduction of the number of free parameters being learnt can lead to weight sharing increasing the learning efficiency. CNNs achieve better generalization on vision problems due to the constraints on these models.
We obtain a feature map by repeatedly applying a function across sub-regions of the entire image, mainly by convolution of the input image with a linear filter, adding a bias term and 19 then applying a non-linear function . The k-th feature map can be denoted as h k at a given layer, whose filters we can determine by the bias b k and weights Wk , then we can obtain the feature map by the given equation:
depicts 2 layers of CNN. There are 4 feature maps in layer m-1 and 2 feature maps in hidden layer m (h 0 and h 1 ). The pixels of layer (m-1) that lie within their 2×2 receptive field in the layer below (colored squares) are used for the computation of the pixels in the feature maps h 0 and h 1 (blue and red squares). It can be observed that how all 4 input feature maps are spanned by the receptive field. As a result the 3D weight tensors are the weights and of and . The input feature maps is indexed by the leading dimensions, whereas the pixel coordinates is referred by the other two. When we combine it all as shown in fig.4.3, at layer m the weight that connects each pixel of the k-th feature map with the pixel of the l-th layer at layer (m-1) and at coordinates (i,j) is denoted  .
Max-pooling a form of non-linear down-sampling is an important concept of CNNs. The input image is partitioned into a group of non-overlapping rectangles and a maximum value is given for each such sub-region. We use max-pooling in vision for the following reasonsThe computation of upper layers is reduced by the removal of non-maximal values. Suppose a max-pooling layer is cascaded with a convolutional layer. The input image can be translated by a single pixel in 8 directions. 3 out of 8 possible configurations produce exactly the same output at the convolutional layer if max-pooling is done over a 2×2 region. This jumps to 5/8 for max-pooling over a 3×3 region . A form of translation invariance is provided by this. The dimensionality of intermediate representations is reduced by max-pooling because it provides additional robustness to position.
The dataset I chose for this thesis is from the SUN database . The major reason for choosing this dataset was that the images in it were pre-annotated and had annotations as XML files for each image. The SUN database is huge so I had to choose a small subset of it for this study. In this study I am trying to classify images based on 8 classes namely: water, car, mountain, ground, tree, building, snow, sky and unknown which contains all the rest of the classes. I chose only those sets of images which I felt were more relevant to these classes. I collected a database of 3000 images from 41 categories. Each image has its annotations in an XML file. I randomly divided the dataset into 80% training set and 20% testing. There are 1900 training images, 600 testing images and 500 validation images. The training set was further divided into 80% training set and 20% validation set. The major drawback of this dataset is that the images are annotated by humans and the annotations are not perfect thus it may have some effect on the results. I try to handle this problem by getting as many synonyms as I can for each class label. A few examples of the synonyms are lake, lake water, sea water, river water, wave, ripple, river, sea, river water among others which all belong to the class label water. I mapped these synonyms to their respective class labels which are being used. Not all images in every categories were annotated. I filtered out the annotated images from the dataset and used only them for this study. Fig.4.5 shows an example of an image from the dataset and its annotation file where it can be seen how a river is annotated by the user