Overview

These models use a local connectivity pattern that convolves the inputs at each layer with a shared window of weights called a kernel or a filter. This connective structure enables networks to learn representations that are robust to translational shifts. Each convolutional layer consists of many kernels in which the outputs from this operation produce feature maps.

g is the filter. f is the input. Spatial Convolution in 1 dimension. $$ f[x] * g[x] = \sum_k f[i] \cdot g[x-i] $$

In two dimensions, $$ f[x, y] * g[x, y] = \sum_i \sum_j f[i, j] \cdot g[x-i, y-j] $$

Now let's break the image up into patches.

Since a 3x3 kernel doesn't evenly divide into a 28x28 image we have uneven patches in the last row and column.

What if we want some overlap?