Conditional principal components analysis

Conditional principal components analysis seeks to restrict neurons to perform principal components analysis only when activated.

Model


We use a set of linear neurons with binary inputs. Given a set of k-dimensional binary inputs represented as a column vector $$\vec{x} = [x_1, x_2, \cdots, x_k]^T$$, and a set of m linear neurons with (initially random) synaptic weights from the inputs, represented as a matrix formed by m weight column vectors (i.e. a k row x m column matrix):

$$\mathbf{W} = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1m}\\ w_{21} & w_{22} & \cdots & w_{2m}\\ \vdots & & & \vdots \\ w_{k1} & w_{k2} & \cdots & w_{km} \end{bmatrix}$$

where $$w_{ij}$$ is the weight between input i and neuron j, the output of the set of neurons is defined as follows (but see also the below section on contrast enhancement):

$$\vec{y} = \mathbf{W}^T \vec{x}$$

The CPCA rule gives the update rule which is applied after an input pattern is presented:

$$\Delta w_{ij} = \eta y_j(x_i - w_{ij})$$

With a set of such neurons, typically a k-Winner-Takes-All pass is run before the update: all neurons are evaluated, and the $$k$$ neurons with the highest outputs have their outputs set to 1, while the rest have their outputs set to 0.

Derivation
We want the weight from input $$i$$ to neuron $$j$$ to eventually settle at the probability that neuron $$j$$ will be activated given that input $$i$$ is activated. That is, when the weight is at equilibrium, we have:

$$w_{ij} = P(y_j=1 \mid x_i=1)$$

By the definition of conditional probability:

$$w_{ij} = P(y_j=1 \mid x_i=1) = \frac{P(y_j=1 \wedge x_i=1)}{P(y_j=1)}$$

Using the total probability theorem, we can condition the numerator and denominator on the input patterns. If an input pattern is $$t$$, then we have:

$$\begin{align} P(y_j=1) &= \sum_t P(y_j=1 \mid t)P(t),\\ P(y_j=1 \wedge x_i=1) &= \sum_t P(y_j=1 \wedge x_i=1 \mid t)P(t) \end{align}$$

Substituting back into the equation for $$w_{ij}$$ and doing some rearrangement, we get:

$$0 = \sum_t P(y_j=1 \wedge x_i=1 \mid t)P(t) - w_{ij} \sum_t P(y_j=1 \mid t)P(t)$$

A good assumption is that all input patterns in the set of input patterns are equally likely to appear, so that $$P(t)$$ is a constant and can be eliminated:

$$0 = \sum_t P(y_j=1 \wedge x_i=1 \mid t) - w_{ij} \sum_t P(y_j=1 \mid t)$$

Since inputs and outputs are either 0 or 1, conveniently the average over all patterns of an input or output (or a combination of input and output) will be equal to the probability of that input or output (or combination of input and output) being 1. Thus:

$$ 0 = \left \langle y_j x_i \right \rangle_t - w_{ij} \left \langle  y_j \right \rangle_t$$

We can easily turn this into an update rule which will drive the weights to the above equilibrium condition:

$$\Delta w_{ij} = \eta (y_j x_i - w_{ij} y_j) = \eta y_j(x_i - w_{ij})$$

Interpretation
Since the inputs and outputs are binary, the update rule can be interpreted as follows:


 * If the output is not active, do not alter any weight.
 * If the output is active and an input is not active, subtract the weight (times a learning rate).
 * If the output is active and an input is also active, add 1 minus the weight (times a learning rate).

The second rule has the effect of driving the weight towards zero (asymptotically), and the third rule has the effect of driving the weight towards one (asymptotically). Overall, the rules combine to equilibrate the weight towards the probability that the output is active given that the input is active.

k-Winner Takes All
Without some form of competition between neurons fed the same input, neurons may tend to represent the same component of the input. To prevent this, $$k$$ neurons out of all the neurons have their outputs set to one, and the rest have their outputs set to zero. The k-Winner Takes All algorithm chooses $$k$$ neurons which have the highest outputs.

Contrast enhancement
Since weights are initially randomized, some neurons may end up representing weak correlations while others represent strong correlations. In addition, because of the asymptotic nature of the weights, it is difficult for weights to become highly selective (i.e. to get near zero or near one). To fix this problem, the weights are passed through a sigmoid function during computation of the output:

$$y_j = \sum_i x_i \frac{1}{1 + \left ( \frac{1-w_{ij}}{\theta w_{ij}} \right )^\gamma}$$

where $$\theta$$ is a parameter indicating the center of the sigmoid ($$\theta=1$$ sets it at 0.5, lower values set it lower, higher values set it higher), and $$\gamma$$ is a parameter indicating how sharp the transition is.