Conditional principal components analysis

Conditional principal components analysis seeks to restrict neurons to perform principal components analysis only when activated.^[1]

Model

Model of a neuron. j is the index of the neuron when there is more than one neuron. For a linear neuron, the activation function is not present (or simply the identity function).

We use a set of linear neurons with binary inputs. Given a set of k-dimensional binary inputs represented as a column vector <math>\vec{x} = [x_1, x_2, \cdots, x_k]^T</math>, and a set of m linear neurons with (initially random) synaptic weights from the inputs, represented as a matrix formed by m weight column vectors (i.e. a k row x m column matrix):

<math>\mathbf{W} = \begin{bmatrix}

w_{11} & w_{12} & \cdots & w_{1m}\\ w_{21} & w_{22} & \cdots & w_{2m}\\ \vdots & & & \vdots \\ w_{k1} & w_{k2} & \cdots & w_{km}

\end{bmatrix}</math>

where <math>w_{ij}</math> is the weight between input i and neuron j, the output of the set of neurons is defined as follows:

<math>\vec{y} = \mathbf{W}^T \vec{x}</math>

The CPCA rule gives the update rule which is applied after an input pattern is presented:

<math>\Delta w_{ij} = \eta y_j(x_i - w_{ij})</math>

With a set of such neurons, typically a k-Winner-Takes-All pass is run before the update: all neurons are evaluated, and the <math>k</math> neurons with the highest outputs have their outputs set to 1, while the rest have their outputs set to 0.

Derivation

We want the weight from input <math>i</math> to neuron <math>j</math> to eventually settle at the probability that neuron <math>j</math> will be activated given that input <math>i</math> is activated. That is, when the weight is at equilibrium, we have:

By the definition of conditional probability:

<math>w_{ij} = P(y_j=1 \mid x_i=1) = \frac{P(y_j=1 \wedge x_i=1)}{P(y_j=1)}</math>

Using the total probability theorem, we can condition the numerator and denominator on the input patterns. If an input pattern is <math>t</math>, then we have:

<math>\begin{align}

P(y_j=1) &= \sum_t P(y_j=1 \mid t)P(t),\\ P(y_j=1 \wedge x_i=1) &= \sum_t P(y_j=1 \wedge x_i=1 \mid t)P(t)

\end{align}</math>

Substituting back into the equation for <math>w_{ij}</math> and doing some rearrangement, we get:

<math>0 = \sum_t P(y_j=1 \wedge x_i=1 \mid t)P(t) - w_{ij} \sum_t P(y_j=1 \mid t)P(t)</math>

A good assumption is that all input patterns in the set of input patterns are equally likely to appear, so that <math>P(t)</math> is a constant and can be eliminated:

<math>0 = \sum_t P(y_j=1 \wedge x_i=1 \mid t) - w_{ij} \sum_t P(y_j=1 \mid t)</math>

Since inputs and outputs are either 0 or 1, conveniently the average over all patterns of an input or output (or a combination of input and output) will be equal to the probability of that input or output (or combination of input and output) being 1. Thus:

<math> 0 = \left \langle y_j x_i \right \rangle_t - w_{ij} \left \langle y_j \right \rangle_t</math>

We can easily turn this into an update rule which will drive the weights to the above equilibrium condition:

<math>\Delta w_{ij} = \eta (y_j x_i - w_{ij} y_j) = \eta y_j(x_i - w_{ij})</math>

Interpretation

Since the inputs and outputs are binary, the update rule can be interpreted as follows:

If the output is not active, do not alter any weight.
If the output is active and an input is not active, subtract the weight (times a learning rate).
If the output is active and an input is also active, add 1 minus the weight (times a learning rate).

This will have the effect of attempting to saturate the weight towards zero or one.

References

↑ Template:Cite book

[1] Template:Cite book

[1]

Conditional principal components analysis

Contents

Model

Derivation

Interpretation

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools