Week 7: Learning
Synaptic plasticity, Hebbian and statistical learning
Biological basis of learning
Hippocampus: first observations of synaptic plasticity
Long Term Potentiation: Long-term increase in synaptic strength. Observed as an increase in EPSP size over time.
Long Term Depression: Long-term decrease in synaptic strength, observed as decrease of EPSP.
Q: Can you speculate what might be happening to the post-synaptic cell that might cause an increase or decrease in EPSP amplitude?
- An increase/decrease in the number of neurotransmitter-gated channels.
- An increase/decrease in the number of leak channels.
- Both of these.
- Neither of these
Explanation: Anything that works to change the cell’s rest conductance (changing number of leak channels) will change its excitability. Similarly, anything that changes the amount of current that can flow into the post synaptic cell upon binding with neurotransmitter (changing neurotransmitter-gated channels) will change its excitability.
Hebb’s rule
Donald Hebb suggested the Hebbian learning rule before the discovery of LTP/D:
If neuron A takes part in the firing of neuron B, then the synapse from A to B is strengthened.
Let the input vector , connected to output unit (scalar) , through the weights vector .
Assume that the network dynamics is fast enough to be ignored. Then, the output is determined by the steady state output:
Then, the continuous Hebbian learning rule can be expressed as the following rate equation:
In english, the change in connection strength is proportional to the product of input and output, which product will be low if either or both units don’t fire, and will be high if and only if both units fire.
The delta-time version Hebbian learning rule
leads to the following discrete time implementation:
where is the learning rate.
Average effect of Hebb’s rule
With several input vectors , the Hebbian rule changes the weights by multiplying the input correlation matrix to them:
The input correlation matrix is .
Stability of the Hebbian learning rule
Is converging? Looking at should indicate whether it explodes.
For the continuous Hebbian learning rule,
which means that grows without bounds.
The covariance rule
LTD is modeled to happen when there is no or low ouput even though there is an input. But Hebb’s rule doesn’t model that.
The covariance rule generalizes the Hebbian learning rule to also work for LTD. In the continuous expression of the Hebbian learning rule, , we replace the output vector by its variation from average :
Q: If v (output of post-synaptic neuron) is large relative to the average of v over time in this equation, what does that imply about w?
- w will increase in magnitude and the cell will undergo LTP
- w will decrease in magnitude and the cell will undergo LTP
- w will increase in magnitude and the cell will undergo LTD
- w will decrease in magnitude and the cell will undergo LTD
Q: Conversely, if v is small (approaches 0) relative to the average of v over time, what does that imply about w?
- w will increase in magnitude and the cell will undergo LTP
- w will decrease in magnitude and the cell will undergo LTP
- w will increase in magnitude and the cell will undergo LTD
- w will decrease in magnitude and the cell will undergo LTD
We note that is positive for high outputs, which amounts to Hebbian learning, and negative for low outputs, which makes the change of weights negative, leading to LTD.
Average effect of the covariance rule
With several input vectors , the covariance rule changes the weights by multiplying the input covariance matrix to them:
The covariance matrix of the inputs is
Stability of the covariance rule
For the covariance learning rule,
so grows without bounds too.
Unstable learning rules can be stabilized by normalizing at each update, ensuring that at all times. But that prevents small weights and LTD.
Oja’s learning rule
For ,
Stability of Oja’s rule
Looking at the derivative of ,
The norm of will converge when , which corresponds to
So Oja’s rule is stable, and converges to .
What do Hebbian, Covariance and Oja’s learning rules do mathematically?
The averaged Hebbian learning rule gives us an equation in terms of . To solve that equation, we use eigenvectors (!).
Let the eigenvectors of the correlation matrix . is orthonormal, so we can write as . Substituting it in the expression of the averaged Hebbian rule:
with the solution
Using this solution in , we get
is a linear combination of the eigenvectors of , with terms that are exponentially dependent on the eigenvalues of . In consequence, at the limit in time, the largest eigenvalue dominates the expression, and we have .
Similarly, for Oja’s rule, .
This means that those rules are equivalent to Principal Component Analysis.
Principal Component Analysis
For data with zero mean, the Hebb rule orients the weights vector in the direction of the maximum variance of the dataset (direction of the first eigenvector). If does not do that if the mean is not .
Without constraint on the mean, the covariance learning rule will orient the weights vector in the direction of the maximum variance.
This means that Hebbian learning and its derivatives perform PCA on the dataset, leading to the linear dimensionality reduction that maximizes the variance of the projected data.
When applied to clustered data (for instance, two clouds of points in a 2D input space), the covariance rule will still align with the overall maximum variance, which is less interesting for clustered data.
Networks of neurons performing unsupervised learning are able to take better advantage of such data.
Introduction to Unsupervised Learning
With clustered data, the covariance rule doesn’t find the most interesting feature of the input data, which are the two clusters.
A feedforward network that learns 2 clusters
We have two-dimensional data input through vector , connected to output units and via the weight vectors and . The output of unit is
Neuron A (B) can then represent cluster A (B) by having () be equal to the center (mean) of cluster A (B).
Q: Would neuron A or B be more active with an input that is closer (in Euclidian distance) to ?
- A
- B
For a particular input, the most active neuron will be the one closest to the input:
Indeed, assuming that the input and weight vectors have been normalized, minimizing is equivalent to maximizing .
Updating the weights
Given a new input vector (t is not time but a chronological index of inputs from 1 to t),
- Find the new input’s cluster
- Update the weight of that cluster by setting the weight vector to the running average of the inputs of that cluster.
How to compute that running average? We want
So, in the code, the modification of the weights vector has to be
where can be set as to compute the running average, or a small positive value.
Q: (1/t) may seem like a logical value for epsilon, but can you speculate why having a positive constant for epsilon might be more beneficial?
- It allows u to remain larger than w
- It allows the algorithm to update w indefinitely.
- (1/t) is actually the best value for epsilon.
- None of these.
Competitive learning
The competitive learning algorithm goes as follows:
- Select the most active neuron (let it be A), with weights closest to the new input (for instance using a WTA type of network)
- update the weights:
The weights vector is shifted from its current location in the direction of the new input by an amount of .
Example
We have 3 output neurons with randomly assigned initial weight values , and . Inputs come one at a time, and one neuron’s weights is adapted for each input. In the end, a partitioning in 3 cluster has emerged, and each weight is close to the centroid of its cluster.
Relation to ANN algorithms
Competitive learning is closely related to Kohonen maps (SOM). Kohonen maps, however, update the weight vectors of other neurons in the neighborhood of the winning neuron too.
in this update, neighborhood is taken in a topological sense, as a network-intrinsic neighborhood relationship has been defined prior to learning. Typically, neurons form a topological two or three dimensional grid. The initial weights of neurons are typically aligned to their location in the topological map.
In biological neural systems, cortical maps can have a similar organization, with neural preferences for one input feature neatly arranged in the 2-D topological map of neighboring neurons. For instance, V1 orientation preference maps are made of neighboring neurons that prefer similar orientations.
Unsupervised learning
The input is generated by a set of hidden causes .
A generative model defines the transformation . The goal is to learn a good model of the data generation process.
There is a prior joint probability , and a likelihood function . is determined by a set of unknown parameters that we would like to learn.
The two sub-problems of unsupervised learning are:
- problem of recognition: estimate the causes  for any observed input 
    - compute the posterior probability
 
- use that information to learn the parameters
Example
Back to our example of 2 clusters of points A and B in a 2d plane.
We assume a model of two gaussian generators and . It’s a mixture of Gaussians model:
where is the prior probability .
The problem of recognition (finding the posterior ) here is similar to the first step of competitive learning. The first step of competitive learning was to assign an incoming data point to a cluster. Here, we compute the posterior probability of the data point belonging to cluster A or B: “data is A given that its coordinates are and that cluster A has distributions parameters “ () and “data is B given that its coordinates are and that cluster B has distributions parameters “ (), and find that .
The second step of unsupervised learning is to update the parameters of , using . That corresponds to the second step of competitive learning, that consist in updating the weight vector to reflect the new cluster center. But here, in unsupervised learning, we do that update using the Expectation-Maximization algorithm.
The EM algorithm for unsupervised learning
The algorithm consists in iterating two steps until convergence.
The Expectation step
In the E step, we compute the posterior distribution of (of clusters A and B in our example) for each .
Using Bayes’ rule,
As compared to competitive learning, this is a softer competition than WTA.
The Maximization step
In the M step, we chance the set of parameters using the results of the E stem.
The mean is changed as follows:
the variance:
and the prior probability:
In comparison, competitive learning adapted the mean, but not the variance or prior probability of the clusters.
The EM algorithm assumes that all datapoints are available at the time of learning (batch learning), whereas real-time learning is possible in competitive learning (online learning).
Sparse Coding and Predictive Coding
Principal Component Analysis
Even on large input spaces, the eigenvectors of the input covariance matrix can be linearly combined to approximate input exemplars.
For instance, given a set of b&w pictures of faces of N pixels, Turk and Pentland (1991) computed the eigenvectors of the input covariance matrix, and subsequently expressed each face as a linear combination of these “eigenfaces”:
If we restrict the reconstruction to the use of the first principal eigenvectors (associated with the ) largest eigenvalues of the covariance matrix, then a face is
This model can be used for lossy image compression.
However, an eigenvector analysis (or, equivalently, a PCA) is not good for local components extraction (edges, parts, etc…).
Linear model of natural images
We now split up the input sample (natural image of pixels for instance) in a number of weighted basis features (f.e. localized oriented edges). The image is the weighted sum of (this time possibly larger than ) basis vectors, plus noise:
where is the matrix of column basis vectors , and is the row vector of coefficients ( elements).
We need to learn and .
Generative model of natural images
To specify a generative model for natural images, we need to specify a prior probability distribution for natural images , and a likelihood function .
Likelihood function for the generative model of natural images, assuming white noise
In our linear model , if the noise vector is assumed to be a Gaussian ( no correlation across the components of the noise vector) with zero mean, then the likelihood function is also Gaussian, with a mean and an identity covariance:
The proportionality to the exponential above makes that the log likelihood is
where, in the quadratic term, can be identified as the difference between the input image and its reconstruction.
Q: Based on the equation log p[u \mid v;G] = -(1/2) \lVert u-Gv \rVert ^2 + C, can you see what effect minimizing the squared reconstruction error has on the likelihood function?
- Minimizes it.
- Maximizes it.
- Stabilizes it.
- None of these.
Prior distribution for the generative model of natural images, assuming that causes are independent
If we can assume that the causes are independent (we typically can’t for natural images, but let’s start like that), then the prior probability for is equal to the product of the individual prior probabilities of its components:
In log terms:
How to find the individual priors ?
As we assume that these represent very specific components of the image (matching the sparseness of biological neural representations), then for any input, we only want a few to be active. In consequence, we expect to have a leptokurtic (super Gaussian) distribution (peak at 0, long tail, e.g. exponential , Cauchy , …).
We represent each in the form of an exponential: . For instance, in an exponential, for a Chauchy pdf, etc…
Then, we have
Bayesian approach to finding and learning in the generative model of natural stimuli
Bayesian = maximize the posterior probability of causes:
This is equivalent to maximizing the log-posterior probability of causes:
We note that represents the reconstruction error, which we want to minimize, and is the sparseness constraint, that we try to maximize.
Maximization algorithm
We note the similarity with the EM algorithm.
Repeat those two steps:
- Maximize w.r.t. (keep fixed) (lie the E step in EM)
- Maximize w.r.t. (keep fixed to the value found in the previous step) (like the M step in EM)
Maximizing w.r.t. with gradient ascent
To maximize w.r.t. , we look at . We change proportionately to the slope . Here,
where is a time constant is minus its reconstruction, so it’s the error, and is the sparseness constraint.
This equation has the form of a recurrent network’s firing rate kinetic equation!
Recurrent network implementation of sparse coding
In ,
- is the total input to the output layer : it is the input after weighting through the feedforward weight matrix
- is the recurrent (intra-layer) input, with the recurrent weights
- There is a feedback connection from to with weights . It computes , which is the mean of the generative distribution, and the prediction.
- Hence, , the error, is further computed at the input layer and propagated to the output layer through the feedforward weights . With that, the output layer corrects its estimate .
- This prediction-correction cycle is iterated until convergence is achieved ( stable for any given )
Learning with gradient ascent
Like for , we maximize w.r.t. by changing proportionately to the slope :
Taking guatantees that converges faster than , making it possible to use to update .
This learning rule, , is very similar to Oja’s learning rule. However, it’s not learning the eigenvectors thanks to the sparseness criterion.
Take a guess just for fun. Based on what we have discussed regarding the visual system so far, especially our characterization of receptive field structure in early visual processing, what kind of basis vectors would you predict are learned for natural image patches?
- Eigenvectors of natural image patches
- Objects in the environment
- Complex shapes
- Oriented bars
Feeding the network with patches from natural images, the basis vectors (columns of ) learnt resemble the oriented receptor fields of V1. As an interpretive model, this indicates that the brain creates an efficient sparse representation of natural images through these RF.
Predictive coding networks
The sparse coding network is an instance of predictive coding network that uses feedback connections to transmit predictions about the input, and feedforward connections to convey a prediction error signal. The predictive estimator (output layer) maintains an estimate of the (hidden) causes of the input (vector ).
Predictive coding models can have, in addition to feedforward and feedback weights, a set of recurrent weights that learn time-varying input correlations, for instance in moving images. The internal representation (recurrent weights) is allowed to vary over time to model the dynamics of the inputs. Another possible component of predictive coding networks is a sensory error gain on the input layer, allowing to model such effects as visual attention.
The visual cortices present a puzzle: the connections between cortical areas in the visual streams are almost always bidirectional. Why the feedback connections?
Predictive coding networks models of the visual cortex give an explanation. They suggest that the feedback connections convey predictions of activity from higher to lower cortical areas, and that feedforward connections convey the error signals (activity minus prediction). Contextual effects, surround suppression, etc can be explained by hierarchical predictive coding models trained on natural images.