08: Neural Networks - Representation

Previous Next Index

Neural networks - Overview and summary

Why do we need neural networks?

Example: Problems where n is large - computer vision


Neurons and the brain
Model representation 1

Artificial neural network - representation of a neurone


Neural networks - notation


Model representation II

Here we'll look at how to carry out the computation efficiently through a vectorized implementation. We'll also consider 
why NNs are good and how we can use them to learn complex non-linear things
  • Below is our original problem from before
    • Sequence of steps to compute output of hypothesis are the equations below
  • Define some additional terms
    • z12 = Ɵ101xƟ111xƟ121xƟ131x3
    • Which means that
      • a12 = g(z12)
    • NB, superscript numbers are the layer associated 
  • Similarly, we define the others as
    • z22 and z32 
    • These values are just a linear combination of the values
  • If we look at the block we just redefined
    • We can vectorize the neural network computation
    • So lets define
      • x as the feature vector x
      • z2 as the vector of z values from the second layer 

  • z2 is a 3x1 vector
  • We can vectorize the computation of the neural network as as follows in two steps
    • z2 = Ɵ(1)x
      • i.e. Ɵ(1) is the matrix defined above
      • x is the feature vector
    • a2 = g(z(2))
      • To be clear, z2 is a 3x1 vecor
      • a2 is also a 3x1 vector
      • g() applies the sigmoid (logistic) function element wise to each member of the z2 vector
  • To make the notation with input layer make sense;
    • a1 = x
      • a1 is the activations in the input layer
      • Obviously the "activation" for the input layer is just the input!
    • So we define x as a1 for clarity 
      • So 
        • a1 is the vector of inputs
        • a2 is the vector of values calculated by the g(z2) function
  • Having calculated then z2 vector, we need to calculate a02 for the final hypothesis calculation
                      
  • To take care of the extra bias unit add a02 = 1 
    • So add a02 to a2 making it a 4x1 vector
  • So, 
    • z3 = Ɵ2a2 
      • This is the inner term of the above equation
    • hƟ(x) = a3 = g(z3)
  • This process is also called forward propagation
    • Start off with activations of input unit
      • i.e. the x vector as input
    • Forward propagate and calculate the activation of each layer sequentially
    • This is a vectorized version of this implementation
Neural networks learning its own features
  • Diagram below looks a lot like logistic regression

  • Layer 3 is a logistic regression node
    • The hypothesis output = g(Ɵ10a02 + Ɵ11a12 + Ɵ12a22 + Ɵ13a32)
    • This is just logistic regression 
      • The only difference is, instead of input a feature vector, the features are just values calculated by the hidden layer
  • The features a12a22, and a32 are calculated/learned - not original features
  • So the mapping from layer 1 to layer 2 (i.e. the calculations which generate the a2 features) is determined by another set of parameters - Ɵ1
    • So instead of being constrained by the original input features, a neural network can learn its own features to feed into logistic regression
    • Depending on the Ɵ1 parameters you can learn some interesting things
      • Flexibility to learn whatever features it wants to feed into the final logistic regression calculation
        • So, if we compare this to previous logistic regression, you would have to calculate your own exciting features to define the best way to classify or describe something
        • Here, we're letting the hidden layers do that, so we feed the hidden layers our input values, and let them learn whatever gives the best final result to feed into the final output layer
  • As well as the networks already seen, other architectures (topology) are possible
    • More/less nodes per layer
    • More layers
    • Once again, layer 2 has three hidden units, layer 3 has 2 hidden units by the time you get to the output layer you get very interesting non-linear hypothesis

  • Some of the intuitions here are complicated and hard to understand
    • In the following lectures we're going to go though a detailed example to understand how to do non-linear analysis

Neural network example - computing a complex, nonlinear function of the input
  • Non-linear classification: XOR/XNOR
    •  x1, x2 are binary
  • Example on the right shows a simplified version of the more complex problem we're dealing with (on the left)
  • We want to learn a non-linear decision boundary to separate the positive and negative examples
y = x1 XOR x2 
      x1 XNOR x2 

Where XNOR = NOT (x1 XOR x2)
  • Positive examples when both are true and both are false
    • Let's start with something a little more straight forward...
    • Don't worry about how we're determining the weights (Ɵ values) for now - just get a flavor of how NNs work

Neural Network example 1: AND function 
  • Simple first example

  • Can we get a one-unit neural network to compute this logical AND function? (probably...)
    • Add a bias unit
    • Add some weights for the networks
      • What are weights?
        • Weights are the parameter values which multiply into the input nodes (i.e. Ɵ)

  • Sometimes it's convenient to add the weights into the diagram
    • These values are in fact just the Ɵ parameters so
      • Ɵ101 = -30
      • Ɵ111 = 20
      • Ɵ121 = 20
    • To use our original notation
  • Look at the four input values
  • So, as we can see, when we evaluate each of the four possible input, only (1,1) gives a positive output
Neural Network example 2: NOT function 
  • How about negation?

  • Negation is achieved by putting a large negative weight in front of the variable you want to negative
Neural Network example 3: XNOR function 
  • So how do we make the XNOR function work?
    • XNOR is short for NOT XOR 
      • i.e. NOT an exclusive or, so either go big (1,1) or go home (0,0)
    • So we want to structure this so the input which produce a positive output are
      • AND (i.e. both true)
        OR
      • Neither (which we can shortcut by saying not only one being true)
  • So we combine these into a neural network as shown below;


  • Simplez!
Neural network intuition - handwritten digit classification
  • Yann LeCun = machine learning pioneer
  • Early machine learning system was postcode reading
    • Hilarious music, impressive demonstration!
Multiclass classification
  • Multiclass classification is, unsurprisingly, when you distinguish between more than two categories (i.e. more than 1 or 0)
  • With handwritten digital recognition problem - 10 possible categories (0-9)
    • How do you do that?
    • Done using an extension of one vs. all classification 
  • Recognizing pedestrian, car, motorbike or truck
    • Build a neural network with four output units
    • Output a vector of four numbers
      • 1 is 0/1 pedestrian
      • 2 is 0/1 car
      • 3 is 0/1 motorcycle
      • 4 is 0/1 truck
    • When image is a pedestrian get [1,0,0,0] and so on
  • Just like one vs. all described earlier
    • Here we have four logistic regression classifiers

  • Training set here is images of our four classifications
    • While previously we'd written y as an integer {1,2,3,4}
    • Now represent y as