- Second major problem type
- In unsupervised learning, we get unlabeled data
- Just told - here is a data set, can you structure it
- One way of doing this would be to cluster data into to groups
- This is a clustering algorithm
Clustering algorithm- Example of clustering algorithm
- Google news
- Groups news stories into cohesive groups
- Used in any other problems as well
- Genomics
- Microarray data
- Have a group of individuals
- On each measure expression of a gene
- Run algorithm to cluster individuals into types of people
- Organize computer clusters
- Identify potential weak spots or distribute workload effectively
- Social network analysis
- Astronomical data analysis
- Algorithms give amazing results
- Basically
- Can you automatically generate structure
- Because we don't give it the answer, it's unsupervised learning
Cocktail party algorithm- Cocktail party problem
- Lots of overlapping voices - hard to hear what everyone is saying
- Two people talking
- Microphones at different distances from speakers
- Record sightly different versions of the conversation depending on where your microphone is
- But overlapping none the less
- Have recordings of the conversation from each microphone
- Give them to a cocktail party algorithm
- Algorithm processes audio recordings
- Determines there are two audio sources
- Separates out the two sources
- Is this a very complicated problem
- Algorithm can be done with one line of code!
- [W,s,v] = svd((repmat(sum(x.*x,1), size(x,1),1).*x)*x');
- Not easy to identify
- But, programs can be short!
- Using octave (or MATLAB) for examples
- Often prototype algorithms in octave/MATLAB to test as it's very fast
- Only when you show it works migrate it to C++
- Gives a much faster agile development
- Understanding this algorithm
- svd - linear algebra routine which is built into octave
- In C++ this would be very complicated!
- Shown that using MATLAB to prototype is a really good way to do this
Linear Regression
- Housing price data example used earlier
- Supervised learning regression problem
- What do we start with?
- Training set (this is your data set)
- Notation (used throughout the course)
- m = number of training examples
- x's = input variables / features
- y's = output variable "target" variables
- (x,y) - single training example
- (xi, yj) - specific example (ith training example)
- i is an index to training set
- With our training set defined - how do we used it?
- Take training set
- Pass into a learning algorithm
- Algorithm outputs a function (denoted h ) (h = hypothesis)
- This function takes an input (e.g. size of new house)
- Tries to output the estimated value of Y
- How do we represent hypothesis h ?
- What does this mean?
- Means Y is a linear function of x!
- θi are parameters
- θ0 is zero condition
- θ1 is gradient
- This kind of function is a linear regression with one variable
- Also called univariate linear regression
- So in summary
- A hypothesis takes in some variable
- Uses parameters determined by a learning system
- Outputs a prediction based on that input
Linear regression - implementation (cost function)- A cost function lets us figure out how to fit the best straight line to our data
- Choosing values for θi (parameters)
- Different values give you different functions
- If θ0 is 1.5 and θ1 is 0 then we get straight line parallel with X along 1.5 @ y
- If θ1 is > 0 then we get a positive slope
- Based on our training set we want to generate parameters which make the straight line
- Chosen these parameters so hθ(x) is close to y for our training examples
- Basically, uses xs in training set with hθ(x) to give output which is as close to the actual y value as possible
- Think of hθ(x) as a "y imitator" - it tries to convert the x into y, and considering we already have y we can evaluate how well hθ(x) does this
- To formalize this;
- We want to want to solve a minimization problem
- Minimize (hθ(x) - y)2
- i.e. minimize the difference between h(x) and y for each/any/every example
- Sum this over the training set
- Minimize squared different between predicted house price and actual house price
- 1/2m
- 1/m - means we determine the average
- 1/2m the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the smallest value is still the smallest value!)
- Minimizing θ0/θ1 means we get the values of θ0 and θ1 which find on average the minimal deviation of x from y when we use those parameters in our hypothesis function
- More cleanly, this is a cost function
- And we want to minimize this cost function
- Our cost function is (because of the summartion term) inherently looking at ALL the data in the training set at any time
- So to recap
- Hypothesis - is like your prediction machine, throw in an x value, get a putative y value
- Cost - is a way to, using your training data, determine values for your θ values which make the hypothesis as accurate as possible
- This cost function is also called the squared error cost function
- This cost function is reasonable choice for most regression functions
- Probably most commonly used function
- In case J(θ0,θ1) is a bit abstract, going into what it does, why it works and how we use it in the coming sections
Cost function - a deeper look
- Lets consider some intuition about the cost function and why we want to use it
- The cost function determines parameters
- The value associated with the parameters determines how your hypothesis behaves, with different values generate different
- Simplified hypothesis
- Cost function and goal here are very similar to when we have θ0, but with a simpler parameter
- Simplified hypothesis makes visualizing cost function J() a bit easier
- So hypothesis pass through 0,0
- Two key functins we want to understand
- hθ(x)
- Hypothesis is a function of x - function of what the size of the house is
- J(θ1)
- Is a function of the parameter of θ1
- So for example
- Plot
- If we compute a range of values plot
- J(θ1) vs θ1 we get a polynomial (looks like a quadratic)
- The optimization objective for the learning algorithm is find the value of θ1 which minimizes J(θ1)
- So, here θ1 = 1 is the best value for θ1
A deeper insight into the cost function - simplified cost function
- Assume you're familiar with contour plots or contour figures
- Using same cost function, hypothesis and goal as previously
- It's OK to skip parts of this section if you don't understand cotour plots
- Using our original complex hyothesis with two pariables,
- Example,
- Say
- Previously we plotted our cost function by plotting
- Now we have two parameters
- Plot becomes a bit more complicated
- Generates a 3D surface plot where axis are
- We can see that the height (y) indicates the value of the cost function, so find where y is at a minimum
- Instead of a surface plot we can use a contour figures/plots
- Set of ellipses in different colors
- Each colour is the same value of J(θ0, θ1), but obviously plot to different locations because θ1 and θ0 will vary
- Imagine a bowl shape function coming out of the screen so the middle is the concentric circles