17: Large Scale Machine Learning

Previous Next Index

Learning with large datasets
Why large datasets?
Learning with large datasets

Stochastic Gradient Descent

Stochastic gradient descent

Mini Batch Gradient Descent
Mini-batch algorithm

Mini-batch gradient descent vs. stochastic gradient descent

Stochastic gradient descent convergence
Checking for convergence
Learning rate 
Online learning
Another example - product search

Map reduce and data parallelism
  • Previously spoke about stochastic gradient descent and other algorithms
    • These could be run on one machine
    • Some problems are just too big for one computer
    • Talk here about a different approach called Map Reduce 
  • Map reduce example
    • We want to do batch gradient descent
      • Assume m = 400
        • Normally m would be more like 400 000 000
        • If m is large this is really expensive
    • Split training sets into different subsets
      • So split training set into 4 pieces
    • Machine 1: use (x1, y1), ..., (x100, y100)
      • Uses first quarter of training set
      • Just does the summation for the first 100

    • So now we have these four temp values, and each machine does 1/4 of the work
    • Once we've got our temp variables
      • Send to to a centralized master server
      • Put them back together
      • Update θ using
        • This equation is doing the same as our original batch gradient descent algorithm
  • More generally map reduce uses the following scheme (e.g. where you split into 4)
  • The bulk of the work in gradient descent is the summation
    • Now, because each of the computers does a quarter of the work at the same time, you get a 4x speedup
    • Of course, in practice, because of network latency, combining results, it's slightly less than 4x, but still good!
  • Important thing to ask is
    • "Can algorithm be expressed as computing sums of functions of the training set?"
      • Many algorithms can!
  • Another example
    • Using an advanced optimization algorithm with logistic regression
      • Need to calculate cost function - see we sum over training set
      • So split training set into x machines, have x machines compute the sum of the value over 1/xth of the data
      • These terms are also a sum over the training set
      • So use same approach
  • So with these results send temps to central server to deal with combining everything
  • More broadly, by taking algorithms which compute sums you can scale them to very large datasets through parallelization 
    • Parallelization can come from
      • Multiple machines
      • Multiple CPUs
      • Multiple cores in each CPU
    • So even on a single compute can implement parallelization 
  • The advantage of thinking about Map Reduce here is because you don't need to worry about network issues
    • It's all internal to the same machine
  • Finally caveat/thought
    • Depending on implementation detail, certain numerical linear algebra libraries can automatically parallelize your calculations across multiple cores
    • So, if this is the case and you have a good vectorization implementation you can not worry about local parallelization and the local libraries sort optimization out for you
  • Hadoop is a good open source Map Reduce implementation
  • Represents a top-level Apache project develop by a global community of developers
    • Large developer community all over the world
  • Written in Java
  • Yahoo has been the biggest contributor
    • Pushed a lot early on
  • Support now from Cloudera
Interview with Cloudera CEO Mike Olson (2010)
  • Seeing a change in big data industry (Twitter, Facebook etc) - relational databases can't scale to the volumes of data being generated
    • Q: Where the tech came from?
      • Early 2000s - Google had too much data to process (and index)
        • Designed and built Map Reduce
          • Buy and mount a load of rack servers
          • Spread the data out among these servers (with some duplication)
          • Now you've stored the data and you have this processing infrastructure spread among the data
            • Use local CPU to look at local data
            • Massive data parallelism
          • Published as a paper in 2004
            • At the time wasn't obvious why it was necessary - didn't support queries, transactions, SQL etc
        • When data was at "human" scale relational databases centralized around a single server was fine
        • But now we're scaling by Moore's law in two ways
          • More data
          • Cheaper to store
  • Q: How do people approach the issues in big data?
    • People still use relational databases - great if you have predictable queries over structured data
      • Data warehousing still used - long term market
    • But the data people want to work with is becoming more complex and bigger
      • Free text, unstructured data doesn't fit will into tables
      • Do sentiment analysis in SQL isn't really that good
      • So to do new kinds of processing need a new type of architecture
    • Hadoop lets you do data processing - not transactional processing - on the big scale
    • Increasingly things like NoSQL is being used
    • Data centers are starting to chose technology which is aimed at a specific problem, rather than trying to shoehorn problems into an ER issue
    • Open source technologies are taking over for developer facing infrastructures and platforms
  • Q: What is Hadoop?
    • Open source implementation of Map reduce (Apache software)
    • Yahoo invested a lot early on - developed a lot the early progress
    • Is two things
      • HDFS
        • Disk on ever server
        • Software infrastructure to spread data
      • Map reduce
        • Lets you push code down to the data in parallel
    • As size increases you can just add more servers to scale up
  • Q: What is memcached?
    • Ubiquitous invisible infrastructure that makes the web run
    • You go to a website, see data being delivered out of a MySQL database
      • BUT, when infrastructure needs to scale querying a disk EVERY time is too much
    • Memcache is a memory layer between disk and web server
      • Cache reads
      • Push writes through incrementally
      • Is the glue that connects a website with a disk-backend
    • Northscale is commercializing this technology
    • New data delivery infrastructure which has pretty wide adoption
  • Q: What is Django?
    • Open source tool/language
    • Written in Python, uses MVC design and is basically a framework for hosting webpages - think Rails in Python (where Rails is in Ruby)
  • Q: What are some of the tool sets being used in data management? What is MySQL drizzle?
    • Drizzle is a re-implementation of MySQL
      • Team developing Drizzle feels they learned a lot of lessons when building MySQL
      • More modern architecture better targeted at web applications
    • NoSQL
      • Distributed hash tables
      • Idea is instead of a SQL query and fetching a record, go look up something from a store by name
        • Go pull a record by name from a store
      • So now systems being developed to support that, such as
        • MongoDB
        • CouchDB
    • Hadoop companion projects
      • Hive
        • Lets you use SQL to talk to a Hadoop cluster
      • HBase
        • Sits on top of HDFS 
        • Gives you key-value storage on top of HDFS - provides abstraction from distributed system
    • Good time to be working in big data
      • Easy to set up small cluster through cloud system
      • Get a virtual cluster through Rackspace or Cloudera