I took notes during a recent talk at UMass. I present them here in the first person, from Bengio's point of view. I will write up some thoughts I had in another post.
You should go check out the Deep Learning book, which is free online. It is meant as an introduction to those into the field, which is helpful because there is a "huge demand by companies" for deep learning.
My goals are to understand some broad principles that underly both human and machine intelligence. I see these as:
- data: information about the world
- models: some way to hold what we have learned about data
- compute: a way to do computation on that data
- priors: know something that informs what is possible
- inference: make predictions about the world given our model
I focus on the priors. These are necessary to "defeat the curse of dimensionality". We need strong (that restrict us to reasonable models) general assumption about the world. Then we can use a small amount of data to create models with an exponential number of factors. For example, if we look at a photo of a person in motion, we (humans) are able to have some idea what the next frame might look like. If we see a photo of someone jogging, we are able to predict that if a photo was taken of them in the next second, it would be of them jogging as well, maybe having moved past some amount. In order for a computer to make this same prediction, it needs to know a lot about the world.
When we make this prediction on the next frame, we factorize the image into a bunch of different parts. We have a bit of knowledge that knows what humans look like, so we can identify them, then we also know about how roads look, and also how clothes look (so we know they are jogging maybe instead of just stretching) and also about sweat (we know they are active) and facial expressions (we see their exertion). All of these factors we can identify separately (and some are hierarchical) and we use these together to make a prediction about what will happen next. In order for a computer to do the same thing, it must also break it up into a bunch of factors (a distributed representation).
Neural networks use a distributed representation, through their horizontal and vertical composition. They learn this composition:
Important features such as face detectors and text detectors are learned, even though we do not ask the networks to specifically learn these things. Instead, it learns them because they help with other tasks (e.g. identifying bowties, which are usually paired with faces, and bookcases, which are usually full of books labeled with text).
Distributed representations allow us to represent an exponential number of outputs given a linear number of features. Another example is this paper which learns different factors of an image in an unsupervised way.
How deep should our neural networks be? Technically any function can be represented with a single hidden layer, but might require a large number of factors. Instead, with a certain class of problems, we can increase the depth and reduce the width to lower the total number of required nodes.
Training deep neural networks were thought to be hard originally because they have many local minima. We are looking for a certain set of weights that optimize the model. We start with random weights and continually nudge them into the direction to get lower weights, then we will end up at a place where all neighboring weights are more, but we haven't hit the global minimum. There is some other set of weights that is less, but we are in a little dip in the landscape of weights vs. accuracy. However, as the number of dimensions goes up, it becomes more likely that we will hit saddle points instead of local minima. The probability of having totally local minima decreases, because there are more possible dimensions to decrease in (in a random model). So we can hope that local descent methods are still effective, even if they are not guaranteed to find the absolute best solution, as the number of dimensions increases.
Recurrent Neural Networks keep applying a neural network over a series of steps, that allows past calculations to inform future ones:
However, the storage of previous computations degrades over time, often more than we want it to. This means it is hard to store things in memory for a long time. So as a solution, we add higher level connections between farther apart time steps. This gives us more direct access to things in the past. Really, we are just adding some structure to this graph. It would be great to do this in a more automated way.
Another way to address this problem of decomposition is an Attention Based approach (related to Neural Turing Machines). In this, if we are translating from French to English, instead of going from a French sentence to a vector of numbers that describe that sentence, to an English sentence, we instead do this in a more iterative fashion, drawing from different parts of the French sentence as we generative the English sentence.
I really see the future here in Unsupervised Learning. Currently, all industrial products use supervised learning. Unsupervised learning is harder, statically speaking because it requires computing the full joint PDF of X and Y, instead of just P(Y|X). Whereas supervised learning involved "discovering some statistical dependence" about the world, unsupervised learning is about learning everything we can from the data. It is more general.
The "killer app" for unsupervised learning, is model based reinforcement learning. Currently, automated cars are trained using supervised learning (showing it how humans drive) or policy learning (trying to maximize a certain outcome). However, we can't get enough experience to predict rare events. Instead, we need to be able to reason about the world, to predict what states we will be in the future. Kids don't learn physics from their parents telling them how gravity works, or by telling them when they are wrong or right. Instead, they learn it by observing the world and occasionally taking action in it to see the response (reinforcement learning).
Overall, our models are vulnerable to the example that are not in our distribution. We can solve this through disentanglement (invariance). It isn't about creating independent variables, but really latent variables that are independently controllable. For example, when we are listening to speech, we want to know who the person is who is talking (their age, language, gender), their mood, as well as the words they are saying. These things aren't independent, but they could be disentangled.
In response to a question I got, I believe we can get to this general model through unsupervised learning, by using unstructured generative models. These are like the NN that learn intermediate features without being told they exist. One way to do this is "enforcing things like causality".