What is backpropagation really doing? | Deep learning, chapter 3


Here we tackle backpropagation, the core algorithm behind how neural networks learn. After a quick recap for where we are, the first thing I’ll do is an intuitive walkthrough for what the algorithm is actually doing without any reference to the formulas, Then for those of you who do want to dive into the math, the next video goes into the calculus underlying all this. If you watched the last two videos or if you’re just jumping in with the appropriate background, you know what a neural network is and how it feeds forward information. Here we’re doing the classic example of recognizing handwritten digits, whose pixel values get fed into the first layer of the network with 784 neurons. And I’ve been showing a network with two hidden layers having just 16 neurons each, and an output layer of 10 neurons, indicating which digit the network is choosing as its answer. I’m also expecting you to understand gradient descent as described in the last video, and how what we mean by learning is that we want to find which weights and biases minimize a certain cost function. As a quick reminder, for the cost of a single training example, what you do is take the output that the network gives, along with the output that you wanted it to give, and you just add up the squares of the differences between each component. Doing this for all of your tens of thousands of training examples, and averaging the results, this gives you the total cost of the network. And as if that’s not enough to think about, as described in the last video, the thing that we’re looking for is the negative gradient of this cost function, which tells you how you need to change all of the weights and biases, all of these connections, so as to most efficiently decrease the cost. Backpropagation, the topic of this video, is an algorithm for computing that crazy complicated gradient. And the one idea from the last video that I really want you to hold firmly in your mind right now is that because thinking of the gradient vector as a direction in 13000 dimensions is, to put it lightly, beyond the scope of our imaginations, there’s another way you can think about it: The magnitude of each component here is telling you how sensitive the cost function is to each weight and bias. For example, let’s say you go through the process I’m about to describe, and you compute the negative gradient, and the component associated with the weight on this edge here comes out to be 3.2, while the component associated with this edge here comes out as 0.1. The way you would interpret that is that the cost of the function is 32 times more sensitive to changes in that first weight. So if you were to wiggle that value just a little bit, it’s gonna cause some change to the cost, and that change is 32 times greater than what the same wiggle to that second weight would give. Personally, when I was first learning about backpropagation, I think the most confusing aspect was just the notation and the index chasing of it all. But once you unwrap what each part of this algorithm is really doing, each individual effect that it’s having is actually pretty intuitive. It’s just that there’s a lot of little adjustments getting layered on top of each other. So I’m gonna start things off here with a complete disregard for the notation, and just step through those effects that each training example is having on the weights and biases. Because the cost function involves averaging a certain cost per example over all the tens of thousands of training examples, the way that we adjust the weights and biases for a single gradient descent step also depends on every single example, or rather in principle it should, but for computational efficiency we’re going to do a little trick later to keep you from needing to hit every single example for every single step. Another case right now, all we’re gonna do is focus our attention on one single example: this image of a 2. What effect should this one training example have on how the weights and biases get adjusted? Let’s say we’re at a point where the network is not well trained yet, so the activations in the output are gonna look pretty random, maybe something like 0.5, 0.8, 0.2, on and on. Now we can’t directly change those activations, we only have influence on the weights and biases, but it is helpful to keep track of which adjustments we wish should take place to that output layer, and since we want it to classify the image as a 2, we want that third value to get nudged up, while all of the others get nudged down. Moreover, the sizes of these nudges should be proportional to how far away each current value is from its target value. For example, the increase to that number 2 neurons activation is, in a sense, more important than the decrease to the number 8 neuron, which is already pretty close to where it should be. So zooming in further, let’s focus just on this one neuron, the one whose activation we wish to increase. Remember, that activation is defined as a certain weighted sum of all of the activations in the previous layer, plus a bias, which has all been plugged into something like the sigmoid squishification function or a ReLU, So there are three different avenues that can team up together to help increase that activation: you can increase the bias, you can increase the weights, and you can change the activations from the previous layer. Focusing just on how the weights should be adjusted, notice how the weights actually have differing levels of influence: the connections with the brightest neurons from the preceding layer have the biggest effect, since those weights are multiplied by larger activation values. So if you were to increase one of those weights, it actually has a stronger influence on the ultimate cost function than increasing the weights of connections with dimmer neurons, at least as far as this one training example is concerned. Remember when we talked about gradient descent, we don’t just care about whether each component should get nudged up or down, we care about which ones give you the most bang for your buck. This, by the way, is at least somewhat reminiscent of a theory in neuroscience for how biological networks of neurons learn Hebbian theory – often summed up in the phrase “neurons that fire together wire together”. Here, the biggest increases to weights, the biggest strengthening of connections, happens between neurons which are the most active, and the ones which we wish to become more active. In a sense, the neurons that are firing while seeing a 2, get more strongly linked to those firing when thinking about a 2. To be clear, I really am not in a position to make statements one way or another about whether artificial networks of neurons behave anything like biological brains, and this fires-together-wire-together idea comes with a couple meaningful asterisks. But taken as a very loose analogy, I do find it interesting to note. Anyway, the third way that we can help increase this neuron’s activation is by changing all the activations in the previous layer, namely, if everything connected to that digit 2 neuron with a positive weight got brighter, and if everything connected with a negative weight got dimmer, then that digit 2 neuron would become more active. And similar to the weight changes, you’re going to get the most bang for your buck by seeking changes that are proportional to the size of the corresponding weights. Now of course, we cannot directly influence those activations, we only have control over the weights and biases. But just as with the last layer, it’s helpful to just keep a note of what those desired changes are. But keep in mind, zooming out one step here, this is only what that digit 2 output neuron wants. Remember, we also want all of the other neurons in the last layer to become less active, and each of those other output neurons has its own thoughts about what should happen to that second-to-last layer. So, the desire of this digit 2 neuron is added together with the desires of all the other output neurons for what should happen to this second-to-last layer. Again, in proportion to the corresponding weights, and in proportion to how much each of those neurons needs to change. This right here is where the idea of propagating backwards comes in. By adding together all these desired effects, you basically get a list of nudges that you want to happen to the second-to-last layer. And once you have those, you can recursively apply the same process to the relevant weights and biases that determine those values, repeating the same process I just walked through and moving backwards through the network. And zooming out a bit further, remember that this is all just how a single training example wishes to nudge each one of those weights and biases. If we only listen to what that 2 wanted, the network would ultimately be incentivized just to classify all images as a 2. So what you do is you go through this same backprop routine for every other training example, recording how each of them would like to change the weights and the biases, and you averaged together those desired changes. This collection here of the averaged nudges to each weight and bias is, loosely speaking, the negative gradient of the cost function referenced in the last video, or at least something proportional to it. I say “loosely speaking”, only because I have yet to get quantitatively precise about those nudges. But if you understood every change that I just referenced, why some are proportionally bigger than others, and how they all need to be added together, you understand the mechanics for what backpropagation is actually doing. By the way, in practice it takes computers an extremely long time to add up the influence of every single training example, every single gradient descent step. So here’s what’s commonly done instead: You randomly shuffle your training data, and then divide it into a whole bunch of mini-batches, let’s say, each one having 100 training examples. Then you compute a step according to the mini-batch. It’s not going to be the actual gradient of the cost function, which depends on all of the training data, not this tiny subset. So it’s not the most efficient step downhill. But each mini batch does give you a pretty good approximation, and more importantly, it gives you a significant computational speed up. If you were to plot the trajectory of your network under the relevant cost surface, it would be a little more like a drunk man stumbling aimlessly down a hill, but taking quick steps; rather than a carefully calculating man determining the exact downhill direction of each step before taking a very slow and careful step in that direction. This technique is referred to as “stochastic gradient descent”. There’s kind of a lot going on here, so let’s just sum it up for ourselves, shall we? Backpropagation is the algorithm for determining how a single training example would like to nudge the weights and biases, not just in terms of whether they should go up or down, but in terms of what relative proportions to those changes cause the most rapid decrease to the cost. A true gradient descent step would involve doing this for all your tens and thousands of training examples and averaging the desired changes that you get. But that’s computationally slow. So instead you randomly subdivide the data into these mini-batches and compute each step with respect to a mini-batch. Repeatedly going through all of the mini batches and making these adjustments, you will converge towards a local minimum of the cost function, which is to say, your network is going to end up doing a really good job on the training examples. So with all of that said, every line of code that would go into implementing backprop actually corresponds with something that you have now seen, at least in informal terms. But sometimes knowing what the math does is only half the battle, and just representing the damn thing is where it gets all muddled and confusing. So for those of you who do want to go deeper, the next video goes through the same ideas that were just presented here but in terms of the underlying calculus, which should hopefully make it a little more familiar as you see the topic in other resources. Before that, one thing worth emphasizing is that for this algorithm to work, and this goes for all sorts of machine learning beyond just neural networks, you need a lot of training data. In our case, one thing that makes handwritten digits such a nice example is that there exists the MNIST database with so many examples that have been labeled by humans. So a common challenge that those of you working in machine learning will be familiar with is just getting the labeled training data that you actually need, whether that’s having people label tens of thousands of images or whatever other data type you might be dealing with. And this actually transitions really nicely to today’s extremely relevant sponsor – CrowdFlower, which is a software platform where data scientists and machine learning teams can create training data. They allow you to upload text or audio or image data, and have it annotated by real people. You may have heard of the human-in-the-loop approach before, and this is essentially what we’re talking about here: “leveraging human intelligence to train machine intelligence”. They employ a whole bunch of pretty smart quality control mechanisms to keep the data clean and accurate, and they’ve helped to train test and tune thousands of data and AI projects. And what’s most fun, there’s actually a free t-shirt in this for you guys. If you go to 3b1b.co/crowdflower, or follow the link on screen and in the description, you can create a free account and run a project, and they’ll send you a free shirt once you’ve done the job. And the shirt it’s actually pretty cool, I quite like it. So thanks to CrowdFlower for supporting this video, and thank you also to everyone on Patreon helping support these videos.

100 thoughts on “What is backpropagation really doing? | Deep learning, chapter 3

  1. I might misunderstand but maybe not. At 7:38, we want to reduce the activation of the neuron responsible for 3 right? So we should decrease the activation of the neurons that have a positive weight to 3, no? In the video, it's actually the opposite. For example, the first neuron has a positive weight with the neuron responsible for 3, so we should decrease its activation right?
    Can someone help me on that please?

  2. Suppose that we have a neural network with one input layer, one output layer, and one hidden layer. Let's refer to the weights from input to hidden as w and the weights from hidden to output as v. Suppose that we have calculated v via backprogagation. When finding the weights for w, do we keep the weights v constant when updating w or do we allow v to update along with w?

  3. 5:09 How do you decide how to balance your 3 different avenues? Do you put the emphasis on the biases, then the weights of the current layer and give a low priority to the activations of the previous layer? That would mean that the weights of the first layer will hardly change. Do you put the highest priority on the weights of the current layer? Do you emphasize the activations of the previous layer? How do you decide?
    Also, you talk about changing the weights and activations, but you don't talk about how to find the changes to the biases.

  4. Thanks, this is super helpful. I studied back prop in grad school and after rewinding a few times, I think I understand now. I'm ready to tackle the math now. Really great explanation.

  5. Sorry for a depth and detailed analysis, but shouldn't the colors of the arrows (related to the amount of desired change) appearing at 7:37 be in an opposite color (for digit outputs 0, 1, 3, …, 9 i.e. last layer)?

  6. Help. I have question. Is it true that the gradient gets computed for for a training example and that the average of the costs function is only for measurement?

  7. Here is another very nice tutorial with step by step Mathematical explanation and full coding.

    http://www.adeveloperdiary.com/data-science/machine-learning/understand-and-implement-the-backpropagation-algorithm-from-scratch-in-python/

  8. Awesome explanation.. i have a doubt.. How do u decide the number of neurons in the hidden layer?? is it by trial and error??.. what if i increase the number of hidden layers ?? will it increase the efficiency of the network?? and also how it will affect the system when the number of neurons in the hidden layer is increased or decreased?

  9. Please make a video for Convolutional Neural Networks!!!!
    You are an amazing teacher! Thank you for these videos.

  10. Came for the knowledge, stayed for the Animations.

    Here I'm thinking how it can be done in After Effects, and turned out it's custom engine!

  11. 8:05 – that's where I don't get the idea of the back proapagation method. The net shows the wrong results, right — but how can we be of any assertiveness that it wasn't just the last layer's fault? What's more, why should the whole net should be pushed in one direction?
    For example, let's imagine the net "decided" to do what you suggested it could do in the first video: one layer picks out geometry and the other analyzes this geometry's locations relative to each other. Then, a mistake in determining the right geometry positions for the desired result has in general nothing to do with detecting geometry itself. Vice versa, if the net doesn't detect geometry very good, positioning it correctly will still yield some wrong results and adjusting these both the same way might be detrimental. And we are changing both layers in the same way instead of making these do better work on their own. How and why does this work?

  12. 9:00 Eureka moment. I cried.
    * eagerly waiting for a video about convolutional neural networks *
    At 12:25, he labels Fermat as a tease 😂

  13. After 6:42, does it mean that the weights and biases in the hidden layer do not change but only the activation value changes by changing the weights and biases from the first to second layer??

  14. How about rights of use the results of machine learning? Imagen the possibilities for big market players to predict how to kill (economically) all the concurrency from side of small players just to get their bigger market parts? They need just to predict for how much time they must keep damping prices and that it. And I without mashing learning can predict that they will do this with accuracy 100%. What about using of those technologies for something like this which is no in theory forbidden by the low, but we all know what can do businessmen to get dominance and authority. How about rights for using data of face recondition and moving and all other those statistical data of people what certainly want to be stored. How we will keep democracy and power of low and equality for all people in this new world? Or I am crazy and nobody thinks about this?
    Are we as society ready for this changes? What about politics? They tell only about dominance bot not about freedom and equality

  15. I imagined the nodes as flexible water balloons and the connections as fixed size pipes! This illustrates the change of the pipe sizes and their effect a little better in my mind. Making one pipe bigger means greater flow of water to a certain balloon making it fill up with more water, getting bigger.

    I also started imagining this neural network of fun balloons with just 2 layers; the first, "input balloons" (each pixel a different sized water balloon based on how lit it was by the example), and second 10 "output balloons" representing the numbers 0-9. So imagine 784 water balloons connected to 10 other water balloons with same sized pipes and water flowing to the output balloons when you squeeze the input balloons. Now fill up the input balloons with water based on how lit a pixel is from an example and then squeeze all the water out of the first layer into the second, the result would be that the 10 output balloons would have the same amount of water in each balloon. Now imagine changing the pipe sizes to allow more water to go to certain output balloons that better representing the number that matches the example, then squeeze all the water back and do it again. You end up with the balloon that represents the number in the example the biggest balloon! A very crude model (i.e combination of pipe sizes) that at least could get that specific example number with some success. It can't do much else.

    Think about balloons with water being squeezed back and forth between the layers and how sizes of the pipes would change where the water ends up. Kinda fun.

    Now how could you improve on this to guess multiple example numbers and not just one? We probably need to have more control over where the water from the input balloons flows to the output balloons. An easy way to do this is to add more pipes and strategically link only certain balloons with these pipes. BUT like 784 pipes weren't enough, imagine say doubling this amount. You will have 784*10 * 2, which comes to 15,680 pipes…thats a lot of pipes. And even if we used math to calculate the best pipe combinations, that would already be more calculations than the 13k or so as explained in the video.

    Is there a more efficient solution?

    How can we add more control but not go crazy with the number of pipes we need to control?

    Knowing that we only have 784 input balloons and have to end up with 10 output balloons, we can create more pipes by adding another layer of balloons in between them, with say 16 balloons – my imagination starts to get a little hazy around here – we connect the input balloons to this middle layer and the middle layer to the output balloons. Water flows from the first to the second giving us a whole bunch of pipe sizes we can play with and then another set between the second and third. The middle balloons kind of group the input balloons pipes into just 16. I feel like it's kinda summarizing or zooming out.

    It's super hard to imagine so many pipe connections, I find it MUCH EASIER thinking of an example with smaller number of balloons, like a grayscale image of a traffic light that can be expressed in just 6 pixels or so with say 3 balloons in the middle, and 1 output balloon that tells you to STOP if the balloon is FULL or go if its empty…. or something like that.

    I've probably screwed the examples up, but wanted to share because it helped me understand it a little better. Hope it helps someone.

    btw these videos are so awesome! thank you so much!!!

  16. Thank you for these great tutorials! You make something complicated seem logical and fun, definitely fuels my interest for learning more!

  17. This is guy is amazing! Can't stop watching! What a phenomenal teacher! Thank you again and again! I not even working today, just watching these classes 🙂

  18. I don't understand one thing. He told that we needed to modify the outputs of the last hidden layer, but then talks about a gradient with a size of 13,000. This would include the weights that go from the last hidden layer to the output layer. Moreover, if we do this recursively, we would arrive to the input layer, where we now have to modify the weights and bias. So I think that the gradient should only have 784×16 elements

  19. You're saving my presentation! I didn't know how to explain neural networks to my fellow students but now I know how to do it! Thanks mate!

  20. I mean, in practice, you probably wont even use bath, stochastic, or mini batch gradient descent but a more efficient optimization algorithm like conjugate descent, BFGS, and other algorithms. I would love it if you could make videos on how these work cause i never bothered to try and understand them, and just use them…

  21. I remember seeing this a year back and not really understanding what is going on. I have recently started Andrew Ng's Machine Learning course and now this video feels a lot clearer 😀

  22. I love these moments in maths or other sciences, where you have this one singular moment of pure enlightenment. You got me at 7:37 ….I got goose bumps when the first column of "+" appeared. This is just brilliant.

  23. 1. Gradient descent (step): You calculate the negative gradient of your multidimensional cost function at a random point and then move to the point where the (displacement) vector, i.e. the gradient vector is pointing to and calculate the gradient again. You do this until you find a point where the magnitude of the gradient vector is very small (approaching zero) – meaning that you do the gradient descent until you find a local minimum of the cost function.

    2. impact on the result a.k.a. bang for the buck: focussing on a single input, the weights must be changed (nudged) by particular values proportional to the activation of the neurons they are referring to. The learning rate alpha is the proportionality factor. Let`s focus on the last layer. You take a single neuron and calculate the needed changes for the activations of the neurons in the previous layer. You do that for every neuron in this layer and add all these desired changes for every activation of the neurons in the previous layer.

    3. backpropagation: it´s called backpropagation because you first calculate the weights of the last layer. Then you move on to the second last hidden layer until the last hidden layer.

  24. <!DOCTYPE html>
    <html>
    <head>
    <title>What is backpropagation really doing? | Deep learning, chapter 3</title>
    </head>
    <body>
    <div id="player" link="https://www.youtube.com/watch?v=Ilg3gGewQ5U">
    <p class="topic" id="plan"> … </p>

    <p class="topic" id="Stochastic gradient descent">It will take a long time to get the average cost for all the data in your training set. So instead of calculating the average cost for everything in your dataset, you instead take a mini batch and calculate the cost. Sure it is less accurate, but you will get a significant computational speedup!</p>

    </div>

    </body>
    </html>

  25. 저거… 자막이 bias를 편견이라고 해석하네요… 그리고 여러 단어가 실제 문맥 맥락과 다른 의미로 번역되었어요…

  26. какой ублюдский перевод субтитров, выблядок что это сделал – вандал, просто не вникая в тему забил в гугл транслит англ субтитры и все, выблядок смерти тебе

  27. Quite frustrating I have to end up here on this guy's channel. I don't like his video format (what's up with this weird music as if this is some super mysterious stuff? It actually turns away a lot of people and I don't think he realizes that he is actually making it look way more complicated than it should be). However, I don't see any proper video on backpropagation so I have to make use of this. But still unliking because of the poor presentation of rather simple ideas. Why do people not use a simple pen and paper approach as should be the case. Weird animations like these just scare a lot of people away.

  28. I'm forcing myself to watch all these videos before going on to write my own code to do this! I've got some awesome ideas that I can't wait to implement, plus in my next year of University, I will be doing computer vision and machine learning, so having this under my belt will be really great! I just wanna start programming though!!

  29. once I took the description of visual system by Hubel and Wiesel and programmed a neural network to simulate it and it was amazing. Then I used the tricks I learned from the brain to program a language cortex. That is even more amazing because I just flush down a language corpus without even telling it what encoding it is in or where the word boundaries are and it learns all the phrases words suffixes prefixes by itself. How do I figure out what I programmed when I dont understand the neural network theory at all? Like how its called and stuff? I find it boring to study and a lot of mental effort. I am lazy to decipher the ideas from the mathematical formulae.

  30. Wittgestein said something like "What we can talk about, we can say it clearly, otherwise is better to keep silence". You talk very, very clear. Congratulations!

  31. 11:51 I think there's a mistake in one of the equations. You have z superscript L superscript L instead of just z superscript L.

  32. What kind of resources do you need to apply neural networks and AI to researching other fields? I know that learning is computationally expensive, but if I am doing image object recognition on a large dataset, is there any description of time or space complexity as a function of the size of the dataset or the number of levels/nodes in a neural network?

  33. You are awesome,best animation,good knowledge,strong concept,may be computer scientist and etc…
    But i am not understanding much
    What is wrong with me

  34. NOBODY LECTURING IN A HALL TRIES TO TALK OVER A LOAD OF BACKGROUND MUSIC! SO, WHY DO YOU DO IT IN YOUR VIDEOS ??? YOU'RE TALKING TO THE WORLD HERE, AND NOT EVERYONE SPEAKS AMERICAN! YOUR ACCENT, AND THE SPEED AT WHICH YOU SPEAK ALREADY PUTS SOME LISTENERS AT A DISADVANTAGE. TO A BEGINNER, THIS SUBJECT ITSELF IS COMPLICATED, AND REQUIRES EFFORT TO GRASP. SO, THE LAST THING ONE NEEDS IS TO BE ADDITIONALLY STRUGGLING AGAINST A LOT OF UNNECESSARY AND DISRUPTIVE RHYTHM, AND BACKGROUND NOISE!

  35. This is one of the most extraordinary teaching examples I've ever seen, and should be modeled in education systems all over the world.

  36. 1.3 million views … wow. I'm kind of blown away by how many people are interested and following this series. That's far more than I would have guessed! Great!

  37. I can not thank you enough for this video. It's easy to understand and very well made. Thanks to you i will now be able to make a decent AI. 🙂 I am lucky to have found your channel this early in life.

  38. What if you scold your computer differently? Like, instead of making it chase a 00010000 kind of last layer, allow it to be in doubt between two options. For instance, using the result of the error the trained network made, create a second label for each data set, that means "what else does this number looks like, besides what it is". Then retrain the network allowing it to have a "second most probable" number guess. If you consider the second guess as also valid, could we reach 99,99% accuracy?

Leave a Reply

Your email address will not be published. Required fields are marked *