Last update:Friday 16th of December 2011
Welcome to the online introduction to artificial intelligence. | ▶ 00:00 |
My name is Sebastian Thrun. >>I'm Peter Norvig. | ▶ 00:04 |
We are teaching this class at Stanford, | ▶ 00:07 |
and now we are teaching it online for the entire world. | ▶ 00:09 |
We are really excited about this. | ▶ 00:11 |
It's great to have you all here. | ▶ 00:13 |
It's exciting to have such a record-breaking number of people. | ▶ 00:14 |
We think we can deliver a good introduction to artificial intelligence. | ▶ 00:18 |
We hope you'll stick with it. | ▶ 00:22 |
It's going to be a lot of work, | ▶ 00:24 |
but we think it's going to be very rewarding. | ▶ 00:25 |
The way that it is going to be organized is that | ▶ 00:27 |
every week there is going to be new videos and with these videos, quizes. | ▶ 00:29 |
With these quizzes, you can test your knowledge about AI. | ▶ 00:32 |
We also post for the advanced version of this class, homework assignments and exams | ▶ 00:35 |
on which you'll be quizzed. | ▶ 00:38 |
We're going to grade those to give you a final score to see | ▶ 00:40 |
if you can actually master artificial intelligence the same way | ▶ 00:44 |
any good student at Stanford would do it. | ▶ 00:47 |
If you do that, then at the end of the class, we'll sign a letter of accomplishment, | ▶ 00:49 |
and let you know that you've achieved this and what your rank in the class was. | ▶ 00:54 |
So I hope you have fun. Watch us on videotape. | ▶ 00:58 |
We will teach you AI. | ▶ 01:02 |
Participate in the discussion forum. | ▶ 01:04 |
Ask your questions, and help others answer questions. | ▶ 01:06 |
I hope we have a fantastic time ahead of us in the next 10 weeks. | ▶ 01:09 |
Welcome to the class. We'll see you online. | ▶ 01:12 |
Welcome to the first unit of Online Introduction to Artificial Intelligence. | ▶ 00:00 |
I will be teaching you the very, very basics today. | ▶ 00:05 |
This is Unit 1 of Artificial Intelligence. | ▶ 00:09 |
Welcome. | ▶ 00:14 |
The purpose of this class is twofold: | ▶ 00:16 |
Number 1, to teach you the very basics of artificial intelligence | ▶ 00:20 |
so you'll be able to talk to people in the field | ▶ 00:25 |
and understand the basic tools of the trade; | ▶ 00:29 |
and also, very importantly, to excite you about the field. | ▶ 00:32 |
I have been in the field of artificial intelligence for about 20 years, | ▶ 00:37 |
and it's been truly rewarding. | ▶ 00:42 |
So I want you to participate in the beauty and the excitement of AI | ▶ 00:44 |
so you can become a professional who gets the same reward | ▶ 00:48 |
and excitement out of this field as I do. | ▶ 00:52 |
The basic structure of this class involves videos | ▶ 00:55 |
in which Peter or I will teach you something new, | ▶ 01:00 |
then also quizzes, which we will ask you about your ability to answer AI questions, | ▶ 01:03 |
and finally, answer videos in which we tell you what the right answer would have been | ▶ 01:11 |
for the quiz that you might have falsely or incorrectly answered before. | ▶ 01:17 |
This will all be reiterated, and every so often you get a homework assignment, | ▶ 01:22 |
also in the form of quizzes but without the answers. | ▶ 01:28 |
And then we also have video exams. | ▶ 01:34 |
If you check our website, there's requirements | ▶ 01:37 |
on how you have to do assignments and exams. | ▶ 01:39 |
Please go to ai-class.org in this class. | ▶ 01:43 |
An AI program is called wetware, a formula, or an intelligent agent. | ▶ 01:48 |
Pick the one that fits best. | ▶ 01:58 |
[Thrun] The correct answer is intelligent agent. | ▶ 00:00 |
Let's talk about intelligent agents. | ▶ 00:04 |
Here is my intelligent agent, | ▶ 00:07 |
and it gets to interact with an environment. | ▶ 00:11 |
The agent can perceive the state of the environment | ▶ 00:17 |
through its sensors, | ▶ 00:22 |
and it can affect its state through its actuators. | ▶ 00:25 |
The big question of artificial intelligence is the function that maps sensors to actuators. | ▶ 00:29 |
That is called the control policy for the agent. | ▶ 00:37 |
So all of this class will deal with how does an agent make decisions | ▶ 00:41 |
that it can carry out with its actuators based on past sensor data. | ▶ 00:48 |
Those decisions take place many, many times, | ▶ 00:54 |
and the loop of environment feedback to sensors, agent decision, | ▶ 00:58 |
actuator interaction with the environment and so on is called perception action cycle. | ▶ 01:03 |
So here is my very first quiz for you. | ▶ 01:12 |
Artificial intelligence, AI, has successfully been used in finance, | ▶ 01:15 |
robotics, games, medicine, and the Web. | ▶ 01:21 |
Check any or all of those that apply. | ▶ 01:26 |
And if none of them applies, check the box down here that says none of them. | ▶ 01:28 |
So the correct answer is all of those-- | ▶ 00:00 |
finance, robotics, games, medicine, the Web, and many more applications. | ▶ 00:03 |
So let me talk about them in some detail. | ▶ 00:08 |
There is a huge number of applications of artificial intelligence in finance, | ▶ 00:10 |
very often in the shape of making trading decisions-- | ▶ 00:15 |
in which case, the agent is called a trading agent. | ▶ 00:18 |
And the environment might be things like the stock market or the bond market | ▶ 00:21 |
or the commodities market. | ▶ 00:27 |
And our trading agent can sense the course of certain things, | ▶ 00:29 |
like the stock or bonds or commodities. | ▶ 00:33 |
It can also read the news online and follow certain events. | ▶ 00:35 |
And its decisions are usually things like buy or sell decisions--trades. | ▶ 00:40 |
There's a huge history of artificial intelligence finding methods to look at data over time | ▶ 00:48 |
and make predictions as to how courses develop over time-- | ▶ 00:55 |
and then put in trades behind those. | ▶ 00:58 |
And very frequently, people using artificial intelligence trading agents | ▶ 01:01 |
have made a good amount of money with superior trading decisions. | ▶ 01:06 |
There's also a long history of AI in Robotics. | ▶ 01:10 |
Here is my depiction of a robot. | ▶ 01:14 |
Of course, there are many different types of robots | ▶ 01:17 |
and they all interact with their environments through their sensors, | ▶ 01:20 |
which include things like cameras, microphones, tactile sensor or touch. | ▶ 01:24 |
And the way they impact their environments is to move motors around, | ▶ 01:33 |
in particular, their wheels, their legs, their arms, their grippers. | ▶ 01:38 |
They can also say things to people using voice. | ▶ 01:43 |
Now there's a huge history of using artificial intelligence in robotics. | ▶ 01:46 |
Pretty much, every robot that does something interesting today uses AI. | ▶ 01:50 |
In fact, often AI has been studied together with robotics, as one discipline. | ▶ 01:54 |
But because robots are somewhat special in that they use physical actuators | ▶ 01:58 |
and deal with physical environments, they are a little bit different from | ▶ 02:03 |
just artificial intelligence, as a whole. | ▶ 02:06 |
When the Web came out, the early Web crawlers were called robots | ▶ 02:08 |
and to block a robot from accessing your website, to the present day, | ▶ 02:15 |
there's a file called robot.txt, that allows you to deny any Web crawler | ▶ 02:20 |
to access and retrieve that information from your website. | ▶ 02:24 |
So historically, robotics played a huge role in artificial intelligence | ▶ 02:28 |
and a good chunk of this class will be focusing on robotics. | ▶ 02:32 |
AI has a huge history in games-- | ▶ 02:36 |
to make games smarter or feel more natural. | ▶ 02:39 |
There are 2 ways in which AI has been used in games, as a game agent. | ▶ 02:43 |
One is to play against you, as a human user. | ▶ 02:47 |
So for example, if you play the game of Chess, | ▶ 02:50 |
then you are the environment to the game agent. | ▶ 02:54 |
The game agent gets to observe your moves, and it generates its own moves | ▶ 02:57 |
with the purpose of defeating you in Chess. | ▶ 03:03 |
So most adversarial games, where you play against an opponent | ▶ 03:07 |
and the opponent is a computer program, | ▶ 03:10 |
the game agent is built to play against you--against your own interests--and make you lose. | ▶ 03:13 |
And of course, your objective is to win. | ▶ 03:20 |
That's an AI games-type situation. | ▶ 03:22 |
The second thing is that games agents in AI | ▶ 03:25 |
also are used to make games feel more natural. | ▶ 03:29 |
So very often games have characters inside, and these characters act in some way. | ▶ 03:32 |
And it's important for you, as the player, to feel that these characters are believable. | ▶ 03:36 |
There's an entire sub-field of artificial intelligence to use AI | ▶ 03:42 |
to make characters in a game more believable--look smarter, so to speak-- | ▶ 03:45 |
so that you, as a player, think you're playing a better game. | ▶ 03:51 |
Artificial intelligence has a long history in medicine as well. | ▶ 03:55 |
The classic example is that of a diagnostic agent. | ▶ 04:00 |
So here you are--and you might be sick, and you go to your doctor. | ▶ 04:04 |
And your doctor wishes to understand | ▶ 04:09 |
what the reason for your symptoms and your sickness is. | ▶ 04:11 |
The diagnostic agent will observe you through various measurements-- | ▶ 04:17 |
for example, blood pressure and heart signals, and so on-- | ▶ 04:21 |
and it'll come up with the hypothesis as to what you might be suffering from. | ▶ 04:25 |
But rather than intervene directly, in most cases the diagnostic of your disease | ▶ 04:29 |
is communicated to the doctor, who then takes on the intervention. | ▶ 04:34 |
This is called a diagnostic agent. | ▶ 04:38 |
There are many other versions of AI in medicine. | ▶ 04:40 |
AI is used in intensive care to understand whether there are situations | ▶ 04:43 |
that need immediate attention. | ▶ 04:48 |
It's been used for life-long medicine to monitor signs over long periods of time. | ▶ 04:50 |
And as medicine becomes more personal, the role of artificial intelligence | ▶ 04:54 |
will definitely increase. | ▶ 04:58 |
We already mentioned AI on the Web. | ▶ 05:01 |
The most generic version of AI is to crawl the Web and understand the Web, | ▶ 05:05 |
and assist you in answering questions. | ▶ 05:09 |
So when you have this search box over here | ▶ 05:12 |
and it says "Search" on the left, | ▶ 05:15 |
and "I'm Feeling Lucky" on the right, | ▶ 05:18 |
and you type in the words, | ▶ 05:20 |
what AI does for you is it understands what words you typed in | ▶ 05:21 |
and finds the most relevant pages. | ▶ 05:28 |
That is really co-artificial intelligence. | ▶ 05:30 |
It's used by a number of companies, such as Microsoft and Google | ▶ 05:32 |
and Amazon, Yahoo, and many others. | ▶ 05:36 |
And the way this works is that there's a crawling agent that can go | ▶ 05:39 |
to the World Wide Web and retrieve pages, through just a computer program. | ▶ 05:43 |
It then sorts these pages into a big database inside the crawler | ▶ 05:51 |
and also analyzes developments of each page to any possible query. | ▶ 05:56 |
When you then come and issue a query, | ▶ 06:01 |
the AI system is able to give you a response-- | ▶ 06:04 |
for example, a collection of 10 best Web links. | ▶ 06:08 |
In short, every time you try to write a piece of software, | ▶ 06:12 |
that makes your computer software smart | ▶ 06:15 |
likely you will need artificial intelligence. | ▶ 06:18 |
And in this class, Peter and I will teach you | ▶ 06:20 |
many of the basic tricks of the trade | ▶ 06:23 |
to make your software really smart. | ▶ 06:25 |
It will be good to introduce some basic terminology | ▶ 00:00 |
that is commonly used in artificial intelligence to distinguish different types of problems. | ▶ 00:04 |
The very first word I will teach you is fully versus partially observable. | ▶ 00:09 |
An environment is called fully observable if what your agent can sense | ▶ 00:16 |
at any point in time is completely sufficient to make the optimal decision. | ▶ 00:19 |
So, for example, in many card games, | ▶ 00:26 |
when all the cards are on the table, the momentary site of all those cards | ▶ 00:29 |
is really sufficient to make the optimal choice. | ▶ 00:36 |
That is in contrast to some other environments where you need memory | ▶ 00:40 |
on the side of the agent to make the best possible decision. | ▶ 00:46 |
For example, in the game of poker, the cards aren't openly on the table, | ▶ 00:50 |
and memorizing past moves will help you make a better decision. | ▶ 00:55 |
To fully understand the difference, consider the interaction of an agent | ▶ 01:00 |
with the environment to its sensors and its actuators, | ▶ 01:04 |
and this interaction takes place over many cycles, | ▶ 01:08 |
often called the perception-action cycle. | ▶ 01:11 |
For many environments, it's convenient to assume | ▶ 01:16 |
that the environment has some sort of internal state. | ▶ 01:19 |
For example, in a card game where the cards are not openly on the table, | ▶ 01:22 |
the state might pertain to the cards in your hand. | ▶ 01:28 |
An environment is fully observable if the sensors can always see | ▶ 01:33 |
the entire state of the environment. | ▶ 01:37 |
It's partially observable if the sensors can only see a fraction of the state, | ▶ 01:41 |
yet memorizing past measurements gives us additional information of the state | ▶ 01:46 |
that is not readily observable right now. | ▶ 01:52 |
So any game, for example, where past moves have information about | ▶ 01:55 |
what might be in a person's hand, those games are partially observable, | ▶ 02:01 |
and they require different treatment. | ▶ 02:06 |
Very often agents that deal with partially observable environments | ▶ 02:08 |
need to acquire internal memory to understand what | ▶ 02:12 |
the state of the environment is, and we'll talk extensively | ▶ 02:15 |
when we talk about hidden Markov models about how this structure | ▶ 02:18 |
has such internal memory. | ▶ 02:21 |
A second terminology for environments pertains to whether the environment | ▶ 02:23 |
is deterministic or stochastic. | ▶ 02:26 |
Deterministic environment is one where your agent's actions | ▶ 02:29 |
uniquely determine the outcome. | ▶ 02:35 |
So, for example, in chess, there's really no randomness when you move a piece. | ▶ 02:37 |
The effect of moving a piece is completely predetermined, | ▶ 02:42 |
and no matter where I'm going to move the same piece, the outcome is the same. | ▶ 02:46 |
That we call deterministic. | ▶ 02:50 |
Games with dice, for example, like backgammon, are stochastic. | ▶ 02:52 |
While you can still deterministically move your pieces, | ▶ 02:56 |
the outcome of an action also involves throwing of the dice, | ▶ 03:00 |
and you can't predict those. | ▶ 03:03 |
There's a certain amount of randomness involved for the outcome of dice, | ▶ 03:05 |
and therefore, we call this stochastic. | ▶ 03:08 |
Let me talk about discrete versus continuous. | ▶ 03:10 |
A discrete environment is one where you have finitely many action choices, | ▶ 03:14 |
and finitely many things you can sense. | ▶ 03:18 |
So, for example, in chess, again, there's finitely many board positions, | ▶ 03:21 |
and finitely many things you can do. | ▶ 03:25 |
That is different from a continuous environment | ▶ 03:28 |
where the space of possible actions or things you could sense may be infinite. | ▶ 03:30 |
So, for example, if you throw darts, there's infinitely many ways to angle the darts | ▶ 03:35 |
and to accelerate them. | ▶ 03:41 |
Finally, we distinguish benign versus adversarial environments. | ▶ 03:43 |
In benign environments, the environment might be random. | ▶ 03:49 |
It might be stochastic, but it has no objective on its own | ▶ 03:53 |
that would contradict the own objective. | ▶ 03:57 |
So, for example, weather is benign. | ▶ 03:59 |
It might be random. It might affect the outcome of your actions. | ▶ 04:02 |
But it isn't really out there to get you. | ▶ 04:06 |
Contrast this with adversarial environments, such as many games, like chess, | ▶ 04:08 |
where your opponent is really out there to get you. | ▶ 04:14 |
It turns out it's much harder to find good actions in adversarial environments | ▶ 04:16 |
where the opponent actively observes you and counteracts what you're trying to achieve | ▶ 04:21 |
relative to benign environment, where the environment might merely be stochastic | ▶ 04:26 |
but isn't really interested in making your life worse. | ▶ 04:30 |
So, let's see to what extent these expressions make sense to you | ▶ 04:35 |
by going to our next quiz. | ▶ 04:38 |
So here are the 4 concepts again: partially observable versus fully, | ▶ 04:40 |
stochastic versus deterministic, continuous versus discrete, | ▶ 04:45 |
adversarial versus benign. | ▶ 04:50 |
And let me ask you about the game of checkers. | ▶ 04:52 |
Check one or all of those attributes that apply. | ▶ 04:56 |
So, if you think checkers is partially observable, check this one. | ▶ 05:00 |
Otherwise, just don't check it. | ▶ 05:03 |
If you think it's stochastic, check this one, | ▶ 05:05 |
continuous, check this one, adversarial, check this one. | ▶ 05:07 |
If you don't know about checkers, you can check the Web and Google it | ▶ 05:11 |
to find a little more information about checkers. | ▶ 05:15 |
So, checkers is an interesting game. | ▶ 00:00 |
Here's the typical board of the game of checkers. | ▶ 00:04 |
Your pieces might look like this, | ▶ 00:08 |
and your opponent's pieces might look like this. | ▶ 00:11 |
And apart from some very cryptic rules in checkers, | ▶ 00:16 |
which I won't really discuss here, the board basically tells you | ▶ 00:19 |
everything there is to know about checkers, so it's clearly fully observable. | ▶ 00:23 |
It is deterministic because your move and your opponent's move | ▶ 00:28 |
very clearly affect the state of the board in ways that have | ▶ 00:33 |
absolutely no stochasticity. | ▶ 00:36 |
It is also discrete because there's finitely many action choices | ▶ 00:39 |
and finitely many board positions, | ▶ 00:45 |
and obviously, it is adversarial, since your opponent is out to get you. | ▶ 00:47 |
[Male narrator] The game of poker--is this partially observable, stochastic, | ▶ 00:00 |
continuous, or adversarial? | ▶ 00:06 |
Please check any or all of those that apply. | ▶ 00:09 |
[Male narrator] I would argue poker is partially observable | ▶ 00:00 |
because it can't be seen what is in your opponent's hands. | ▶ 00:03 |
It is stochastic because you're being dealt cards that are kind of coming at random. | ▶ 00:08 |
It is not continuous; it's just finding many cards | ▶ 00:13 |
and finding many actions you can do, even though you might argue | ▶ 00:16 |
that there's a huge number of different monies you can bet. | ▶ 00:20 |
It's still finite, and it is clearly adversarial. | ▶ 00:24 |
If you've ever played poker before, you know how brutal it can be. | ▶ 00:27 |
[Male narrator] --a favorite, a robotic car. | ▶ 00:00 |
I wish to know whether it is partially observable, | ▶ 00:04 |
stochastic, continuous, or adversarial. | ▶ 00:06 |
That is, is the problem of driving robotically-- | ▶ 00:11 |
say, in a city--subject to any of those 4 categories? | ▶ 00:16 |
Please check any or all that might apply. | ▶ 00:20 |
Well, the robotic car clearly deals with a partially observable environment | ▶ 00:00 |
if you just look at momentary sensing input, you can't even tell how fast other cars are going. | ▶ 00:04 |
So, you need to memorize something. | ▶ 00:10 |
It is stochastic because it's inherently unpredictable | ▶ 00:12 |
what's going to happen next with other cars. | ▶ 00:15 |
It is continuous. | ▶ 00:17 |
There's the infinitely many ways to set your steering | ▶ 00:20 |
or push your gas pedal or your brake, | ▶ 00:23 |
and, well, you can argue with adversarial or not. | ▶ 00:26 |
Depending on where you live, it might be highly adversarial. | ▶ 00:29 |
Where I live, it isn't. | ▶ 00:31 |
I'm going to briefly talk of AI as something else, | ▶ 00:00 |
which is AI is the technique of uncertainty management in computer software. | ▶ 00:03 |
Put differently, AI is the discipline that you apply when you want to know what to do | ▶ 00:10 |
when you don't know what to do. | ▶ 00:17 |
Now, there's many reasons why there might be uncertainty in a computer program. | ▶ 00:22 |
There could be a sensor limit. | ▶ 00:27 |
That is, your sensors are unable to tell me | ▶ 00:29 |
what exactly is the case outside the AI system. | ▶ 00:33 |
There could be adversaries who act in a way that makes it hard for you | ▶ 00:37 |
to understand what is the case. | ▶ 00:41 |
There could be stochastic environments. | ▶ 00:44 |
Every time you roll the dice in a dice game, | ▶ 00:48 |
the stochasticity of the dice will make it impossible for you | ▶ 00:51 |
to be absolutely certain of what's the situation. | ▶ 00:55 |
There could be laziness. | ▶ 00:57 |
So perhaps you can actually compute what the situation is, | ▶ 01:00 |
but your computer program is just too lazy to do it. | ▶ 01:04 |
And here's my favorite: ignorance, plain ignorance. | ▶ 01:07 |
Many people are just ignorant of what's going on. | ▶ 01:11 |
They could know it, but they just don't care. | ▶ 01:14 |
All of these things are cause for uncertainty. | ▶ 01:17 |
AI is the discipline that deals with uncertainty and manages it in decision making. | ▶ 01:21 |
Now we've had an introduction to AI. | ▶ 00:00 |
We've heard about some of the properties of environments, | ▶ 00:03 |
and we've seen some possible architecture for agents. | ▶ 00:06 |
I'd like next to show you some examples of AI in practice. | ▶ 00:10 |
And Sebastian and I have some experience personally in things we have done | ▶ 00:13 |
at Google, at NASA, and at Stanford. | ▶ 00:18 |
And I want to tell you a little bit about some of those. | ▶ 00:21 |
One of the best successes of AI technology at Google | ▶ 00:25 |
has been the machine translation system. | ▶ 00:28 |
Here we see an example of an article in Italian automatically translated into English. | ▶ 00:31 |
Now, these systems are built for 50 different languages, | ▶ 00:37 |
and we can translate from any of the languages into any of the other languages. | ▶ 00:41 |
So, that's over 2,500 different systems, and we've done this all | ▶ 00:46 |
using machine learning techniques, using AI techniques, | ▶ 00:51 |
rather than trying to build them by hand. | ▶ 00:55 |
And the way it works is that we go out and collect examples of text | ▶ 00:58 |
that's a line between the 2 languages. | ▶ 01:03 |
So we find, say, a newspaper that publishes 2 editions, | ▶ 01:06 |
an Italian edition and an English edition, and now we have examples of translations. | ▶ 01:11 |
And if anybody ever asked us for exactly the translation of this one particular article, | ▶ 01:16 |
then we could just look it up and say "We already know that." | ▶ 01:22 |
But of course, we aren't often going to be asked that. | ▶ 01:25 |
Rather, we're going to be asked parts of this. | ▶ 01:27 |
Here are some words that we've seen before, and we have to figure out | ▶ 01:30 |
which words in this article correspond to which words in the translation article. | ▶ 01:34 |
And when we do that by examining many, many millions of words of text | ▶ 01:40 |
in the 2 languages and making the correspondence, | ▶ 01:45 |
and then we can put that all together. | ▶ 01:49 |
And then when we see a new example of text that we haven't seen before, | ▶ 01:51 |
we can just look up what we've seen in the past for that correspondence. | ▶ 01:54 |
So, the task is really two parts. | ▶ 01:58 |
Off-line, before we see an example of text we want to translate, | ▶ 02:01 |
we first build our translation model. | ▶ 02:05 |
We do that by examining all of the different examples | ▶ 02:07 |
and figuring out which part aligns to which. | ▶ 02:10 |
Now, when we're given a text to translate, we use that model, | ▶ 02:14 |
and we go through and find the most probable translation. | ▶ 02:18 |
So, what does it look like? | ▶ 02:22 |
Well, let's look at it in some example text. | ▶ 02:24 |
And rather than look at news articles, I'm going to look at something simpler. | ▶ 02:26 |
I'm going to switch from Italian to Chinese. | ▶ 02:29 |
Here's a bilingual text. | ▶ 02:35 |
Now, for a large-scale machine translation, examples are found on the Web. | ▶ 02:37 |
This example was found in a Chinese restaurant by Adam Lopez. | ▶ 02:41 |
Now, it's given, for a text of this form, | ▶ 02:46 |
that a line in Chinese corresponds to a line in English, | ▶ 02:49 |
and that's true for each of the individual lines. | ▶ 02:55 |
But to learn from this text, what we really want to discover | ▶ 02:59 |
is what individual words in Chinese correspond to individual words | ▶ 03:02 |
or small phrases in English. | ▶ 03:07 |
I've started that process by highlighting the word "wonton" in English. | ▶ 03:09 |
It appears 3 times throughout the text. | ▶ 03:16 |
Now, in each of those lines, there's a character that appears, | ▶ 03:18 |
and that's the only place in the Chinese text where that character appears. | ▶ 03:23 |
So, that seems like it's a high probability that this character in Chinese | ▶ 03:27 |
corresponds to the word "wonton" in English. | ▶ 03:33 |
Let's see if we can go farther. | ▶ 03:36 |
My question for you is what word or what character or characters in Chinese | ▶ 03:38 |
correspond to the word "chicken" in English? | ▶ 03:44 |
And here we see "chicken" appears in these locations. | ▶ 03:47 |
Click on the character or characters in Chinese that corresponds to "chicken." | ▶ 03:54 |
The answer is that chicken appears here, | ▶ 00:01 |
here, here, and here. | ▶ 00:04 |
Now, I don't know for sure, 100%, that that is the character for chicken in Chinese, | ▶ 00:10 |
but I do know that there is a good correspondence. | ▶ 00:14 |
Every place the word chicken appears in English, | ▶ 00:17 |
this character appears in Chinese and no other place. | ▶ 00:20 |
Let's go 1 step farther. | ▶ 00:24 |
Let's see if we can work out a phrase in Chinese | ▶ 00:27 |
and see if it corresponds to a phrase in English. | ▶ 00:30 |
Here's the phrase corn cream. | ▶ 00:33 |
Click on the characters in Chinese that correspond to corn cream. | ▶ 00:38 |
The answer is: these 2 characters here | ▶ 00:00 |
appear only in these 2 locations | ▶ 00:04 |
corresponding to the words corn cream | ▶ 00:07 |
which appear only in these locations in the English text. | ▶ 00:10 |
Again, we're not 100% sure that's the right answer, | ▶ 00:13 |
but it looks like a strong correlation. | ▶ 00:17 |
Now, 1 more question. | ▶ 00:20 |
Tell me what character or characters in Chinese | ▶ 00:22 |
correspond to the English word soup. | ▶ 00:26 |
The answer is that soup occurs in most of these phrases | ▶ 00:00 |
but not 100% of them. | ▶ 00:09 |
It's missing in this phrase. | ▶ 00:11 |
Equivalently, on the Chinese side | ▶ 00:14 |
we see this character occurs | ▶ 00:17 |
in most of the phrases, | ▶ 00:20 |
but it's missing here. | ▶ 00:23 |
So we see that the correspondence doesn't have to be 100% | ▶ 00:27 |
to tell us that there is still a good chance of a correlation. | ▶ 00:31 |
When we're learning to do machine translation | ▶ 00:34 |
we use these kinds of alignments to learn probability tables | ▶ 00:37 |
of what is the probability of one phrase in one language | ▶ 00:41 |
corresponding to the phrase in another language. | ▶ 00:45 |
So congratulations, you just finished unit 1. | ▶ 00:00 |
You just finished unit 1 of this class, | ▶ 00:03 |
where I told you about key applications | ▶ 00:07 |
of artificial intelligence, | ▶ 00:10 |
I told you about the definition of an intelligent agent, | ▶ 00:13 |
I gave you 4 key attributes of intelligent agents | ▶ 00:18 |
(partial observability, stochasticity, continuous spaces, and adversarial natures), | ▶ 00:24 |
I discussed sources and management of uncertainty, | ▶ 00:31 |
and I briefly mentioned the mathematical concept of rationality. | ▶ 00:34 |
Obviously, I only touched any of these issues superficially, | ▶ 00:40 |
but as this class goes on you're going to dive into any of those | ▶ 00:45 |
and learn much more about | ▶ 00:49 |
what it takes to make a truly intelligent AI system. | ▶ 00:51 |
Thank you. | ▶ 00:55 |
[PROBLEM SOLVING] | ▶ 00:00 |
In this unit we're going to talk about problem solving. | ▶ 00:01 |
The theory and technology of building agents | ▶ 00:04 |
that can plan ahead to solve problems. | ▶ 00:06 |
In particular, we're talking about problem solving | ▶ 00:10 |
where the complexity of the problem comes from the idea that there are many states. | ▶ 00:13 |
As in this problem here. | ▶ 00:17 |
A navigation problem where there are many choices to start with. | ▶ 00:19 |
And the complexity comes from picking the right choice now and picking the right choice at the | ▶ 00:24 |
next intersection and the intersection after that. | ▶ 00:29 |
Streaming together a sequence of actions. | ▶ 00:32 |
This is in contrast to the type of complexity shown in this picture, | ▶ 00:35 |
where the complexity comes from the partial observability | ▶ 00:39 |
that we can't see through the fog where the possible paths are. | ▶ 00:43 |
We can't see the results of our actions | ▶ 00:46 |
and even the actions themselves are not known. | ▶ 00:48 |
This type of complexity will be covered in a later unit. | ▶ 00:51 |
Here's an example of a problem. | ▶ 00:56 |
This is a route-finding problem where we're given a start city, | ▶ 00:58 |
in this case, Arad, and a destination, Bucharest, the capital of Romania, | ▶ 01:03 |
from which this is a corner of the map. | ▶ 01:09 |
And the problem then is to find a route from Arad to Bucharest. | ▶ 01:11 |
The actions that the agent can execute when driving | ▶ 01:16 |
from one city to the next along one of the roads shown on the map. | ▶ 01:20 |
The question is, is there a solution that the agent can come up with | ▶ 01:23 |
given the knowledge shown here to the problem of driving from Arad to Bucharest? | ▶ 01:28 |
And the answer is no. | ▶ 00:00 |
There is no solution that the agent can come up with | ▶ 00:03 |
because Bucharest doesn't appear on the map, | ▶ 00:06 |
and so the agent doesn't know any actions that can arrive there. | ▶ 00:08 |
So let's give the agent a better chance. | ▶ 00:12 |
Now we've given the agent the full map of Romania. | ▶ 00:19 |
To start, he's in Arad, and the destination--or goal--is in Bucharest. | ▶ 00:23 |
And the agent is given the problem of coming up with a sequence of actions | ▶ 00:30 |
that will arrive at the destination. | ▶ 00:35 |
Now, is it possible for the agent to solve this problem? | ▶ 00:37 |
And the answer is yes. | ▶ 00:43 |
There are many routes or steps or sequences of actions that will arrive at the destination. | ▶ 00:45 |
Here is one of them: | ▶ 00:50 |
Starting out in Arad, taking this step first, then this one, then this one, | ▶ 00:53 |
then this one, and then this one to arrive at the destination. | ▶ 01:00 |
So that would count as a solution to the problem. | ▶ 01:05 |
So sequence of actions, chained together, that are guaranteed to get us to the goal. | ▶ 01:08 |
[DEFINITION OF A PROBLEM] | ▶ 01:12 |
Now let's formally define what a problem looks like. | ▶ 01:14 |
A problem can be broken down into a number of components. | ▶ 01:17 |
First, the initial state that the agent starts out with. | ▶ 01:21 |
In our route finding problem, the initial state was the agent being in the city of Arad. | ▶ 01:25 |
Next, a function--Actions--that takes a state as input and returns | ▶ 01:32 |
a set of possible actions that the agent can execute when the agent is in this state. | ▶ 01:41 |
[ACTIONS (s) {a,a2,a3...}] | ▶ 01:47 |
In some problems, the agent will have the same actions available in all states | ▶ 01:50 |
and in other problems, he'll have different actions dependent on the state. | ▶ 01:54 |
In the route finding problem, the actions are dependent on the state. | ▶ 01:58 |
When we're in one city, we can take the routes to the neighboring cities-- | ▶ 02:02 |
but we can't go to any other cities. | ▶ 02:06 |
Next we have a function called Result, which takes, as input, a state and an action | ▶ 02:09 |
and delivers, as its output, a new state. | ▶ 02:20 |
So, for example, if the agent is in the city of Arad, and takes--that would be the state-- | ▶ 02:24 |
and takes the action of driving along Route E-671 towards Timisoara, | ▶ 02:33 |
then the result of applying that action in that state would be the new state-- | ▶ 02:40 |
where the agent is in the city of Timisoara. | ▶ 02:45 |
Next, we need a function called Goal Test, | ▶ 02:51 |
which takes a state and returns a Boolean value-- | ▶ 02:58 |
true or false--telling us if this state is a goal or not. | ▶ 03:04 |
In a route-finding problem, the only goal would be being in the destination city-- | ▶ 03:09 |
the city of Bucharest--and all the other states would return false for the Goal Test. | ▶ 03:14 |
And finally, we need one more thing which is a Path Cost function-- | ▶ 03:19 |
which takes a path, a sequence of state/action transitions, | ▶ 03:28 |
and returns a number, which is the cost of that path. | ▶ 03:40 |
Now, for most of the problems we'll deal with, we'll make the Path Cost function be additive | ▶ 03:44 |
so that the cost of the path is just the sum of the costs of the individual steps. | ▶ 03:50 |
And so we'll implement this Path Cost function, in terms of a Step Cost function. | ▶ 03:56 |
The Step Cost function takes a state, an action, and the resulting state from that action | ▶ 04:04 |
and returns a number--n--which is the cost of that action. | ▶ 04:14 |
In the route finding example, the cost might be the number of miles traveled | ▶ 04:18 |
or maybe the number of minutes it takes to get to that destination. | ▶ 04:24 |
Now let’s see how the definition of a problem | ▶ 00:00 |
maps onto the route finding, the domain. | ▶ 00:06 |
First, the initial state was given. | ▶ 00:10 |
Let’s say we start off in Arad, | ▶ 00:12 |
and the goal test, | ▶ 00:15 |
let’s say that the state of being in Bucharest | ▶ 00:17 |
is the only state that counts as a goal, | ▶ 00:22 |
and all the other states are not goals. | ▶ 00:24 |
Now the set of all of the states here | ▶ 00:26 |
is known as the state space, | ▶ 00:29 |
and we navigate the state space by applying actions. | ▶ 00:31 |
The actions are specific to each city, | ▶ 00:35 |
so when we are in Arad, there are three possible actions, | ▶ 00:39 |
to follow this road, this one, or this one. | ▶ 00:42 |
And as we follow them, we build paths | ▶ 00:46 |
or sequences of actions. | ▶ 00:49 |
So just being in Arad is the path of length zero, | ▶ 00:51 |
and now we could start exploring the space | ▶ 00:55 |
and add in this path of length one, | ▶ 00:58 |
this path of length one, | ▶ 01:01 |
and this path of length one. | ▶ 01:03 |
We could add in another path here of length two | ▶ 01:06 |
and another path here of length two. | ▶ 01:11 |
Here is another path of length two. | ▶ 01:14 |
Here is a path of length three. | ▶ 01:17 |
Another path of length two, and so on. | ▶ 01:21 |
Now at ever point, | ▶ 01:26 |
we want to separate the state out into three parts. | ▶ 01:28 |
First, the ends of the paths— | ▶ 01:34 |
The farthest paths that have been explored, | ▶ 01:37 |
we call the frontier. | ▶ 01:40 |
And so the frontier in this case | ▶ 01:42 |
consists of these states | ▶ 01:46 |
that are the farthest out we have explored. | ▶ 01:51 |
And then to the left of that in this diagram, | ▶ 01:55 |
we have the explored part of the state. | ▶ 01:59 |
And then off to the rigtht, | ▶ 02:02 |
we have the unexplored. | ▶ 02:04 |
So let’s write down those three components. | ▶ 02:06 |
We have the frontier. | ▶ 02:09 |
We have the unexplored region, | ▶ 02:15 |
and we have the explored region. | ▶ 02:20 |
One more thing, | ▶ 02:25 |
in this diagram we have labeled the step cost | ▶ 02:27 |
of each action along the route. | ▶ 02:30 |
So the step cost of going between Neamt to Iasi | ▶ 02:33 |
would be 87 corresponding to a distance of 87 kilometers, | ▶ 02:37 |
and the path cost is just the sum of the step costs. | ▶ 02:42 |
So the cost of the path | ▶ 02:46 |
of going from Arad to Oradea | ▶ 02:48 |
would be 71 plus 75. | ▶ 02:50 |
[Narrator] Now let's define a function for solving problems. | ▶ 00:00 |
It's called Tree Search because it superimposes | ▶ 00:04 |
a search tree over the state space. | ▶ 00:07 |
Here's how it works: It starts off by | ▶ 00:10 |
initializing the frontier to be the path | ▶ 00:12 |
consisting of only the initial states, | ▶ 00:14 |
and then it goes into a loop | ▶ 00:16 |
in which it first checks to see | ▶ 00:18 |
do we still have anything left in the frontier? | ▶ 00:21 |
If not we fail, there can be no solution. | ▶ 00:23 |
If we do have something, then we make a choice. | ▶ 00:25 |
Tree Search is really a family of functions | ▶ 00:28 |
not a single algorithm which | ▶ 00:31 |
depends on how we make that choice, | ▶ 00:33 |
and we'll see some of the options later. | ▶ 00:35 |
If we go ahead and make a choice of one of | ▶ 00:38 |
the paths on the frontier and remove that | ▶ 00:41 |
path from the frontier, we find the state | ▶ 00:43 |
which is at the end of the path, and if that | ▶ 00:45 |
state's a go then we're done. | ▶ 00:47 |
We found a path to the goal; otherwise, | ▶ 00:49 |
we do what's called expanding that path. | ▶ 00:51 |
We look at all the actions from that state, | ▶ 00:54 |
and we add to the path the actions | ▶ 00:57 |
and the result of that state; so we get | ▶ 01:00 |
a new path that has the old path, the action | ▶ 01:03 |
and the result of that action, and we | ▶ 01:06 |
stick all of those paths back onto the frontier. | ▶ 01:09 |
Now Tree Search represents a whole family | ▶ 01:17 |
of algorithms, and where you get the family | ▶ 01:19 |
resemblance is that they're all looking | ▶ 01:22 |
at the frontier, copying items off and | ▶ 01:24 |
and looking to see if their goal tests, | ▶ 01:26 |
but where you get the difference is right here, | ▶ 01:29 |
in the choice of how you're going to expand | ▶ 01:31 |
the next item on the frontier, which | ▶ 01:34 |
path do we look at first, and we'll go through | ▶ 01:36 |
different sets of algorithms that make | ▶ 01:39 |
different choices for which path to look at first. | ▶ 01:42 |
The first algorithm I want to consider | ▶ 01:47 |
is called Breadth-First Search. | ▶ 01:49 |
Now it could be called shortest-first search | ▶ 01:51 |
because what it does is always choose | ▶ 01:54 |
of the frontier one of the paths that hadn't been | ▶ 01:56 |
considered yet that's the shortest possible. | ▶ 01:59 |
So how does it work? | ▶ 02:02 |
Well we start off with the path of | ▶ 02:04 |
length 0, starting in the start state, and | ▶ 02:06 |
that's the only path in the frontier so | ▶ 02:10 |
it's the shortest one so we pick it, | ▶ 02:13 |
and then we expand it, and we add in | ▶ 02:15 |
all the paths that result from | ▶ 02:17 |
applying all the possible actions. | ▶ 02:20 |
So now we've removed | ▶ 02:22 |
this path from the frontier, | ▶ 02:25 |
but we've added in 3 new paths. | ▶ 02:28 |
This one, | ▶ 02:31 |
this one, and this one. | ▶ 02:33 |
Now we're in a position where | ▶ 02:37 |
we have 3 paths on the frontier, and | ▶ 02:39 |
we have to pick the shortest one. | ▶ 02:42 |
Now in this case all 3 paths | ▶ 02:45 |
have the same length, length 1, so we | ▶ 02:47 |
break the tie at random or using some | ▶ 02:50 |
other technique, and let's suppose that | ▶ 02:52 |
in this case we choose this path | ▶ 02:56 |
from Arad to Sibiu. | ▶ 02:58 |
Now the question I want you to answer | ▶ 03:00 |
is once we remove that from the frontier, | ▶ 03:03 |
what paths are we going to add next? | ▶ 03:09 |
So show me by checking off the cities | ▶ 03:11 |
that ends the paths, which paths | ▶ 03:14 |
are going to be added to the frontier? | ▶ 03:16 |
[Male narrator] The answer is that in Sibiu, the action function gives us 4 actions | ▶ 00:00 |
corresponding to traveling along these 4 roads, | ▶ 00:06 |
so we have to add in paths for each of those actions. | ▶ 00:09 |
One of those paths goes here, | ▶ 00:15 |
the other path continues from Arad and goes out here. | ▶ 00:17 |
The third path continues out here | ▶ 00:21 |
and then the fourth path goes from here--from Arad to Sibiu | ▶ 00:25 |
and then backtracks back to Arad. | ▶ 00:31 |
Now, it may seem silly and redundant to have a path that starts in Arad, | ▶ 00:36 |
goes to Sibiu and returns to Arad. | ▶ 00:41 |
How can that help us get to our destination in Bucharest? | ▶ 00:44 |
But we can see if we're dealing with a tree search, | ▶ 00:49 |
why it's natural to have this type of formulation | ▶ 00:52 |
and why the tree search doesn't even notice that it's backtracked. | ▶ 00:56 |
What the tree search does is superimpose on top of the state space | ▶ 01:00 |
a tree of searches, and the tree looks like this. | ▶ 01:05 |
We start off in state A, and in state A, there were 3 actions, | ▶ 01:09 |
so we gave those paths going to Z, S, and T. | ▶ 01:15 |
And from S, there were 4 actions, so that gave us paths going from O, F, R, and A, | ▶ 01:21 |
and then the tree would continue on from here. | ▶ 01:34 |
We'd take one of the next items | ▶ 01:37 |
and we'd move it and continue on, but notice that we returned to the A state | ▶ 01:40 |
in the state space, but in the tree, | ▶ 01:48 |
it's just another item in the tree. | ▶ 01:51 |
Now, here's another representation of the search space | ▶ 01:55 |
and what's happening is as we start to explore the state, | ▶ 01:57 |
we keep track of the frontier, which is the set of states that are at the end of the paths | ▶ 02:01 |
that we haven't explored yet, and behind that frontier | ▶ 02:09 |
is the set of explored states, and ahead of the frontier is the unexplored states. | ▶ 02:13 |
Now the reason we keep track of the explored states | ▶ 02:19 |
is that when we want to expand and we find a duplicate-- | ▶ 02:22 |
so say when we expand from here, if we pointed back to state T, | ▶ 02:27 |
if we hadn't kept track of that, we would have to add in a new state for T down here. | ▶ 02:33 |
But because we've already seen it and we know that this is actually a regressive step | ▶ 02:42 |
into the already explored state, now, because we kept track of that, | ▶ 02:47 |
we don't need it anymore. | ▶ 02:51 |
Now we see how to modify the Tree Search Function | ▶ 00:00 |
to make it be a Graph Search Function | ▶ 00:04 |
to avoid those repeated paths. | ▶ 00:06 |
What we do, is we start off and initialize a set | ▶ 00:09 |
called the explored set of states that we have already explored. | ▶ 00:13 |
Then, when we consider a new path, | ▶ 00:16 |
we add the new state to the set of already explored states, | ▶ 00:19 |
and then when we are expanding the path | ▶ 00:23 |
and adding in new states to the end of it, | ▶ 00:26 |
we don’t add that in if we have already seen that new state | ▶ 00:29 |
in either the frontier or the explored. | ▶ 00:33 |
Now back to Breadth First Search. | ▶ 00:37 |
Let’s assume we are using the Graph Search | ▶ 00:39 |
so that we have eliminated the duplicate paths. | ▶ 00:41 |
Arad is crossed off the list. | ▶ 00:44 |
The path that goes from Arad to Sibiu | ▶ 00:47 |
and back to Arad is removed, | ▶ 00:49 |
and we are left with these one, two, three, | ▶ 00:51 |
four, five possible paths. | ▶ 00:53 |
Given these 5 paths, | ▶ 00:57 |
show me which ones are candidates to be expanded next | ▶ 00:59 |
by the Breadth First Search Algorithm. | ▶ 01:02 |
[Male narrator] And the answer is that Breadth - First Search always considers | ▶ 00:00 |
the shortest paths first, and in this case, there's 2 paths of length 1, | ▶ 00:03 |
and 1, the paths from Arad to Zerind and Arad to Timisoara, | ▶ 00:08 |
so those would be the 2 paths that would be considered. | ▶ 00:12 |
Now, let's suppose that the tie is broken in some way | ▶ 00:15 |
and we chose this path from Arad to Zerind. | ▶ 00:18 |
Now, we want to expand that node. | ▶ 00:22 |
We remove it from the frontier and put it in the explored list | ▶ 00:25 |
and now we say, "What paths are we going to add?" | ▶ 00:31 |
So check off the ends of the paths the cities that we're going to add. | ▶ 00:35 |
[Male narrator] In this case, there's nothing to add | ▶ 00:00 |
because of the 2 neighbors, 1 is in the explored list and 1 is in the frontier, | ▶ 00:03 |
and if we're using graph search, then we won't add either of those. | ▶ 00:09 |
[Male narrator] So we move on, we look for another shortest path. | ▶ 00:00 |
There's one path left of length 1, so we look at that path, we expand it, | ▶ 00:04 |
add in this path, put that one on the explored list, | ▶ 00:11 |
and now we've got 3 paths of length 2. | ▶ 00:16 |
We choose 1 of them, and let's say we choose this one. | ▶ 00:20 |
Now, my question is show me which states we add to the path | ▶ 00:23 |
and tell me whether we're going to terminate the algorithm at this point | ▶ 00:30 |
because we've reached the goal or whether we're going to continue. | ▶ 00:35 |
[Male narrator] The answer is that we add 1 more path, the path to Bucharest. | ▶ 00:00 |
We don't add the path going back because it's in the explored list, | ▶ 00:08 |
but we don't terminate it yet. | ▶ 00:11 |
True, we have added a path that ends in Bucharest, | ▶ 00:13 |
but the goal test isn't applied when we add a path to the frontier. | ▶ 00:16 |
Rather, it's applied when we remove that path from the frontier, | ▶ 00:22 |
and we haven't done that yet. | ▶ 00:26 |
[Male narrator] Now, why doesn't the general tree search or graph search algorithm stop | ▶ 00:00 |
when it adds a goal node to the frontier? | ▶ 00:06 |
The reason is because it might not be the best path to the goal. | ▶ 00:09 |
Now, here we found a path of length 2 | ▶ 00:13 |
and we added a path of length 3 that reached the goal. | ▶ 00:16 |
The general graph search or tree search doesn't know | ▶ 00:21 |
that there might be some other path that we could expand | ▶ 00:24 |
that would have a distance of say, 2-1/2, | ▶ 00:27 |
but there's an optimization that could be made. | ▶ 00:30 |
If we know we're doing Breadth - First Search | ▶ 00:33 |
and we know there's no possibility of a path of length 2-1/2. | ▶ 00:35 |
Then we can change algorithm so that it checks states | ▶ 00:40 |
as soon as they're added to the frontier | ▶ 00:44 |
rather than waiting until they're expanded | ▶ 00:46 |
and in that case, we can write a specific Breadth - First Search routine | ▶ 00:49 |
that terminates early and gives us a result as soon as we add a goal state to the frontier. | ▶ 00:53 |
Breadth - First Search will find this path | ▶ 01:01 |
that ends up in Bucharest, and if we're looking for the shortest path | ▶ 01:04 |
in terms of number of steps, | ▶ 01:08 |
Breadth - First Search is guaranteed to find it, | ▶ 01:10 |
But if we're looking for the shortest path in terms of total cost | ▶ 01:12 |
by adding up the step costs, then it turns out | ▶ 01:17 |
that this path is shorter than the path found by Breadth - First Search. | ▶ 01:21 |
So let's look at how we could find that path. | ▶ 01:26 |
An algorithm that has traditionally been called uniform-cost search | ▶ 00:00 |
but could be called cheapest-first search, | ▶ 00:05 |
is guaranteed to find the path with the cheapest total cost. | ▶ 00:08 |
Let's see how it works. | ▶ 00:11 |
We start out as before in the start state. | ▶ 00:14 |
And we pop that empty path off. | ▶ 00:19 |
Move it from the frontier to explored, | ▶ 00:24 |
and then add in the paths out of that state. | ▶ 00:28 |
As before, there will be 3 of those paths. | ▶ 00:33 |
And now, which path are we going to pick next | ▶ 00:39 |
in order to expand according to the rules of cheapest first? | ▶ 00:43 |
Cheapest first says that we pick the path with | ▶ 00:00 |
the lowest total cost. | ▶ 00:04 |
And that would be this path. | ▶ 00:06 |
It has a cost of 75 compared to the cost of 118 and 140 | ▶ 00:07 |
for the other paths. | ▶ 00:13 |
So we get here. We take that path off the frontier, | ▶ 00:14 |
put it on the explored list, add in its neighbors. | ▶ 00:19 |
Not going back to Arad, | ▶ 00:23 |
but adding in this new path. | ▶ 00:26 |
Summing up the total cost of that path, | ▶ 00:30 |
71 + 75 is 146 for this path. | ▶ 00:33 |
And now the question is, | ▶ 00:40 |
which path gets expanded next? | ▶ 00:41 |
Of the 3 paths on the frontier, we have ones | ▶ 00:00 |
with a cost of 146, 140, and 118. | ▶ 00:05 |
And that's the cheapest, so this one gets expanded. | ▶ 00:10 |
We take it off the frontier, move it to explored, | ▶ 00:13 |
add in its successors. In this case it's only 1. | ▶ 00:16 |
And that has a path total of 229. | ▶ 00:21 |
Which path do we expand next? | ▶ 00:29 |
Well, we've got 146, 140, and 229 | ▶ 00:30 |
So 140 is the lowest. | ▶ 00:33 |
Take it off the frontier. Put it on explored. | ▶ 00:38 |
Add in this path | ▶ 00:41 |
for a total cost of 220. | ▶ 00:44 |
And this path for a total cost of 239. | ▶ 00:48 |
And now the question is, which path do we expand next? | ▶ 00:53 |
The answer is this one, 146. | ▶ 00:00 |
Put it on explored. | ▶ 00:04 |
But there's nothing to add because | ▶ 00:07 |
both of its neighbors have already been explored. | ▶ 00:12 |
Which path do we look at next? | ▶ 00:13 |
The answer is this one. Two-twenty is less than 229 or 239. | ▶ 00:00 |
Take it off the frontier. Put it on explored. | ▶ 00:05 |
Add in 2 more paths and sum them up. | ▶ 00:09 |
So, 220 plus 146 is 366. | ▶ 00:15 |
And 220 plus 97 is 317. | ▶ 00:21 |
Okay, and now, notice that we're closing in on Bucharest. | ▶ 00:29 |
We've got 2 neighbors almost there, but neither of them is their turn yet. | ▶ 00:32 |
Instead, the cheapest path is this one over here, | ▶ 00:38 |
so move it to the explored list. | ▶ 00:43 |
Add 70 to the path cost so far, | ▶ 00:45 |
and we get 299. | ▶ 00:50 |
Now the cheapest node is 239 here, | ▶ 00:57 |
so we expand, finally, into Bucharest at a cost of 460. | ▶ 01:01 |
And now the question is are we done? Can we terminate the algorithm? | ▶ 01:09 |
[Male] And the answer is no, we're not done yet. | ▶ 00:00 |
We've put Bucharest, the gold state, onto the frontier, | ▶ 00:03 |
but we haven't popped it off the frontier yet. | ▶ 00:07 |
And the reason is because we've got to look around and see if there's a better path | ▶ 00:09 |
that can reach it, Bucharest. | ▶ 00:13 |
And so, let's continue. | ▶ 00:15 |
Look at everything on the frontier. | ▶ 00:18 |
Here's the cheapest one over here. | ▶ 00:20 |
Expand that. | ▶ 00:23 |
Now, what's the cheapest next one? | ▶ 00:26 |
Well, over here. | ▶ 00:30 |
Oops, forgot to take this one off the list. | ▶ 00:33 |
So now, we're at 317 plus 101 gives us another path into Bucharest, | ▶ 00:36 |
and this is a better path. | ▶ 00:44 |
This is 418, gives us another route in. | ▶ 00:46 |
But we have to keep going. | ▶ 00:54 |
The best path on the frontier is 366, | ▶ 00:59 |
so pop that off, and that would give us 2 more routes into here, | ▶ 01:06 |
and eventually we pop off all of these. | ▶ 01:14 |
And then we get to the point where 418 was the best path on the frontier. | ▶ 01:18 |
We pop that off, and then we recognize that we'd reach the goal, | ▶ 01:24 |
and the reason that uniform cost finds the optimal path, the cheapest cost, | ▶ 01:29 |
is because it's guaranteed that it will first pop off this cheapest path, | ▶ 01:35 |
the 418, before it gets to the more expensive path, like the 460. | ▶ 01:40 |
So, we've looked at 2 search algorithms. | ▶ 00:00 |
One, breadth-first search, in which we always expand first | ▶ 00:03 |
the shallowest paths, the shortest paths. | ▶ 00:08 |
Second, cheapest-first search, in which we always expand first the path | ▶ 00:12 |
with the lowest total cost. | ▶ 00:17 |
And I'm going to take this opportunity to introduce a third algorithm, depth-first search, | ▶ 00:20 |
which is in a way the opposite of breadth-first search. | ▶ 00:25 |
In depth-first search, we always expand first the longest path, | ▶ 00:28 |
the path with the most lengths in it. | ▶ 00:33 |
Now, what I want to ask you to do is for each of these nodes in each of the trees, | ▶ 00:36 |
tell us in what order they're expanded, | ▶ 00:42 |
first, second, third, fourth, fifth and so on by putting a number into the box. | ▶ 00:44 |
And if there are ties, put that number in and resolve the ties in left to right order. | ▶ 00:49 |
Then I want you to ask one more question or answer one more question | ▶ 00:58 |
which is are these searches optimal? | ▶ 01:03 |
That is, are they guaranteed to find the best solution? | ▶ 01:06 |
And for breadth-first search, optimal would mean finding the shortest path. | ▶ 01:11 |
If you think it's guaranteed to find the shortest path, check here. | ▶ 01:16 |
For cheapest first, it would mean finding the path with the lowest total path cost. | ▶ 01:21 |
Check here if you think it's guaranteed to do that. | ▶ 01:26 |
And we'll allow the assumption that all costs have to be positive. | ▶ 01:30 |
And in depth first, cheapest or optimal would mean, again, | ▶ 01:34 |
as in breadth first, finding the shortest possible path in terms of number of lengths. | ▶ 01:41 |
Check here if you think depth first will always find that. | ▶ 01:46 |
Here are the answers. | ▶ 00:00 |
Breadth-first search, as the name implies, expands nodes in this order. | ▶ 00:04 |
One, 2, 3, 4, 5, 6, 7. | ▶ 00:10 |
So, it's going across a stripe at a time, breadth first. | ▶ 00:17 |
Is it optimal? | ▶ 00:23 |
Well, it's always expanding in the shortest paths first, | ▶ 00:25 |
and so wherever the goal is hiding, it's going to find it by examining | ▶ 00:28 |
no longer paths, so in fact, it is optimal. | ▶ 00:34 |
Cheapest first, first we expand the path of length zero, | ▶ 00:38 |
then the path of length 2. | ▶ 00:45 |
Now there's a path of length 4, path of length 5, | ▶ 00:47 |
path of length 6, a path of length 7, and finally, a path of length 8. | ▶ 00:53 |
And as we've seen, it's guaranteed to find the cheapest path of all, | ▶ 01:02 |
assuming that all the individual step costs are not negative. | ▶ 01:08 |
Depth-first search tries to go as deep as it can first, | ▶ 01:14 |
so it goes 1, 2, 3, then backs up, 4, | ▶ 01:17 |
then backs up, 5, 6, 7. | ▶ 01:24 |
And you can see that it doesn't necessarily find the shortest path of all. | ▶ 01:29 |
Let's say that there were goals in position 5 and in position 3. | ▶ 01:34 |
It would find the longer path to position 3 and find the goal there | ▶ 01:39 |
and would not find the goal in position 5. | ▶ 01:43 |
So, it is not optimal. | ▶ 01:46 |
Given the non-optimality of depth-first search, | ▶ 00:00 |
why would anybody choose to use it? | ▶ 00:04 |
Well, the answer has to do with the storage requirements. | ▶ 00:07 |
Here I've illustrated a state space | ▶ 00:10 |
consisting of a very large or even infinite binary tree. | ▶ 00:13 |
As we go to levels 1, 2, 3, down to level n, | ▶ 00:18 |
the tree gets larger and larger. | ▶ 00:22 |
Now, let's consider the frontier for each of these search algorithms. | ▶ 00:24 |
For breadth-first search, we know a frontier looks like that, | ▶ 00:29 |
and so when we get down to level n, we'll require a storage space of | ▶ 00:35 |
2 to the n of pass in a breadth-first search. | ▶ 00:40 |
For cheapest first, the frontier is going to be more complicated. | ▶ 00:45 |
It's going to sort of work out this contour of cost, | ▶ 00:49 |
but it's going to have a similar total number of nodes. | ▶ 00:53 |
But for depth-first search, as we go down the tree, we start going down this branch, | ▶ 00:57 |
and then we back up, but at any point, our frontier is only going to have n nodes | ▶ 01:03 |
rather than 2 to the n nodes, so that's a substantial savings for depth-first search. | ▶ 01:08 |
Now, of course, if we're also keeping track of the explored set, | ▶ 01:14 |
then we don't get that much savings. | ▶ 01:19 |
But without the explored set, depth-first search has a huge advantage | ▶ 01:21 |
in terms of space saved. | ▶ 01:25 |
One more property of the algorithms to consider | ▶ 01:27 |
is the property of completeness, meaning if there is a goal somewhere, | ▶ 01:30 |
will the algorithm find it? | ▶ 01:35 |
So, let's move from very large trees to infinite trees, | ▶ 01:37 |
and let's say that there's some goal hidden somewhere deep down in that tree. | ▶ 01:41 |
And the question is, are each of these algorithms complete? | ▶ 01:47 |
That is, are they guaranteed to find a path to the goal? | ▶ 01:51 |
Mark off the check boxes for the algorithms that you believe are complete in this sense. | ▶ 01:55 |
The answer is that breadth-first search is complete, | ▶ 00:00 |
so even if the tree is infinite, if the goal is placed at any finite level, | ▶ 00:04 |
eventually, we're going to march down and find that goal. | ▶ 00:10 |
Same with cheapest first. | ▶ 00:16 |
No matter where the goal is, if it has a finite cost, | ▶ 00:18 |
eventually, we're going to go down and find it. | ▶ 00:21 |
But not so for depth-first search. | ▶ 00:25 |
If there's an infinite path, depth-first search will keep following that, | ▶ 00:28 |
so it will keep going down and down and down along this path | ▶ 00:33 |
and never get to the path that the goal consists of | ▶ 00:37 |
and never get to the path on which the goal sits. | ▶ 00:42 |
So, depth-first search is not complete. | ▶ 00:46 |
Let's try to understand a little better how uniform cost search works. | ▶ 00:00 |
We start at a start state, | ▶ 00:05 |
and then we start expanding out from there looking at different paths, | ▶ 00:08 |
and what we end of doing is expanding in terms of contours like on a topological map, | ▶ 00:13 |
where first we span out to a certain distance, then to a farther distance, | ▶ 00:21 |
and then to a farther distance. | ▶ 00:28 |
Now at some point we meet up with a goal. Let's say the goal is here. | ▶ 00:31 |
Now we found a path from the start to the goal. | ▶ 00:35 |
But notice that the search really wasn't directed at any way towards the goal. | ▶ 00:42 |
It was expanding out everywhere in the space and depending on where the goal is, | ▶ 00:46 |
we should expect to have to explore half the space, on average, before we find the goal. | ▶ 00:52 |
If the space is small, that can be fine, | ▶ 00:57 |
but when spaces are large, that won't get us to the goal fast enough. | ▶ 01:00 |
Unfortunately, there is really nothing we can do, with what we know, to do better than that, | ▶ 01:05 |
and so if we want to improve, if we want to be able to find the goal faster, | ▶ 01:10 |
we're going to have to add more knowledge. | ▶ 01:15 |
The type of knowledge that is proven most useful in search is an estimate of the distance | ▶ 01:21 |
from the start state to the goal. | ▶ 01:27 |
So let's say we're dealing with a route-finding problem, | ▶ 01:32 |
and we can move in any direction--up or down, right or left-- | ▶ 01:36 |
and we'll take as our estimate, the straight line distance between a state and a goal, | ▶ 01:43 |
and we'll try to use that estimate to find our way to the goal fastest. | ▶ 01:50 |
Now an algorithm called greedy best-first search does exactly that. | ▶ 01:55 |
It expands first the path that's closest to the goal according to the estimate. | ▶ 02:04 |
So what do the contours look like in this approach? | ▶ 02:09 |
Well, we start here, and then we look at all the neighboring states, | ▶ 02:13 |
and the ones that appear to be closest to the goal we would expand first. | ▶ 02:17 |
So we'd start expanding like this and like this and like this and like this | ▶ 02:21 |
and that would lead us directly to the goal. | ▶ 02:30 |
So now instead of exploring whole circles that go out everywhere with a certain space, | ▶ 02:33 |
our search is directed towards the goal. | ▶ 02:38 |
In this case it gets us immediately towards the goal, but that won't always be the case | ▶ 02:41 |
if there are obstacles along the way. | ▶ 02:46 |
Consider this search space. We have a start state and a goal, | ▶ 02:50 |
and there's an impassable barrier. | ▶ 02:54 |
Now greedy best-first search will start expanding out as before, | ▶ 02:57 |
trying to get towards the goal, | ▶ 03:02 |
and when it reaches the barrier, what will it do next? | ▶ 03:08 |
Well, it will try to increase along a path that's getting closer and closer to the goal. | ▶ 03:11 |
So it won't consider going back this way which is farther from the goal. | ▶ 03:15 |
Rather it will continue expanding out along these lines | ▶ 03:20 |
which always get closer and closer to the goal, | ▶ 03:24 |
and eventually it will find its way towards the goal. | ▶ 03:28 |
So it does find a path, and it does it by expanding a small number of nodes, | ▶ 03:31 |
but it's willing to accept a path which is longer than other paths. | ▶ 03:36 |
Now if we explored in the other direction, we could have found a much simpler path, | ▶ 03:42 |
a much shorter path, by just popping over the barrier, and then going directly to the goal. | ▶ 03:47 |
but greedy best-first search wouldn't have done that because | ▶ 03:54 |
that would have involved getting to this point, which is this distance to the goal, | ▶ 03:56 |
and then considering states which were farther from the goal. | ▶ 04:01 |
What we would really like is an algorithm that combines the best parts | ▶ 04:08 |
of greedy search which explores a small number of nodes in many cases | ▶ 04:11 |
and uniform cost search which is guaranteed to find a shortest path. | ▶ 04:17 |
We'll show how to do that next using an algorithm called the A-star algorithm. | ▶ 04:22 |
[Male narrator] A* Search works by always expanding the path | ▶ 00:00 |
that has a minimum value of the function f | ▶ 00:03 |
which is defined as a sum of the g + h components. | ▶ 00:07 |
Now, the function g of a path | ▶ 00:12 |
is just the path cost, | ▶ 00:16 |
and the function h of a path | ▶ 00:19 |
is equal to the h value of the state, | ▶ 00:23 |
which is the final state of the path, | ▶ 00:27 |
which is equal to the estimated distance to the goal. | ▶ 00:30 |
Here's an example of how A* works. | ▶ 00:36 |
Suppose we found this path through the state's base to a state x | ▶ 00:39 |
and we're trying to give a measure to the value of this path. | ▶ 00:44 |
The measure f is a sum of g, the path cost so far, | ▶ 00:48 |
and h, which is the estimated distance that the path will take | ▶ 00:55 |
to complete its path to the goal. | ▶ 01:02 |
Now, minimizing g helps us keep the path short | ▶ 01:04 |
and minimizing h helps us keep focused on finding the goal | ▶ 01:08 |
and the result is a search strategy that is the best possible | ▶ 01:13 |
in the sense that it finds the shortest length path | ▶ 01:17 |
while expanding the minimum number of paths possible. | ▶ 01:20 |
It could be called "best estimated total path cost first," | ▶ 01:24 |
but the name A* is traditional. | ▶ 01:28 |
Now let's go back to Romania and apply the A* algorithm | ▶ 01:32 |
and we're going to use a heuristic, which is a straight line distance | ▶ 01:36 |
between a state and the goal. | ▶ 01:40 |
The goal, again, is Bucharest, | ▶ 01:42 |
and so the distance from Bucharest to Bucharest is, of course, 0. | ▶ 01:44 |
And for all the other states, I've written in red | ▶ 01:47 |
the straight line distance. | ▶ 01:51 |
For example, straight across like that. | ▶ 01:53 |
Now, I should say that all the roads here I've drawn as straight lines, | ▶ 01:55 |
but actually, roads are going to be curved to some degree, | ▶ 01:59 |
so the actual distance along the roads is going to be longer | ▶ 02:03 |
than the straight line distance. | ▶ 02:06 |
Now, we start out as usual--we'll start in Arad as a start state-- | ▶ 02:09 |
and we'll expand out Arad and so we'll add 3 paths | ▶ 02:13 |
and the evaluation function, f, will be the sum of the path length, | ▶ 02:21 |
which is given in black, and the estimated distance, | ▶ 02:26 |
which is given in red. | ▶ 02:29 |
And so the path length from this path | ▶ 02:32 |
will be 140+253 or 393; | ▶ 02:37 |
for this path, 75+374, or 449; | ▶ 02:45 |
and for this path, 118+329, or 447. | ▶ 02:55 |
And now, the question is out of all the paths that are on the frontier, | ▶ 03:05 |
which path would we expand next under the A* algorithm? | ▶ 03:09 |
The answer is that we select this path first--the one from Arad to Sibiu-- | ▶ 00:00 |
because it has the smallest value--393--of the sum f=g+h. | ▶ 00:05 |
Let's go ahead and expand this node now. | ▶ 00:00 |
So we're going to add 3 paths. | ▶ 00:03 |
This one has a path cost of 291 | ▶ 00:06 |
and an estimated distance to the goal of 380, | ▶ 00:10 |
for a total of 671. | ▶ 00:14 |
This one has a path cost of 239 | ▶ 00:18 |
and an estimated distance of 176, for a total of 415. | ▶ 00:21 |
And the final one is 220+193=413. | ▶ 00:27 |
And now the question is which state to we expand next? | ▶ 00:33 |
The answer is we expand this path next | ▶ 00:00 |
because its total, 413, | ▶ 00:03 |
is less than all the other ones on the front tier-- | ▶ 00:06 |
although only slightly less than the 415 for this path. | ▶ 00:09 |
So we expand this node, | ▶ 00:00 |
giving us 2 more paths-- | ▶ 00:03 |
this one with an f-value of 417, | ▶ 00:06 |
and this one with an f-value of 526. | ▶ 00:10 |
The question again--which path are we going to expand next? | ▶ 00:16 |
And the answer is that we expand this path, Fagaras, next, | ▶ 00:00 |
because its f-total, 415, | ▶ 00:05 |
is less than all the other paths in the front tier. | ▶ 00:08 |
Now we expand Fagaras | ▶ 00:01 |
and we get a path that reaches the goal | ▶ 00:04 |
and it has a path length of 450 and an estimated distance of 0 | ▶ 00:07 |
for a total f value of 450, | ▶ 00:11 |
and now the question is: What do we do next? | ▶ 00:14 |
Click here if you think we're at the end of the algorithm | ▶ 00:17 |
and we don't need to expand next | ▶ 00:22 |
or click on the node that you think we will expand next. | ▶ 00:24 |
The answer is that we're not done yet, | ▶ 00:00 |
because the algorithm works by doing the goal test, | ▶ 00:03 |
when we take a path off the front tier, | ▶ 00:06 |
not when we put a path on the front tier. | ▶ 00:08 |
Instead, we just continue in the normal way and choose the node | ▶ 00:11 |
on the front tier which has the lowest value. | ▶ 00:15 |
That would be this one--the path through Pitesti, with a total of 417. | ▶ 00:18 |
So let's expand the node at Pitesti. | ▶ 00:01 |
We have to go down this direction, up, | ▶ 00:04 |
then we reach a path we've seen before, | ▶ 00:08 |
and we go in this direction. | ▶ 00:11 |
Now we reach Bucharest, which is the goal, | ▶ 00:13 |
and the h value is going to be 0 | ▶ 00:16 |
because we're at the goal, and the g value works out to 418. | ▶ 00:19 |
Again, we don't stop here just because we put a path onto the front tier, | ▶ 00:24 |
we put it there, we don't apply the goal test next, | ▶ 00:31 |
but, now we go back to the front tier, | ▶ 00:35 |
and it turns out that this 418 is the lowest-cost path on the front tier. | ▶ 00:38 |
So now we pull it off, do the goal test, | ▶ 00:43 |
and now we found our path to the goal, | ▶ 00:45 |
and it is, in fact, the shortest possible path. | ▶ 00:49 |
In this case, A-star was able to find the lowest-cost path. | ▶ 00:55 |
Now the question that you'll have to think about, | ▶ 00:59 |
because we haven't explained it yet, | ▶ 01:02 |
is whether A-star will always do this. | ▶ 01:04 |
Answer yes if you think A-star will always find the shortest cost path, | ▶ 01:06 |
or answer no if you think it depends on the particular problem given, | ▶ 01:12 |
or answer no if you think it depends on the particular heuristic estimate function, h. | ▶ 01:17 |
The answer is that it depends on the h function. | ▶ 00:02 |
A-star will find the lowest-cost path | ▶ 00:06 |
if the h function for a state is less than the true cost | ▶ 00:09 |
of the path to the goal through that state. | ▶ 00:16 |
In other words, we want the h to never overestimate the distance to the goal. | ▶ 00:20 |
We also say that h is optimistic. | ▶ 00:26 |
Another way of stating that | ▶ 00:31 |
is that h is admissible, | ▶ 00:34 |
meaning is it admissible to use it to find the lowest-cost path. | ▶ 00:37 |
Think of all of these of being the same way | ▶ 00:41 |
of stating the conditions under which A-star finds the lowest-cost path. | ▶ 00:45 |
Here we give you an intuition as to why | ▶ 00:01 |
an optimistic heuristic function, h, finds the lowest-cost path. | ▶ 00:03 |
When A-star ends, it returns a path, p, with estimated cost, c. | ▶ 00:08 |
It turns out that c is also the actual cost, | ▶ 00:15 |
because at the goal the h component is 0, | ▶ 00:20 |
and so the path cost is the total cost as estimated by the function. | ▶ 00:23 |
Now, all the paths on the front tier | ▶ 00:28 |
have an estimated cost that's greater than c, | ▶ 00:31 |
and we know that because the front tier is explored in cheapest-first order. | ▶ 00:35 |
If h is optimistic, then the estimated cost | ▶ 00:40 |
is less than the true cost, | ▶ 00:44 |
so the path p must have a cost that's less than the true cost | ▶ 00:47 |
of any of the paths on the front tier. | ▶ 00:51 |
Any paths that go beyond the front tier | ▶ 00:54 |
must have a cost that's greater than that | ▶ 00:57 |
because we agree that the step cost is always 0 or more. | ▶ 00:59 |
So that means that this path, p, must be the minimal cost path. | ▶ 01:04 |
Now, this argument, I should say, only goes through | ▶ 01:09 |
as is for tree search. | ▶ 01:13 |
For graph search the argument is slightly more complicated, | ▶ 01:16 |
but the general intuitions hold the same. | ▶ 01:19 |
So far we've looked at the state space of cities in Romania-- | ▶ 00:01 |
a 2-dimensional, physical space. | ▶ 00:05 |
But the technology for problem solving through search | ▶ 00:07 |
can deal with many types of state spaces, | ▶ 00:10 |
dealing with abstract properties, not just x-y position in a plane. | ▶ 00:12 |
Here I introduce another state space--the vacuum world. | ▶ 00:17 |
It's a very simple world in which there are only 2 positions | ▶ 00:21 |
as opposed to the many positions in the Romania state space. | ▶ 00:25 |
But there are additional properties to deal with as well. | ▶ 00:30 |
The robot vacuum cleaner can be in either of the 2 conditions, | ▶ 00:33 |
but as well as that each of the positions | ▶ 00:36 |
can either have dirt in it or not have dirt in it. | ▶ 00:40 |
Now the question is to represent this as a state space | ▶ 00:43 |
how many states do we need? | ▶ 00:47 |
The number of states can fill in this box here. | ▶ 00:51 |
And the answer is there are 8 states. | ▶ 00:01 |
There are 2 physical states that the robot vacuum cleaner can be in-- | ▶ 00:04 |
either in state A or in state B. | ▶ 00:10 |
But in addition to that, there are states about how the world is | ▶ 00:12 |
as well as where the robot is in the world. | ▶ 00:17 |
So state A can be dirty or not. | ▶ 00:19 |
That's 2 possibilities. | ▶ 00:24 |
And B can be dirty or not. | ▶ 00:26 |
That's 2 more possibilities. | ▶ 00:28 |
We multiply those together. We get 8 possible states. | ▶ 00:31 |
Here is a diagram of the state space for the vacuum world. | ▶ 00:01 |
Note that there are 8 states, and we have the actions connecting the states | ▶ 00:05 |
just as we did in the Romania problem. | ▶ 00:09 |
Now let's look at a path through this state. | ▶ 00:12 |
Let's say we start out in this position, | ▶ 00:15 |
and then we apply the action of moving right. | ▶ 00:19 |
Then we end up in a position where the state of the world looks the same, | ▶ 00:23 |
except the robot has moved from position 'A' to position 'B'. | ▶ 00:27 |
Now if we turn on the sucking action, | ▶ 00:32 |
then we end up in a state where the robot is in the same position | ▶ 00:37 |
but that position is no longer dirty. | ▶ 00:42 |
Let's take this very simple vacuum world | ▶ 00:47 |
and make a slightly more complicated one. | ▶ 00:50 |
First, we'll say that the robot has a power switch, | ▶ 00:53 |
which can be in one of three conditions: on, off, or sleep. | ▶ 00:56 |
Next, we'll say that the robot has a dirt-sensing camera, | ▶ 01:04 |
and that camera can either be on or off. | ▶ 01:09 |
Third, this is the deluxe model of robot | ▶ 01:13 |
in which the brushes that clean up the dust | ▶ 01:16 |
can be set at 1 of 5 different heights | ▶ 01:19 |
to be appropriate for whatever level of carpeting you have. | ▶ 01:22 |
Finally, rather that just having the 2 positions, | ▶ 01:27 |
we'll extend that out and have 10 positions. | ▶ 01:30 |
Now the question is how many states are in this state space? | ▶ 01:37 |
The answer is that the number of states is the cross product | ▶ 00:01 |
of the numbers of all the variables, since they're each independent, | ▶ 00:05 |
and any combination can occur. | ▶ 00:08 |
For the power we have 3 possible positions. | ▶ 00:10 |
The camera has 2. | ▶ 00:14 |
The brush height has 5. | ▶ 00:18 |
The dirt has 2 for each of the 10 positions. | ▶ 00:23 |
That's 2^10 or 1024. | ▶ 00:28 |
Then the robot's position can be any of those 10 positions as well. | ▶ 00:33 |
That works out to 307,200 states in the state space. | ▶ 00:39 |
Notice how a fairly trivial problem-- | ▶ 00:44 |
we're only modeling a few variables and only 10 positions-- | ▶ 00:46 |
works out to a large number of state spaces. | ▶ 00:50 |
That's why we need efficient algorithms for searching through states spaces. | ▶ 00:52 |
I want to introduce one more problem that can be solved with search techniques. | ▶ 00:01 |
This is a sliding blocks puzzle, called a 15 puzzle. | ▶ 00:05 |
You may have seen something like this. | ▶ 00:08 |
So there are a bunch of little squares or blocks or tiles | ▶ 00:10 |
and you can slide them around. | ▶ 00:14 |
and the goal is to get into a certain configuration. | ▶ 00:19 |
So we'll say that this is the goal state, where the numbers 1-15 are in order | ▶ 00:21 |
left to right, top to bottom. | ▶ 00:27 |
The starting state would be some state where all the positions are messed up. | ▶ 00:29 |
Now the question is: Can we come up with a good heuristic for this? | ▶ 00:34 |
Let's examine that as a way of thinking about where heuristics come from. | ▶ 00:38 |
The first heuristic we're going to consider | ▶ 00:42 |
we'll call h1, and that is equal to the number of misplaced blocks. | ▶ 00:46 |
So here 10 and 11 are misplaced because they should be there and there, respectively, | ▶ 00:54 |
12 is in the right place, 13 is in the right place, | ▶ 00:59 |
and 14 and 15 are misplaced. | ▶ 01:02 |
That's a total of 4 misplaced blocks. | ▶ 01:04 |
The 2nd heuristic, h2, is equal to | ▶ 01:07 |
the sum of the distances that each block would have to move to get to the right position. | ▶ 01:13 |
For this position, 10 would have to move 1 space to get to the right position, | ▶ 01:19 |
11 would have to move 1, so that's a total of 2 so far, | ▶ 01:26 |
13 is in the right place, | ▶ 01:30 |
14 is 1 displaced, | ▶ 01:31 |
and 15 is 1 displaced, | ▶ 01:33 |
so that would also be a total of 4. | ▶ 01:35 |
Now, the question is: Which, if any, of these heuristics are admissible? | ▶ 01:38 |
Check the boxes next to the heuristics that you think | ▶ 01:44 |
are admissible. | ▶ 01:47 |
H1 is admissible, because every tile that's in the wrong position | ▶ 00:02 |
must be moved at least once to get into the right position. | ▶ 00:07 |
So h1 never overestimates. | ▶ 00:10 |
How about h2? | ▶ 00:13 |
H2 is also admissible, because every tile in the wrong position | ▶ 00:15 |
can be moved closer to the correct position no faster than 1 space per move. | ▶ 00:20 |
Therefore, both are admissible. | ▶ 00:26 |
But notice that h2 is always greater than or equal to h1. | ▶ 00:28 |
That means that, with the exception of breaking ties, | ▶ 00:33 |
an A* search using h2 will always expand | ▶ 00:35 |
fewer paths than one using h1 | ▶ 00:39 |
Now, we're trying to build an artificial intelligence | ▶ 00:01 |
that can solve problems like this all on its own. | ▶ 00:04 |
You can see that the search algorithms do a great job | ▶ 00:08 |
of finding solutions to problems like this. | ▶ 00:12 |
But, you might complain that in order for the search algorithms to work, | ▶ 00:15 |
we had to provide it with a heurstic function. | ▶ 00:19 |
A heurstic function came from the outside. | ▶ 00:22 |
You might think that coming up with a good heurstic function is really where all the intelligence is. | ▶ 00:25 |
So, a problem solver that uses an heurstic function given to it | ▶ 00:30 |
really isn't intelligent at all. | ▶ 00:34 |
So let's think about where the intelligence could come from | ▶ 00:36 |
and can we automatically come up with good heurstic functions. | ▶ 00:39 |
I'm going to sketch a description of | ▶ 00:45 |
a program that can automatically come up with good heurstics | ▶ 00:47 |
given a description of a problem. | ▶ 00:50 |
Suppose this program is given a description of the sliding blocks puzzle | ▶ 00:52 |
where we say that a block can move from square A to square B | ▶ 00:57 |
if A is adjacent to B and B is blank. | ▶ 01:02 |
Now, imagine that we try to loosen this restriction. | ▶ 01:06 |
We cross out "B is blank," | ▶ 01:10 |
and then we get the rule | ▶ 01:14 |
"a block can move from A to B if A is adjacent to B," | ▶ 01:16 |
and that's equal to our heurstic h2 | ▶ 01:20 |
because a block can move anywhere to an adjacent state. | ▶ 01:23 |
Now, we could also cross out the other part of the rule, | ▶ 01:27 |
and we now get "a block can move from any square A | ▶ 01:31 |
to any square B regardless of any condition. | ▶ 01:36 |
That gives us heurstic h1. | ▶ 01:40 |
So we see that both of our heurstics can be derived | ▶ 01:43 |
from a simple mechanical manipulation | ▶ 01:48 |
of the formal description of the problem. | ▶ 01:50 |
Once we've generated automatically these candidate heuristics, | ▶ 01:53 |
another way to come up with a good heurstic is to say | ▶ 01:58 |
that a new heurstic, h, | ▶ 02:02 |
is equal to the maximum of h1 and h2, | ▶ 02:04 |
and that's guaranteed to be admissible as long as | ▶ 02:10 |
h1 and h2 are admissible | ▶ 02:13 |
because it still never overestimates, | ▶ 02:16 |
and it's guaranteed to be better because its getting closer to the true value. | ▶ 02:18 |
The only problem with combining multiple heuristics like this | ▶ 02:22 |
is that there is some cause to compute the heuristic | ▶ 02:27 |
and it could take longer to compute | ▶ 02:29 |
even if we end up expanding pure paths. | ▶ 02:31 |
Crossing out parts of the rules like this | ▶ 02:35 |
is called "generating a relaxed problem." | ▶ 02:38 |
What we've done is we've taken the original problem, | ▶ 02:41 |
where it's hard to move squares around, | ▶ 02:44 |
and made it easier by relaxing one of the constraints. | ▶ 02:46 |
You can see that as adding new links in the state space, | ▶ 02:49 |
so if we have a state space in which there are only particular links, | ▶ 02:54 |
by relaxing the problem it's as if we are adding new operators | ▶ 02:59 |
that traverse the state in new ways. | ▶ 03:05 |
So adding new operators only makes the problem easier, | ▶ 03:07 |
and thus never overestimates, and thus is admissible. | ▶ 03:11 |
We've seen what search can do for problem solving. | ▶ 00:00 |
It can find the lowest-cost path to a goal, | ▶ 00:03 |
and it can do that in a way in which we never generate more paths than we have to. | ▶ 00:06 |
We can find the optimal number of paths to generate, | ▶ 00:12 |
and we can do that with a heuristic function that we generate on our own | ▶ 00:15 |
by relaxing the existing problem definition. | ▶ 00:19 |
But let's be clear on what search can't do. | ▶ 00:22 |
All the solutions that we have found consist of a fixed sequence of actions. | ▶ 00:25 |
In other words, the agent Hirin Arad, thinks, comes up with a plan that it wants to execute | ▶ 00:31 |
and then essentially closes his eyes and starts driving, | ▶ 00:38 |
never considering along the way if something has gone wrong. | ▶ 00:42 |
That works fine for this type of problem, | ▶ 00:46 |
but it only works when we satisfy the following conditions. | ▶ 00:49 |
[Problem solving works when:] | ▶ 00:53 |
Problem-solving technology works when the following set of conditions is true: | ▶ 00:55 |
First, the domain must be fully observable. | ▶ 00:59 |
In other words, we must be able to see what initial state we start out with. | ▶ 01:03 |
Second, the domain must be known. | ▶ 01:08 |
That is, we have to know the set of available actions to us. | ▶ 01:12 |
Third, the domain must be discrete. | ▶ 01:16 |
There must be a finite number of actions to chose from. | ▶ 01:20 |
Fourth, the domain must be deterministic. | ▶ 01:24 |
We have to know the result of taking an action. | ▶ 01:28 |
Finally, the domain must be static. | ▶ 01:32 |
There must be nothing else in the world that can change the world except our own actions. | ▶ 01:36 |
If all these conditions are true, then we can search for a plan | ▶ 01:41 |
which solves the problem and is guaranteed to work. | ▶ 01:44 |
In later units, we will see what to do if any of these conditions fail to hold. | ▶ 01:47 |
Our description of the algorithm has talked about paths in the state space. | ▶ 00:01 |
I want to say a little bit now about how to implement that in terms of a computer algorithm. | ▶ 00:08 |
We talk about paths, but we want to implement that in some ways. | ▶ 00:15 |
In the implementation we talk about nodes. | ▶ 00:19 |
A node is a data structure, and it has four fields. | ▶ 00:22 |
The state field indicates the state at the end of the path. | ▶ 00:27 |
The action was the action it took to get there. | ▶ 00:35 |
The cost is the total cost, | ▶ 00:40 |
and the parent is a pointer to another node. | ▶ 00:45 |
In this case, the node that has state "S", | ▶ 00:50 |
and it will have a parent which points to the node that has state "A", | ▶ 00:56 |
and that will have a parent pointer that's null. | ▶ 01:06 |
So we have a linked list of nodes representing the path. | ▶ 01:10 |
We'll use the word "path" for the abstract idea, | ▶ 01:15 |
and the word "node" for the representation in the computer memory. | ▶ 01:18 |
But otherwise, you can think of those two terms as being synonyms, | ▶ 01:22 |
because they're in a one-to-one correspondence. | ▶ 01:26 |
Now there are two main data structures that deal with nodes. | ▶ 01:31 |
We have the "frontier" and we have the "explored" list. | ▶ 01:35 |
Let's talk about how to implement them. | ▶ 01:41 |
In the frontier the operations we have to deal with | ▶ 01:44 |
are removing the best item from the frontier and adding in new ones. | ▶ 01:48 |
And that suggests we should implement it as a priority queue, | ▶ 01:52 |
which knows how to keep track of the best items in proper order. | ▶ 01:55 |
But we also need to have an additional operation | ▶ 01:59 |
of a membership test as a new item in the frontier. | ▶ 02:03 |
And that suggests representing it as a set, | ▶ 02:07 |
which can be built from a hash table or a tree. | ▶ 02:10 |
So the most efficient implementations of search actually have both representations. | ▶ 02:14 |
The explored set, on the other hand, is easier. | ▶ 02:20 |
All we have to do there is be able to add new members and check for membership. | ▶ 02:23 |
So we represent that as a single set, | ▶ 02:28 |
which again can be done with either a hash table or tree. | ▶ 02:31 |
Congratulations. | ▶ 00:00 |
You just made assignment 1. | ▶ 00:02 |
This is homework assignment #1. | ▶ 00:00 |
This is a question about peg solitaire. | ▶ 00:01 |
In peg solitaire, a single player faces | ▶ 00:04 |
the following kind of board. | ▶ 00:08 |
Initially, all pieces are occupied except for the center piece. | ▶ 00:13 |
You can find more information on peg solitare at the following URL. | ▶ 00:22 |
[http://en.wikipedia.org/wiki/peg_solitaire] | ▶ 00:26 |
I wish to know whether this game is partially observable, | ▶ 00:36 |
Please say yes or no. | ▶ 00:40 |
I wish to know whether it is stochastic. | ▶ 00:43 |
Please say yes if it is and no if it's deterministic. | ▶ 00:46 |
Let me know if it's continuous, yes or no, | ▶ 00:50 |
and let me know if it's adversarial, yes or no. | ▶ 00:55 |
>>Peg Solitaire is not partially observable because you can see the board at all times. | ▶ 00:00 |
It is not stochastic because you just make all the moves, | ▶ 00:06 |
and they have very different mystic effects. | ▶ 00:09 |
It is not continuous. It's just finding many choices of actions | ▶ 00:11 |
and finding many board positions, so therefore, it is not continuous. | ▶ 00:15 |
and it's not adversarial because there is no adversaries--just you playing. | ▶ 00:18 |
I am going to ask you about the problem to learn about a loaded coin. | ▶ 00:01 |
A loaded coin is a coin, | ▶ 00:05 |
that if you flip it, | ▶ 00:07 |
might have a non 0.5 chance | ▶ 00:09 |
of coming up heads or tails. | ▶ 00:13 |
Fair coins always come up 50% heads or tails. | ▶ 00:16 |
Loaded coins might come up, for example, | ▶ 00:20 |
0.9 chance heads and 0.1 chance tails. | ▶ 00:23 |
Your task will be to understand, | ▶ 00:27 |
from coin flips, | ▶ 00:30 |
whether a coin is loaded, | ▶ 00:31 |
and if so, at what probability. | ▶ 00:33 |
I don't want you to solve the problem, | ▶ 00:35 |
but I want you to answer the following questions: | ▶ 00:37 |
Is it partially observable? | ▶ 00:40 |
Yes or no. | ▶ 00:42 |
Is it stochastic? | ▶ 00:44 |
Yes or no. | ▶ 00:46 |
Is it continuous? [Yes or no.] | ▶ 00:48 |
And finally, is it adversarial? | ▶ 00:51 |
Yes or no. | ▶ 00:53 |
[Thrun] So the loaded coin example is clearly partially observable, | ▶ 00:00 |
and the reason is it is actually used for the memory | ▶ 00:06 |
if you flip it more than 1 time so you can learn more about what the actual probability is. | ▶ 00:09 |
Therefore, looking at the most recent coin flip is insufficient to make your choice. | ▶ 00:14 |
It is stochastic because you flip a coin. | ▶ 00:20 |
It is not continuous because there's only 1 action--a flip--and 2 outcomes. | ▶ 00:25 |
And it isn't really adversarial because while you do your learning task | ▶ 00:31 |
no adversary interferes. | ▶ 00:36 |
Let's talk about the problem of finding a path through a maze. | ▶ 00:00 |
Let me draw you a maze. | ▶ 00:05 |
Suppose you wish to find the path from the start to your goal. | ▶ 00:10 |
I don't want to you to solve this problem. | ▶ 00:15 |
Rather I want you to tell me whether it's partially observable. | ▶ 00:19 |
Yes or no. | ▶ 00:23 |
It is stochastic? | ▶ 00:25 |
Yes or no. | ▶ 00:27 |
Is it continuous? | ▶ 00:29 |
Yes or no. | ▶ 00:31 |
[Thrun] The path through the maze is clearly not partially observable | ▶ 00:00 |
because you can see the maze entirely at all times. | ▶ 00:03 |
It is not stochastic. There is no randomness involved. | ▶ 00:06 |
It isn't really continuous. | ▶ 00:10 |
There's typically just finitely many choices--go left or right. | ▶ 00:12 |
And it isn't adversarial because there's no real adversary involved. | ▶ 00:15 |
This is a search question. | ▶ 00:00 |
Suppose we are given the following search tree. | ▶ 00:02 |
We are searching from the top, the start node, | ▶ 00:05 |
to the goal, which is over here. | ▶ 00:08 |
Assume we expand from left to right. | ▶ 00:12 |
Tell me how many nodes are expanded | ▶ 00:17 |
if we expand from left to right, | ▶ 00:20 |
counting the start node and the goal node in your answer. | ▶ 00:23 |
And give me the same answer for Depth First Search. | ▶ 00:27 |
Now, let's assume you're going to search from right to left. | ▶ 00:32 |
How many nodes would we now expand in Breadth First Search, | ▶ 00:35 |
and how many do we expand in Depth First Search? | ▶ 00:39 |
[Thrun] Breadth first from left to right is 6-- | ▶ 00:00 |
1, 2, 3, 4, 5, 6. | ▶ 00:03 |
Depth first from left to right is 4--1, 2, 3, 4. | ▶ 00:07 |
Breadth first searched from right to left is 9-- | ▶ 00:15 |
1, 2, 3, 4, 5, 6, 7, 8, 9. | ▶ 00:19 |
And depth first from right to left is 9-- | ▶ 00:25 |
1, 2, 3, 4, 5, 6, 7, 8, 9. | ▶ 00:28 |
Another search problem-- | ▶ 00:00 |
Consider the following search tree, | ▶ 00:03 |
where this is the start node. | ▶ 00:08 |
Now, assume we search from left to right. | ▶ 00:12 |
I would like you to tell me the number of nodes expanded from Breadth-First Search | ▶ 00:15 |
and Depth-First Search. | ▶ 00:19 |
Please do count the start and the goal node, | ▶ 00:22 |
and please give me the same numbers for Right-to-Left Search, | ▶ 00:25 |
for Breadth-First, and Depth-First. | ▶ 00:28 |
[Thrun] The correct answer for breadth first left to right is 13-- | ▶ 00:00 |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13. | ▶ 00:05 |
And for depth first it is 10-- | ▶ 00:13 |
1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. | ▶ 00:17 |
For right to left search, the right answer for breadth first is 11-- | ▶ 00:28 |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. | ▶ 00:32 |
And for depth first the right answer is 7-- | ▶ 00:38 |
1, 2, 3, 4, 5, 6, 7. | ▶ 00:42 |
This is another search problem. | ▶ 00:00 |
Let's assume we have a search graph. | ▶ 00:04 |
It isn't quite a tree but looks like this. | ▶ 00:07 |
Obviously in the structure we can reach nodes through multiple paths. | ▶ 00:13 |
So let's assume that our search never expands the same node twice. | ▶ 00:18 |
Let's also assume this start node is on top. We search down. | ▶ 00:22 |
And this over here is our goal node. | ▶ 00:27 |
So left-to-right search, tell me how many nodes | ▶ 00:30 |
breadth first would expand--do count the start and goal node in the final answer. | ▶ 00:35 |
Give me the same result for a depth-first search. | ▶ 00:43 |
Again counting the start and the goal node in your answer. | ▶ 00:48 |
And again give me your answer for breadth-first | ▶ 00:51 |
and for depth-first in the right-to-left search paradigm. | ▶ 00:54 |
[Thrun] The right answer over here is 10 for breadth first from left to right-- | ▶ 00:00 |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10. | ▶ 00:05 |
Depth first is 16, or all nodes-- | ▶ 00:11 |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. | ▶ 00:15 |
And notice how I never expanded a node twice. | ▶ 00:30 |
Correct answer for breadth first right to left is 7-- | ▶ 00:34 |
1, 2, 3, 4, 5, 6, 7. | ▶ 00:38 |
And the correct answer for depth first from right to left is 4--1, 2, 3, and 4. | ▶ 00:43 |
Let's talk about a star search. | ▶ 00:00 |
Let's assume we have the following grid. | ▶ 00:03 |
The start state is right here. | ▶ 00:08 |
And the goal state is right here. | ▶ 00:13 |
And just for convenience, I will give each here a little number. | ▶ 00:16 |
A. B. C. D. | ▶ 00:22 |
Let me draw a heuristic function. | ▶ 00:26 |
Please take a look for a moment | ▶ 00:30 |
and tell me whether this heuristic function is admissable. | ▶ 00:32 |
Check here if yes and here if no. | ▶ 00:38 |
Which one is the first node a star would expand? | ▶ 00:41 |
B1 or A2? | ▶ 00:46 |
What's the second node to expand? | ▶ 00:51 |
B1, C1, A2, A3, or B2? | ▶ 00:56 |
And finally, what is the third node to expand? | ▶ 01:06 |
D1, C2, B3, or A4? | ▶ 01:10 |
[Thrun] Clearly this is an admissable heuristic because the distance to the goal | ▶ 00:00 |
is strictly underestimated. | ▶ 00:05 |
From here it would take 1 step, | ▶ 00:07 |
from here it will take 1, 2 steps, so the answer is yes. | ▶ 00:09 |
Now, to understand A*, let me also draw the g function | ▶ 00:15 |
for development part of this table. | ▶ 00:22 |
Clearly g is 0 over here. | ▶ 00:24 |
To understand which node to expand, this one or this one, | ▶ 00:27 |
let's project the g function, which is 1, | ▶ 00:31 |
and we will see that 3 plus 1 is smaller than 4 plus 1; | ▶ 00:34 |
therefore, this is the second node to expand, which is b1. | ▶ 00:40 |
Now let me for the next step explain the g function from this guy here, 2 and 2. | ▶ 00:47 |
So 2 plus 2 is 4 versus 3 plus 2 is 5, so we expand this node next, which is c1. | ▶ 00:55 |
And finally, the g function from here would go 3 and 3. | ▶ 01:08 |
3 plus 1 is better than 3 plus 2, so we would expand d1 next. | ▶ 01:14 |
And notice how in the sum of g and h, | ▶ 01:24 |
this node over here, which has a total of 4, is better than any other node that is unexpanded. | ▶ 01:29 |
So in particular, 4 plus 1 is 5, and 3 plus 2 is 5 as well, | ▶ 01:35 |
and 2 plus 3 is 5 as well, so this is the next one to expand. | ▶ 01:40 |
So the next units will be concerned with probabilities | ▶ 00:00 |
and particularly with structured probabilities using Bayes networks. | ▶ 00:03 |
This is some of the most involved material in this class. | ▶ 00:08 |
And since this is a Stanford level class, | ▶ 00:12 |
you will find out that some of the quizzes are actually really hard. | ▶ 00:14 |
So as you go through the material, I hope the hardness of the quizzes won't discourage you; | ▶ 00:18 |
it'll really entice you to take a piece of paper and a pen and work them out. | ▶ 00:23 |
Let me give you a flavor of a Bayes network using an example. | ▶ 00:30 |
Suppose you find in the morning that your car won't start. | ▶ 00:35 |
Well, there's many causes why your car might not start. | ▶ 00:39 |
One is that your battery is flat. | ▶ 00:43 |
Even for a flat battery there is multiple causes. | ▶ 00:46 |
One, it's just plain dead, | ▶ 00:50 |
and one is that the battery is okay but it's not charging. | ▶ 00:52 |
The reason why a battery might not charge is that the alternator might be broken | ▶ 00:55 |
or the fan belt might be broken. | ▶ 01:01 |
If you look at this influence diagram, also called a Bayes network, | ▶ 01:03 |
you'll find there's many different ways to explain that the car won't start. | ▶ 01:07 |
And a natural question you might have is, "Can we diagnose the problem?" | ▶ 01:12 |
One diagnostic tool is a battery meter, | ▶ 01:17 |
which may increase or decrease your belief that the battery may cause your car failure. | ▶ 01:20 |
You might also know your battery age. | ▶ 01:26 |
Older batteries tend to go dead more often. | ▶ 01:29 |
And there's many other ways to look at reasons why the car might not start. | ▶ 01:31 |
You might inspect the lights, the oil light, the gas gauge. | ▶ 01:37 |
You might even dip into the engine to see what the oil level is with a dipstick. | ▶ 01:43 |
All of those relate to alternative reasons why the car might not be starting, | ▶ 01:48 |
like no oil, no gas, the fuel line might be blocked, or the starter may be broken. | ▶ 01:52 |
And all of these can influence your measurements, | ▶ 01:59 |
like the oil light or the gas gauge, in different ways. | ▶ 02:04 |
For example, the battery flat would have an effect on the lights. | ▶ 02:07 |
It might have an effect on the oil light and on the gas gauge, | ▶ 02:12 |
but it won't really affect the oil you measure with the dipstick. | ▶ 02:16 |
That is affected by the actual oil level, which also affects the oil light. | ▶ 02:20 |
Gas will affect the gas gauge, and of course without gas the car doesn't start. | ▶ 02:26 |
So this is a complicated structure that really describes one way to understand | ▶ 02:32 |
how a car doesn't start. | ▶ 02:39 |
A car is a complex system. | ▶ 02:41 |
It has lots of variables you can't really measure immediately, | ▶ 02:43 |
and it has sensors which allow you to understand a little bit about the state of the car. | ▶ 02:46 |
What the Bayes network does, | ▶ 02:52 |
it really assists you in reasoning from observable variables, like the car won't start | ▶ 02:54 |
and the value of the dipstick, to hidden causes, like is the fan belt broken | ▶ 03:01 |
or is the battery dead. | ▶ 03:06 |
What you have here is a Bayes network. | ▶ 03:09 |
A Bayes network is composed of nodes. | ▶ 03:13 |
These nodes correspond to events that you might or might not know | ▶ 03:15 |
that are typically called random variables. | ▶ 03:21 |
These nodes are linked by arcs, and the arcs suggest that a child of an arc | ▶ 03:24 |
is influenced by its parent but not in a deterministic way. | ▶ 03:31 |
It might be influenced in a probabilistic way, which means an older battery, for example, | ▶ 03:35 |
has a higher chance of causing the battery to be dead, | ▶ 03:41 |
but it's not clear that every old battery is dead. | ▶ 03:45 |
There is a total of 16 variables in this Bayes network. | ▶ 03:48 |
What the graph structure and associated probabilities specify | ▶ 03:53 |
is a huge probability distribution in the space of all of these 16 variables. | ▶ 03:59 |
If they are all binary, which we'll assume throughout this unit, | ▶ 04:06 |
they can take 2 to the 16th different values, which is a lot. | ▶ 04:10 |
The Bayes network, as we find out, is a complex representation | ▶ 04:15 |
of a distribution over this very, very large joint probability distribution of all of these variables. | ▶ 04:18 |
Further, once we specify the Bayes network, | ▶ 04:26 |
we can observe, for example, the car won't start. | ▶ 04:29 |
We can observe things like the oil light and the lights and the battery meter | ▶ 04:33 |
and then compute probabilities of the hypothesis, like the alternator is broken | ▶ 04:37 |
or the fan belt is broken or the battery is dead. | ▶ 04:41 |
So in this class we're going to talk about how to construct this Bayes network, | ▶ 04:45 |
what the semantics are, and how to reason in this Bayes network | ▶ 04:50 |
to find out about variables we can't observe, like whether the fan belt is broken or not. | ▶ 04:56 |
That's an overview. | ▶ 05:02 |
Throughout this unit I am going to assume that every event is discrete-- | ▶ 05:04 |
in fact, it's binary. | ▶ 05:08 |
We'll start with some consideration of basic probability, | ▶ 05:10 |
we'll work our way into some simple Bayes networks, | ▶ 05:14 |
we'll talk about concepts like conditional independence | ▶ 05:19 |
and then define Bayes networks more generally, | ▶ 05:23 |
move into concepts like D-separation and start doing parameter counts. | ▶ 05:26 |
Later on, Peter will tell you about inference in Bayes networks. | ▶ 05:32 |
So we won't do this in this class. | ▶ 05:36 |
I can't overemphasize how important this class is. | ▶ 05:38 |
Bayes networks are used extensively in almost all fields of smart computer system, | ▶ 05:43 |
in diagnostics, for prediction, for machine learning, and fields like finance, | ▶ 05:49 |
inside Google, in robotics. | ▶ 05:57 |
Bayes networks are also the building blocks of more advanced AI techniques | ▶ 06:00 |
such as particle filters, hidden Markov models, MDPs and POMDPs, | ▶ 06:05 |
Kalman filters, and many others. | ▶ 06:12 |
These are words that don't sound familiar quite yet, | ▶ 06:14 |
but as you go through the class, I can promise you you will get to know what they mean. | ▶ 06:18 |
So let's start now at the very, very basics. | ▶ 06:22 |
[Thrun] So let's talk about probabilities. | ▶ 00:00 |
Probabilities are the cornerstone of artificial intelligence. | ▶ 00:02 |
They are used to express uncertainty, | ▶ 00:05 |
and the management of uncertainty is really key to many, many things in AI | ▶ 00:08 |
such as machine learning and Bayes network inference | ▶ 00:12 |
and filtering and robotics and computer vision and so on. | ▶ 00:16 |
So I'm going to start with some very basic questions, | ▶ 00:21 |
and we're going to work our way up from there. | ▶ 00:24 |
Here is a coin. | ▶ 00:26 |
The coin can come up heads or tails, and my question is the following: | ▶ 00:28 |
Suppose the probability for heads is 0.5. | ▶ 00:32 |
What's the probability for it coming up tails? | ▶ 00:38 |
[Thrun] So the right answer is a half, or 0.5, | ▶ 00:00 |
and the reason is the coin can only come up heads or tails. | ▶ 00:03 |
We know that it has to be either one. | ▶ 00:07 |
Therefore, the total probability of both coming up is 1. | ▶ 00:10 |
So if half of the probability is assigned to heads, then the other half is assigned to tail. | ▶ 00:14 |
[Thrun] Let me ask my next quiz. | ▶ 00:00 |
Suppose the probability of heads is a quarter, 0.25. | ▶ 00:02 |
What's the probability of tail? | ▶ 00:06 |
[Thrun] And the answer is 3/4. | ▶ 00:00 |
It's a loaded coin, and the reason is, well, | ▶ 00:02 |
each of them come up with a certain probability. | ▶ 00:05 |
The total of those is 1. The quarter is claimed by heads. | ▶ 00:08 |
Therefore, 3/4 remain for tail, which is the answer over here. | ▶ 00:12 |
[Thrun] Here's another quiz. | ▶ 00:00 |
What's the probability that the coin comes up heads, heads, heads, three times in a row, | ▶ 00:02 |
assuming that each one of those has a probability of a half | ▶ 00:08 |
and that these coin flips are independent? | ▶ 00:12 |
[Thrun] And the answer is 0.125. | ▶ 00:00 |
Each head has a probability of a half. | ▶ 00:04 |
We can multiply those probabilities because they are independent events, | ▶ 00:06 |
and that gives us 1 over 8 or 0.125. | ▶ 00:10 |
[Thrun] Now let's flip the coin 4 times, and let's call Xi the result of the i-th coin flip. | ▶ 00:00 |
So each Xi is going to be drawn from heads or tail. | ▶ 00:11 |
What's the probability that all 4 of those flips give us the same result, | ▶ 00:16 |
no matter what it is, assuming that each one of those has identically | ▶ 00:22 |
an equally distributed probability of coming up heads of the half? | ▶ 00:26 |
[Thrun] And the answer is, well, there's 2 ways that we can achieve this. | ▶ 00:00 |
One is the all heads and one is all tails. | ▶ 00:04 |
You already know that 4 times heads is 1/16, | ▶ 00:06 |
and we know that 4 times tail is also 1/16. | ▶ 00:10 |
These are completely independent events. | ▶ 00:13 |
The probability of either one occurring is 1/16 plus 1/16, which is 1/8, which is 0.125. | ▶ 00:15 |
[Thrun] So here's another one. | ▶ 00:00 |
What's the probability that within the set of X1, X2, X3, and X4 | ▶ 00:02 |
there are at least three heads? | ▶ 00:07 |
[Thrun] And the solution is let's look at different sequences | ▶ 00:00 |
in which head occurs at least 3 times. | ▶ 00:03 |
It could be head, head, head, head, in which it comes 4 times. | ▶ 00:06 |
It could be head, head, head, tail and so on, all the way to tail, head, head, head. | ▶ 00:10 |
There's 1, 2, 3, 4, 5 of those outcomes. | ▶ 00:16 |
Each of them has a 16th for probability, so it's 5 times a 16th, which is 0.3125. | ▶ 00:19 |
[Thrun] So we just learned a number of things. | ▶ 00:00 |
One is about complementary probability. | ▶ 00:02 |
If an event has a certain probability, p, | ▶ 00:05 |
the complementary event has the probability 1-p. | ▶ 00:08 |
We also learned about independence. | ▶ 00:13 |
If 2 random variables, X and Y, are independent, | ▶ 00:15 |
which you're going to write like this, | ▶ 00:19 |
that means the probability of the joint that any 2 variables can assume | ▶ 00:21 |
is the product of the marginals. | ▶ 00:26 |
So rather than asking the question, "What is the probability | ▶ 00:30 |
"for any combination that these 2 coins or maybe 5 coins could have taken?" | ▶ 00:34 |
we can now look at the probability of each coin individually, | ▶ 00:40 |
look at its probability and just multiply them up. | ▶ 00:42 |
[Thrun] So let me ask you about dependence. | ▶ 00:00 |
Suppose we flip 2 coins. | ▶ 00:03 |
Our first coin is a fair coin, and we're going to denote the outcome by X1. | ▶ 00:05 |
So the chance of X1 coming up heads is half. | ▶ 00:12 |
But now we branch into picking a coin based on the first outcome. | ▶ 00:15 |
So if the first outcome was heads, | ▶ 00:20 |
you pick a coin whose probability of coming up heads is going to be 0.9. | ▶ 00:23 |
The way I word this is by conditional probability, | ▶ 00:28 |
probability of the second coin flip coming up heads | ▶ 00:32 |
provided that or given that X1, the first coin flip, was heads, is 0.9. | ▶ 00:35 |
The first coin flip might also come up tails, | ▶ 00:41 |
in which case I pick a very different coin. | ▶ 00:44 |
In this case I pick a coin which with 0.8 probability will once again give me tails, | ▶ 00:47 |
conditioned on the first coin flip coming up tails. | ▶ 00:54 |
So my question for you is, | ▶ 00:57 |
what's the probability of the second coin flip coming up heads? | ▶ 00:59 |
[Thrun] The answer is 0.55. | ▶ 00:00 |
The way to compute this is by the theorem of total probability. | ▶ 00:04 |
Probability of X2 equals heads. | ▶ 00:08 |
There's 2 ways I can get to this outcome. | ▶ 00:12 |
One is via this path over here, and one is via this path over here. | ▶ 00:15 |
Let me just write both of them down. | ▶ 00:18 |
So first of all, it could be the probability of X2 equals heads | ▶ 00:20 |
given that and I will assume X1 was head already. | ▶ 00:26 |
Now I have to add the complementary event. | ▶ 00:30 |
Suppose X1 came up tails. | ▶ 00:32 |
Then I can ask the question, what is the probability that X2 comes up heads regardless, | ▶ 00:35 |
even though X1 was tails? | ▶ 00:40 |
Plugging in the numbers gives us the following. | ▶ 00:42 |
This one over here is 0.9 times a half. | ▶ 00:44 |
The probability of tails is 0.8, | ▶ 00:49 |
thereby my head probability becomes 1 minus 0.8, which is 0.2. | ▶ 00:51 |
Adding all of this together gives me 0.45 plus 0.1, | ▶ 00:58 |
which is exactly 0.55. | ▶ 01:03 |
So, we actually just learned some interesting lessons. | ▶ 00:00 |
The probability of any random variable Y can be written as | ▶ 00:02 |
probability of Y given that some other random variable X assumes value i | ▶ 00:08 |
times probability of X equals i, | ▶ 00:13 |
sums over all possible outcomes i for the (inaudible) variable X. | ▶ 00:17 |
This is called total probability. | ▶ 00:22 |
The second thing we learned has to do with negation of probabilities. | ▶ 00:24 |
We found that probability of not X given Y is 1 minus probability of X given Y. | ▶ 00:27 |
Now, you might be tempted to say "What about the probability of X given not Y?" | ▶ 00:37 |
"Is this the same as 1 minus probability of X given Y?" | ▶ 00:43 |
And the answer is absolutely no. | ▶ 00:51 |
That's not the case. | ▶ 00:54 |
If you condition on something that has a certain probability value, | ▶ 00:56 |
you can take the event you're looking at and negate this, | ▶ 01:00 |
but you can never negate your conditional variable | ▶ 01:03 |
and assume these values add up to 1. | ▶ 01:05 |
We assume there is sometimes sunny days and sometimes rainy days, | ▶ 00:00 |
and on day 1, which we're going to call D1, | ▶ 00:06 |
the probability of sunny is 0.9. | ▶ 00:09 |
And then let's assume that a sunny day follows a sunny day with 0.8 chance, | ▶ 00:13 |
and a rainy day follows a sunny day with--well-- | ▶ 00:20 |
Well, the correct answer is 0.2, which is a negation of this event over here. | ▶ 00:00 |
A sunny day follows a rainy day with 0.6 chance, | ▶ 00:00 |
and a rainy day follows a rainy day-- | ▶ 00:06 |
please give me your number. | ▶ 00:11 |
0.4 | ▶ 00:00 |
So, what are the chances that D2 is sunny? | ▶ 00:00 |
Suppose the same dynamics apply from D2 to D3, | ▶ 00:03 |
so just replace D3 over here with D2s over there. | ▶ 00:06 |
That means the transition probabilities from one day to the next remain the same. | ▶ 00:10 |
Tell me, what's the probability that D3 is sunny? | ▶ 00:14 |
So, the correct answer over here is 0.78, | ▶ 00:00 |
and over here it's 0.756. | ▶ 00:04 |
To get there, let's complete this one first. | ▶ 00:10 |
The probability of D2 = sunny. | ▶ 00:13 |
Well, we know there's a 0.9 chance it's sunny on D1, | ▶ 00:16 |
and then if it is sunny, we know it stays sunny with a 0.8 chance. | ▶ 00:21 |
So, we multiply these 2 things together, and we get 0.72. | ▶ 00:25 |
We know there's a 0.1 chance of it being rainy on day 1, which is the complement, | ▶ 00:29 |
but if it's rainy, we know it switches to sunny with 0.6 chance, | ▶ 00:33 |
so you multiply these 2 things, and you get 0.06. | ▶ 00:37 |
Adding those two up equals 0.78. | ▶ 00:41 |
Now, for the next day, we know our prior for sunny is 0.78. | ▶ 00:46 |
If it is sunny, it stays sunny with 0.8 probability. | ▶ 00:51 |
Multiplying these 2 things gives us 0.624. | ▶ 00:55 |
We know it's rainy with 0.2 chance, which is the complement of 0.78, | ▶ 01:01 |
but a 0.6 chance if it was (inaudible) sunny. | ▶ 01:07 |
But if you multiply those, 0.132. | ▶ 01:10 |
Adding those 2 things up gives us 0.756. | ▶ 01:14 |
So, to some extents, it's tedious to compute these values, | ▶ 01:19 |
but they can be perfectly computed, as shown here. | ▶ 01:23 |
Next example is a cancer example. | ▶ 00:00 |
Suppose there's a specific type of cancer which exists for 1% of the population. | ▶ 00:05 |
I'm going to write this as follows. | ▶ 00:11 |
You can probably tell me now what the probability of not having this cancer is. | ▶ 00:13 |
And yes, the answer is 0.99. | ▶ 00:00 |
Let's assume there's a test for this cancer, | ▶ 00:04 |
which gives us probabilistically an answer whether we have this cancer or not. | ▶ 00:07 |
So, let's say the probability of a test being positive, as indicated by this + sign, | ▶ 00:12 |
given that we have cancer, is 0.9. | ▶ 00:18 |
The probability of the test coming out negative if we have the cancer is--you name it. | ▶ 00:22 |
0.1, which is the difference between 1 and 0.9. | ▶ 00:00 |
Let's assume the probability of the test coming out positive | ▶ 00:06 |
given that we don't have this cancer is 0.2. | ▶ 00:11 |
In other words, the probability of the test correctly saying | ▶ 00:15 |
we don't have the cancer if we're cancer free is 0.8. | ▶ 00:19 |
Now, ultimately, I'd like to know what's the probability | ▶ 00:24 |
they have this cancer given they just received a single, positive test? | ▶ 00:28 |
Before I do this, please help me filling out some other probabilities | ▶ 00:35 |
that are actually important. | ▶ 00:39 |
Specifically, the joint probabilities. | ▶ 00:41 |
The probability of a positive test and having cancer. | ▶ 00:45 |
The probability of a negative test and having cancer, | ▶ 00:51 |
and this is not conditional anymore. | ▶ 00:53 |
It's now a joint probability. | ▶ 00:55 |
So, please give me those 4 values over here. | ▶ 00:57 |
And here the correct answer is 0.009, | ▶ 00:00 |
which is the product of your prior, 0.01, times the conditional, 0.9. | ▶ 00:05 |
Over here we get 0.001, the probability of our prior cancer times 0.1. | ▶ 00:12 |
Over here we get 0.198, | ▶ 00:21 |
the probability of not having cancer is 0.99 | ▶ 00:26 |
times still getting a positive reading, which is 0.2. | ▶ 00:29 |
And finally, we get 0.792, | ▶ 00:32 |
which is the probability of this guy over here, and this guy over here. | ▶ 00:37 |
Now, our next quiz, I want you to fill in the probability of | ▶ 00:00 |
the cancer given that we just received a positive test. | ▶ 00:04 |
And the correct answer is 0.043. | ▶ 00:00 |
So, even though I received a positive test, | ▶ 00:06 |
my probability of having cancer is just 4.3%, | ▶ 00:09 |
which is not very much given that the test itself is quite sensitive. | ▶ 00:14 |
It really gives me a 0.8 chance of getting a negative result if I don't have cancer. | ▶ 00:18 |
It gives me a 0.9 chance of detecting cancer given that I have cancer. | ▶ 00:26 |
Now, what comes (inaudible) small? | ▶ 00:32 |
Well, let's just put all the cases together. | ▶ 00:35 |
You already know that we received a positive test. | ▶ 00:38 |
Therefore, this entry over here, and this entry over here are relevant. | ▶ 00:41 |
Now, the chance of having a positive test and having cancer is 0.009. | ▶ 00:47 |
Well, I might--when I receive a positive test--have cancer or not cancer, | ▶ 00:56 |
so we will just normalize by these 2 possible causes for the positive test, | ▶ 01:01 |
which is 0.009 + 0.198. | ▶ 01:06 |
We know both these 2 things together gets 0.009 over 0.207, | ▶ 01:11 |
which is approximately 0.043. | ▶ 01:20 |
Now, the interesting thing in this equation is that the chances | ▶ 01:23 |
of having seen a positive test result in the absence of cancers | ▶ 01:28 |
are still much, much higher than the chance of seeing a positive result | ▶ 01:32 |
in the presence of cancer, and that's because our prior for cancer | ▶ 01:35 |
is so small in the population that it's just very unlikely to have cancer. | ▶ 01:39 |
So, the additional information of a positive test | ▶ 01:44 |
only erased my posterior probability to 0.043. | ▶ 01:47 |
So, we've just learned about what's probably the most important | ▶ 00:00 |
piece of math for this class in statistics called Bayes Rule. | ▶ 00:03 |
It was invented by Reverend Thomas Bayes, who was a British mathematician | ▶ 00:09 |
and a Presbyterian minister in the 18th century. | ▶ 00:15 |
Bayes Rule is usually stated as follows: P of A given B where B is the evidence | ▶ 00:18 |
and A is the variable we care about is P of B given A times P of A over P of B. | ▶ 00:27 |
This expression is called the likelihood. | ▶ 00:36 |
This is called the prior, and this is called marginal likelihood. | ▶ 00:40 |
The expression over here is called the posterior. | ▶ 00:46 |
The interesting thing here is the way the probabilities are reworded. | ▶ 00:50 |
Say we have evidence B. | ▶ 00:55 |
We know about B, but we really care about the variable A. | ▶ 00:57 |
So, for example, B is a test result. | ▶ 01:01 |
We don't care about the test result as much as we care about the fact | ▶ 01:03 |
whether we have cancer or not. | ▶ 01:06 |
This diagnostic reasoning--which is from evidence to its causes-- | ▶ 01:08 |
is turned upside down by Bayes Rule into a causal reasoning, | ▶ 01:16 |
which is given--hypothetically, if we knew the cause, | ▶ 01:22 |
what would be the probability of the evidence we just observed. | ▶ 01:27 |
But to correct for this inversion, we have to multiply | ▶ 01:31 |
by the prior of the cause to be the case in the first place, | ▶ 01:36 |
in this case, having cancer or not, | ▶ 01:40 |
and divide it by the probability of the evidence, P(B), | ▶ 01:42 |
which often is expanded using the theorem of total probability as follows. | ▶ 01:47 |
The probability of B is a sum over all probabilities of B | ▶ 01:52 |
conditional on A, lower caps a, times the probability of A equals lower caps a. | ▶ 01:58 |
This is total probability as we already encountered it. | ▶ 02:04 |
So, let's apply this to the cancer case | ▶ 02:08 |
and say we really care about whether you have cancer, | ▶ 02:10 |
which is our cause, conditioned on the evidence | ▶ 02:13 |
that is the result of this hidden cause, in this case, a positive test result. | ▶ 02:17 |
Let's just plug in the numbers. | ▶ 02:23 |
Our likelihood is the probability of seeing a positive test result | ▶ 02:25 |
given that you have cancer multiplied by the prior probability | ▶ 02:30 |
of having cancer over the probability of the positive test result, | ▶ 02:33 |
and that is--according to the tables we looked at before-- | ▶ 02:38 |
0.9 times a prior of 0.01 over-- | ▶ 02:43 |
now we're going to expand this right over here according to total probability | ▶ 02:50 |
which gives us 0.9 times 0.01. | ▶ 02:55 |
That's the probability of + given that we do have cancer. | ▶ 03:01 |
So, the probability of + given that we don't have cancer is 0.2, | ▶ 03:06 |
but the prior here is 0.99. | ▶ 03:11 |
So, if we plug in the numbers we know about, we get 0.009 | ▶ 03:15 |
over 0.009 + 0.198. | ▶ 03:20 |
That is approximately 0.0434, which is the number we saw before. | ▶ 03:27 |
So, if you want to draw Bayes rule graphically, | ▶ 00:00 |
we have a situation where we have an internal variable A, | ▶ 00:03 |
like whether I'm going to die of cancer, but we can't sense A. | ▶ 00:08 |
Instead, we have a second variable, called B, | ▶ 00:13 |
which is our test, and B is observable, but A isn't. | ▶ 00:16 |
This is a classical example of a Bayes network. | ▶ 00:21 |
The Bayes network is composed of 2 variables, A and B. | ▶ 00:26 |
We know the prior probability for A, | ▶ 00:30 |
and we know the conditional. | ▶ 00:33 |
A causes B--whether or not we have cancer, | ▶ 00:35 |
causes the test result to be positive or not, | ▶ 00:38 |
although there was some randomness involved. | ▶ 00:41 |
So, we know what the probability of B given the different values for A, | ▶ 00:44 |
and what we care about in this specific instance is called diagnostic reasoning, | ▶ 00:49 |
which is the inverse of the causal reasoning, | ▶ 00:54 |
the probability of A given B or similarly, probability of A given not B. | ▶ 00:58 |
This is our very first Bayes network, and the graphical representation | ▶ 01:06 |
of drawing 2 variables, A and B, connected with an arc | ▶ 01:11 |
that goes from A to B is the graphical representation of a distribution | ▶ 01:15 |
of 2 variables that are specified in the structure over here, | ▶ 01:22 |
which has a prior probability and has a conditional probability as shown over here. | ▶ 01:26 |
Now, I do have a quick quiz for you. | ▶ 01:31 |
How many parameters does it take to specify | ▶ 01:34 |
the entire joint probability within A and B, or differently, the entire Bayes network? | ▶ 01:37 |
I'm not looking for structural parameters that relate to the graph over here. | ▶ 01:43 |
I'm just looking for the numerical parameters of the underlying probabilities. | ▶ 01:48 |
And the answer is 3. | ▶ 00:00 |
It takes 1 parameter to specify P of A from which we can derive P of not A. | ▶ 00:02 |
It takes 2 parameters to specify P of B given A and P given not A, | ▶ 00:09 |
from which we can derive P not B given A and P of not B given not A. | ▶ 00:15 |
So, it's a total of 3 parameters for this Bayes network. | ▶ 00:21 |
So, we just encountered our very first Bayes network | ▶ 00:00 |
and did a number of interesting calculations. | ▶ 00:03 |
Let's now talk about Bayes Rule and look into more complex Bayes networks. | ▶ 00:06 |
I will look at Bayes Rule again and make an observation | ▶ 00:10 |
that is really non-trivial. | ▶ 00:13 |
Here is Bayes Rule, and in practice, what we find is | ▶ 00:15 |
this term here is relatively easy to compute. | ▶ 00:20 |
It's just a product, whereas this term is really hard to compute. | ▶ 00:23 |
However, this term over here does not depend on what we assume for variable A. | ▶ 00:28 |
It's just the function of B. | ▶ 00:33 |
So, suppose for a moment we also care about the complementary event of not A | ▶ 00:35 |
given B, for which Bayes Rule unfolds as follows. | ▶ 00:40 |
Then we find that the normalizer, P(B), is identical, | ▶ 00:43 |
whether we assume A on the left side or not A on the left side. | ▶ 00:47 |
We also know from prior work that P(A) given B plus | ▶ 00:51 |
P of not A given B must be one because these are 2 complementary events. | ▶ 00:57 |
That allows us to compute Bayes Rule very differently | ▶ 01:03 |
by basically ignoring the normalizer, so here's how it goes. | ▶ 01:06 |
We compute P(A) given B--and I want to call this prime, | ▶ 01:11 |
because it's not a real probability--to be just P(B) given A times P(A), | ▶ 01:16 |
which is the normalizer, so the denominator of the expression over here. | ▶ 01:23 |
We do the same thing with not A. | ▶ 01:28 |
So, in both cases, we compute the posterior probability non-normalized | ▶ 01:31 |
by omitting the normalizer B. | ▶ 01:36 |
And then we can recover the original probabilities by normalizing | ▶ 01:38 |
based on those values over here, so the probability of A given B, | ▶ 01:43 |
the actual probability, is a normalizer, eta, | ▶ 01:48 |
times this non-normalized form over here. | ▶ 01:52 |
The same is true for the negation of A over here. | ▶ 01:55 |
And eta is just the normalizer that results by adding these 2 values over here together | ▶ 01:59 |
as shown over here, and dividing them for one. | ▶ 02:06 |
So, take a look at this for a moment. | ▶ 02:10 |
What we've done is we deferred the calculation of the normalizer over here | ▶ 02:13 |
by computing pseudo probabilities that are non-normalized. | ▶ 02:18 |
This made the calculation much easier, and when we were done with everything, | ▶ 02:22 |
we just folded it back into the normalizer based on the resulting | ▶ 02:26 |
pseudo probabilities and got the correct answer. | ▶ 02:29 |
The reason why I gave you all this is because I want you to apply it now | ▶ 00:00 |
to a slightly more complicated problem, which is the 2-test cancer example. | ▶ 00:03 |
In this example, we again might have our unobservable cancer C, | ▶ 00:08 |
but now we're running 2 tests, test 1 and test 2. | ▶ 00:14 |
As before, the prior probability of cancer is 0.01. | ▶ 00:18 |
The probability of receiving a positive test result for either test is 0.9. | ▶ 00:24 |
The probability of getting a negative result given they're cancer free is 0.8. | ▶ 00:30 |
And from those, we were able to compute all the other probabilities, | ▶ 00:36 |
and we're just going to write them down over here. | ▶ 00:40 |
So, take a moment to just verify those. | ▶ 00:43 |
Now, let's assume both of my tests come back positive, | ▶ 00:46 |
so T1 = + and T2 = +. | ▶ 00:50 |
What's the probability of cancer now written in short form probability of | ▶ 00:56 |
C given ++? | ▶ 01:00 |
I want you to tell me what that is, and this is a non-trivial question. | ▶ 01:03 |
So, the correct answer is 0.1698 approximately, | ▶ 00:00 |
and to compute this, I used the trick I've shown you before. | ▶ 00:10 |
Let me write down the running count for cancer and for not cancer | ▶ 00:15 |
as I integrate the various multiplications in Bayes Rule. | ▶ 00:24 |
My prior for cancer was 0.01 and for non-cancer was 0.99. | ▶ 00:28 |
Then I get my first +, and the probability of a + given they have cancer is 0.9, | ▶ 00:37 |
and the same for non-cancer is 0.2. | ▶ 00:43 |
So, according to the non-normalized Bayes Rule, | ▶ 00:48 |
I now multiply these 2 things together to get my non-normalized probability | ▶ 00:52 |
of having cancer given the plus. | ▶ 00:58 |
Since multiplication is commutative, | ▶ 01:00 |
I can do the same thing again with my 2nd test result, 0.9 and 0.2, | ▶ 01:03 |
and I multiply all of these 3 things together to get my non-normalized probability | ▶ 01:09 |
P prime to be the following: 0.0081, if you multiply those things together, | ▶ 01:14 |
and 0.0396 if you multiply these facts together. | ▶ 01:21 |
And these are not a probability. | ▶ 01:28 |
If we add those for the 2 complementary of cancer/non-cancer, | ▶ 01:30 |
I get 0.0477. | ▶ 01:34 |
However, if I now divide, that is, I normalize | ▶ 01:38 |
those non-normalized probabilities over here by this factor over here, | ▶ 01:42 |
I actually get the correct posterior probability P of cancer given ++. | ▶ 01:47 |
And they look as follows: | ▶ 01:52 |
approximately 0.1698 and approximately 0.8301. | ▶ 01:54 |
Calculate for me the probability of cancer | ▶ 00:00 |
given that I received one positive and one negative test result. | ▶ 00:03 |
Please write your number into this box. | ▶ 00:08 |
We apply the same trick as before | ▶ 00:00 |
where we use the exact same prior of 0.01. | ▶ 00:03 |
Our first + gives us the following factors: 0.9 and 0.2. | ▶ 00:07 |
And our minus gives us the probability 0.1 for a negative first test result given that we have cancer, | ▶ 00:13 |
and a 0.8 for the inverse of a negative result of not having cancer. | ▶ 00:20 |
We multiply those together. | ▶ 00:26 |
We get our non-normalized probability. | ▶ 00:28 |
And if we now normalize by the sum of those two things | ▶ 00:30 |
to turn this back into a probability, we get 0.009 | ▶ 00:35 |
over the sum of those two things over here, and this is 0.0056 | ▶ 00:41 |
for the chance of having cancer and 0.9943 for the chance of being cancer free. | ▶ 00:50 |
And this adds up approximately to 1, and therefore, is a probability distribution. | ▶ 00:59 |
I want to use a few words of terminology. | ▶ 00:00 |
This, again, is a Bayes network, of which the hidden variable C | ▶ 00:03 |
causes the still stochastic test outcomes T1 and T2. | ▶ 00:08 |
And what is really important is that we assume not just | ▶ 00:16 |
that T1 and T2 are identically distributed. | ▶ 00:19 |
We use the same 0.9 for test 1 as we use for test 2, | ▶ 00:22 |
but we also assume that they are conditionally independent. | ▶ 00:27 |
We assumed that if God told us whether we actually had cancer or not, | ▶ 00:31 |
if we knew with absolute certainty the value of the variable C, | ▶ 00:37 |
that knowing anything about T1 would not help us make a statement about T2. | ▶ 00:41 |
Put differently, we assumed that the probability of T2 given C and T1 | ▶ 00:48 |
is the same as the probability of T2 given C. | ▶ 00:55 |
This is called conditional independence, which is given the value of the cancer variable C. | ▶ 01:00 |
If you knew this for a fact, then T2 would be independent of T1. | ▶ 01:08 |
It's conditionally independent because the independence only holds true | ▶ 01:17 |
if we actually know C, and it comes out of this diagram over here. | ▶ 01:21 |
If we look at this diagram, if you knew the variable C over here, | ▶ 01:26 |
then C separately causes T1 and T2. | ▶ 01:32 |
So, as a result, if you know C, whatever counted over here | ▶ 01:39 |
is kind of cut off causally from what happens over here. | ▶ 01:46 |
That causes these 2 variables to be conditionally independent. | ▶ 01:48 |
So, conditional independence is a really big thing in Bayes networks. | ▶ 01:52 |
Here's a Bayes network where A causes B and C, | ▶ 01:58 |
and for a Bayes network of this structure, we know that given A, | ▶ 02:02 |
B and C are independent. | ▶ 02:08 |
It's written as B conditionally independent of C given A. | ▶ 02:11 |
So, here's a question. | ▶ 02:16 |
Suppose we have conditional independence between B and C given A. | ▶ 02:18 |
Would that imply--and there's my question--that B and C are independent? | ▶ 02:21 |
So, suppose we don't know A. | ▶ 02:28 |
We don't know whether we have cancer, for example. | ▶ 02:30 |
What that means is that the test results individually are still independent of each other | ▶ 02:33 |
even if we don't know about the cancer situation. | ▶ 02:38 |
Please answer yes or no. | ▶ 02:42 |
And the correct answer is No | ▶ 00:00 |
Intuitively, getting a positive test result about cancer | ▶ 00:03 |
gives us information about whether you have cancer or not. | ▶ 00:08 |
So if you get a positive test result | ▶ 00:13 |
you're going to raise the probability of having cancer | ▶ 00:15 |
relative to the prior probability. | ▶ 00:18 |
With that increased probability we will predict | ▶ 00:20 |
that another test will with a higher likelihood | ▶ 00:24 |
give us a positive response than if we hadn't taken the previous test. | ▶ 00:27 |
That's really important to understand | ▶ 00:33 |
So that we understand it let me make you calculate those probabilities | ▶ 00:36 |
Let me draw the cancer example again with two tests. | ▶ 00:00 |
Here's my cancer variable | ▶ 00:05 |
and then there's two conditionally independent tests T1 and T2. | ▶ 00:07 |
And as before let me assume that the prior probability of cancer is 0.01 | ▶ 00:13 |
What I want you to compute for me is the probability of the second test | ▶ 00:19 |
to be positive if we know that the first test was positive. | ▶ 00:26 |
So write this into the following box. | ▶ 00:33 |
So, for this one, we want to apply total probability. | ▶ 00:00 |
This thing over here is the same as probability of test 2 to be positive, | ▶ 00:04 |
which I'm going to abbreviate with a +2 over here, | ▶ 00:10 |
conditioned on test 1 being positive and me having cancer | ▶ 00:14 |
times the probability of me having cancer given test 1 was positive plus | ▶ 00:19 |
the probability of test 2 being positive conditioned on test 1 being positive | ▶ 00:25 |
and me not having cancer times the probability of me not having cancer | ▶ 00:31 |
given that test 1 is positive. | ▶ 00:36 |
That's the same as the theorem of total probability, | ▶ 00:38 |
but now everything is conditioned on +1. | ▶ 00:42 |
Take a moment to verify this. | ▶ 00:46 |
Now, here I can plug in the numbers. | ▶ 00:48 |
You already calculated this one before, which is approximately 0.043, | ▶ 00:50 |
and this one over here is 1 minus that, which is 0.957 approximately. | ▶ 00:57 |
And this term over here now exploits conditional independence, | ▶ 01:05 |
which is given that I know C, knowledge of the first test | ▶ 01:09 |
gives me no more information about the second test. | ▶ 01:14 |
It only gives me information if C was unknown, as was the case over here. | ▶ 01:17 |
So, I can rewrite this thing over here as follows: | ▶ 01:21 |
P of +2 given that I have cancer. | ▶ 01:24 |
I can drop the +1, and the same is true over here. | ▶ 01:27 |
This is exploiting my conditional independence. | ▶ 01:31 |
I knew that P of +1 or +2 conditioned on C | ▶ 01:34 |
is the same as P of +2 conditioned on C and +1. | ▶ 01:41 |
I can now read those off my table over here, | ▶ 01:47 |
which is 0.9 times 0.043 plus 0.2, | ▶ 01:50 |
which is 1 minus 0.8 over here times 0.957, | ▶ 01:58 |
which gives me approximately 0.2301. | ▶ 02:03 |
So, that says if my first test comes in positive, | ▶ 02:09 |
I expect my second test to be positive with probably 0.2301. | ▶ 02:14 |
That's an increased probability to the default probability, | ▶ 02:21 |
which we calculated before, which is the probability of any test, | ▶ 02:24 |
test 2 come in as positive before was the normalizer of Bayes rule which was 0.207. | ▶ 02:29 |
So, my first test has a 20% chance of coming in positive. | ▶ 02:38 |
My second test, after seeing a positive test, | ▶ 02:43 |
has now an increased probability of about 23% of coming in positive. | ▶ 02:47 |
So, now we've learned about independence, | ▶ 00:00 |
and the corresponding Bayes network has 2 nodes. | ▶ 00:02 |
They're just not connected at all. | ▶ 00:04 |
And we learned about conditional independence, | ▶ 00:07 |
in which case we have a Bayes network that looks like this. | ▶ 00:09 |
Now I would like to know whether absolute independence | ▶ 00:12 |
implies conditional independence. | ▶ 00:16 |
True or false? | ▶ 00:18 |
And I'd also like to know whether conditional independence implies absolute independence. | ▶ 00:20 |
Again, true or false? | ▶ 00:25 |
And the answer is both of them are false. | ▶ 00:00 |
We already saw that conditional independence, as shown over here, | ▶ 00:03 |
doesn't give us absolute independence. | ▶ 00:07 |
So, for example, this is test #1 and test #2. | ▶ 00:09 |
You might or might not have cancer. | ▶ 00:13 |
Our first test gives us information about whether you have cancer or not. | ▶ 00:15 |
As a result, we've changed our prior probability | ▶ 00:18 |
for the second test to come in positive. | ▶ 00:21 |
That means that conditional independence does not imply absolute independence, | ▶ 00:24 |
which means this assumption here falls, | ▶ 00:30 |
and it also turns out that if you have absolute independence, | ▶ 00:32 |
things might not be conditionally independent for reasons that I can't quite explain so far, | ▶ 00:37 |
but that we will learn about next. | ▶ 00:43 |
[Thrun] For my next example, I will study a different type of a Bayes network. | ▶ 00:00 |
Before, we've seen networks of the following type, | ▶ 00:04 |
where a single hidden cause caused 2 different measurements. | ▶ 00:08 |
I now want to study a network that looks just like the opposite. | ▶ 00:13 |
We have 2 independent hidden causes, | ▶ 00:17 |
but they get confounded within a single observational variable. | ▶ 00:20 |
I would like to use the example of happiness. | ▶ 00:26 |
Suppose I can be happy or unhappy. | ▶ 00:29 |
What makes me happy is when the weather is sunny or if I get a raise in my job, | ▶ 00:33 |
which means I make more money. | ▶ 00:41 |
So let's call this sunny, let's call this a raise, and call this happiness. | ▶ 00:43 |
Perhaps the probability of it being sunny is 0.7, | ▶ 00:47 |
probability of a raise is 0.01. | ▶ 00:53 |
And I will tell you that the probability of being happy is governed as follows. | ▶ 00:58 |
The probability of being happy given that both of these things occur-- | ▶ 01:05 |
I got a raise and it is sunny--is 1. | ▶ 01:09 |
The probability of being happy given that it is not sunny and I still got a raise is 0.9. | ▶ 01:13 |
The probability of being happy given that it's sunny but I didn't get a raise is 0.7. | ▶ 01:20 |
And the probability of being happy given that it is neither sunny nor did I get a raise is 0.1. | ▶ 01:27 |
This is a perfectly fine specification of a probability distribution | ▶ 01:35 |
where 2 causes affect the variable down here, the happiness. | ▶ 01:39 |
So I'd like you to calculate for me the following questions. | ▶ 01:46 |
Probability of a raise given that it is sunny, according to this model. | ▶ 01:50 |
Please enter your answer over here. | ▶ 01:57 |
[Thrun] The answer is surprisingly simple. | ▶ 00:00 |
It is 0.01. | ▶ 00:03 |
How do I know this so fast? | ▶ 00:05 |
Well, if you look at this Bayes network, | ▶ 00:08 |
both the sunniness and the question whether I got a raise impact my happiness. | ▶ 00:12 |
But since I don't know anything about the happiness, | ▶ 00:21 |
there is no way that just the weather might implicate or impact whether I get a raise or not. | ▶ 00:24 |
In fact, it might be independently sunny, and I might independently get a raise at work. | ▶ 00:32 |
There is no mechanism of which these 2 things would co-occur. | ▶ 00:39 |
Therefore, the probability of a raise given that it's sunny | ▶ 00:46 |
is just the same as the probability of a raise given any weather, which is 0.01. | ▶ 00:49 |
[Thrun] Let me talk about a really interesting special instance of Bayes net reasoning | ▶ 00:00 |
which is called explaining away. | ▶ 00:07 |
And I'll first give you the intuitive answer, | ▶ 00:10 |
then I'll wish you to compute probabilities for me that manifest the explain away effect | ▶ 00:14 |
in a Bayes network of this type. | ▶ 00:19 |
Explaining away means that if we know that we are happy, | ▶ 00:22 |
then sunny weather can explain away the cause of happiness. | ▶ 00:27 |
If I then also know that it's sunny, it becomes less likely that I received a raise. | ▶ 00:34 |
Let me put this differently. | ▶ 00:41 |
Suppose I'm a happy guy on a specific day | ▶ 00:43 |
and my wife asks me, "Sebastian, why are you so happy?" | ▶ 00:45 |
"Is it sunny, or did you get a raise?" | ▶ 00:49 |
If she then looks outside and sees it is sunny, | ▶ 00:52 |
then she might explain to herself, | ▶ 00:55 |
"Well, Sebastian is happy because it is sunny." | ▶ 00:57 |
"That makes it effectively less likely that he got a raise | ▶ 01:00 |
"because I could already explain his happiness by it being sunny." | ▶ 01:05 |
If she looks outside and it is rainy, | ▶ 01:10 |
that makes it more likely I got a raise, | ▶ 01:13 |
because the weather can't really explain my happiness. | ▶ 01:16 |
In other words, if we see a certain effect that could be caused by multiple causes, | ▶ 01:20 |
seeing one of those causes can explain away any other potential cause | ▶ 01:27 |
of this effect over here. | ▶ 01:33 |
So let me put this in numbers and ask you the challenging question of | ▶ 01:36 |
what's the probability of a raise given that I'm happy and it's sunny? | ▶ 01:43 |
[Thrun] The answer is approximately 0.0142, | ▶ 00:00 |
and it is an exercise in expanding this term using Bayes' rule, | ▶ 00:07 |
using total probability, which I'll just do for you. | ▶ 00:11 |
Using Bayes' rule, you can transform this into P of H given R comma S | ▶ 00:16 |
times P of R given S over P of H given S. | ▶ 00:24 |
We observe the conditional independence of R and S | ▶ 00:34 |
to simplify this to just P of R, | ▶ 00:37 |
and the denominator is expanded by folding in R and not R, | ▶ 00:40 |
P of H given R comma S | ▶ 00:46 |
times P of R plus P of H given not R and S | ▶ 00:49 |
times P of not R, which is total probability. | ▶ 00:54 |
We can now read off the numbers from the tables over here, | ▶ 00:58 |
which gives us 1 times 0.01 divided by this expression | ▶ 01:01 |
that is the same as the expression over here, so 0.01 plus this thing over here, | ▶ 01:10 |
which you can find over here to be 0.7, times this guy over here, | ▶ 01:17 |
which is 1 minus the value over here, 0.99, | ▶ 01:23 |
which gives us approximately 0.0142. | ▶ 01:27 |
[Thrun] Now, to understand the explain away effect, | ▶ 00:00 |
you have to compare this to the probability of a raise given that we're just happy | ▶ 00:04 |
and we don't know anything about the weather. | ▶ 00:11 |
So let's do that exercise next. | ▶ 00:14 |
So my next quiz is, what's the probability of a raise given that all I know is that I'm happy | ▶ 00:16 |
and I don't know about the weather? | ▶ 00:24 |
This happens to be once again a pretty complicated question, so take your time. | ▶ 00:26 |
[Thrun] So this is a difficult question. | ▶ 00:00 |
Let me compute an auxiliary variable, which is P of happiness. | ▶ 00:02 |
That one is expanded by looking at the different conditions that can make us happy. | ▶ 00:12 |
P of happiness given S and R | ▶ 00:19 |
times P of S and R, which is of course the product of those 2 | ▶ 00:24 |
because they are independent, | ▶ 00:29 |
plus P of happiness given not S R, probability of not as R | ▶ 00:31 |
plus P of H given S and not R | ▶ 00:39 |
times the probability of P of S and not R plus the last case, | ▶ 00:43 |
P of H given not S and not R. | ▶ 00:48 |
So this just looks at the happiness under all 4 combinations of the variables | ▶ 00:52 |
that can lead to happiness. | ▶ 00:56 |
And you can plug those straight in. | ▶ 00:58 |
This one over here is 1, and this one over here is the product of S and R, | ▶ 01:00 |
which is 0.7 times 0.01. | ▶ 01:05 |
And as you plug all of those in, | ▶ 01:10 |
you get as a result 0.5245. | ▶ 01:14 |
That's P of H. | ▶ 01:21 |
Just take some time and do the math by going through these different cases | ▶ 01:24 |
using total probability, and you get this result. | ▶ 01:28 |
Armed with this number, the rest now becomes easy, | ▶ 01:32 |
which is we can use Bayes' rule to turn this around. | ▶ 01:38 |
P of H given R times P of R over P of H. | ▶ 01:43 |
P of R we know from over here, the probability of a raise is 0.01. | ▶ 01:49 |
So the only thing we need to compute now is P of H given R. | ▶ 01:54 |
And again, we apply total probability. | ▶ 01:57 |
Let me just do this over here. | ▶ 01:59 |
We can factor P of H given R as P of H given R and S, sunny, | ▶ 02:02 |
times probability of sunny plus P of H given R and not sunny | ▶ 02:09 |
times the probability of not sunny. | ▶ 02:14 |
And if you plug in the numbers with this, you get 1 times 0.7 | ▶ 02:16 |
plus 0.9 times 0.3. | ▶ 02:21 |
That happens to be 0.97. | ▶ 02:25 |
So if we now plug this all back into this equation over here, | ▶ 02:30 |
we get 0.97 times 0.01 over 0.5245. | ▶ 02:33 |
This gives us approximately as the correct answer 0.0185. | ▶ 02:45 |
[Thrun] And if you got this right, I will be deeply impressed | ▶ 00:00 |
about the fact you got this right. | ▶ 00:04 |
But the interesting thing now to observe is if we happen to know it's sunny | ▶ 00:07 |
and I'm happy, then the probability of a raise is 14%, 0.014. | ▶ 00:13 |
If I don't know about the weather and I'm happy, | ▶ 00:21 |
then the probability of a raise goes up to about 18.5%. | ▶ 00:26 |
Why is that? | ▶ 00:30 |
Well, it's the explaining away effect. | ▶ 00:32 |
My happiness is well explained by the fact that it's sunny. | ▶ 00:35 |
So if someone observes me to be happy and asks the question, | ▶ 00:40 |
"Is this because Sebastian got a raise at work?" | ▶ 00:43 |
well, if you know it's sunny and this is a fairly good explanation for me being happy, | ▶ 00:46 |
you don't have to assume I got a raise. | ▶ 00:53 |
If you don't know about the weather, then obviously the chances are higher | ▶ 00:55 |
that the raise caused my happiness, | ▶ 01:01 |
and therefore this number goes up from 0.014 to 0.018. | ▶ 01:03 |
Let me ask you one final question in this next quiz, | ▶ 01:10 |
which is the probability of the raise given that I look happy and it's not sunny. | ▶ 01:14 |
This is the most extreme case for making a raise likely | ▶ 01:23 |
because I am a happy guy, and it's definitely not caused by the weather. | ▶ 01:27 |
So it could be just random, or it could be caused by the raise. | ▶ 01:33 |
So please calculate this number for me and enter it into this box. | ▶ 01:37 |
[Thrun] Well, the answer follows the exact same scheme as before, | ▶ 00:00 |
with S being replaced by not S. | ▶ 00:04 |
So this should be an easier question for you to answer. | ▶ 00:08 |
P of R given H and not S can be inverted by Bayes' rule to be as follows. | ▶ 00:11 |
Once we apply Bayes' rule, as indicated over here where we swapped H to the left side | ▶ 00:20 |
and R to the right side, you can observe that this value over here | ▶ 00:24 |
can be readily found in the table. | ▶ 00:29 |
It's actually the 0.9 over there. | ▶ 00:32 |
This value over here, the raise is independent of the weather | ▶ 00:35 |
by virtue of our Bayes network, so it's just 0.01. | ▶ 00:41 |
And as before, we apply total probability to the expression over here, | ▶ 00:45 |
and we obtain off this quotient over here that these 2 expressions are the same. | ▶ 00:52 |
P of H given not S, not R is the value over here, | ▶ 00:58 |
and the 0.99 is the complement of probability of R taken from over here, | ▶ 01:03 |
and that ends up to be 0.0833. | ▶ 01:08 |
This would have been the right answer. | ▶ 01:16 |
[Thrun] It's really interesting to compare this to the situation over here. | ▶ 00:00 |
In both cases I'm happy, as shown over here, | ▶ 00:04 |
and I ask the same question, which is whether I got a raise at work, as R over here. | ▶ 00:08 |
But in one case I observe that the weather is sunny; in the other one it isn't. | ▶ 00:15 |
And look what it does to my probability of having received a raise. | ▶ 00:21 |
The sunniness perfectly well explains my happiness, | ▶ 00:25 |
and my probability of having received a raise ends up to be a mere 1.4%, or 0.014. | ▶ 00:30 |
However, if my wife observes it to be non-sunny, then it is much more likely | ▶ 00:41 |
that the cause of my happiness is related to a raise at work, | ▶ 00:47 |
and now the probability is 8.3%, which is significantly higher than the 1.4% before. | ▶ 00:51 |
This is a Bayes network of which S and R are independent | ▶ 00:58 |
but H adds a dependence between S and R. | ▶ 01:04 |
Let me talk about this in a little bit more detail on the next paper. | ▶ 01:10 |
So here is our Bayes network again. | ▶ 01:16 |
In our previous exercises, we computed for this network | ▶ 01:18 |
that the probability of a raise of R given any of these variables shown here was as follows. | ▶ 01:22 |
The really interesting thing is that in the absence of information about H, | ▶ 01:29 |
which is the middle case over here, | ▶ 01:34 |
the probability of R is unaffected by knowledge of S-- | ▶ 01:37 |
that is, R and S are independent. | ▶ 01:41 |
This is the same as probability of R, | ▶ 01:46 |
and R and S are independent. | ▶ 01:49 |
However, if I know something about the variable H, | ▶ 01:56 |
then S and R become dependent-- | ▶ 02:02 |
that is, knowing about my happiness over here renders S and R dependent. | ▶ 02:06 |
This is not the same as probability of just R given H. | ▶ 02:15 |
Obviously, it isn't because if I now vary S from S to not S, | ▶ 02:23 |
it affects my probability for the variable R. | ▶ 02:28 |
That is a really unusual situation | ▶ 02:33 |
where we have R and S are independent | ▶ 02:36 |
but given the variable H, R and S are not independent anymore. | ▶ 02:40 |
So knowledge of H makes 2 variables that previously were independent non-independent. | ▶ 02:50 |
Offered differently, 2 variables that are independent may not be in certain cases | ▶ 02:58 |
conditionally independent. | ▶ 03:06 |
Independence does not imply conditional independence. | ▶ 03:08 |
[Thrun] So we're now ready to define Bayes networks in a more general way. | ▶ 00:00 |
Bayes networks define probability distributions over graphs or random variables. | ▶ 00:05 |
Here is an example graph of 5 variables, | ▶ 00:10 |
and this Bayes network defines the distribution over those 5 random variables. | ▶ 00:14 |
Instead of enumerating all possibilities of combinations of these 5 random variables, | ▶ 00:19 |
the Bayes network is defined by probability distributions | ▶ 00:24 |
that are inherent to each individual node. | ▶ 00:28 |
For node A and B, we just have a distribution P of A and P of B | ▶ 00:32 |
because A and B have no incoming arcs. | ▶ 00:38 |
C is a conditional distribution conditioned on A and B. | ▶ 00:42 |
D and E are conditioned on C. | ▶ 00:47 |
The joint probability represented by a Bayes network | ▶ 00:52 |
is the product of various Bayes network probabilities | ▶ 00:56 |
that are defined over individual nodes | ▶ 01:00 |
where each node's probability is only conditioned on the incoming arcs. | ▶ 01:03 |
So A has no incoming arc; therefore, we just want it P of A. | ▶ 01:08 |
C has 2 incoming arcs, so we define the probability of C conditioned on A and B. | ▶ 01:12 |
And D and E have 1 incoming arc that's shown over here. | ▶ 01:18 |
The definition of this joint distribution by using the following factors | ▶ 01:22 |
has one really big advantage. | ▶ 01:27 |
Whereas the joint distribution over any 5 variables requires 2 to the 5 minus 1, | ▶ 01:30 |
which is 31 probability values, | ▶ 01:40 |
the Bayes network over here only requires 10 such values. | ▶ 01:43 |
P of A is one value, for which we can derive P of not A. | ▶ 01:48 |
Same for P of B. | ▶ 01:53 |
P of C given A B is derived by a distribution over C | ▶ 01:55 |
conditioned on any combination of A and B, of which there are 4 of A and B as binary. | ▶ 02:02 |
P of D given C is 2 parameters for P of D given C and P of D given not C. | ▶ 02:07 |
And the same is true for P of E given C. | ▶ 02:15 |
So if you add those up, you get 10 parameters in total. | ▶ 02:18 |
So the compactness of the Bayes network | ▶ 02:21 |
leads to a representation that scales significantly better to large networks | ▶ 02:25 |
than the common natorial approach which goes through all combinations of variable values. | ▶ 02:31 |
That is a key advantage of Bayes networks, | ▶ 02:36 |
and that is the reason why Bayes networks are being used so extensively | ▶ 02:39 |
for all kinds of problems. | ▶ 02:43 |
So here is a quiz. | ▶ 02:45 |
How many probability values are required to specify this Bayes network? | ▶ 02:47 |
Please put your answer in the following box. | ▶ 02:51 |
[Thrun] And the answer is 13. | ▶ 00:00 |
One over here, 2 over here, and 4 over here. | ▶ 00:03 |
Simplifiably speaking, any variable that has K inputs requires 2 to the K such variables. | ▶ 00:06 |
So in total we have 1, 9, 13. | ▶ 00:15 |
[Thrun] Here's another quiz. | ▶ 00:00 |
How many parameters do we need to specify the joint distribution | ▶ 00:02 |
for this Bayes network over here | ▶ 00:06 |
where A, B, and C point into D, D points into E, F, and G | ▶ 00:09 |
and C also points into G? | ▶ 00:13 |
Please write your answer into this box. | ▶ 00:15 |
[Thrun] And the answer is 19. | ▶ 00:00 |
So 1 here, 1 here, 1 here, 2 here, 2 here, 2 arcs point into G, which makes for 4, | ▶ 00:02 |
and 3 arcs point into D. Two to the 3 is 8. | ▶ 00:09 |
So we get 1, 2, 3, 8, 2, 2, 4. If you add those up, it's 19. | ▶ 00:13 |
[Thrun] And here is our car network which we discussed at the very beginning of this unit. | ▶ 00:00 |
How many parameters do we need to specify this network? | ▶ 00:06 |
Remember, there are 16 total variables, | ▶ 00:11 |
and the naive joint over the 16 will be 2 to the 16th minus 1, which is 65,535. | ▶ 00:15 |
Please write your answer into this box over here. | ▶ 00:25 |
[Thrun] To answer this question, let us add up these numbers. | ▶ 00:00 |
Battery age is 1, 1, 1. | ▶ 00:04 |
This has 1 incoming arc, so it's 2. | ▶ 00:08 |
Two incoming arcs makes 4. | ▶ 00:10 |
One incoming arc is 2, 2 equals 4. | ▶ 00:13 |
Four incoming arcs makes 16. | ▶ 00:17 |
If we add all the right numbers, we get 47. | ▶ 00:21 |
[Thrun] So it takes 47 numerical probabilities to specify the joint | ▶ 00:00 |
compared to 65,000 if you didn't have the graph-like structure. | ▶ 00:05 |
I think this example really illustrates the advantage | ▶ 00:11 |
of compact Bayes network representations over unstructured joint representations. | ▶ 00:14 |
[Thrun] The next concept I'd like to teach you is called D-separation. | ▶ 00:00 |
And let me start the discussion of this concept by a quiz. | ▶ 00:04 |
We have here a Bayes network, | ▶ 00:09 |
and I'm going to ask you a conditional independence question. | ▶ 00:11 |
Is C independent of A? | ▶ 00:16 |
Please tell me yes or no. | ▶ 00:20 |
Is C independent of A given B? | ▶ 00:22 |
Is C independent of D? | ▶ 00:27 |
Is C independent of D given A? | ▶ 00:30 |
And is E independent of C given D? | ▶ 00:32 |
[Thrun] So C is not independent of A. | ▶ 00:00 |
In fact, A influences C by virtue of B. | ▶ 00:04 |
But if you know B, then A becomes independent of C, | ▶ 00:09 |
which means the only determinate into C is B. | ▶ 00:13 |
If you know B for sure, then knowledge of A won't really tell you anything about C. | ▶ 00:17 |
C is also not independent of D, just the same way C is not independent of A. | ▶ 00:22 |
If I learn something about D, I can infer more about C. | ▶ 00:27 |
But if I do know A, then it's hard to imagine how knowledge of D would help me with C | ▶ 00:31 |
because I can't learn anything more about A than knowing A already. | ▶ 00:38 |
Therefore, given A, C and D are independent. | ▶ 00:42 |
The same is true for E and C. | ▶ 00:45 |
If we know D, then E and C become independent. | ▶ 00:48 |
[Thrun] In this specific example, the rule that we could apply is very, very simple. | ▶ 00:00 |
Any 2 variables are independent if they're not linked by just unknown variables. | ▶ 00:04 |
So for example, if we know B, then everything downstream of B | ▶ 00:10 |
becomes independent of anything upstream of B. | ▶ 00:14 |
E is now independent of C, conditioned on B. | ▶ 00:18 |
However, knowledge of B does not render A and E independent. | ▶ 00:22 |
In this graph over here, A and B connect to C and C connects to D and to E. | ▶ 00:26 |
So let me ask you, is A independent of E, | ▶ 00:33 |
A independent of E given B, | ▶ 00:37 |
A independent of E given C, | ▶ 00:39 |
A independent of B, | ▶ 00:41 |
and A independent of B given C? | ▶ 00:43 |
[Thrun] And the answer for this one is really interesting. | ▶ 00:00 |
A is clearly not independent of E because through C we can see an influence of A to E. | ▶ 00:03 |
Given B, that doesn't change. | ▶ 00:08 |
A still influences C, despite the fact we know B. | ▶ 00:11 |
However, if we know C, the influence is cut off. | ▶ 00:15 |
There is no way A can influence E if we know C. | ▶ 00:18 |
A is clearly independent of B. | ▶ 00:22 |
They are different entry variables. They have no incoming arcs. | ▶ 00:25 |
But here is the caveat. | ▶ 00:29 |
Given C, A and B become dependent. | ▶ 00:32 |
So whereas initially A and B were independent, | ▶ 00:35 |
if you give C, they become dependent. | ▶ 00:38 |
And the reason why they become dependent we've studied before. | ▶ 00:41 |
This is the explain away effect. | ▶ 00:44 |
If you know, for example, C to be true, | ▶ 00:48 |
then knowledge of A will substantially affect what we believe about B. | ▶ 00:51 |
If there's 2 joint causes for C and we happen to know A is true, | ▶ 00:57 |
we will discredit cause B. | ▶ 01:02 |
If we happen to know A is false, we will increase our belief for the cause B. | ▶ 01:04 |
That was an effect we studied extensively in the happiness example I gave you before. | ▶ 01:09 |
The interesting thing here is we are facing a situation | ▶ 01:15 |
where knowledge of variable C renders previously independent variables dependent. | ▶ 01:19 |
[Thrun] This leads me to the general study of conditional independence in Bayes networks, | ▶ 00:00 |
often called D-separation or reachability. | ▶ 00:06 |
D-separation is best studied by so-called active triplets and inactive triplets | ▶ 00:10 |
where active triplets render variables dependent | ▶ 00:17 |
and inactive triplets render them independent. | ▶ 00:20 |
Any chain of 3 variables like this makes the initial and final variable dependent | ▶ 00:23 |
if all variables are unknown. | ▶ 00:30 |
However, if the center variable is known-- | ▶ 00:32 |
that is, it's behind the conditioning bar-- | ▶ 00:35 |
then this variable and this variable become independent. | ▶ 00:38 |
So if we have a structure like this and it's quote-unquote cut off | ▶ 00:42 |
by a known variable in the middle, that separates or deseparates | ▶ 00:47 |
the left variable from the right variable, and they become independent. | ▶ 00:53 |
Similarly, any structure like this renders the left variable and the right variable dependent | ▶ 00:57 |
unless the center variable is known, | ▶ 01:04 |
in which case the left and right variable become independent. | ▶ 01:08 |
Another active triplet now requires knowledge of a variable. | ▶ 01:12 |
This is the explain away case. | ▶ 01:16 |
If this variable is known for a Bayes network that converges into a single variable, | ▶ 01:19 |
then this variable and this variable over here become dependent. | ▶ 01:25 |
Contrast this with a case where all variables are unknown. | ▶ 01:29 |
A situation like this means that this variable on the left or on the right are actually independent. | ▶ 01:33 |
In a single final example, we also get dependence if we have the following situation: | ▶ 01:40 |
a direct successor of a conversion variable is known. | ▶ 01:48 |
So it is sufficient if a successor of this variable is known. | ▶ 01:52 |
The variable itself does not have to be known, | ▶ 01:57 |
and the reason is if you know this guy over here, | ▶ 01:59 |
we get knowledge about this guy over here. | ▶ 02:02 |
And by virtue of that, the case over here essentially applies. | ▶ 02:05 |
If you look at those rules, | ▶ 02:09 |
those rules allow you to determine for any Bayes network | ▶ 02:11 |
whether variables are dependent or not dependent given the evidence you have. | ▶ 02:15 |
If you color the nodes dark for which you do have evidence, | ▶ 02:20 |
then you can use these rules to understand whether any 2 variables | ▶ 02:25 |
are conditionally independent or not. | ▶ 02:29 |
So let me ask you for this relatively complicated Bayes network the following questions. | ▶ 02:31 |
Is F independent of A? | ▶ 02:37 |
Is F independent of A given D? | ▶ 02:41 |
Is F independent of A given G? | ▶ 02:45 |
And is F independent of A given H? | ▶ 02:49 |
Please mark your answers as you see fit. | ▶ 02:51 |
[Thrun] And the answer is yes, F is independent of A. | ▶ 00:00 |
What we find for our rules of D-separation is that F is dependent on D | ▶ 00:04 |
and A is dependent on D. | ▶ 00:08 |
But if you don't know D, you can't govern any dependence between A and F at all. | ▶ 00:11 |
If you do know D, then F and A become dependent. | ▶ 00:16 |
And the reason is B and E are dependent given D, | ▶ 00:20 |
and we can transform this back into dependence of A and F | ▶ 00:25 |
because B and A are dependent and E and F are dependent. | ▶ 00:29 |
There is an active path between A and F which goes across here and here | ▶ 00:33 |
because D is known. | ▶ 00:38 |
If we know G, the same thing is true because G gives us knowledge about D, | ▶ 00:40 |
and D can be applied back to this path over here. | ▶ 00:44 |
However, if you know H, that's not the case. | ▶ 00:47 |
So H might tell us something about G, | ▶ 00:49 |
but it doesn't tell us anything about D, | ▶ 00:51 |
and therefore, we have no reason to close the path between A and F. | ▶ 00:53 |
The path between A and F is still passive, even though we have knowledge of H. | ▶ 00:59 |
[Thrun] So congratulations. You learned a lot about Bayes networks. | ▶ 00:00 |
You learned about the graph structure of Bayes networks, | ▶ 00:03 |
you understood how this is a compact representation, | ▶ 00:06 |
you learned about conditional independence, | ▶ 00:10 |
and we talked a little bit about application of Bayes network | ▶ 00:12 |
to interesting reasoning problems. | ▶ 00:15 |
But by all means this was a mostly theoretical unit of this class, | ▶ 00:18 |
and in future classes we will talk more about applications. | ▶ 00:23 |
The instrument of Bayes networks is really essential to a number of problems. | ▶ 00:27 |
It really characterizes the sparse dependence that exists in many readable problems | ▶ 00:31 |
like in robotics and computer vision and filtering and diagnostics and so on. | ▶ 00:36 |
I really hope you enjoyed this class, | ▶ 00:41 |
and I really hope you understood in depth how Bayes networks work. | ▶ 00:43 |
[Probabilistic Interference] | ▶ 00:00 |
[Male] Welcome back. In the previous unit, we went over the basics | ▶ 00:02 |
of probability theory and saw how | ▶ 00:05 |
a Bayes network could concisely represent a joint probability distribution, | ▶ 00:12 |
including the representation of independence between the variables. | ▶ 00:17 |
In this unit, we will see how to do probabilistic inference. | ▶ 00:24 |
That is, how to answer probability questions using Bayes nets. | ▶ 00:31 |
Let's put up a simple Bayes net. | ▶ 00:36 |
We'll use the familiar example of the earthquake | ▶ 00:40 |
where we can have a burglary or an earthquake | ▶ 00:45 |
setting off an alarm, and if the alarm goes off, | ▶ 00:50 |
either John or Mary might call. | ▶ 00:53 |
Now, what kinds of questions can we ask to do inference about? | ▶ 00:58 |
The simplest type of question is the same question we ask | ▶ 01:02 |
with an ordinary subroutine or function in a programming language. | ▶ 01:05 |
Namely, given some inputs, what are the outputs? | ▶ 01:08 |
So, in this case, we could say given the inputs of B and E, | ▶ 01:12 |
what are the outputs, J and M? | ▶ 01:18 |
Rather than call them input and output variables, | ▶ 01:22 |
in probabilistic inference, we'll call them evidence and query variables. | ▶ 01:26 |
That is, the variables that we know the values of are the evidence, | ▶ 01:36 |
and the ones that we want to find out the values of are the query variables. | ▶ 01:39 |
Anything that is neither evidence nor query is known as a hidden variable. | ▶ 01:44 |
That is, we won't tell you what its value is. | ▶ 01:52 |
We won't figure out what its value is and report it, | ▶ 01:55 |
but we'll have to compute with it internally. | ▶ 01:58 |
And now furthermore, in probabilistic inference, | ▶ 02:01 |
the output is not a single number for each of the query variables, | ▶ 02:05 |
but rather, it's a probability distribution. | ▶ 02:10 |
So, the answer is going to be a complete, joint probability distribution | ▶ 02:13 |
over the query variables. | ▶ 02:17 |
We call this the posterior distribution, given the evidence, | ▶ 02:19 |
and we can write it like this. | ▶ 02:23 |
It's the probability distribution of one or more query variables | ▶ 02:26 |
given the values of the evidence variables. | ▶ 02:34 |
And there can be zero or more evidence variables, | ▶ 02:39 |
and each of them are given an exact value. | ▶ 02:42 |
And that's the computation we want to come up with. | ▶ 02:47 |
There's another question we can ask. | ▶ 02:53 |
Which is the most likely explanation? | ▶ 02:56 |
That is, out of all the possible values for all the query variables, | ▶ 02:58 |
which combination of values has the highest probability? | ▶ 03:03 |
We write the formula like this, asking which Q values | ▶ 03:08 |
are maxable given the evidence values. | ▶ 03:12 |
Now, in an ordinary programming language, each function goes only one way. | ▶ 03:16 |
It has input variables, does some computation, | ▶ 03:22 |
and comes up with a result variable or result variables. | ▶ 03:26 |
One great thing about Bayes nets is that we're not restricted | ▶ 03:31 |
to going only in one direction. | ▶ 03:34 |
We could go in the causal direction, giving as evidence | ▶ 03:36 |
the route nodes of the tree and asking as query values the nodes at the bottom. | ▶ 03:41 |
Or, we could reverse that causal flow. | ▶ 03:47 |
For example, we could have J and M be the evidence variables | ▶ 03:50 |
and B and E be the query variables, | ▶ 03:55 |
or we could have any other combination. | ▶ 03:58 |
For example, we could have M be the evidence variable | ▶ 04:01 |
and J and B be the query variables. | ▶ 04:05 |
Here's a question for you. | ▶ 04:11 |
Imagine the situation where Mary has called to report that the alarm is going off, | ▶ 04:13 |
and we want to know whether or not there has been a burglary. | ▶ 04:18 |
For each of the nodes, click on the circle to tell us | ▶ 04:22 |
if the node is an evidence node, a hidden node, | ▶ 04:27 |
or a query node. | ▶ 04:32 |
The answer is that Mary calling is the evidence node. | ▶ 00:00 |
The burglary is the query node, | ▶ 00:04 |
and all the others are hidden variables in this case. | ▶ 00:07 |
Now we're going to talk about how to do inference on Bayes net. | ▶ 00:00 |
We'll start with our familiar network, and we'll talk about a method | ▶ 00:04 |
called enumeration, | ▶ 00:08 |
which goes through all the possibilities, adds them up, | ▶ 00:12 |
and comes up with an answer. | ▶ 00:15 |
So, what we do is start by stating the problem. | ▶ 00:17 |
We're going to ask the question of what is the probability | ▶ 00:24 |
that the burglar alarm occurred given that John called and Mary called? | ▶ 00:27 |
We'll use the definition of conditional probability to answer this. | ▶ 00:34 |
So, this query is equal to the joint probability distribution | ▶ 00:39 |
of all 3 variables divided by the conditionalized variables. | ▶ 00:47 |
Now, note I'm using a notation here where instead of writing out the probability | ▶ 00:55 |
of some variable equals true, I'm just using the notation plus | ▶ 01:01 |
and then the variable name in lower case, | ▶ 01:05 |
and if I wanted the negation, I would use negation sign. | ▶ 01:08 |
Notice there's a different notation where instead of writing out | ▶ 01:13 |
the plus and negation signs, we just use the variable name itself, P(e), | ▶ 01:17 |
to indicate E is true. | ▶ 01:22 |
That notation works well, but it can get confusing between | ▶ 01:25 |
does P(e) mean E is true, or does it mean E is a variable? | ▶ 01:29 |
And so we're going to stick to the notation where we explicitly have | ▶ 01:34 |
the pluses and negation signs. | ▶ 01:37 |
To do inference by enumeration, we first take a conditional probability | ▶ 01:41 |
and rewrite it as unconditional probabilities. | ▶ 01:45 |
Now we enumerate all the atomic probabilities and calculate the sum of products. | ▶ 01:49 |
Let's look at just the complex term on the numerator first. | ▶ 01:56 |
The procedure for figuring out the denominator would be similar, and we'll skip that. | ▶ 02:00 |
So, the probability of these 3 terms together | ▶ 02:05 |
can be determined by enumerating all possible values of the hidden variables. | ▶ 02:12 |
In this case, there are 2, E and A, | ▶ 02:17 |
so we'll sum over those variables for all values of E and for all values of A. | ▶ 02:22 |
In this case, they're boolean, so there's only 2 values of each. | ▶ 02:29 |
We ask what's the probability of this unconditional term? | ▶ 02:34 |
And that we get by summing out over all possibilities, | ▶ 02:41 |
E and A being true or false. | ▶ 02:44 |
Now, to get the values of these atomic events, | ▶ 02:49 |
we'll have to rewrite this equation in a form that corresponds | ▶ 02:52 |
to the conditional probability tables that we have associated with the Bayes net. | ▶ 02:55 |
So, we'll take this whole expression and rewrite it. | ▶ 03:00 |
It's still a sum over the hidden variables E and A, | ▶ 03:04 |
but now I'll rewrite this expression in terms of the parents | ▶ 03:08 |
of each of the nodes in the network. | ▶ 03:12 |
So, that gives us the product of these 5 terms, | ▶ 03:15 |
which we then have to sum over all values of E and A. | ▶ 03:21 |
If we call this product f(e,a), | ▶ 03:24 |
then the whole answer is the sum of F for all values of E and A, | ▶ 03:31 |
so as the sum of 4 terms where each of the terms is a product of 5 numbers. | ▶ 03:43 |
Where do we get the numbers to fill in this equation? | ▶ 03:51 |
From the conditional probability tables from our model, | ▶ 03:54 |
so let's put the equation back up, and we'll ask you for the case | ▶ 03:58 |
where both E and A are positive | ▶ 04:03 |
to look up in the conditional probability tables and fill in the numbers | ▶ 04:09 |
for each of these 5 terms, and then multiply them together and fill in the product. | ▶ 04:14 |
We get the answer by reading numbers off the conditional probability tables, | ▶ 00:00 |
so probability of B being positive is 0.001. | ▶ 00:04 |
Of E being positive, because we're dealing with the positive case now | ▶ 00:11 |
for the variable E, is 0.002. | ▶ 00:16 |
The probability of A being positive, because we're dealing with that case, | ▶ 00:22 |
given that B is positive and the case for an E is positive, | ▶ 00:26 |
that we can read off here as 0.95. | ▶ 00:30 |
The probability that J is positive given that A is positive is 0.9. | ▶ 00:37 |
And finally, the probability that M is positive given that A is positive | ▶ 00:44 |
we read off here as 0.7. | ▶ 00:50 |
We multiple all those together, it's going to be a small number | ▶ 00:54 |
because we've got the .001 and the .002 here. | ▶ 00:57 |
Can't quite fit it in the box, but it works out to .000001197. | ▶ 01:00 |
That seems like a really small number, but remember, | ▶ 01:12 |
we have to normalize by the P(+j,+m) term, | ▶ 01:14 |
and this is only 1 of the 4 possibilities. | ▶ 01:19 |
We have to enumerate over all 4 possibilities for E and A, | ▶ 01:22 |
and in the end, it works out that the probability of the burglar alarm being true | ▶ 01:26 |
given that John and Mary calls, is 0.284. | ▶ 01:32 |
And we get that number because intuitively, | ▶ 01:38 |
it seems that the alarm is fairly reliable. | ▶ 01:42 |
John and Mary calling are very reliable, | ▶ 01:44 |
but the prior probability of burglary is low. | ▶ 01:47 |
And those 2 terms combine together to give us the 0.284 value | ▶ 01:49 |
when we sum up each of the 4 terms of these products. | ▶ 01:54 |
[Norvig] We've seen how to do enumeration to solve the inference problem | ▶ 00:00 |
on belief networks. | ▶ 00:04 |
For a simple network like the alarm network, that's all we need to know. | ▶ 00:06 |
There's only 5 variables, so even if all 5 of them were hidden, | ▶ 00:10 |
there would only be 32 rows in the table to sum up. | ▶ 00:14 |
From a theoretical point of view, we're done. | ▶ 00:20 |
But from a practical point of view, other networks could give us trouble. | ▶ 00:22 |
Consider this network, which is one for determining insurance for car owners. | ▶ 00:26 |
There are 27 different variables. | ▶ 00:35 |
If each of the variables were boolean, that would give us over 100 million rows to sum out. | ▶ 00:38 |
But in fact, some of the variables are non-boolean, | ▶ 00:44 |
they have multiple values, and it turns out that representing this entire network | ▶ 00:46 |
and doing enumeration we'd have to sum over a quadrillion rows. | ▶ 00:52 |
That's just not practical, so we're going to have to come up with methods | ▶ 00:57 |
that are faster than enumerating everything. | ▶ 01:01 |
The first technique we can use to get a speed-up in doing inference on Bayes nets | ▶ 01:04 |
is to pull out terms from the enumeration. | ▶ 01:09 |
For example, here the probability of b is going to be the same for all values of E and a. | ▶ 01:13 |
So we can take that term and move it out of the summation, | ▶ 01:20 |
and now we have a little bit less work to do. | ▶ 01:26 |
We can multiply by that term once rather than having it in each row of the table. | ▶ 01:28 |
We can also move this term, the P of e, to the left of the summation over a, | ▶ 01:33 |
because it doesn't depend on a. | ▶ 01:40 |
By doing this, we're doing less work. | ▶ 01:43 |
The inner loop of the summation now has only 3 terms rather than 5 terms. | ▶ 01:45 |
So we've reduced the cost of doing each row of the table. | ▶ 01:50 |
But we still have the same number of rows in the table, | ▶ 01:53 |
so we're going to have to do better than that. | ▶ 01:57 |
The next technique for efficient inference is to maximize independence of variables. | ▶ 02:00 |
The structure of a Bayes net determines how efficient it is to do inference on it. | ▶ 02:08 |
For example, a network that's a linear string of variables, | ▶ 02:12 |
X1 through Xn, can have inference done in time proportional to the number n, | ▶ 02:17 |
whereas a network that's a complete network | ▶ 02:27 |
where every node points to every other node and so on could take time 2 to the n | ▶ 02:31 |
if all n variables are boolean variables. | ▶ 02:40 |
In the alarm network we saw previously, we took care | ▶ 02:45 |
to make sure that we had all the independence relations represented | ▶ 02:50 |
in the structure of the network. | ▶ 02:54 |
But if we put the nodes together in a different order, | ▶ 02:57 |
we would end up with a different structure. | ▶ 03:00 |
Let's start by ordering the node John calls first | ▶ 03:03 |
and then adding in the node Mary calls. | ▶ 03:09 |
The question is, given just these 2 nodes and looking at the node for Mary calls, | ▶ 03:13 |
is that node dependent or independent of the node for John calls? | ▶ 03:19 |
[Norvig] The answer is that the node for Mary calls in this network | ▶ 00:01 |
is dependent on John calls. | ▶ 00:05 |
In the previous network, they were independent given that we knew that the alarm had occurred. | ▶ 00:08 |
But here we don't know that the alarm had occurred, | ▶ 00:13 |
and so the nodes are dependent | ▶ 00:16 |
because having information about one will affect the information about the other. | ▶ 00:18 |
[Norvig] Now we'll continue and we'll add the node A for alarm to the network. | ▶ 00:00 |
And what I want you to do is click on all the other variables | ▶ 00:05 |
that A is dependent on in this network. | ▶ 00:09 |
[Norvig] The answer is that alarm is dependent on both John and Mary. | ▶ 00:01 |
And so we can draw both nodes in, both arrows in. | ▶ 00:05 |
Intuitively that makes sense because if John calls, | ▶ 00:09 |
then it's more likely that the alarm has occurred, | ▶ 00:14 |
likely as if Mary calls, and if both called, it's really likely. | ▶ 00:16 |
So you can figure out the answer by intuitive reasoning, | ▶ 00:20 |
or you can figure it out by going to the conditional probability tables | ▶ 00:23 |
and seeing according to the definition of conditional probability | ▶ 00:27 |
whether the numbers work out. | ▶ 00:31 |
[Norvig] Now we'll continue and we'll add the node B for burglary | ▶ 00:01 |
and ask again, click on all the variables that B is dependent on. | ▶ 00:05 |
[Norvig] The answer is that B is dependent only on A. | ▶ 00:00 |
In other words, B is independent of J and M given A. | ▶ 00:04 |
[Norvig] And finally, we'll add the last node, E, | ▶ 00:00 |
and ask you to click on all the nodes that E is dependent on. | ▶ 00:04 |
[Norvig] And the answer is that E is dependent on A. | ▶ 00:00 |
That much is fairly obvious. | ▶ 00:04 |
But it's also dependent on B. | ▶ 00:06 |
Now, why is that? | ▶ 00:08 |
E is dependent on A because if the earthquake did occur, | ▶ 00:10 |
then it's more likely that the alarm would go off. | ▶ 00:13 |
On the other hand, E is also dependent on B | ▶ 00:16 |
because if a burglary occurred, then that would explain why the alarm is going off, | ▶ 00:19 |
and it would mean that the earthquake is less likely. | ▶ 00:23 |
[Norvig] The moral is that Bayes nets tend to be the most compact | ▶ 00:00 |
and thus the easier to do inference on when they're written in the causal direction-- | ▶ 00:04 |
that is, when the networks flow from causes to effects. | ▶ 00:12 |
Let's return to this equation, which we use to show how to do inference by enumeration. | ▶ 00:00 |
In this equation, we join up the whole joint distribution | ▶ 00:06 |
before we sum out over the hidden variables. | ▶ 00:10 |
That's slow, because we end up repeating a lot of work. | ▶ 00:15 |
Now we're going to show a new technique called variable elimination, | ▶ 00:18 |
which in many networks operates much faster. | ▶ 00:25 |
It's still a difficult computation, an NP-hard computation, | ▶ 00:27 |
to do inference over Bayes nets in general. | ▶ 00:30 |
Variable elimination works faster than inference by enumeration | ▶ 00:34 |
in most practical cases. | ▶ 00:38 |
It requires an algebra for manipulating factors, | ▶ 00:41 |
which are just names for multidimensional arrays | ▶ 00:45 |
that come out of these probabilistic terms. | ▶ 00:48 |
We'll use another example to show how variable elimination works. | ▶ 00:53 |
We'll start off with a network that has 3 boolean variables. | ▶ 00:57 |
R indicates whether or not it's raining. | ▶ 01:00 |
T indicates whether or not there's traffic, | ▶ 01:04 |
and T is dependent on whether it's raining. | ▶ 01:12 |
And finally, L indicates whether or not I'll be late for my next appointment, | ▶ 01:15 |
and that depends on whether or not there's traffic. | ▶ 01:19 |
Now we'll put up the conditional probability tables for each of these 3 variables. | ▶ 01:22 |
And then we can use inference to figure out the answer to questions like | ▶ 01:29 |
am I going to be late? | ▶ 01:35 |
And we know by definition that we could do that through enumeration | ▶ 01:38 |
by going through all the possible values for R and T | ▶ 01:42 |
and summing up the product of these 3 nodes. | ▶ 01:47 |
Now, in a simple network like this, straight enumeration would work fine, | ▶ 01:54 |
but in a more complex network, what variable elimination does is give us a way | ▶ 01:59 |
to combine together parts of the network into smaller parts | ▶ 02:03 |
and then enumerate over those smaller parts and then continue combining. | ▶ 02:09 |
So, we start with a big network. | ▶ 02:13 |
We eliminate some of the variables. | ▶ 02:15 |
We compute by marginalizing out, and then we have a smaller network to deal with, | ▶ 02:17 |
and we'll show you how those 2 steps work. | ▶ 02:24 |
The first operation in variable elimination is called joining factors. | ▶ 02:28 |
A factor, again, is one of these tables. | ▶ 02:35 |
It's a multidimensional matrix, and what we do is choose 2 of the factors, | ▶ 02:39 |
2 or more of the factors. | ▶ 02:43 |
In this case, we'll choose these 2, and we'll combine them together | ▶ 02:45 |
to form a new factor which represents | ▶ 02:49 |
the joint probability of all the variables in that factor. | ▶ 02:52 |
In this case, R and T. | ▶ 02:56 |
Now we'll draw out that table. | ▶ 03:00 |
In each case, we just look up in the corresponding table, | ▶ 03:03 |
figure out the numbers, and multiply them together. | ▶ 03:06 |
For example, in this row we have a +r and a +t, | ▶ 03:08 |
so the +r is 0.1, and the entry for +r and +t is 0.8, | ▶ 03:13 |
so multiply them together and you get 0.08. | ▶ 03:19 |
Go all the way down. For example, in the last row we have a -r and a -t. | ▶ 03:22 |
-r is 0.9. The entry for -r and -t is also 0.9. | ▶ 03:28 |
Multiply those together and you get 0.81. | ▶ 03:34 |
So, what have we done? | ▶ 03:40 |
We used the operation of joining factors on these 2 factors, | ▶ 03:42 |
getting us a new factor which is part of the existing network. | ▶ 03:45 |
Now we want to apply a second operation called elimination, | ▶ 03:50 |
also called summing out or marginalization, to take this table and reduce it. | ▶ 03:56 |
Right now, the tables we have look like this. | ▶ 04:02 |
We could sum out or marginalize over the variable R | ▶ 04:06 |
to give us a table that just operates on T. | ▶ 04:10 |
So, the question is to fill in this table for P(T)-- | ▶ 04:14 |
there will be 2 entries in this table, the +t entry, formed by summing out | ▶ 04:20 |
all the entries here for all values of r for which t is positive, | ▶ 04:23 |
and the -t entry, formed the same way, by looking in this table | ▶ 04:28 |
and summing up all the rows over all values of r where t is negative. | ▶ 04:32 |
Put your answers in these boxes. | ▶ 04:37 |
The answer is that for +t we look up the 2 possible values for r, | ▶ 00:00 |
and we get 0.08 or 0.09. | ▶ 00:05 |
Sum those up, get 0.17, | ▶ 00:09 |
and then we look at the 2 possible values of R for -t, | ▶ 00:13 |
and we get 0.02 and 0.81. | ▶ 00:18 |
Add those up, and we get 0.83. | ▶ 00:22 |
So, we took our network with RT and L. We summed out over R. | ▶ 00:00 |
That gives us a new network with T and L | ▶ 00:04 |
with these conditional probability tables. | ▶ 00:09 |
And now we want to do a join over T and L | ▶ 00:13 |
and give us a new table with the joint probability of P(T, L). | ▶ 00:17 |
And that table is going to look like this. | ▶ 00:25 |
The answer, again, for joining variables is determined by pointwise multiplication, | ▶ 00:00 |
so we have 0.17 times 0.3 is 0.051, | ▶ 00:05 |
+t and +l, 0.17 times 0.7 is 0.119. | ▶ 00:12 |
Then we go to the minuses. | ▶ 00:21 |
Minus 0.83 times 0.1 is 0.083. | ▶ 00:23 |
And finally, 0.83 times 0.9 is 0.747. | ▶ 00:31 |
Now we're down to a network with a single node, T, L, | ▶ 00:00 |
with this joint probability table, and the only operation we have left to do | ▶ 00:06 |
is to sum out to give us a node with just L in it. | ▶ 00:12 |
So, the question is to compute P(L) for both values of L, | ▶ 00:17 |
+l and -l. | ▶ 00:26 |
The answer is that the +l values, | ▶ 00:00 |
0.051 plus 0.083 equals 0.134. | ▶ 00:03 |
And the negative values, 0.119 plus 0.747 | ▶ 00:11 |
equals 0.886. | ▶ 00:15 |
No subtitles... | ▶ 00:00 |
So, that's how variable elimination works. | ▶ 00:00 |
It's a continued process of joining together factors | ▶ 00:03 |
to form a larger factor and then eliminating variables by summing out. | ▶ 00:06 |
If we make a good choice of the order in which we apply these operations, | ▶ 00:11 |
then variable elimination can be much more efficient | ▶ 00:15 |
than just doing the whole enumeration. | ▶ 00:18 |
Now I want to talk about approximate inference | ▶ 00:00 |
by means of sampling. | ▶ 00:07 |
What do I mean by that? | ▶ 00:12 |
Say we want to deal with a joint probability distribution, | ▶ 00:14 |
say the distribution of heads and tails over these 2 coins. | ▶ 00:17 |
We can build a table and then start counting by sampling. | ▶ 00:24 |
Here we have our first sample. | ▶ 00:30 |
We flip the coins and the one-cent piece came up heads, | ▶ 00:32 |
and the five-cent piece came up tails, | ▶ 00:35 |
so we would mark down one count. | ▶ 00:39 |
Then we'd toss them again. | ▶ 00:42 |
This time the five cents is heads, and the one cent is tails, | ▶ 00:45 |
so we put down a count there, and we'd repeat that process | ▶ 00:50 |
and keep repeating it until we got enough counts that we could estimate | ▶ 01:00 |
the joint probability distribution by looking at the counts. | ▶ 01:06 |
Now, if we do a small number of samples, the counts might not be very accurate. | ▶ 01:11 |
There may be some random variation that causes them not to converge | ▶ 01:15 |
to their true values, but as we add more counts, | ▶ 01:19 |
the counts--as we add more samples, | ▶ 01:23 |
the counts we get will come closer to the true distribution. | ▶ 01:25 |
Thus, sampling has an advantage over inference in that we know a procedure | ▶ 01:29 |
for coming up with at least an approximate value for the joint probability distribution, | ▶ 01:35 |
as opposed to exact inference, where the computation may be very complex. | ▶ 01:42 |
There's another advantage to sampling, which is if we don't know | ▶ 01:50 |
what the conditional probability tables are, as we did in our other models, | ▶ 01:53 |
if we don't know these numeric values, but we can simulate the process, | ▶ 01:59 |
we can still proceed with sampling, whereas we couldn't with exact inference. | ▶ 02:04 |
Here's a new network that we'll use to investigate | ▶ 00:00 |
how sampling can be used to do inference. | ▶ 00:05 |
In this network, we have 4 variables. They're all boolean. | ▶ 00:10 |
Cloudy tells us if it's cloudy or not outside, | ▶ 00:14 |
and that can have an effect on whether the sprinklers are turned on, | ▶ 00:17 |
and whether it's raining. | ▶ 00:21 |
And those 2 variables in turn have an effect on whether the grass gets wet. | ▶ 00:23 |
Now, to do inference over this network using sampling, | ▶ 00:28 |
we start off with a variable where all the parents are defined. | ▶ 00:34 |
In this case, there's only one such variable, Cloudy. | ▶ 00:38 |
And it's conditional probability table tells us that the probability is 50% for Cloudy, | ▶ 00:42 |
50% for not Cloudy, and so we sample from that. | ▶ 00:48 |
We generate a random number, and let's say it comes up with positive for Cloudy. | ▶ 00:52 |
Now that variable is defined, we can choose another variable. | ▶ 00:59 |
In this case, let's choose Sprinkler, and we look at the rows in the table | ▶ 01:02 |
for which Cloudy, the parent, is positive, and we see we should sample | ▶ 01:08 |
with probability 10% to +s and 90% a -s. | ▶ 01:13 |
And so let's say we do that sampling with a random number generator, | ▶ 01:19 |
and it comes up negative for Sprinkler. | ▶ 01:23 |
Now let's jump over here. Look at the Rain variable. | ▶ 01:26 |
Again, the parent, Cloudy, is positive, | ▶ 01:29 |
so we're looking at this part of the table. | ▶ 01:34 |
We get a 0.8 probability for Rain being positive, | ▶ 01:38 |
and a 0.2 probability for Rain being negative. | ▶ 01:41 |
Let's say we sample that randomly, and it comes up Rain is positive. | ▶ 01:44 |
And now we're ready to sample the final variable, | ▶ 01:51 |
and what I want you to do is tell me which of the rows | ▶ 01:54 |
of this table should we be considering and tell me what's more likely. | ▶ 02:01 |
Is it more likely that we have a +w or a -w? | ▶ 02:07 |
The answer to the question is that we look at the parents. | ▶ 00:00 |
We find that the Sprinkler variable is negative, | ▶ 00:03 |
so we're looking at this part of the table. | ▶ 00:06 |
And the Rain variable is positive, so we're looking at this part. | ▶ 00:09 |
So, it would be these 2 rows that we would consider, | ▶ 00:14 |
and thus, we'd find there's a 0.9 probability for w, the grass being wet, | ▶ 00:18 |
and only 0.1 for it being negative, | ▶ 00:25 |
so the positive is more likely. | ▶ 00:28 |
And once we've done that, then we generated a complete sample, | ▶ 00:31 |
and we can write down the sample here. | ▶ 00:34 |
We had +c, -s, +r. | ▶ 00:37 |
And assuming we got a probability of 0.9 came out in favor of the +w, | ▶ 00:43 |
that would be the end of the sample. | ▶ 00:51 |
Then we could throw all this information out and start over again | ▶ 00:54 |
by having another 50/50 choice for cloudy and then working our way through the network. | ▶ 00:59 |
Now, the probability of sampling a particular variable, | ▶ 00:00 |
choosing a +w or a -w, depends on the values of the parents. | ▶ 00:04 |
But those are chosen according to the conditional probability tables, | ▶ 00:10 |
so in the limit, the count of each sampled variable | ▶ 00:14 |
will approach the true probability. | ▶ 00:18 |
That is, with an infinite number of samples, this procedure computes the true | ▶ 00:20 |
joint probability distribution. | ▶ 00:24 |
We say that the sampling method is consistent. | ▶ 00:27 |
We can use this kind of sampling to compute the complete joint probability distribution, | ▶ 00:33 |
or we can use it to compute a value for an individual variable. | ▶ 00:38 |
But what if we wanted to compute a conditional probability? | ▶ 00:43 |
Say we wanted to compute the probability of wet grass | ▶ 00:47 |
given that it's not cloudy. | ▶ 00:53 |
To do that, the sample that we generated here wouldn't be helpful at all | ▶ 00:58 |
because it has to do with being cloudy, not with being not cloudy. | ▶ 01:03 |
So, we would cross this sample off the list. | ▶ 01:08 |
We would say that we reject the sample, and this technique is called rejection sampling. | ▶ 01:11 |
We go through ignoring any samples that don't match | ▶ 01:17 |
the conditional probabilities that we're interested in | ▶ 01:21 |
and keeping samples that do, say the sample -c, +s, +r, -w. | ▶ 01:24 |
We would just continue going through generating samples, | ▶ 01:34 |
crossing off the ones that don't match, keeping the ones that do. | ▶ 01:37 |
And this procedure would also be consistent. | ▶ 01:41 |
We call this procedure rejection sampling. | ▶ 01:46 |
But there's a problem with rejection sampling. | ▶ 00:00 |
If the evidence is unlikely, you end up rejecting a lot of the samples. | ▶ 00:03 |
Let's go back to the alarm network where we had variables for burglary and for an alarm | ▶ 00:08 |
and say when arrested, in computing the probability of a burglary, | ▶ 00:16 |
given that the alarm goes off. | ▶ 00:22 |
The problem is that burglaries are very infrequent, | ▶ 00:25 |
so most of the samples we would get would end up being-- | ▶ 00:28 |
we start with generating a B, and we get a -b and then a -a. | ▶ 00:32 |
We go back and say does this match? | ▶ 00:39 |
No, we have to reject this sample, | ▶ 00:43 |
so we generate another sample, and we get another -b, -a. | ▶ 00:45 |
We reject that. We get another -b, -a. | ▶ 00:50 |
And we keep rejecting, and eventually we get a +b, | ▶ 00:54 |
but we'd end up spending a lot of time rejecting samples. | ▶ 01:00 |
So, we're going to introduce a new method called likelihood weighting | ▶ 01:04 |
that generates samples so that we can keep every one. | ▶ 01:13 |
With likelihood weighting, we fix the evidence variables. | ▶ 01:17 |
That is, we say that A will always be positive, | ▶ 01:20 |
and then we sample the rest of the variables, | ▶ 01:25 |
so then we get samples that we want. | ▶ 01:28 |
We would get a list like -b, +a, | ▶ 01:31 |
-b, +a, | ▶ 01:37 |
+b, +a. | ▶ 01:40 |
We get to keep every sample, but we have a problem. | ▶ 01:42 |
The resulting set of samples is inconsistent. | ▶ 01:46 |
We can fix that, however, by assigning a probability | ▶ 01:52 |
to each sample and weighing them correctly. | ▶ 01:56 |
In likelihood weighting, we're going to be collecting samples just like before, | ▶ 00:00 |
but we're going to add a probabilistic weight to each sample. | ▶ 00:05 |
Now, let's say we want to compute the probability of rain | ▶ 00:11 |
given that the sprinklers are on, and the grass is wet. | ▶ 00:17 |
We start as before. | ▶ 00:22 |
We make a choice for Cloudy, and let's say that, again, | ▶ 00:24 |
we choose Cloudy being positive. | ▶ 00:30 |
Now we want to make a choice for Sprinkler, | ▶ 00:33 |
but we're constrained to always choose Sprinkler being positive, | ▶ 00:37 |
so we'll make that choice. | ▶ 00:41 |
And we know we were dealing with Cloudy being positive, | ▶ 00:44 |
so we're in this row, and we're forced to make the choice of Sprinkler being positive, | ▶ 00:50 |
and that has a probability of only 0.1, so we'll put that 0.1 into the weight. | ▶ 00:56 |
Next, we'll look at the Rain variable, | ▶ 01:05 |
and here we're not constrained in any way, so we make a choice | ▶ 01:09 |
according to the probability tables with Cloudy being positive. | ▶ 01:13 |
And let's say that we choose the more popular choice, and Rain gets the positive value. | ▶ 01:19 |
Now, we look at Wet Grass. | ▶ 01:27 |
We're constrained to choose positive, and we know that the parents | ▶ 01:30 |
are also positive, so we're dealing with this row here. | ▶ 01:35 |
Since it's a constrained choice, we're going to add in or multiply in an additional weight, | ▶ 01:41 |
and I want you to tell me what that weight should be. | ▶ 01:47 |
The answer is we're looking for the probability | ▶ 00:00 |
of having a +w given a +s and a +r, | ▶ 00:04 |
so that's in this row, so it's 0.99. | ▶ 00:09 |
So, we take our old weight and multiply it by 0.99, | ▶ 00:16 |
gives us a final weight of 0.099 | ▶ 00:22 |
for a sample of +c, +s, +r and +w. | ▶ 00:28 |
When we include the weights, | ▶ 00:00 |
counting this sample that was forced to have a +s and a +w | ▶ 00:03 |
with a weight of 0.099, instead of counting it as a full one sample, | ▶ 00:08 |
we find that likelihood weighting is also consistent. | ▶ 00:14 |
Likelihood weighting is a great technique, | ▶ 00:00 |
but it doesn't solve all our problems. | ▶ 00:03 |
Suppose we wanted to compute the probability of C given +s and +r. | ▶ 00:05 |
In other words, we're constraining Sprinkler and Rain to always be positive. | ▶ 00:14 |
Since we use the evidence when we generate a node that has that evidence as parents, | ▶ 00:21 |
the Wet Grass node will always get good values based on that evidence. | ▶ 00:27 |
But the Cloudy node won't, and so it will be generating values at random | ▶ 00:31 |
without looking at these values, and most of the time, or some of the time, | ▶ 00:39 |
it will be generating values that don't go well with the evidence. | ▶ 00:44 |
Now, we won't have to reject them like we do in rejection sampling, | ▶ 00:48 |
but they'll have a low probability associated with them. | ▶ 00:51 |
A technique called Gibbs sampling, | ▶ 00:00 |
named after the physicist Josiah Gibbs, | ▶ 00:07 |
takes all the evidence into account and not just the upstream evidence. | ▶ 00:10 |
It uses a method called Markov Chain Monte Carlo, or MCMC. | ▶ 00:14 |
The idea is that we resample just one variable at a time | ▶ 00:26 |
conditioned on all the others. | ▶ 00:31 |
That is, we have a set of variables, | ▶ 00:33 |
and we initialize them to random variables, keeping the evidence values fixed. | ▶ 00:37 |
Maybe we have values like this, | ▶ 00:44 |
and that constitutes one sample, and now, at each iteration through the loop, | ▶ 00:48 |
we select just one non-evidence variable and resample it | ▶ 00:54 |
based on all the other variables. | ▶ 01:01 |
And that will give us another sample, and repeat that again. | ▶ 01:04 |
Choose another variable. | ▶ 01:11 |
Resample that variable and repeat. | ▶ 01:15 |
We end up walking around in this space of assignments of variables randomly. | ▶ 01:21 |
Now, in rejection and likelihood sampling, | ▶ 01:27 |
each sample was independent of the other samples. | ▶ 01:30 |
In MCMC, that's not true. | ▶ 01:34 |
The samples are dependent on each other, and in fact, | ▶ 01:37 |
adjacent samples are very similar. | ▶ 01:40 |
They only vary or differ in one place. | ▶ 01:42 |
However, the technique is still consistent. We won't show the proof for that. | ▶ 01:46 |
Now, just one more thing. | ▶ 00:00 |
I can't help but describe what is probably the most famous probability problem of all. | ▶ 00:02 |
It's called the Monty Hall Problem after the game show host. | ▶ 00:07 |
And the idea is that you're on a game show, and there's 3 doors: | ▶ 00:11 |
door #1, door #2, and door #3. | ▶ 00:15 |
And behind each door is a prize, and you know that one of the doors | ▶ 00:20 |
contains an expensive sports car, which you would find desirable, | ▶ 00:26 |
and the other 2 doors contain a goat, which you would find less desirable. | ▶ 00:29 |
Now, say you're given a choice, and let's say you choose door #1. | ▶ 00:35 |
But according to the conventions of the game, the host, Monty Hall, | ▶ 00:42 |
will now open one of the doors, knowing that the door that he opens | ▶ 00:47 |
contains a goat, and he shows you door #3. | ▶ 00:52 |
And he now gives you the opportunity to stick with your choice | ▶ 00:57 |
or to switch to the other door. | ▶ 01:02 |
What I want you to tell me is, what is your probability of winning | ▶ 01:05 |
if you stick to door #1, and what is the probability of winning | ▶ 01:10 |
if you switched to door #2? | ▶ 01:15 |
The answer is that you have a 1/3 chance of winning if you stick with door #1 | ▶ 00:00 |
and a 2/3 chance if you switch to door #2. | ▶ 00:08 |
How do we explain that, and why isn't it 50/50? | ▶ 00:12 |
Well, it's true that there's 2 possibilities, | ▶ 00:16 |
but we've learned from probability that just because there are 2 options | ▶ 00:18 |
doesn't mean that both options are equally likely. | ▶ 00:22 |
It's easier to explain why the first door has a 1/3 probability | ▶ 00:26 |
because when you started, the car could be in any one of 3 places. | ▶ 00:30 |
You chose one of them. That probability was 1/3. | ▶ 00:34 |
And that probability hasn't been changed by the revealing of one of the other doors. | ▶ 00:37 |
Why is door #2 two-thirds? | ▶ 00:43 |
Well, one way to explain it is that the probability has to sum to 1, | ▶ 00:45 |
and if 1/3 is here, the 2/3 has to be here. | ▶ 00:49 |
But why doesn't the same argument that you use for 1 hold for 2? | ▶ 00:53 |
Why can't we say the probability of 2 holding the car | ▶ 00:58 |
was 1/3 before this door was revealed? | ▶ 01:03 |
Why has that changed 2 and has not changed 1? | ▶ 01:07 |
And the reason is because we've learned something about door #2. | ▶ 01:11 |
We've learned that it wasn't the door that was flipped over by the host, | ▶ 01:14 |
and so that additional information has updated the probability, | ▶ 01:18 |
whereas we haven't learned anything additional about door #1 | ▶ 01:22 |
because it was never an option that the host might switch door #1. | ▶ 01:26 |
And in fact, in this case, if we reveal the door, | ▶ 01:30 |
we find that's where the car actually is. | ▶ 01:37 |
So you see, learning probability may end up winning you something. | ▶ 01:40 |
Now, as a final epilogue, I have here a copy of a letter written by Monty Hall himself | ▶ 00:00 |
in 1990 to Professor Lawrence Denenberg of Harvard | ▶ 00:07 |
who, with Harry Lewis, wrote a statistics book | ▶ 00:10 |
in which they used the Monty Hall Problem as an example, | ▶ 00:14 |
and they wrote to Monty asking him for permission to use his name. | ▶ 00:18 |
Monty kindly granted the permission, but in his letter, | ▶ 00:23 |
he writes, "As I see it, it wouldn't make any difference after the player | ▶ 00:26 |
has selected Door A, and having been shown Door C-- | ▶ 00:31 |
why should he then attempt to switch to Door B? | ▶ 00:34 |
So, we see Monty Hall himself did not understand the Monty Hall Problem. | ▶ 00:38 |
[Thrun] Given the following Bayes network with P of A equal to 0.5, | ▶ 00:00 |
P of B given the A equals 0.2, | ▶ 00:06 |
and P of B given not A 0.8, | ▶ 00:08 |
calculate the following probability. | ▶ 00:12 |
[Thrun] Consider a network of the following type: | ▶ 00:00 |
a variable, A, that is binary connects to three variables, X1, X2, and X3, | ▶ 00:03 |
that are also binary. | ▶ 00:10 |
The probability of A is 0.5, and for all variable XI we have the probability of XI given A is 0.2, | ▶ 00:12 |
and the probability of XI given not A equals 0.6. | ▶ 00:24 |
I would like to know from you the probability of A | ▶ 00:29 |
given that we observed X1, X2, and not X3. | ▶ 00:31 |
Notice that these variables over here are conditionally independent given A. | ▶ 00:37 |
[Thrun] Let us consider the same network again. | ▶ 00:00 |
I would like to know the probability of X3 given that I observed X1. | ▶ 00:03 |
[Thrun] In this next homework assignment I will be drawing you a Bayes network | ▶ 00:00 |
and will ask you some conditional independence questions. | ▶ 00:04 |
Is B conditionally independent of C? And say yes or no. | ▶ 00:09 |
Is B conditionally independent of C given D? And say yes or no. | ▶ 00:14 |
Is B conditionally independent of C given A? And say yes or no. | ▶ 00:19 |
And is B conditionally independent given A and D? And say yes or no. | ▶ 00:24 |
[Thrun] Consider the following network. | ▶ 00:00 |
I would like to know whether the following statements are true or false. | ▶ 00:02 |
C is conditionally independent of E given A. | ▶ 00:08 |
B is conditionally independent of D given C and E. | ▶ 00:12 |
A is conditionally independent of C given E. | ▶ 00:18 |
And A is conditionally independent of C given B. | ▶ 00:21 |
Please check yes or no for each of these questions. | ▶ 00:25 |
[Thrun] In my final question I'll look at the exact same network as before, | ▶ 00:00 |
but I would like to know the minimum number of numerical parameters | ▶ 00:04 |
such as the values to define probabilities and conditional probabilities | ▶ 00:08 |
that are necessary to specify the joint distribution of all 5 variables. | ▶ 00:13 |
[Thrun] The answer is 0.2, | ▶ 00:00 |
and this follows directly from Bayes' rule. | ▶ 00:03 |
In this formula, we can read off the first 2 values straight from the table over here, | ▶ 00:07 |
and we expand the denominator by total probability. | ▶ 00:11 |
Observing that this is exactly the same expression as up here, | ▶ 00:15 |
we get 0.1 divided by 0.1 plus this expression over here can be copied from over here, | ▶ 00:19 |
and P of not A is directly obtained up here. | ▶ 00:27 |
Hence we get 0.5 over here, and as a result we get 0.2. | ▶ 00:30 |
[Thrun] For this question we will be exploring a little trick | ▶ 00:00 |
about non-normalized probability. | ▶ 00:03 |
We will observe that P of A given X1, X2 and not X3, | ▶ 00:05 |
the expression on the left can be resolved by Bayes' rule into this expression over here. | ▶ 00:11 |
We will take X3 to the left and replace it by A, | ▶ 00:16 |
both conditioned on the variables X1 and X2. | ▶ 00:20 |
Then we have PA given X1, X2 divided by P not X3, X1, X2. | ▶ 00:23 |
Next we employ 2 things. | ▶ 00:29 |
One is the denominator does not depend on A, | ▶ 00:31 |
so whether I put an A or not A has no bearing on any calculation here, | ▶ 00:34 |
which means I can defer its calculation until later, and it will turn out to be important. | ▶ 00:39 |
So I'm going to be proportional to just the stuff over here. | ▶ 00:44 |
And second, I export my conditional independence | ▶ 00:49 |
whereby I can omit X1 and X2 from the probability of not X3 conditioned on A. | ▶ 00:52 |
These variables are conditionally independent. | ▶ 00:58 |
This gives me the following recursion | ▶ 01:02 |
where I now removed the third variable from the estimation problem | ▶ 01:05 |
and just retained the first 2 relative to my initial expression. | ▶ 01:10 |
If I keep expanding this, I get the following solution. | ▶ 01:14 |
P of not X3 given A, P X2 given A, P X1 given A times P of A. | ▶ 01:19 |
You might take a minute to just verify this, | ▶ 01:27 |
but this is exploiting the conditional independence | ▶ 01:30 |
very much as in the first step I showed you over here. | ▶ 01:32 |
This step lacks the normalizer, | ▶ 01:35 |
so let me work on the normalizer by expressing the opposite probability, | ▶ 01:38 |
P of not A given the same events, X1, X2, and not X3, | ▶ 01:44 |
which resolves to P of not X3 given not A, | ▶ 01:50 |
P of X2 given not A, P of X1 given not A, | ▶ 01:54 |
and P of not A. | ▶ 02:00 |
I can now plug in the values from above. | ▶ 02:02 |
So the first term gives me 0.8 times 0.2 times 0.2 times 0.5. | ▶ 02:04 |
In the second term I get 0.4 times 0.6 times 0.6 times 0.5, | ▶ 02:15 |
which resolves to 0.016 and 0.072. | ▶ 02:24 |
This is clearly not a probability because we left out the normalizer. | ▶ 02:31 |
But as we know, the normalizer does not depend on whether I put A or not A in here. | ▶ 02:36 |
As a result, it will be the same for both of these expressions, | ▶ 02:40 |
and I can obtain it by just adding these non-normalized probabilities | ▶ 02:44 |
and then subsequently divide these non-normalized probabilities accordingly. | ▶ 02:47 |
So let me just do this. | ▶ 02:52 |
We get for the desired probability over here 0.1818 | ▶ 02:55 |
and for the inverse probability over here 0.8182. | ▶ 03:01 |
Our desired answer therefore is 0.1818. | ▶ 03:08 |
This was not an easy question. | ▶ 03:14 |
[Thrun] The answer is a little bit involved. | ▶ 00:00 |
We use total probability to re-express this by bringing in A. | ▶ 00:03 |
P of X3 given X1 is the sum of P of X3 given X1 and A | ▶ 00:08 |
times P of A given X1 plus the A complement, which is X3, conditional X1 and not A | ▶ 00:15 |
times P of not A given X1. | ▶ 00:22 |
That is just total probability. | ▶ 00:24 |
Next we utilized conditional independence by which we can simplify this expression | ▶ 00:26 |
to drop X1 in the conditional variables | ▶ 00:30 |
and we transform this expression by Bayes' rule again. | ▶ 00:33 |
The same applies to the right side with not A replacing A. | ▶ 00:36 |
All of those expressions over here can be found | ▶ 00:41 |
either in the table up there or just by their complements, | ▶ 00:45 |
with the exception of P of X1. | ▶ 00:49 |
But P of X1 can again be just obtained by total probability, | ▶ 00:52 |
which resolves to 0.2 times 0.5 plus 0.6 times 0.5, | ▶ 00:58 |
which gives me 0.4. | ▶ 01:11 |
We are now in a position to calculate the last term over here, which goes as follows. | ▶ 01:13 |
This expression is 0.2 times 0.2 times 0.5 over 0.4 plus 0.6 times 0.6 times 0.5 over 0.4, | ▶ 01:19 |
which gives us as a final result 0.5. | ▶ 01:36 |
[Thrun] And the answer is as follows. | ▶ 00:00 |
No, no, yes, and no. | ▶ 00:02 |
B and C in the absence of any other information are dependent through A, | ▶ 00:06 |
which is if you learn something about B, you can infer something about A, | ▶ 00:11 |
and then we'll know more about C. | ▶ 00:17 |
If you know D, that doesn't change a thing. | ▶ 00:20 |
You can just take D out of the pool. | ▶ 00:22 |
If you know A, B and C become conditionally independent. | ▶ 00:24 |
This dependence goes away, and ignorance of D doesn't render B and C dependent. | ▶ 00:29 |
However, if we add D back to the mix, | ▶ 00:36 |
then knowledge of D will render B and C dependent by way of the explaining away effect. | ▶ 00:39 |
[Thrun] So the correct answer is tricky in this case. | ▶ 00:00 |
It is no, no, no, and yes. | ▶ 00:03 |
The first one is straightforward. | ▶ 00:07 |
C and E are conditionally independent based on D, | ▶ 00:09 |
and knowledge of A doesn't change anything. | ▶ 00:13 |
B and D are conditionally independent through A, | ▶ 00:15 |
and knowledge of C or E doesn't change that. | ▶ 00:20 |
A and C is interesting. | ▶ 00:23 |
A and C is independent. But if you know D, they become dependent. | ▶ 00:25 |
It turns out if you know E, you can know something about D, | ▶ 00:29 |
and as a result, A and C become dependent through the explain away effect. | ▶ 00:32 |
That doesn't apply if you know B. | ▶ 00:37 |
Even though B tells you something about E, | ▶ 00:39 |
it tells you nothing about D because B and D are independent. | ▶ 00:42 |
Therefore, knowing B tells you nothing about D, | ▶ 00:46 |
and the explain away effect does not occur between A and C. | ▶ 00:49 |
The answer here is yes. | ▶ 00:52 |
[Thrun] The correct answer is 16. | ▶ 00:00 |
The probability of A and C require 1 parameter each. | ▶ 00:03 |
The complement of not A and not C follows by 1 minus that parameter. | ▶ 00:06 |
This guy over here requires 2 parameters. | ▶ 00:12 |
You need to know the probability of B given A and B given not A. | ▶ 00:15 |
The complements can be obtained easily. | ▶ 00:18 |
The probability of D is conditioned on 2 variables which can take 4 possible values. | ▶ 00:20 |
Hence the number is 4. | ▶ 00:24 |
And E is conditioned on 3 variables, so it can take a total of 8 different values, | ▶ 00:26 |
2 to the 3rd, which is 8. | ▶ 00:30 |
If you add 8 plus 4 plus 2 plus 1 plus 1, you get 16. | ▶ 00:32 |
Welcome to the machine learning unit. | ▶ 00:00 |
Machine learning is a fascinating area. | ▶ 00:03 |
The world has become immeasurably data-rich. | ▶ 00:06 |
The world wide web has come up over the last decade. | ▶ 00:09 |
The human genome is being sequenced. | ▶ 00:12 |
Vast chemical databases, pharmaceutical databases, | ▶ 00:15 |
and financial databases are now available | ▶ 00:19 |
on a scale unthinkable even 5 years ago. | ▶ 00:22 |
To make sense out of the data, | ▶ 00:26 |
to extract information from the data, | ▶ 00:28 |
machine learning is the discipline to go. | ▶ 00:30 |
Machine learning is an important subfeed of artificial intelligence, | ▶ 00:33 |
it's my personal favorite next to robotics | ▶ 00:37 |
because I believe it has a huge impact on society | ▶ 00:40 |
and is absolutely necessary as we move forward. | ▶ 00:43 |
So in this class, I teach you some of the very basics of | ▶ 00:47 |
machine learning, and in our next unit | ▶ 00:50 |
Peter will tell you some more about machine learning. | ▶ 00:52 |
We'll talk about supervised learning, which is one side of machine learning, | ▶ 00:56 |
and Peter will tell you about unsupervised learning, | ▶ 01:00 |
which is a different style. | ▶ 01:02 |
Later in this class we will also encounter reinforcement learning, | ▶ 01:05 |
which is yet another set of machine learning. | ▶ 01:07 |
Anyhow, let's just dive in. | ▶ 01:10 |
Welcome to the first class on machine learning. | ▶ 00:00 |
So far we talked a lot about Bayes Networks. | ▶ 00:03 |
And the way we talked about them | ▶ 00:07 |
is all about reasoning within Bayes Networks | ▶ 00:10 |
that are known. | ▶ 00:14 |
Machine learning addresses the problem | ▶ 00:15 |
of how to find those networks | ▶ 00:17 |
or other models | ▶ 00:19 |
based on data. | ▶ 00:20 |
Learning models from data | ▶ 00:22 |
is a major, major area of artificial intelligence | ▶ 00:25 |
and it's perhaps the one | ▶ 00:29 |
that had the most commercial success. | ▶ 00:31 |
In many commercial applications | ▶ 00:33 |
the models themselves are fitted | ▶ 00:37 |
based on data. | ▶ 00:39 |
For example, Google | ▶ 00:40 |
uses data to understand | ▶ 00:42 |
how to respond to each search query. | ▶ 00:44 |
Amazon uses data | ▶ 00:46 |
to understand how to place products on their website. | ▶ 00:49 |
And these machine learning techniques | ▶ 00:52 |
are the enabling techniques that make that possible. | ▶ 00:53 |
So this class | ▶ 00:56 |
which is about supervised learning | ▶ 00:57 |
will go through some very basic methods | ▶ 00:59 |
for learning models from data | ▶ 01:02 |
in particular, specific types of Bayes Networks. | ▶ 01:04 |
We will complement this | ▶ 01:06 |
with a class on unsupervised learning | ▶ 01:08 |
that will be taught next | ▶ 01:10 |
after this class. | ▶ 01:14 |
Let me start off with a quiz. | ▶ 01:15 |
The quiz is: What companies are famous | ▶ 01:18 |
for machine learning using data? | ▶ 01:20 |
Google for mining the web. | ▶ 01:24 |
Netflix for mining what people | ▶ 01:29 |
would like to rent on DVDs. | ▶ 01:31 |
Which is DVD recommendations. | ▶ 01:36 |
Amazon.com for product placement. | ▶ 01:40 |
Check any or all | ▶ 01:45 |
and if none of those apply | ▶ 01:47 |
check down here. | ▶ 01:49 |
And, not surprisingly, the answer is | ▶ 00:00 |
all of those companies and many, many, many more | ▶ 00:03 |
use massive machine learning for making decisions | ▶ 00:06 |
that are really essential to the businesses. | ▶ 00:09 |
Google mines the web and uses machine learning for translation, | ▶ 00:12 |
as we've seen in the introductory level. Netflix has used | ▶ 00:15 |
machine learning extensively for understanding what type of DVD to recommend to you next. | ▶ 00:18 |
Amazon composes its entire product pages using | ▶ 00:22 |
machine learning by understanding how customers | ▶ 00:25 |
respond to different compositions and placements of their products, | ▶ 00:28 |
and many, many other examples exist. | ▶ 00:31 |
I would argue that in Silicon Valley, | ▶ 00:35 |
at least half the companies dealing with customers and online products | ▶ 00:37 |
do extensively use machine learning, | ▶ 00:41 |
so it makes machine learning a really exciting discipline. | ▶ 00:43 |
In my own research, I've extensively used machine learning for robotics. | ▶ 00:00 |
What you see here is a robot my students and I built at Stanford | ▶ 00:05 |
called Stanley, and it won the DARPA Grand Challenge. | ▶ 00:08 |
It's a self-driving car that drives without any human assistance whatsoever, | ▶ 00:12 |
and this vehicle extensively uses machine learning. | ▶ 00:16 |
The robot is equipped with a laser system | ▶ 00:22 |
I will talk more about lasers in my robotics class, | ▶ 00:25 |
but here you can see how the robot is able to build | ▶ 00:28 |
3-D models of the terrain ahead. | ▶ 00:31 |
These are almost like video game models that allow it to make | ▶ 00:34 |
assessments where to drive and where not to drive. | ▶ 00:37 |
Essentially, it's trying to drive on flat ground. | ▶ 00:39 |
The problem with these lasers is that they don't see very far. | ▶ 00:43 |
They see about 25 meters out, so to drive really fast | ▶ 00:46 |
the robot has to see further. | ▶ 00:50 |
This is where machine learning comes into play. | ▶ 00:53 |
What you see here is camera images delivered by the robot | ▶ 00:56 |
superimposed with laser data that doesn't see very far, | ▶ 00:58 |
but the laser is good enough to extract samples | ▶ 01:01 |
of driveable road surface that can then be machine learned | ▶ 01:04 |
and extrapolated into the entire camera image. | ▶ 01:08 |
That enables the robot to use the camera | ▶ 01:10 |
to see driveable terrain all the way to the horizon | ▶ 01:13 |
up to like 200 meters out, enough to drive really, really fast. | ▶ 01:16 |
This ability to adapt its vision by driving its own training examples using lasers | ▶ 01:22 |
but seeing out 200 meters or more | ▶ 01:27 |
was a key factor in winning the race. | ▶ 01:30 |
Machine learning is a very large field | ▶ 00:00 |
with many different methods | ▶ 00:03 |
and many different applications. | ▶ 00:04 |
I will now define some of the very basic terminology | ▶ 00:06 |
that is being used to distinguish | ▶ 00:10 |
different machine learning methods. | ▶ 00:12 |
Let's start with the what. | ▶ 00:13 |
What is being learned? | ▶ 00:17 |
You can learn parameters | ▶ 00:19 |
like the probabilities of a Bayes Network. | ▶ 00:23 |
You can learn structure | ▶ 00:26 |
like the arc structure of a Bayes Network. | ▶ 00:27 |
And you might even discover hidden concepts. | ▶ 00:31 |
For example | ▶ 00:34 |
you might find that certain training example | ▶ 00:35 |
form a hidden group. | ▶ 00:37 |
For example Netflix | ▶ 00:39 |
you might find that there's different types of customers | ▶ 00:41 |
some that care about classic movies | ▶ 00:43 |
some of them care about modern movies | ▶ 00:45 |
and those might form hidden concepts | ▶ 00:47 |
whose discovery can really help you | ▶ 00:49 |
make better sense of the data. | ▶ 00:51 |
Next is what from? | ▶ 00:53 |
Every machine learning method | ▶ 00:57 |
is driven by some sort of target information | ▶ 01:00 |
that you care about. | ▶ 01:02 |
In supervised learning | ▶ 01:03 |
which is the subject of today's class | ▶ 01:06 |
we're given specific target labels | ▶ 01:08 |
and I give you examples just in a second. | ▶ 01:10 |
We also talk about unsupervised learning | ▶ 01:13 |
where target labels are missing | ▶ 01:15 |
and we use replacement principles | ▶ 01:19 |
to find, for example | ▶ 01:21 |
hidden concepts. | ▶ 01:22 |
Later there will be a class in reinforcement learning | ▶ 01:24 |
when an agent learns from feedback with the physical environment | ▶ 01:27 |
by interacting and trying actions | ▶ 01:32 |
and receiving some sort of evaluation | ▶ 01:34 |
from the environment | ▶ 01:37 |
like "Well done" or "That works." | ▶ 01:37 |
Again, we will talk about those in detail later. | ▶ 01:41 |
There's different things you could try to do | ▶ 01:43 |
with machine learning technique. | ▶ 01:46 |
You might care about prediction. | ▶ 01:48 |
For example you might want to care about what's going to happen with the future | ▶ 01:49 |
in the stockmarket for example. | ▶ 01:53 |
You might care to diagnose something | ▶ 01:55 |
which is you get data and you wish to explain it | ▶ 01:57 |
and you use machine learning for that. | ▶ 01:59 |
Sometimes your objective is to summarize something. | ▶ 02:01 |
For example if you read a long article | ▶ 02:04 |
your machine learning method might aim to | ▶ 02:07 |
produce a short article that summarizes the long article. | ▶ 02:09 |
And there's many, many, many more different things. | ▶ 02:12 |
You can talk about the how to learn. | ▶ 02:14 |
We use the word passive | ▶ 02:16 |
if your learning agent is just an observer | ▶ 02:19 |
and has no impact on the data itself. | ▶ 02:23 |
Otherwise, you call it active. | ▶ 02:24 |
Sometimes learning occurs online | ▶ 02:26 |
which means while the data is being generated | ▶ 02:30 |
and some of it is offline | ▶ 02:32 |
which means learning occurs | ▶ 02:35 |
after the data has been generated. | ▶ 02:37 |
There's different types of outputs | ▶ 02:39 |
of a machine learning algorithm. | ▶ 02:42 |
Today we'll talk about classification | ▶ 02:44 |
versus regression. | ▶ 02:47 |
In classification the output is binary | ▶ 02:50 |
or a fixed number of classes | ▶ 02:53 |
for example something is either a chair or not. | ▶ 02:55 |
Regression is continuous. | ▶ 02:57 |
The temperature might be 66.5 degrees | ▶ 02:59 |
in our prediction. | ▶ 03:01 |
And there's tons of internal details | ▶ 03:03 |
we will talk about. | ▶ 03:05 |
Just to name one. | ▶ 03:07 |
We will distinguish generative | ▶ 03:09 |
from discriminative. | ▶ 03:12 |
Generative seeks to model the data | ▶ 03:14 |
as generally as possible | ▶ 03:16 |
versus discriminative methods | ▶ 03:18 |
seek to distinguish data | ▶ 03:20 |
and this might sound like a superficial distinction | ▶ 03:21 |
but it has enormous ramification | ▶ 03:24 |
on the learning algorithm. | ▶ 03:26 |
Now to tell you the truth | ▶ 03:27 |
it took me many years | ▶ 03:29 |
to fully learn all these words here | ▶ 03:30 |
and I don't expect you to pick them all up | ▶ 03:33 |
in one class | ▶ 03:36 |
but you should as well know that they exist. | ▶ 03:37 |
And as they come up | ▶ 03:39 |
I'll emphasize them | ▶ 03:41 |
so you can resort any learning method | ▶ 03:42 |
I tell you back into the specific taxonomy over here. | ▶ 03:44 |
The vast amount of work in the field | ▶ 00:00 |
falls into the area of supervised learning. | ▶ 00:02 |
In supervised learning | ▶ 00:06 |
you're given for each training example | ▶ 00:08 |
a feature vector | ▶ 00:10 |
and a target label named Y. | ▶ 00:13 |
For example, for a credit rating agency | ▶ 00:16 |
X1, X2, X3 might be a feature | ▶ 00:20 |
such as is the person employed? | ▶ 00:23 |
What is the salary of the person? | ▶ 00:25 |
Has the person previously defaulted on a credit card? | ▶ 00:27 |
And so on. | ▶ 00:30 |
And Y is a predictor | ▶ 00:32 |
whether the person is to default | ▶ 00:34 |
on the credit or not. | ▶ 00:36 |
Now machine learning | ▶ 00:38 |
is to be carried out on past data | ▶ 00:40 |
where the credit rating agency | ▶ 00:42 |
might have collected features just like these | ▶ 00:44 |
and actual occurances of default or not. | ▶ 00:46 |
What it wishes to produce | ▶ 00:49 |
is a function that allows us | ▶ 00:51 |
to predict future customers. | ▶ 00:53 |
So the new person comes in | ▶ 00:55 |
with a different feature vector. | ▶ 00:56 |
Can we predict as good as possible | ▶ 00:58 |
the functional relationship | ▶ 01:00 |
between these features X1 to Xn all the way to Y? | ▶ 01:02 |
You can apply the exact same example | ▶ 01:05 |
in image recognition | ▶ 01:08 |
where X might be pixels of images | ▶ 01:09 |
or it might be features of things found in images | ▶ 01:11 |
and Y might be a label that says | ▶ 01:14 |
whether a certain object is contained | ▶ 01:16 |
in an image or not. | ▶ 01:17 |
Now in supervised learning | ▶ 01:19 |
you're given many such examples. | ▶ 01:20 |
X21 to X2n | ▶ 01:25 |
leads to Y2 | ▶ 01:28 |
all way the index m. | ▶ 01:32 |
This is called your data. | ▶ 01:35 |
If we call each input vector Xm | ▶ 01:38 |
and we wish to find out the function | ▶ 01:43 |
given any Xm or any future vector X | ▶ 01:44 |
produces as close as possible | ▶ 01:50 |
my target signal Y. | ▶ 01:53 |
Now this isn't always possible | ▶ 01:55 |
and sometimes it's acceptable | ▶ 01:57 |
in fact preferable | ▶ 01:59 |
to tolerate a certain amount of error | ▶ 02:00 |
in your training data. | ▶ 02:03 |
But the subject of machine learning | ▶ 02:05 |
is to identify this function over here. | ▶ 02:07 |
And once you identify it | ▶ 02:10 |
you can use it for future Xs | ▶ 02:11 |
that weren't part of the training set | ▶ 02:13 |
to produce a prediction | ▶ 02:16 |
that hopefully is really, really good. | ▶ 02:19 |
So let me ask you a question. | ▶ 02:21 |
And this is a question | ▶ 02:24 |
for which I haven't given you the answer | ▶ 02:27 |
but I'd like to appeal to your intuition. | ▶ 02:28 |
Here's one data set | ▶ 02:31 |
where the X is one dimensionally plotted horizontally | ▶ 02:34 |
and the Y is vertically | ▶ 02:37 |
and suppose there looks like this. | ▶ 02:39 |
Suppose my machine learning algorithm | ▶ 02:44 |
gives me 2 hypotheses. | ▶ 02:45 |
One is this function over here | ▶ 02:47 |
which is a linear function | ▶ 02:51 |
and one is this function over here. | ▶ 02:52 |
I'd like to know which of the functions | ▶ 02:53 |
you find preferable | ▶ 02:57 |
as an explanation for the data. | ▶ 02:59 |
Is it function A? | ▶ 03:01 |
Or function B? | ▶ 03:02 |
Check here for A | ▶ 03:06 |
here for B | ▶ 03:08 |
and here for neither. | ▶ 03:09 |
And I hope you guessed function A. | ▶ 00:00 |
Even though both perfectly describe the data | ▶ 00:04 |
B is much more complex than A. | ▶ 00:08 |
In fact, outside the data | ▶ 00:10 |
B seems to go to a minus infinity much faster | ▶ 00:12 |
than these data points | ▶ 00:16 |
and to plus infinity much faster | ▶ 00:17 |
with these data points over here. | ▶ 00:19 |
And in between | ▶ 00:21 |
we have wide oscillations | ▶ 00:22 |
that don't correspond to any data. | ▶ 00:23 |
So I would argue | ▶ 00:25 |
A is preferable. | ▶ 00:27 |
The reason why I asked this question | ▶ 00:31 |
is because of something called Occam's Razor. | ▶ 00:32 |
Occam can be spelled in many different ways. | ▶ 00:35 |
And what Occam says is that | ▶ 00:38 |
everything else being equal | ▶ 00:41 |
chose the less complex hypothesis. | ▶ 00:43 |
Now in practice | ▶ 00:46 |
there's actually a trade-off | ▶ 00:48 |
between a really good data fit | ▶ 00:50 |
and low complexity. | ▶ 00:53 |
Let me illustrate this to you | ▶ 00:55 |
by a hypothetical example. | ▶ 00:58 |
Consider the following graph | ▶ 00:59 |
where the horizontal axis graphs | ▶ 01:02 |
complexity of the solution. | ▶ 01:04 |
For example, if you use polynomials | ▶ 01:07 |
this might be a high-degree polynomial over here | ▶ 01:10 |
and maybe a linear function over here | ▶ 01:12 |
which is a low-degree polynomial | ▶ 01:14 |
your training data error | ▶ 01:16 |
tends to go like this. | ▶ 01:19 |
The more complex the hypothesis you allow | ▶ 01:22 |
the more you can just fit your data. | ▶ 01:25 |
However, in reality | ▶ 01:29 |
your generalization error on unknown data | ▶ 01:31 |
tends to go like this. | ▶ 01:33 |
It is the sum of the training data error | ▶ 01:37 |
and another function | ▶ 01:40 |
which is called the overfitting error. | ▶ 01:42 |
Not surprisingly | ▶ 01:46 |
the best complexity is obtained | ▶ 01:47 |
where the generalization error is minimum. | ▶ 01:49 |
There are methods | ▶ 01:52 |
to calculate the overfitting error. | ▶ 01:53 |
They go into a statistical field | ▶ 01:55 |
under the name Bayes variance methods. | ▶ 01:57 |
However, in practice | ▶ 02:01 |
you're often just given the training data error. | ▶ 02:02 |
You find if you don't find the model | ▶ 02:04 |
that minimizes the training data error | ▶ 02:08 |
but instead pushes back the complexity | ▶ 02:11 |
your algorithm tends to perform better | ▶ 02:14 |
and that is something we will study a little bit | ▶ 02:17 |
in this class. | ▶ 02:20 |
However, this slide is really important | ▶ 02:22 |
for anybody doing machine learning in practice. | ▶ 02:26 |
If you deal with data | ▶ 02:29 |
and you have ways to fit your data | ▶ 02:31 |
be aware that overfitting | ▶ 02:33 |
is a major source of poor performance | ▶ 02:36 |
of a machine learning algorithm. | ▶ 02:39 |
And I give you examples in just one second. | ▶ 02:41 |
So a really important example | ▶ 00:00 |
of machine learning is SPAM detection. | ▶ 00:02 |
We all get way too much email | ▶ 00:04 |
and a good number of those are SPAM. | ▶ 00:06 |
Here are 3 examples of email. | ▶ 00:08 |
Dear Sir: First I must solicit your confidence | ▶ 00:12 |
in this transaction, this is by virtue of its nature | ▶ 00:14 |
being utterly confidential and top secret... | ▶ 00:16 |
This is likely SPAM. | ▶ 00:19 |
Here's another one. | ▶ 00:22 |
In upper caps. | ▶ 00:23 |
99 MILLION EMAIL ADDRESSES FOR ONLY $99 | ▶ 00:25 |
This is very likely SPAM. | ▶ 00:28 |
And here's another one. | ▶ 00:31 |
Oh, I know it's blatantly OT | ▶ 00:33 |
but I'm beginning to go insane. | ▶ 00:35 |
Had an old Dell Dimension XPS sitting in the corner | ▶ 00:37 |
and decided to put it to use. | ▶ 00:40 |
And so on and so on. | ▶ 00:41 |
Now this is likely not SPAM. | ▶ 00:42 |
How can a computer program | ▶ 00:45 |
distinguish between SPAM and not SPAM? | ▶ 00:47 |
Let's use this as an example | ▶ 00:49 |
to talk about machine learning for discrimination | ▶ 00:51 |
using Bayes Networks. | ▶ 00:55 |
In SPAM detection | ▶ 00:59 |
we get an email | ▶ 01:01 |
and we wish to categorize it | ▶ 01:03 |
either as SPAM | ▶ 01:05 |
in which case we don't even show as to the where | ▶ 01:07 |
or what we call HAM | ▶ 01:10 |
which is the technical word for | ▶ 01:12 |
an email worth passing on to the person being emailed. | ▶ 01:15 |
So the function over here | ▶ 01:19 |
is the function we're trying to learn. | ▶ 01:21 |
Most SPAM filters use human input. | ▶ 01:23 |
When you go through email | ▶ 01:26 |
you have a button called IS SPAM | ▶ 01:28 |
which allows you as a user to flag SPAM | ▶ 01:32 |
and occasionally you will say an email is SPAM. | ▶ 01:34 |
If you look at this | ▶ 01:37 |
you have a typical supervised machine learning situation | ▶ 01:40 |
where the input is an email | ▶ 01:43 |
and the output is whether you flag it as SPAM | ▶ 01:45 |
or if we don't flag it | ▶ 01:47 |
we just think it's HAM. | ▶ 01:49 |
Now to make this amenable to | ▶ 01:52 |
a machine learning algorithm | ▶ 01:54 |
we have to talk about how to represent emails. | ▶ 01:55 |
They're all using different words and different characters | ▶ 01:57 |
and they might have different graphics included. | ▶ 02:00 |
Let's pick a representation that's easy to process. | ▶ 02:02 |
And this representation is often called | ▶ 02:06 |
Bag of Words. | ▶ 02:09 |
Bag of Words is a representation | ▶ 02:10 |
of a document | ▶ 02:14 |
that just counts the frequency | ▶ 02:15 |
of words. | ▶ 02:17 |
If an email were to say Hello | ▶ 02:18 |
I will say Hello. | ▶ 02:22 |
The Bag of Words representation | ▶ 02:24 |
is the following. | ▶ 02:26 |
2-1-1-1 | ▶ 02:27 |
for the dictionary | ▶ 02:31 |
that contains the 4 words | ▶ 02:33 |
Hello I will say. | ▶ 02:36 |
Now look at the subtlety here. | ▶ 02:38 |
Rather than representing each individual word | ▶ 02:41 |
we have a count of each word | ▶ 02:43 |
and the count is oblivious | ▶ 02:46 |
to the order in which the words were stated. | ▶ 02:49 |
A Bag of Words representation | ▶ 02:52 |
relative to a fixed dictionary | ▶ 02:55 |
represents the counts of each word | ▶ 02:57 |
relative to the words in the dictionary. | ▶ 03:01 |
If you were to use a different dictionary | ▶ 03:03 |
like hello and good-bye | ▶ 03:06 |
our counts would be | ▶ 03:08 |
2 and 0. | ▶ 03:10 |
However, in most cases | ▶ 03:13 |
you make sure that all the words found | ▶ 03:14 |
in messages | ▶ 03:17 |
are actually included in the dictionary. | ▶ 03:18 |
So the dictionary might be very, very large. | ▶ 03:19 |
Let me make up an unofficial example | ▶ 03:22 |
of a few SPAM and a few HAM messages. | ▶ 03:25 |
Offer is secret. | ▶ 03:30 |
Click secret link. | ▶ 03:32 |
Secret sports link. | ▶ 03:35 |
Obviously those are contrived | ▶ 03:37 |
and I tried to retain the recovery | ▶ 03:40 |
to a small number of words | ▶ 03:42 |
to make this example workable. | ▶ 03:44 |
In practice we need thousands | ▶ 03:46 |
of such messages | ▶ 03:47 |
to get good information. | ▶ 03:48 |
Play sports today. | ▶ 03:50 |
Went play sports. | ▶ 03:52 |
Secret sports event. | ▶ 03:54 |
Sport is today. | ▶ 03:56 |
Sport costs money. | ▶ 03:59 |
My first quiz is | ▶ 04:02 |
What is the size of the vocabulary | ▶ 04:06 |
that contains all words in these messages? | ▶ 04:08 |
Please enter the value in this box over here. | ▶ 04:12 |
Well let's count. | ▶ 00:00 |
Offer is secret click. | ▶ 00:02 |
Secret occurs over here already | ▶ 00:08 |
so we don't have to count it twice. | ▶ 00:10 |
Link, sports, play, today, went, event | ▶ 00:12 |
costs money. | ▶ 00:18 |
So the answer is | ▶ 00:20 |
12. | ▶ 00:22 |
There's 12 different words | ▶ 00:24 |
contained in these 8 messages. | ▶ 00:26 |
[Narrator] Another quiz. | ▶ 00:00 |
What is the probability that a random message | ▶ 00:03 |
that arrives to fall into the spam bucket? | ▶ 00:06 |
Assuming that those messages | ▶ 00:09 |
are all drawn at random. | ▶ 00:11 |
[writing on page] | ▶ 00:13 |
[Narrator] And the answer is: | ▶ 00:00 |
there's 8 different messages | ▶ 00:02 |
of which 3 are spam. | ▶ 00:04 |
So the maximum likelihood estimate | ▶ 00:06 |
is 3/8. | ▶ 00:09 |
[writing on paper] | ▶ 00:11 |
So, let's look at this a little bit more formally and talk about maximum likelihood. | ▶ 00:00 |
Obviously, we're observing 8 messages: spam, spam, spam, and 5 times ham. | ▶ 00:03 |
And what we care about is what's our prior probability of spam | ▶ 00:12 |
that maximizes the likelihood of this data? | ▶ 00:17 |
So, let's assume we're going to assign a value of pi to this, | ▶ 00:20 |
and we wish to find the pi that maximizes the likelihood of this data over here, | ▶ 00:24 |
assuming that each email is drawn independently | ▶ 00:29 |
according to an identical distribution. | ▶ 00:33 |
The probability of the p(yi) data item is then pi if yi = spam, | ▶ 00:37 |
and 1 - pi if yi = ham. | ▶ 00:48 |
If we rewrite the data as 1, 1, 1, 0, 0, 0, 0, 0, | ▶ 00:53 |
we can write p(yi) as follows: pi to the yi times (1 - pi) to the 1 - yi. | ▶ 00:59 |
It's not that easy to see that this is equivalent, | ▶ 01:13 |
but say yi = 1. | ▶ 01:16 |
Then this term will fall out. | ▶ 01:19 |
It's not proficient by 1 because the exponent is zero, and we get pi as over here. | ▶ 01:22 |
If yi = 0, then this term falls out, and this one here becomes 1 - pi as over here. | ▶ 01:28 |
Now assuming independence, we get for the entire data set | ▶ 01:36 |
that the joint probability of all data items is the product | ▶ 01:44 |
of the individual data items over here, | ▶ 01:49 |
which can now be written as follows: | ▶ 01:52 |
pi to the count of instances where yi = 1 times | ▶ 01:56 |
1 - pi to the count of the instances where yi = 0. | ▶ 02:03 |
And we know in our example, this count over here is 3, | ▶ 02:09 |
and this count over here is 5, so we get pi to the 3rd times 1 - pi to the 5th. | ▶ 02:13 |
We now wish to find the pi that maximizes this expression over here. | ▶ 02:22 |
We can also maximize the logarithm of this expression, | ▶ 02:28 |
which is 3 times log pi + 5 times log (1 - pi) | ▶ 02:33 |
Optimizing the log is the same as optimizing p because the log is monotonic to p. | ▶ 02:42 |
The maximum of this function is attained with a derivative of 0, | ▶ 02:50 |
so let's compute with a derivative and set it to 0. | ▶ 02:54 |
This is the derivative, 3 over pi - 5 over 1 - pi. | ▶ 03:00 |
We now bring this expression to the right side, | ▶ 03:05 |
multiply the denominators up, and sort all the expressions containing pi to the left, | ▶ 03:09 |
which gives us pi = 3/8, exactly the number we were at before. | ▶ 03:18 |
We just derived mathematically that the data likelihood maximizing number | ▶ 03:26 |
for the probability is indeed the empirical count, | ▶ 03:33 |
which means when we looked at this quiz before | ▶ 03:37 |
and we said a maximum likelihood for the prior probability of spam is 3/8, | ▶ 03:41 |
by simply counting 3 over 8 emails were spam, | ▶ 03:49 |
we actually followed proper mathematical principles | ▶ 03:54 |
to do maximum likelihood estimation. | ▶ 03:57 |
Now, you might not fully have gotten the derivation of this, | ▶ 03:59 |
and I recommend you to watch it again, but it's not that important | ▶ 04:03 |
for the progress in this class. | ▶ 04:07 |
So, here's another quiz. | ▶ 04:09 |
I'd like the maximum likelihood, or ML solutions, | ▶ 04:11 |
for the following probabilities. | ▶ 04:17 |
The probability that the word "secret" comes up, | ▶ 04:19 |
assuming that we already know a message is spam, | ▶ 04:21 |
and the probability that the same word "secret" comes up | ▶ 04:25 |
if we happen to know the message is not spam, it's ham. | ▶ 04:28 |
And just as before | ▶ 00:00 |
we count the word secret | ▶ 00:02 |
in SPAM and in HAM | ▶ 00:04 |
as I've underlined here. | ▶ 00:06 |
Three out of 9 words in SPAM | ▶ 00:07 |
are the word secret | ▶ 00:11 |
so we have a third over here | ▶ 00:12 |
or 0.333 | ▶ 00:14 |
and only 1 out of all the 15 words in HAM | ▶ 00:18 |
are secret | ▶ 00:21 |
so you get a fifteenth | ▶ 00:22 |
or 0.0667. | ▶ 00:23 |
By now, you might have recognized what we're really building up is a Bayes network | ▶ 00:00 |
where the parameters of the Bayes networks are estimated using supervised learning | ▶ 00:06 |
by a maximum likelihood estimator based on training data. | ▶ 00:10 |
The Bayes network has at its root an unobservable variable called spam, | ▶ 00:15 |
which is binary, and it has as many children as there are words in a message, | ▶ 00:20 |
where each word has an identical conditional distribution | ▶ 00:28 |
of the word occurrence given the class spam or not spam. | ▶ 00:33 |
If you write on our dictionary over here, | ▶ 00:39 |
you might remember the dictionary had 12 different words, | ▶ 00:42 |
so here is 5 of the 12, offer, is, secret, click and sports. | ▶ 00:48 |
Then for the spam class, we found the probability of secret given spam is 1/3, | ▶ 00:52 |
and we also found that the probability of secret given ham is 1/15, | ▶ 00:59 |
so here's a quiz. | ▶ 01:05 |
Assuming a vocabulary size of 12, or put differently, | ▶ 01:07 |
the dictionary has 12 words, how many parameters | ▶ 01:12 |
do we need to specify this Bayes network? | ▶ 01:16 |
And the correct answer is 23. | ▶ 00:00 |
We need 1 parameter for the prior p (spam), | ▶ 00:03 |
and then we have 2 dictionary distributions of any word, | ▶ 00:07 |
i given spam, and the same for ham. | ▶ 00:12 |
Now, there's 12 words in a dictionary, | ▶ 00:16 |
but this distribution only needs 11 parameters, | ▶ 00:18 |
so 12 can be figured out because they have to add up to 1. | ▶ 00:20 |
And the same is true over here, so if you add all these together, | ▶ 00:24 |
we get 23. | ▶ 00:27 |
So, here's a quiz. | ▶ 00:00 |
Let's assume we fit all the 23 parameters of the Bayes network | ▶ 00:02 |
as explained using maximum likelihood. | ▶ 00:06 |
Let's now do classification and see what class and message it ends up with. | ▶ 00:09 |
Let me start with a very simple message, and it contains a single word | ▶ 00:14 |
just to make it a little bit simpler. | ▶ 00:18 |
What's the probability that we classify this one word message as spam? | ▶ 00:21 |
And the answer is 0.1667 or 3/18. | ▶ 00:00 |
How do I get there? Well, let's apply Bayes rule. | ▶ 00:07 |
This form is easily transformed into this expression over here, | ▶ 00:13 |
the probability of the message given spam times the prior probability of spam | ▶ 00:19 |
over the normalizer over here. | ▶ 00:25 |
Now, we know that the word "sports" occurs 1 in our 9 words of spam, | ▶ 00:29 |
and our prior probability for spam is 3/8, | ▶ 00:34 |
which gives us this expression over here. | ▶ 00:38 |
We now have to add the same probabilities for the class ham. | ▶ 00:40 |
"Sports" occurs 5 times out of 15 in the ham class, | ▶ 00:45 |
and the prior probability for ham is 5/8, | ▶ 00:51 |
which gives us 3/72 divided by 18/72, which is 3/18 or 1/6. | ▶ 00:55 |
This gets to a more complicated quiz. | ▶ 00:00 |
Say the message now contains 3 words. | ▶ 00:03 |
"Secret is secret," not a particularly meaningful email, | ▶ 00:06 |
but the frequent occurrence of "secret" seems to suggest it might be spam. | ▶ 00:10 |
What's the probability you're going to judge this to be spam? | ▶ 00:16 |
And the answer is surprisingly high. It's 25/26, or 0.9615. | ▶ 00:00 |
To see if we apply Bayes rule, which multiples the prior for spam-ness | ▶ 00:10 |
with the conditional probability of each word given spam. | ▶ 00:16 |
"Secret" carries 1/3, "is" 1/9, and "secret" 1/3 again. | ▶ 00:19 |
We normalize this by the same expression plus the probability for | ▶ 00:26 |
the non-spam case. | ▶ 00:32 |
5/8 is a prior. | ▶ 00:36 |
"Secret" is 1/15. | ▶ 00:38 |
"Is" is 1/15, | ▶ 00:42 |
and "secret" again. | ▶ 00:45 |
This resolves to 1/216 over this expression plus 1/5400, | ▶ 00:48 |
and when you work it all out is 25/26. | ▶ 00:57 |
The final quiz, let's assume our message is "Today is secret." | ▶ 00:00 |
And again, it might look like spam because the word "secret" occurs. | ▶ 00:08 |
I'd like you to compute for me the probability of spam given this message. | ▶ 00:12 |
And surprisingly, the probability for this message to be spam is 0. | ▶ 00:00 |
It's not 0.001. It's flat 0. | ▶ 00:07 |
In other words, it's impossible, according to our model, | ▶ 00:11 |
that this text could be a spam message. | ▶ 00:14 |
Why is this? | ▶ 00:17 |
When we apply the same rule as before, we get the prior for spam which is 3/8. | ▶ 00:19 |
And we multiple the conditional for each word into this. | ▶ 00:24 |
For "secret," we know it to be 1/3. | ▶ 00:28 |
For "is," to be 1/9, but for today, it's 0. | ▶ 00:31 |
It's 0 because the maximum of the estimate for the probability of "today" in spam is 0. | ▶ 00:39 |
"Today" just never occurred in a spam message so far. | ▶ 00:45 |
Now, this 0 is troublesome because as we compute the outcome-- | ▶ 00:49 |
and I'm plugging in all the numbers as before-- | ▶ 00:55 |
none of the words matter anymore, just the 0 matters. | ▶ 01:00 |
So, we get 0 over something which is plain 0. | ▶ 01:03 |
Are we overfitting? You bet. | ▶ 01:10 |
We are clearly overfitting. | ▶ 01:13 |
It can't be that a single word determines the entire outcome of our analysis. | ▶ 01:15 |
The reason is that our model, to assign a probability of 0 for the word "today" | ▶ 01:21 |
to be in the class of spam is just too aggressive. | ▶ 01:26 |
Let's change this. | ▶ 01:29 |
One technique to deal with the overfitting problem is called Laplace smoothing. | ▶ 01:34 |
In maximum likelihood estimation, we assign towards our probability | ▶ 01:39 |
the quotient of the count of this specific event over all events in our data set. | ▶ 01:45 |
For example, for the prior probability, we found that 3/8 messages are spam. | ▶ 01:51 |
Therefore, our maximum likelihood estimate | ▶ 01:57 |
for the prior probability of spam was 3/8. | ▶ 02:00 |
In Laplace Smoothing, we use a different estimate. | ▶ 02:05 |
We add the value k to the count | ▶ 02:10 |
and normalize as if we added k to every single class | ▶ 02:15 |
that we've tried to estimate something over. | ▶ 02:20 |
This is equivalent to assuming we have a couple of fake training examples | ▶ 02:23 |
where we add k to each observation count. | ▶ 02:28 |
Now, if k equals 0, we get our maximum likelihood estimator. | ▶ 02:32 |
But if k is larger than 0 and n is finite, we get different answers. | ▶ 02:36 |
Let's say k equals 1, | ▶ 02:41 |
and let's assume we get one message, | ▶ 02:47 |
and that message was spam, so we're going to write it one message, one spam. | ▶ 02:51 |
What is p (spam) for the Laplace smoothing of k + 1? | ▶ 02:56 |
Let's do the same with 10 messages, and we get 6 spam. | ▶ 03:03 |
And 100 messages, of which 60 are spam. | ▶ 03:09 |
Please enter your numbers into the boxes over here. | ▶ 03:16 |
The answer here is 2/3 or 0.667 and is computed as follows. | ▶ 00:00 |
We have 1 message with 1 as spam, but we're going to add k =1. | ▶ 00:10 |
We're going to add k = 2 over here because there's 2 different classes. | ▶ 00:16 |
K = 1 times 2 = 2, which gives us 2/3. | ▶ 00:22 |
The answer over here is 7/12. | ▶ 00:28 |
Again, we have 6/10 but we add 2 down here and 1 over here, so you get 7/12. | ▶ 00:32 |
And correspondingly, we get 61/102 is 60 + 1 over 100 +2. | ▶ 00:41 |
If we look at the numbers over here, we get 0.5833 | ▶ 00:49 |
and 0.5986. | ▶ 00:56 |
Interestingly, the maximum likelihood on the last 2 cases over here | ▶ 00:59 |
will give us .6, but we only get a value that's closer to .5, | ▶ 01:03 |
which is the effect of our smoothing prior for the Laplacian smoothing. | ▶ 01:09 |
Let's use the Laplacian smoother with K=1 | ▶ 00:00 |
to calculate the few interesting probabilities-- | ▶ 00:05 |
P of SPAM, P of HAM, | ▶ 00:09 |
and then the probability of the words "today", | ▶ 00:12 |
given that it's in the SPAM class or the HAM class. | ▶ 00:15 |
And you might assume that our recovery size | ▶ 00:19 |
is about 12 different words here. | ▶ 00:22 |
This one is easy to calculate for SPAM and HAM. | ▶ 00:00 |
For SPAM, it's 2/5, | ▶ 00:03 |
and the reason is, we had previously | ▶ 00:05 |
3 out of 8 messages assigned to SPAM. | ▶ 00:08 |
But thanks to the Laplacian smoother, we add 1 over here. | ▶ 00:12 |
And there are 2 classes, so we add 2 times 1 over here, | ▶ 00:15 |
which gives us 4/10, which is 2/5. | ▶ 00:19 |
Similarly to get 3/5 over here. | ▶ 00:22 |
Now the tricky part comes up over here. | ▶ 00:26 |
Before, we had 0 occurances of the word "today" in the SPAM class, | ▶ 00:29 |
and we had 9 data points. | ▶ 00:33 |
But now we are going to add 1 for Laplacian smoother, | ▶ 00:35 |
and down here, we are going to add 12. | ▶ 00:38 |
And the reason that we add 12 is because | ▶ 00:40 |
there's 12 different words in our dictionary | ▶ 00:42 |
Hence, for each word in the dictonary, we are going to add 1. | ▶ 00:44 |
So we have a total of 12, which gives us the 12 over here. | ▶ 00:47 |
That makes 1/21. | ▶ 00:50 |
In the HAM class, we had 2 occurrences | ▶ 00:53 |
of the word "today"--over here and over here. | ▶ 00:56 |
We add 1, normalize by 15, | ▶ 00:59 |
plus 12 for the dictionary size, | ▶ 01:04 |
which is 3/27 or 1/9. | ▶ 01:07 |
This was not an easy question. | ▶ 01:14 |
We come now to the final quiz here, | ▶ 00:00 |
which is--I would like to compute the probability | ▶ 00:03 |
that the message "today is secret" | ▶ 00:05 |
falls into the SPAM box with | ▶ 00:08 |
Laplacian smoother using K=1. | ▶ 00:10 |
Please just enter your number over here. | ▶ 00:13 |
This is a non-trivia question. | ▶ 00:16 |
It might take you a while to calculate this. | ▶ 00:18 |
In the approximate probabilities--0.4858. | ▶ 00:00 |
How did we get this? | ▶ 00:06 |
Well, the prior probability for SPAM | ▶ 00:08 |
under the Laplacian smoothing is 2/5. | ▶ 00:12 |
"Today" doesn't occur, but we have already calculated this to be 1/21. | ▶ 00:15 |
"Is" occurs once, so we get 2 over here over 21. | ▶ 00:22 |
"Secret" occurs 3 times, so we get a 4 over here over 21, | ▶ 00:26 |
and we normalize this by the same expression over here. | ▶ 00:32 |
Plus the prior for HAM, which is 3/5, | ▶ 00:37 |
we have 2 occurrences of "today", plus 1, equals 3/27. | ▶ 00:42 |
"Is" occurs once--2/27. | ▶ 00:47 |
And "secret" occurs once--again 2/27. | ▶ 00:50 |
When you work this all out, you get this number over here. | ▶ 00:54 |
So we learned quite a bit. | ▶ 00:00 |
We learned about Naive Bayes | ▶ 00:02 |
as our first supervised learning methods. | ▶ 00:04 |
The setup was that we had | ▶ 00:06 |
features of documents or trading examples and labels. | ▶ 00:08 |
In this case, SPAM or not SPAM. | ▶ 00:14 |
And from those pieces, | ▶ 00:17 |
we made a generative model for the SPAM class | ▶ 00:19 |
and the non-SPAM class | ▶ 00:23 |
that described the condition of probability | ▶ 00:25 |
of each individual feature. | ▶ 00:28 |
We then used first maximum likelihood | ▶ 00:30 |
and then a Laplacian smoother | ▶ 00:33 |
to fit those primers over here. | ▶ 00:36 |
And then using Bayes rule, | ▶ 00:38 |
we could take any training examples over here | ▶ 00:41 |
and figure out what the class probability was over here. | ▶ 00:44 |
This is called a generative model | ▶ 00:48 |
in that the condition of probabilities all aim to maximize | ▶ 00:51 |
the probability of individual features as if those | ▶ 00:55 |
describe the physical world. | ▶ 01:00 |
We also used what is called a bag of words model, | ▶ 01:02 |
in which our representation of each email | ▶ 01:06 |
was such that we just counted the occurrences of words, | ▶ 01:09 |
irrespective of their order. | ▶ 01:12 |
Now this is a very powerful method for fighting SPAM. | ▶ 01:15 |
Unfortunately, it is not powerful enough. | ▶ 01:19 |
It turns out spammers know about Naive Bayes, | ▶ 01:21 |
and they've long learned to come up with messages | ▶ 01:24 |
that are fooling your SPAM filter if it uses Naive Bayes. | ▶ 01:27 |
So companies like Google and others | ▶ 01:31 |
have become much more involved | ▶ 01:33 |
in methods for SPAM filtering. | ▶ 01:35 |
Now I can give you some more examples how to filter SPAM, | ▶ 01:38 |
but all of those quite easily fit with the same Naive Bayes model. | ▶ 01:42 |
[Narrator] So here features that you might consider when you write | ▶ 00:00 |
in an advance spam filter. | ▶ 00:03 |
For example, | ▶ 00:05 |
does the email come from | ▶ 00:07 |
a known spamming IP or computer? | ▶ 00:09 |
Have you emailed this person before? | ▶ 00:12 |
In which case it is less likely to be spam. | ▶ 00:16 |
Here's a powerful one: | ▶ 00:19 |
have 1000 other people | ▶ 00:22 |
recently received the same message? | ▶ 00:25 |
Is the email header consistent? | ▶ 00:29 |
So example if the from field says your bank | ▶ 00:32 |
is the IP address really your bank? | ▶ 00:35 |
Surprisingly is the email all caps? | ▶ 00:38 |
Strangely many spammers believe if you write | ▶ 00:42 |
things in all caps you'll pay more attention to it. | ▶ 00:44 |
Do the inline URLs point to those pages | ▶ 00:48 |
where they say they're pointing to? | ▶ 00:51 |
Are you addressed by your correct name? | ▶ 00:54 |
Now these are some features, | ▶ 00:56 |
I'm sure you can think of more. | ▶ 00:58 |
You can toss them easily into the | ▶ 01:00 |
naive base model and get better classification. | ▶ 01:02 |
In fact model spam filters keep learning | ▶ 01:05 |
as people flag emails as spam, and | ▶ 01:08 |
of course spammers keep learning as well | ▶ 01:10 |
and trying to fool modern spam filters. | ▶ 01:13 |
Who's going to win? | ▶ 01:16 |
Well so far the spam filters are clearly winning. | ▶ 01:18 |
Most of my spam I never see, but who knows | ▶ 01:21 |
what's going to happen with the future? | ▶ 01:23 |
It's a really fascinating machine learning problem. | ▶ 01:25 |
[Narrator] Naive Bayes can also be applied to | ▶ 00:00 |
the problem of hand written digits recognition. | ▶ 00:02 |
This is a sample of hand-written digits taken | ▶ 00:05 |
from a U.S. postal data set | ▶ 00:09 |
where hand written zip codes on letters are | ▶ 00:12 |
being scanned and automatically classified. | ▶ 00:17 |
The machine-learning problem here is | ▶ 00:21 |
taken a symbol just like this. | ▶ 00:23 |
What is the corresponding number? | ▶ 00:28 |
Here it's obviously 0. | ▶ 00:30 |
Here it's obviously 1. | ▶ 00:32 |
Here it's obviously 2, 1. | ▶ 00:34 |
For the one down here, | ▶ 00:36 |
it's a little bit harder to tell. | ▶ 00:38 |
Now when you apply Naive Bayes, | ▶ 00:41 |
the input vector | ▶ 00:44 |
could be the pixel values | ▶ 00:46 |
of each individual pixel so we have | ▶ 00:48 |
a 16 x 16 input resolution. | ▶ 00:50 |
You would get 256 different values | ▶ 00:54 |
corresponding to the brightness of each pixel. | ▶ 00:59 |
Now obviously given sufficiently made | ▶ 01:02 |
training example, you might hope | ▶ 01:05 |
to recognize digits, | ▶ 01:07 |
but one of the deficiencies of this approach is | ▶ 01:09 |
it is not particularly shifted range. | ▶ 01:12 |
So for example a pattern like this | ▶ 01:15 |
will look fundamentally different | ▶ 01:19 |
from a pattern like this. | ▶ 01:21 |
Even though the pattern on the right is obtained | ▶ 01:24 |
by shifting the pattern on the left | ▶ 01:27 |
by 1 to the right. | ▶ 01:29 |
There's many different solutions, but a common one could be | ▶ 01:31 |
to use smoothing in a different way from | ▶ 01:34 |
the way we discussed it before. | ▶ 01:36 |
Instead of just counting 1 pixel value's count, | ▶ 01:38 |
you could mix it with counts of the | ▶ 01:40 |
neighboring pixel values so if | ▶ 01:42 |
all pixels are slightly shifted, | ▶ 01:44 |
we get about the same statistics | ▶ 01:46 |
as the pixel itself. | ▶ 01:48 |
Such a method is called input smoothing. | ▶ 01:50 |
You can what's technically called convolve | ▶ 01:52 |
the input vector equals pixel value variable, and | ▶ 01:55 |
you might get better results than if you | ▶ 01:57 |
do Naive Bayes on the raw pixels. | ▶ 02:00 |
Now to tell you the truth for | ▶ 02:02 |
digit recognition of this type, | ▶ 02:04 |
Naive Bayes is not a good choice. | ▶ 02:06 |
The conditional independence assumption | ▶ 02:08 |
of each pixel, given the class, | ▶ 02:10 |
is too strong an assumption in this case, | ▶ 02:12 |
but it's fun to talk about image recognition | ▶ 02:14 |
in the context of Naive Bayes regardless. | ▶ 02:17 |
So, let me step back a step and talk a bit about | ▶ 00:00 |
overfitting prevention in machine learning | ▶ 00:04 |
because it's such an important topic. | ▶ 00:07 |
We talked about Occam's Razor, | ▶ 00:09 |
which in a generalized way suggests there is | ▶ 00:12 |
a tradeoff between how well we can fit the data | ▶ 00:16 |
and how smooth our learning algorithm is. | ▶ 00:22 |
In our class in smoothing, we already found 1 way | ▶ 00:28 |
to let Occam's Razor play, which is by | ▶ 00:32 |
selecting the value K to make our statistical counts smoother. | ▶ 00:34 |
I alluded to a similar way in the image recognition domain | ▶ 00:40 |
where we smoothed the image so the neighboring pixels count similar. | ▶ 00:44 |
This all raises the question of how to choose the smoothing parameter. | ▶ 00:49 |
So, in particular, in Laplacian smoothing, how to choose the K. | ▶ 00:53 |
There is a method called cross-validation | ▶ 00:58 |
which can help you find an answer. | ▶ 01:02 |
This method assumes there is plenty of training examples, but | ▶ 01:05 |
to tell you the truth, in spam filtering there is more than you'd ever want. | ▶ 01:09 |
Take your training data | ▶ 01:14 |
and divide it into 3 buckets. | ▶ 01:17 |
Train, cross-validate, and test. | ▶ 01:19 |
Typical ratios will be 80% goes into train, | ▶ 01:24 |
10% into cross-validate, | ▶ 01:27 |
and 10% into test. | ▶ 01:30 |
You use the train to find all your parameters. | ▶ 01:33 |
For example, the probabilities of a base network. | ▶ 01:37 |
You use your cross-validation set | ▶ 01:40 |
to find the optimal K, and the way you do this is | ▶ 01:43 |
you train for different values of K, | ▶ 01:46 |
you observe how well the training model performs on the CV data, | ▶ 01:49 |
not touching the test data, | ▶ 01:55 |
and then you maximize over all the Ks to get the best performance | ▶ 01:58 |
on the cross-validation set. | ▶ 02:01 |
You iterate this many times until you find the best K. | ▶ 02:03 |
When you're done with the best K, | ▶ 02:06 |
you train again, and then finally | ▶ 02:09 |
only one you touch the test data | ▶ 02:12 |
to verify the performance, | ▶ 02:15 |
and this is the performance you report. | ▶ 02:17 |
It's really important in cross-validation | ▶ 02:20 |
split apart a cross-validation set that's different from the test set. | ▶ 02:23 |
If you were to use the test set to find the optimal K, | ▶ 02:28 |
then your test set becomes an effective part of your training routine, | ▶ 02:31 |
and you might overfit your test data, | ▶ 02:35 |
and you wouldn't even know. | ▶ 02:38 |
By keeping the test data separate from the beginning, | ▶ 02:40 |
and train on the training data, you use | ▶ 02:43 |
the cross-validation data to find how good your train data is doing, | ▶ 02:46 |
and the unknown parameters of K to fine-tune the K. | ▶ 02:49 |
Finally, only once you use the test data | ▶ 02:53 |
do you get a fair answer to the question, | ▶ 02:56 |
"How well will your model perform on future data?" | ▶ 02:59 |
So, pretty much everybody in machine learning | ▶ 03:02 |
uses this model. | ▶ 03:05 |
You can redo the split between training and the cross-validation part, | ▶ 03:08 |
people often use the word 10-fold cross-validation | ▶ 03:12 |
where they do 10 different forwardings | ▶ 03:15 |
and run the model 10 times to find the optimal K | ▶ 03:17 |
or smoothing parameter. | ▶ 03:20 |
No matter which way you do it, find the optimal smoothing parameter | ▶ 03:22 |
and then use a test set exactly once to verify in a report. | ▶ 03:25 |
Let me back up a step further, | ▶ 00:00 |
and let's look at supervised learning more generally. | ▶ 00:03 |
Our example so far was one of classification. | ▶ 00:06 |
The characteristic of classifcation is | ▶ 00:09 |
that the target labels or the target class is discrete. | ▶ 00:12 |
In our case it was actually binary. | ▶ 00:16 |
In many problems, we try to predict a continuous quantity. | ▶ 00:18 |
For example, in the interval 0 to 1 or perhaps a real number. | ▶ 00:23 |
Those machine learning problems are called regression problems. | ▶ 00:29 |
Regression problems are fundamentally different from classification problems. | ▶ 00:33 |
For example, our base network doesn't afford us an answer | ▶ 00:37 |
to a problem where the target value could be at 0,1. | ▶ 00:42 |
A regression problem, for example, would be one to | ▶ 00:45 |
predict the weather tomorrow. | ▶ 00:48 |
Temperature is a continuous value. Our base number would not be able | ▶ 00:50 |
to predict the temperature, it only can predict discrete classes. | ▶ 00:53 |
A regression algorithm is able to give us a continuous prediction | ▶ 00:58 |
about the temperature tomorrow. | ▶ 01:01 |
So let's look at the regression next. | ▶ 01:04 |
So here's my first quiz for you on regression. | ▶ 01:07 |
This scatter plot shows for Berkeley California for a period of time | ▶ 01:10 |
the data for each house that was sold. | ▶ 01:18 |
Each dot is a sold house. | ▶ 01:21 |
It graphs the size of the house in square feet | ▶ 01:24 |
to the sales price in thousands of dollars. | ▶ 01:27 |
As you can see, roughly speaking, | ▶ 01:32 |
as the size of the house goes up, | ▶ 01:34 |
so does the sales price. | ▶ 01:37 |
I wonder, for a house of about 2500 square feet, | ▶ 01:40 |
what is the approximate sales price you would assume | ▶ 01:45 |
based just on the scatter plot data? | ▶ 01:49 |
Is it 400k, 600k, 800k, or 1000k? | ▶ 01:52 |
My answer is, there seems to be a roughly linear relationship, | ▶ 00:00 |
maybe not quite linear, between the house size and the price. | ▶ 00:05 |
So we look at a linear graph that best describes the data-- | ▶ 00:11 |
you get this dashed line over here. | ▶ 00:15 |
And for the dashed line, if you walk up the 2500 square feet, | ▶ 00:18 |
you end up with roughly 800K. | ▶ 00:22 |
So this would have been the best answer. | ▶ 00:24 |
Now obviously you can answer this question without understanding anything about regression. | ▶ 00:00 |
But what you find is this is different from classification as before. | ▶ 00:05 |
This is not a binary concept anymore of like expensive and cheap. | ▶ 00:10 |
It really is a relationship between two variables. | ▶ 00:13 |
One you care about--the house price, and one that you can observe, | ▶ 00:17 |
which is the house size in square feet. | ▶ 00:20 |
And your goal is to fit a curve that best explains the data. | ▶ 00:23 |
Once again, we have a case where we can play Occam's razor. | ▶ 00:28 |
There clearly is a data fit that is not linear that might be better, | ▶ 00:31 |
like this one over here. | ▶ 00:35 |
And when you go to hide the linear curves, | ▶ 00:37 |
you might even be inclined to draw a curve like this. | ▶ 00:40 |
Now of course the curve I'm drawing right now is likely an overfit. | ▶ 00:44 |
And you don't want to postulate that this is the general relationship | ▶ 00:49 |
between the size of a house and the sales price. | ▶ 00:54 |
So even though my black curve might describe the data better, | ▶ 00:57 |
the blue curve or the dashed linear curve over here might be a better explanation overture of Occam's razor. | ▶ 01:01 |
So let's look a little bit deeper into what we call regression. | ▶ 01:08 |
As in all regression problems, our data will be comprised of | ▶ 01:15 |
input vectors of length in that map to another continuous value. | ▶ 01:19 |
And we might be given a total of M data points. | ▶ 01:25 |
This is from the classification case, except this time the Ys are continuous. | ▶ 01:30 |
Once again, we're looking for function f that maps our vector x into y. | ▶ 01:36 |
In linear regression, the function has a particular form which is W1 times X plus W0. | ▶ 01:44 |
In this case X is one dimensional which is N = 1. | ▶ 01:54 |
Or in the high-dimensional space, we might just write W times X plus W0, | ▶ 01:59 |
where W is a vector and X is a vector. | ▶ 02:07 |
And this is the inner product of these 2 vectors over here. | ▶ 02:12 |
Let's for now just consider the one-dimensional case. | ▶ 02:16 |
In this quiz, I've given you a linear regression form with 2 unknown parameters, W1 and W0. | ▶ 02:20 |
I've given you a data set. | ▶ 02:27 |
And this data set happens to be fittable by a linear regression model without any residual error. | ▶ 02:30 |
Without any math, can you look at this and find out to me what the 2 parameters, W0 and W1 are? | ▶ 02:36 |
This is a suprisingly challenging question. | ▶ 00:00 |
If you look at these numbers from 3 to 6. | ▶ 00:03 |
When we increase X by 3, Y decreases by 3, | ▶ 00:07 |
which suggests W1 is -1. | ▶ 00:14 |
Now let's see if this holds. | ▶ 00:18 |
If we increase X by 3, it decreases Y by 3. | ▶ 00:20 |
If we increase X by 1, we decrease Y by 1. | ▶ 00:24 |
If we increase X by 2, we decrease Y by 2. | ▶ 00:28 |
So this number seems to be an exact fit. | ▶ 00:32 |
Next we have to get the constant W0 right. | ▶ 00:36 |
For X = 3, we get -3 as an expression over here, | ▶ 00:41 |
because we know W1 = -1. | ▶ 00:48 |
So if this has to equal zero in the end, then W0 has to be 3. | ▶ 00:50 |
Let's do a quick check. | ▶ 00:57 |
-3 plus 3 is 0. | ▶ 00:59 |
-6 plus 3 is -3. | ▶ 01:02 |
And if we plug in any of the numbers, you find those are correct. | ▶ 01:05 |
Now this is the case of an exact data set. | ▶ 01:09 |
It gets much more challenging if the data set cannot be fit with a linear function. | ▶ 01:12 |
To define linear regression, | ▶ 00:00 |
we need to understand what we are trying to minimize. | ▶ 00:02 |
The word is called here, are loss function | ▶ 00:05 |
and the loss function is the amount of residual error we obtain | ▶ 00:08 |
after fitting the linear function as good as possible. | ▶ 00:12 |
The residual error is the sum of all training examples, | ▶ 00:16 |
J of YJ, which is the target label, | ▶ 00:20 |
minus our prediction, which is W1 XJ minus W0 to the square. | ▶ 00:25 |
This is the quadratic error between our target tables | ▶ 00:34 |
and what our best hypothesis can produce. | ▶ 00:37 |
The minimizing of loss | ▶ 00:41 |
is used for linear regression of a new regression problem, | ▶ 00:43 |
and you can write it as follows: | ▶ 00:46 |
Our solution to the regression problem W* | ▶ 00:50 |
is the arg min of the loss over all possible vectors W. | ▶ 00:52 |
The problem of minimizing quadratic loss for linear functions can be solved in closed form. | ▶ 00:00 |
When I reduce, I will do this for the one-dimensional case on paper. | ▶ 00:07 |
I will also give you the solution for the case where your input space is multidimensional, | ▶ 00:12 |
which is often called "multivariant regression." | ▶ 00:17 |
We seek to minimize a sum of a quadratic expression | ▶ 00:22 |
where the target labels are subtracted with the output of our linear regression model | ▶ 00:26 |
parameterized by w1 and w2. | ▶ 00:33 |
The summation here is overall training examples, | ▶ 00:36 |
and I leave the index of the summation out if not necessary. | ▶ 00:40 |
The minimum of this is obtained where the derivative of this function equals zero. | ▶ 00:45 |
Let's call this function "L." | ▶ 00:50 |
For the partial derivative with respect to w0, we get this expression over here, | ▶ 00:53 |
which we have to set to zero. | ▶ 00:59 |
We can easily get rid of the -2 and transform this as follows: | ▶ 01:02 |
Here M is the number of training examples. | ▶ 01:11 |
This expression over here gives us w0 as a function of w1, | ▶ 01:17 |
but we don't know w1. Let's do the same trick for w1 | ▶ 01:21 |
and set this to zero as well, | ▶ 01:28 |
which gets us the expression over here. | ▶ 01:32 |
We can now plug in the w0 over here into this expression over here | ▶ 01:38 |
and obtain this expression over here, | ▶ 01:44 |
which looks really involved but is relatively straightforward. | ▶ 01:47 |
With a few steps of further calculation, which I'll spare you for now, | ▶ 01:52 |
we get for w1 the following important formula: | ▶ 01:56 |
This is the final quotient for w1, | ▶ 02:02 |
where we take the number of training examples times of the sum of all xy | ▶ 02:05 |
minus the sum of x times the sum of y divided by this expression over here. | ▶ 02:10 |
Once we've computed w1, | ▶ 02:16 |
we can go back to our original articulation of w0 over here | ▶ 02:19 |
and plug w1 into w0 and obtain w0. | ▶ 02:23 |
These are the two important formulas we can also find in the textbook. | ▶ 02:30 |
I'd like to go back and use those formulas to calculate these two coefficients over here. | ▶ 02:39 |
You get 4 times the sum of x and the sum of y, which is -32 | ▶ 02:45 |
minus the product of the sum of x, which is 18, and the sum of y, which is -6, | ▶ 02:56 |
divided by the sum of x squared, which is 86, times 4, minus the sum of x squared, | ▶ 03:05 |
which is 18 times 18, which is 324. | ▶ 03:16 |
If you work this all out, it becomes -1, which is w1. | ▶ 03:20 |
W0 is now obtained by completing the quarter times sum of all y, | ▶ 03:25 |
which is -6, minus -1/4 times sum of all x. | ▶ 03:31 |
If you plug this all in, you get 3, as over here. Our formula is actually correct. | ▶ 03:39 |
Here is another quiz for linear regression. We have the follow data: | ▶ 03:46 |
Here is the data plotted graphically. | ▶ 03:51 |
I wonder what the best regression is. | ▶ 03:53 |
Give me w0 and w1. Apply the formulas I just gave you. | ▶ 03:56 |
And the answer is W0 = 0.5, and W1 = 0.9. | ▶ 00:00 |
If I were to draw a line, it would go about like this. | ▶ 00:09 |
It doesn't really hit the two points at the end. | ▶ 00:14 |
If you were thinking of something like this, you were wrong. | ▶ 00:19 |
If you draw a curve like this, your quadratic error becomes 2. | ▶ 00:24 |
One over here, and one over here. | ▶ 00:28 |
The quadratic error is smaller for the line that goes in between those points. | ▶ 00:30 |
This is easily seen by computing as shown in the previous slide. | ▶ 00:35 |
W1 equals (4 x 118 - 20 x 20) / (4 x 120 - 400) which is 0.9. | ▶ 00:41 |
This is merely plugging in those numbers into the formulas I gave you. | ▶ 00:55 |
W0 then becomes ¼ x 20. | ▶ 01:00 |
Now we plug in W1-- 0.9 / 4 x 20 equals 0.5. | ▶ 01:05 |
This is an example of linear regression, | ▶ 01:12 |
in which case there is a residual error, | ▶ 01:16 |
and the best-fitting curve is the one that minimizes | ▶ 01:18 |
the total of the residual vertical error in this graph over here. | ▶ 01:22 |
So linear regression works well | ▶ 00:00 |
if the data is approximately linear, | ▶ 00:03 |
but there are many examples when linear regression performs poorly. | ▶ 00:05 |
Here's one where we have a | ▶ 00:09 |
curve that is really nonlinear. | ▶ 00:12 |
This is an interesting one where we seem to have a linear relationship | ▶ 00:15 |
that is flatter than the linear regression indicates, | ▶ 00:18 |
but there is one outlier. | ▶ 00:21 |
Because if you are minimizing quadratic error, | ▶ 00:23 |
outliers penalize you over-proportionately. | ▶ 00:26 |
So outliers are particularly bad for linear regression. | ▶ 00:30 |
And here is a case, | ▶ 00:34 |
where the data clearly suggests | ▶ 00:35 |
a very different phenomena for linear. | ▶ 00:37 |
We have only two ?? variables even being used, | ▶ 00:40 |
and this one has a strong frequency | ▶ 00:42 |
and a strong vertical spread. | ▶ 00:45 |
Clearly a linear regression model | ▶ 00:47 |
is a very poor one to explain | ▶ 00:49 |
this data over here. | ▶ 00:51 |
Another problem with linear regression | ▶ 00:53 |
is that as you go to infinity in the X space, | ▶ 00:55 |
your Ys also become infinite. | ▶ 00:59 |
In some problems that isn't a plausible model. | ▶ 01:02 |
For example, if you wish to predict the weather | ▶ 01:05 |
anytime into the future, | ▶ 01:08 |
it's implausible to assume the further the prediction goes out, | ▶ 01:10 |
the hotter or the cooler it becomes. | ▶ 01:13 |
For such situations there is a | ▶ 01:15 |
model called logistic regression, | ▶ 01:17 |
which uses a slightly more complicated | ▶ 01:20 |
model than linear regression, | ▶ 01:22 |
which goes as follows:. | ▶ 01:24 |
Let F of XP, or linear function, | ▶ 01:25 |
and the output of logistic regression | ▶ 01:30 |
is obtained by the following function: | ▶ 01:32 |
One over one plus exponential of minus F of X. | ▶ 01:34 |
So here's a quick quiz for you. | ▶ 01:40 |
What is the range in which Z might fall | ▶ 01:43 |
given this function over here, | ▶ 01:48 |
and ??? the linear function of F or X over here. | ▶ 01:49 |
Is it zero, one? | ▶ 01:53 |
Is it minus one, one? | ▶ 01:56 |
Is it minus one, zero? | ▶ 01:59 |
Minus two, two? | ▶ 02:02 |
Or none of the above? | ▶ 02:04 |
The answer is zero, one. | ▶ 00:00 |
If this expression over here, | ▶ 00:02 |
F of X, | ▶ 00:05 |
grows to positive infinity, | ▶ 00:07 |
then Z becomes one. | ▶ 00:09 |
And the reason is | ▶ 00:14 |
as this term over here becomes very large, | ▶ 00:16 |
E to the minus of that term approaches zero; | ▶ 00:19 |
one over one equals one. | ▶ 00:22 |
If F of X goes to minus infinity, | ▶ 00:25 |
then Z goes to zero. | ▶ 00:30 |
And the reason is, | ▶ 00:33 |
if this expression over here goes to minus infinity, | ▶ 00:34 |
E to the infinity becomes very large; | ▶ 00:38 |
one over something very large becomes zero. | ▶ 00:41 |
When we plot the logistic function it looks like this: | ▶ 00:44 |
So it's approximately linear | ▶ 00:49 |
around F of X equals zero, | ▶ 00:51 |
but it levels off to zero and one | ▶ 00:54 |
as we go to the extremes. | ▶ 00:58 |
Another problem with linear regression has to do with the regularization | ▶ 00:00 |
or complexity control. | ▶ 00:04 |
Just like before, we sometimes wish to have | ▶ 00:06 |
a less complex model. | ▶ 00:08 |
So in regularization, the loss function is either the sum | ▶ 00:10 |
of the loss of a data function and a complexity control term, | ▶ 00:15 |
which is often called the loss of the parameters. | ▶ 00:21 |
The loss of the data is simply curvatic loss, as we discussed before. | ▶ 00:24 |
The loss of parameters might just be a function that penalizes | ▶ 00:29 |
the parameters to become large | ▶ 00:35 |
up to some known P, where P is usually either 1 or 2. | ▶ 00:37 |
If you draw this graphically, | ▶ 00:43 |
in a parameter space comprised of 2 parameters, | ▶ 00:46 |
your curvatic term for minimizing the data error | ▶ 00:49 |
might look like this, where the minimum sits over here. | ▶ 00:53 |
Your term for regularization might pull these parameters toward 0. | ▶ 00:57 |
It pulls it toward 0, along the circle if you use curvatic error, | ▶ 01:02 |
and it does it in a diamond-shaped way. | ▶ 01:09 |
For L1 regularization--either one works well. | ▶ 01:14 |
L1 has the advantage in that parameters tend to get really sparse. | ▶ 01:20 |
If you look at this diagram, there is a tradeoff between W-0 and W-1. | ▶ 01:24 |
In the L1 case, that allows one of them to be driven to 0. | ▶ 01:30 |
In the L2 case, parameters tend not to be as sparse. | ▶ 01:33 |
So L1 is often preferred. | ▶ 01:37 |
This all raises the question, | ▶ 00:00 |
how to minimize more complicated loss functions | ▶ 00:03 |
than the one we discussed so far. | ▶ 00:06 |
Are there closed-form solutions of the type we found for linear regression? | ▶ 00:09 |
Or do we have to resort to iterative methods? | ▶ 00:14 |
The general answer is, unfortunantly, we have to resort to iterative methods. | ▶ 00:17 |
Even though there are special cases in which corresponding solutions may exist, | ▶ 00:23 |
in general, our loss functions now become complicated enough | ▶ 00:28 |
that all we can do is iterate. | ▶ 00:32 |
Here is a prototypical loss function | ▶ 00:35 |
and the method for interation will be called gradient descent. | ▶ 00:40 |
In gradient descent, you start with an initial guess, | ▶ 00:44 |
W-0, where 0 is your iteration number, | ▶ 00:48 |
and then you up with it iteratively. | ▶ 00:53 |
Your i plus 1st parameter guess will be obtained by taking your i-th guess | ▶ 00:55 |
and subtracting from it the gradient of your loss function, | ▶ 01:04 |
and that guess multiplied by a small learning weight alpha, | ▶ 01:10 |
where alpha is often as small as 0.01. | ▶ 01:15 |
I have a couple of questions for you. | ▶ 01:19 |
Consider the following 3 points. | ▶ 01:21 |
We call them A, B, C. | ▶ 01:25 |
I wish to know, for points A, B, and C, | ▶ 01:27 |
Is the gradient at this point positive, about zero, or negative? | ▶ 01:34 |
For each of those, check exactly one of those cases. | ▶ 01:40 |
In case A, the gradient is negative. | ▶ 00:00 |
If you move to the right in the X space, | ▶ 00:03 |
then your loss decreases. | ▶ 00:06 |
In B, it's about zero. | ▶ 00:09 |
In C, it's pointing up; it's positive. | ▶ 00:12 |
So if you apply the rule over here, | ▶ 00:15 |
if you were to start at A as your W-zero, | ▶ 00:18 |
then your gradient is negative. | ▶ 00:21 |
Therefore, you would add something to the value of W. | ▶ 00:23 |
You move to the right, and your loss has decreased. | ▶ 00:26 |
You do this until you find yourself | ▶ 00:29 |
with what's called a local minimum, where B resides. | ▶ 00:31 |
In this instance over here, gradient descent starting at A | ▶ 00:34 |
would not get you to the global minimum, | ▶ 00:37 |
which sits over here because there's a bump in between. | ▶ 00:39 |
Gradient methods are known to be subject to local minimum. | ▶ 00:42 |
I have another gradient quiz. | ▶ 00:00 |
Consider the following quadratic arrow function. | ▶ 00:03 |
We are considering the gradient in 3 different places. | ▶ 00:06 |
a. b. and c. | ▶ 00:09 |
And they ask you which gradient is the largest. | ▶ 00:13 |
a, b, or c or are they all equal? | ▶ 00:17 |
In which case, you would want to check the last box over here | ▶ 00:23 |
And the answer is C. | ▶ 00:00 |
The derivative of a quadratic function is a linear function. | ▶ 00:04 |
Which would look about like this. | ▶ 00:08 |
And as we go outside, our gradient becomes larger and larger. | ▶ 00:11 |
This over here is much steeper than this curve over here. | ▶ 00:15 |
[Thrun] Here is a final gradient descent quiz. | ▶ 00:00 |
Suppose we have a loss function like this | ▶ 00:04 |
and our gradient descent starts over here. | ▶ 00:08 |
Will it likely reach the global minimum? | ▶ 00:12 |
Yes or no. | ▶ 00:15 |
Please check one of those boxes. | ▶ 00:17 |
[Thrun] And the answer is yes, | ▶ 00:00 |
although, technically speaking, to reach the absolute global minimum | ▶ 00:02 |
we need the learning rates to become smaller and smaller over time. | ▶ 00:06 |
If they stay constant, there is a chance this thing might bounce around | ▶ 00:11 |
between 2 points in the end and never reach the global minimum. | ▶ 00:15 |
But assuming that we implement gradient descent correctly, | ▶ 00:18 |
we will finally reach the global minimum. | ▶ 00:22 |
That's not the case if you start over here, where we can get stuck over here | ▶ 00:24 |
and settle for the minimum over here, which is a local minimum | ▶ 00:29 |
and not the best solution to our optimization problem. | ▶ 00:32 |
So one of the important points to take away from this is | ▶ 00:35 |
gradient descent is universally applicable to more complicated problems-- | ▶ 00:38 |
problems that don't have a plausible solution. | ▶ 00:43 |
But you have to check whether there is many local minima, | ▶ 00:46 |
and if so, you have to worry about this. | ▶ 00:49 |
Any optimization book can tell you tricks how to overcome this. | ▶ 00:51 |
I won't go into any more depth here in this class. | ▶ 00:55 |
[Thrun] It's interesting to see how to minimize a loss function using gradient descent. | ▶ 00:00 |
In our linear case, we have L equals sum over the correct labels | ▶ 00:05 |
minus our linear function to the square, | ▶ 00:12 |
which we seek to minimize. | ▶ 00:16 |
We already know that this has a closed form solution, | ▶ 00:18 |
but just for the fun of it, let's look at gradient descent. | ▶ 00:21 |
The gradient of L with respect to W1 is minus 2, sum of all J | ▶ 00:25 |
of the difference as before but without the square times Xj. | ▶ 00:33 |
The gradient with respect to W0 is very similar. | ▶ 00:39 |
So in gradient descent we start with W1 0 and W0 0 | ▶ 00:43 |
where the upper cap 0 corresponds to the iteration index of gradient descent. | ▶ 00:49 |
And then we iterate. | ▶ 00:55 |
In the M iteration we get our new estimate by using the old estimate | ▶ 00:57 |
minus a learning rate of this gradient over here | ▶ 01:06 |
taking the position of the old estimate W1, M minus 1. | ▶ 01:10 |
Similarly, for W0 we get this expression over here. | ▶ 01:15 |
And these expressions look nasty, | ▶ 01:20 |
but what it really means is we subtract an expression like this | ▶ 01:24 |
every time we do gradient descent from W1 | ▶ 01:28 |
and an expression like this every time we do gradient descent from W0, | ▶ 01:31 |
which is easy to implement, and that implements gradient descent. | ▶ 01:36 |
Now, there are many different ways to apply linear functions in machine learning. | ▶ 00:00 |
We so far have studied linear functions for regression, | ▶ 00:08 |
but linear functions are also used for classification, | ▶ 00:12 |
and specifically for an algorithm called the perceptron algorithm. | ▶ 00:16 |
This algorithm happens to be a very early model of a neuron, | ▶ 00:21 |
as in the neurons we have in our brains, | ▶ 00:27 |
and was invented in the 1940s. | ▶ 00:30 |
Suppose we give a data set of positive samples and negative samples. | ▶ 00:33 |
A linear separator is a linear equation that separates positive from negative examples. | ▶ 00:41 |
Obviously, not all sets possess a linear separator, but some do. | ▶ 00:49 |
For those we can define the algorithm of the perceptron and it actually converges. | ▶ 00:55 |
To define a linear separator, let's start with our linear equation as before-- | ▶ 01:02 |
w1x + w0 in cases where x is higher dimensional this might actually be a vector--never mind. | ▶ 01:07 |
If this is larger or equal to zero, then we call our classification 1. | ▶ 01:18 |
Otherwise, we call it zero. | ▶ 01:26 |
Here's our linear separation classification function | ▶ 01:30 |
where this is our common linear function. | ▶ 01:35 |
Now, as I said, perceptron only converges if the data is linearly separable, | ▶ 01:39 |
and then it converges to a linear separation of the data, | ▶ 01:45 |
which is quite amazing. | ▶ 01:49 |
Perceptron is an iterative algorithm that is not dissimilar from grade descent. | ▶ 01:52 |
In fact, the update rule echoes that of grade descent, and here's how it goes. | ▶ 01:56 |
We start with a random guess for w1 and w0, | ▶ 02:03 |
which may correspond to a random separation line, | ▶ 02:09 |
but usually is inaccurate. | ▶ 02:13 |
Then the mth weight-i is obtained by using the old weight plus some learning rate alpha | ▶ 02:17 |
times the difference between the desired target label | ▶ 02:29 |
and the target label produced by our function at the point m-1. | ▶ 02:33 |
Now, this is an online learning rule, which is we don't process all the data in batch. | ▶ 02:39 |
We process one data at a time, and we might go through the data many, many times-- | ▶ 02:45 |
hence the j over here-- | ▶ 02:50 |
but every time we do this, we apply this rule over here. | ▶ 02:52 |
What this rule gives us is a method to adapt our weights in proportion to the error. | ▶ 02:55 |
If the prediction of our function f equals our target label, | ▶ 03:03 |
and the error is zero, then no update occurs. | ▶ 03:07 |
If there is a difference, however, we update in a way so as to minimize the error. | ▶ 03:11 |
Alpha is a small learning weight. | ▶ 03:18 |
Once again, perceptron converges to a correct linear separator | ▶ 03:22 |
if such linear separator exists. | ▶ 03:28 |
Now, the case of linear separation has recently received a lot of attention in machine learning. | ▶ 03:31 |
If you look at the picture over here, you'll find there are many different linear separators. | ▶ 03:36 |
There is one over here. There is one over here. There is one over here. | ▶ 03:42 |
One of the questions that has recently been researched extensively is which one to prefer. | ▶ 03:47 |
Is it a, b, or c? | ▶ 03:53 |
Even though you probably have never seen this literature, | ▶ 03:57 |
I will just ask your intuition in this following quiz. | ▶ 04:01 |
Which linear separator would you prefer if you look at these three different linear separators-- | ▶ 04:05 |
a, b, c, or none of them? | ▶ 04:10 |
[Narrator] And intuitively I would argue it's B, | ▶ 00:00 |
and the reason why is | ▶ 00:04 |
C comes really close to examples. | ▶ 00:06 |
So if these examples are noisy, | ▶ 00:09 |
it's quite likely that | ▶ 00:12 |
by being so close to these examples | ▶ 00:14 |
that future examples cross the line. | ▶ 00:17 |
Similarly A comes close to examples. | ▶ 00:20 |
B is the one that stays really far away | ▶ 00:23 |
from any example. | ▶ 00:26 |
So there's this entire region over here | ▶ 00:28 |
where there's no example anywhere near B. | ▶ 00:31 |
This region is often called the margin. | ▶ 00:34 |
The margin of the linear separator | ▶ 00:37 |
is the distance of the separator | ▶ 00:40 |
to the closest training example. | ▶ 00:43 |
The margin is a really important concept | ▶ 00:45 |
in machine learning. | ▶ 00:47 |
There is an entire class of maximum margin | ▶ 00:49 |
learning algorithms, | ▶ 00:51 |
and the 2 most popular are | ▶ 00:53 |
support vector machines and boosting. | ▶ 00:56 |
If you are familiar with machine learning, | ▶ 01:00 |
you've come across these terms. | ▶ 01:02 |
These are very frequently used these days | ▶ 01:04 |
in actual discrimination learning tasks. | ▶ 01:07 |
I will not go into any details because it would go | ▶ 01:10 |
way beyond the scope of this introduction | ▶ 01:12 |
to artificial intelligence class, but let's see | ▶ 01:16 |
a few abstract words specifically about | ▶ 01:18 |
support vector machines or SVMs. | ▶ 01:21 |
As I said before a support vector machine | ▶ 01:25 |
derives a linear separator, and it takes | ▶ 01:30 |
the one that actually maximizes the margin | ▶ 01:34 |
as shown over here. | ▶ 01:39 |
By doing so it attains additional robost-ness | ▶ 01:42 |
over perceptron which only picks | ▶ 01:44 |
a linear separator without | ▶ 01:46 |
consideration of the margin. | ▶ 01:48 |
Now the problem of finding the | ▶ 01:51 |
margin maximizing linear separator | ▶ 01:53 |
can be solved by a quadratic program | ▶ 01:55 |
which is an integer method for finding the best | ▶ 01:59 |
linear separator that maximizes the margin. | ▶ 02:03 |
One of the nice things that support | ▶ 02:06 |
vector machines do in practice is | ▶ 02:08 |
they use linear techniques to solve | ▶ 02:12 |
nonlinear separation problems, | ▶ 02:16 |
and I'm just going to give you a glimpse of | ▶ 02:19 |
what's happening without going into any detail. | ▶ 02:22 |
Suppose the data looks as follows: | ▶ 02:25 |
we have a positive class | ▶ 02:28 |
which is near the origin of a coordinate system | ▶ 02:31 |
and a negative class that surrounds the positive class. | ▶ 02:33 |
Clearly these 2 classes | ▶ 02:37 |
are not linearly separable | ▶ 02:39 |
because there's no line I can draw that | ▶ 02:41 |
separates the negative examples from the positive examples. | ▶ 02:43 |
An idea that underlies SVMs, | ▶ 02:47 |
that will ultimately be known as | ▶ 02:49 |
the kernel trick, | ▶ 02:51 |
is to augment the feature set by new features. | ▶ 02:53 |
Suppose this is X1, and this is X2, | ▶ 02:56 |
and normally X1 and X2 | ▶ 02:58 |
will be the input features. | ▶ 03:00 |
In this example, you might derive | ▶ 03:03 |
a 3rd one. | ▶ 03:05 |
Let me pick a 3rd one | ▶ 03:07 |
Suppose X3 equals the square root of | ▶ 03:09 |
X1 square + X2 square. | ▶ 03:13 |
In other words X3 is the distance | ▶ 03:18 |
of any data point from the center | ▶ 03:22 |
of the coordinate system. | ▶ 03:25 |
Then things do become linearly separable | ▶ 03:27 |
so that just along the 3rd dimension | ▶ 03:31 |
all the positive examples end up | ▶ 03:33 |
to be close to the origin, | ▶ 03:36 |
and all the negative examples | ▶ 03:39 |
are further away, and the line is | ▶ 03:41 |
orthogonal to the 3rd input feature | ▶ 03:43 |
solves the separation problem. | ▶ 03:46 |
Map back into the space over here | ▶ 03:49 |
is actually a circle which is a set of all | ▶ 03:52 |
values of X3 that are equidistant | ▶ 03:55 |
to the center of the origin. | ▶ 04:00 |
Now this trick could be done in any linear learning algorithm, | ▶ 04:02 |
and it's really an amazing trick. | ▶ 04:06 |
You can take any nonlinear problem, add | ▶ 04:08 |
features of this type or any other type, | ▶ 04:10 |
and use linear techniques | ▶ 04:13 |
and get better solutions. | ▶ 04:15 |
This is a very deep machine learning insight | ▶ 04:17 |
that you can extend your feature space | ▶ 04:19 |
in this way, and there's numerous | ▶ 04:21 |
papers written about this. | ▶ 04:23 |
In SVMs, the extension of the feature space is mathematically done by | ▶ 04:25 |
what's called a kernel. | ▶ 04:31 |
I can't really tell you about this in this class, | ▶ 04:33 |
but it makes it possible to write | ▶ 04:36 |
very large new feature spaces including | ▶ 04:38 |
infinitely dimensional new feature spaces. | ▶ 04:41 |
These messages are very powerful. | ▶ 04:44 |
It turns out you never | ▶ 04:46 |
really compute all those features. | ▶ 04:48 |
They are implicitly represented by | ▶ 04:50 |
so called kernels, and if you care about this, | ▶ 04:52 |
I recommend you to dive | ▶ 04:55 |
deeper into the literature | ▶ 04:57 |
of support vector machines. | ▶ 04:59 |
This is meant to just give you | ▶ 05:01 |
an overview of the essence of | ▶ 05:03 |
what support vector machines are all about. | ▶ 05:05 |
So in summary, | ▶ 05:08 |
linear methods we learned about | ▶ 05:10 |
using them for regression | ▶ 05:12 |
and also classification. | ▶ 05:15 |
We learned about exact solutions | ▶ 05:17 |
versus iterative solutions. | ▶ 05:19 |
We talked about smoothing, | ▶ 05:23 |
and we even talked about | ▶ 05:25 |
using linear methods for nonlinear problems. | ▶ 05:27 |
So we covered quite a bit of ground. | ▶ 05:30 |
This is a really significant cross section | ▶ 05:33 |
of machine learning. | ▶ 05:35 |
As the final method in this unit, I'd like now to talk about k-nearest neighbors. | ▶ 00:00 |
And the distinguishing factor of k-nearest neighbors | ▶ 00:06 |
is that it is a nonparametric machine learning method. | ▶ 00:09 |
So far we've talked about parametric methods. | ▶ 00:13 |
Parametric methods have parameters, like probabilities or weights, | ▶ 00:16 |
and the number of parameters is constant. | ▶ 00:21 |
Or to put it differently, the number of parameters is independent of the training set size. | ▶ 00:25 |
So for example in the Naive Bayes, if we bring up more data, | ▶ 00:29 |
the number of condition probabilities will stay the same. | ▶ 00:34 |
Well, that wasn't technically always the case. | ▶ 00:37 |
Our vocabulary might increase and as such the number of parameters. | ▶ 00:41 |
But for any fixed dictionary, the number of parameters are truly independent of the training set size. | ▶ 00:46 |
The same was true, for example, in our regression cases | ▶ 00:53 |
where the number of regression weights is independent of the number of data points. | ▶ 00:56 |
Now this is very different from non-parametric | ▶ 01:02 |
where the number of parameters can grow. | ▶ 01:06 |
In fact, it can grow a lot over time. | ▶ 01:10 |
Those techniques are called non-parametric. | ▶ 01:13 |
Nearest neighbor is so straightforward. | ▶ 01:16 |
I'd really like to introduce you using a quiz. | ▶ 01:20 |
So here's my quiz. | ▶ 01:23 |
Suppose we have a number of data points. | ▶ 01:25 |
I want you for 1-nearest neighbor to check those squared areas | ▶ 01:29 |
that you believe will carry a positive label. | ▶ 01:37 |
And I will give you the label of the existing data points. | ▶ 01:41 |
So please check any of those boxes that you believe are now | ▶ 01:45 |
1-nearest neighbor that carry a positive label. | ▶ 01:50 |
And the algorithm, of course, searches for the nearest point in this Euclidean space and just copies its label. | ▶ 01:54 |
And the answer was: This is a positive point, | ▶ 00:00 |
and this is a positive point. | ▶ 00:03 |
These 2 points over here are negative. | ▶ 00:05 |
So let's define k-nearest neighbors. | ▶ 00:08 |
The algorithm is really blatantly simple. | ▶ 00:12 |
In the learning step, you simply memorize all data. | ▶ 00:16 |
If a new example comes along with the input value you know | ▶ 00:20 |
but which you wish to classify, you do the following. | ▶ 00:23 |
You first find the k-nearest neighbors. | ▶ 00:28 |
And then you return the majority class label as your final class label for the new example. | ▶ 00:31 |
Simple, isn't it? | ▶ 00:38 |
So here's a somewhat contrived situation of the data point we wish to classify | ▶ 00:41 |
where the label data lies on the spiral of increasing diameter as it goes outwards. | ▶ 00:45 |
Please answer for me in this quiz what class label you'd assign | ▶ 00:53 |
for k = 1, k = 3, 5, 7, and all the way to 9. | ▶ 00:57 |
And this is an easy answer. | ▶ 00:00 |
The nearest neighbor is this point over here, | ▶ 00:02 |
so for 1 we say plus. | ▶ 00:04 |
For 3 neighbors, we get 2 positive, 1 negative. | ▶ 00:06 |
It's still plus. | ▶ 00:09 |
For 5 neighbors--1, 2, 3, 4, 5-- | ▶ 00:11 |
we get 3 negative, 2 positive. | ▶ 00:14 |
It's a minus. | ▶ 00:16 |
For 7, we get 3 positive but still 4 negative, | ▶ 00:18 |
so it's negative. | ▶ 00:21 |
And for 9, the positives outweigh the negative, | ▶ 00:23 |
so you get a plus. | ▶ 00:26 |
Obviously, as you can see, as K increases, | ▶ 00:28 |
more and more data points are being consulted. | ▶ 00:33 |
So when K finally becomes 9, | ▶ 00:35 |
all those data points are in and make a much smoother result. | ▶ 00:37 |
Just as in the Laplacian smoothing example before, | ▶ 00:00 |
the value of k is a smoothing parameter. | ▶ 00:05 |
It makes the function less scattered. | ▶ 00:08 |
Here is an example of k=1 | ▶ 00:11 |
for a 2-class nearest neighbor problem. | ▶ 00:15 |
You can see the separation boundary is what's called a Voronoi diagram | ▶ 00:18 |
between the positive and negative class, and | ▶ 00:25 |
in cases where there is noise between these class boundaries, | ▶ 00:29 |
you'll find really funny, complex boundaries as indicated over here. | ▶ 00:33 |
Particularly interesting is this guy over here where the class of this circle over here | ▶ 00:38 |
protrudes way into the otherwise solid class. | ▶ 00:45 |
Now, as you go to k=3, you get this graph over here, | ▶ 00:50 |
which is smoother. | ▶ 00:55 |
So if you are over here, your two nearest neighbors are of this type over there, | ▶ 00:57 |
and you get a uniform class over here. | ▶ 01:01 |
In this region over here, you get uniform classes as solid classes | ▶ 01:05 |
as shown over here. | ▶ 01:09 |
The more you drive up k, the more clean this decision boundary becomes, | ▶ 01:11 |
but the more outliers are actually misclassified as well. | ▶ 01:15 |
So if I go back to my k-nearest neighbor method, | ▶ 01:19 |
we just learned that k is a regularizer. | ▶ 01:22 |
It controls the complexity of the k-nearest neighbor algorithm. | ▶ 01:26 |
and the larger k is, the smoother the output. | ▶ 01:30 |
We can, once again, use cross-validation to find the optimal k | ▶ 01:34 |
because there is an inherent trade off--between the complexity of what we want to fit | ▶ 01:38 |
and the goodness of the fit. | ▶ 01:42 |
What are the problems of kNN? | ▶ 00:00 |
Well, I would argue that there're two. | ▶ 00:02 |
One is very large data sets, | ▶ 00:04 |
and one is very large feature spaces. | ▶ 00:07 |
Now the first one results in lengthy searches | ▶ 00:10 |
when you try to find K's nearest neighbors. | ▶ 00:14 |
Now, fortunately there are | ▶ 00:17 |
methods to search efficiently. | ▶ 00:19 |
Often you represent your data | ▶ 00:22 |
not by a linear list, in which case the search | ▶ 00:24 |
would be linear in the number of data points, | ▶ 00:27 |
but by a tree, where the search becomes logarithmic. | ▶ 00:29 |
The method of choice is called kDD trees | ▶ 00:34 |
where there are many other ways | ▶ 00:38 |
to represent data points as trees. | ▶ 00:40 |
Now very large feature spaces | ▶ 00:43 |
cause more of a problem. | ▶ 00:45 |
It turns out computing nearest neighbors, | ▶ 00:48 |
as the feature space for the input vector increases, | ▶ 00:51 |
becomes increasingly difficult, | ▶ 00:54 |
and the tree methods become increasingly brittle. | ▶ 00:57 |
And the reason is shown in the following graph: | ▶ 01:00 |
If your graph input dimension to | ▶ 01:03 |
the average edge length of your neighborhood | ▶ 01:06 |
you'll find that for randomly chosen points | ▶ 01:09 |
very quickly all points are really far away. | ▶ 01:12 |
The edge length of one is obtained | ▶ 01:16 |
if your query point | ▶ 01:19 |
is unit one away from the nearest neighbor. | ▶ 01:23 |
If you have one hundred dimensions, | ▶ 01:26 |
that is almost certain. | ▶ 01:28 |
Why is that? | ▶ 01:29 |
Well, in one hundred dimensions, | ▶ 01:31 |
they are to be one where just by chance | ▶ 01:33 |
your're far away. | ▶ 01:35 |
The number of points you need | ▶ 01:37 |
to get something close | ▶ 01:39 |
grows exponentially with the number of dimensions. | ▶ 01:40 |
So, for any fixed data set size | ▶ 01:45 |
you will find yourself in a situation | ▶ 01:47 |
where all your neighbors are far away. | ▶ 01:49 |
Nearest neighbor works really well | ▶ 01:52 |
for small input spaces like three or four dimensions. | ▶ 01:54 |
It works very poorly | ▶ 01:58 |
if your input space is twenty, twenty-five, | ▶ 01:59 |
or maybe one hundred dimensions. | ▶ 02:01 |
So don't trust nearest neighbor to do a good job | ▶ 02:03 |
if your input and measure spaces are high. | ▶ 02:06 |
So congratulations. | ▶ 00:00 |
You've just learned a lot about machine learning. | ▶ 00:02 |
We focused on supervised machine learning | ▶ 00:04 |
which deals with situations | ▶ 00:07 |
where you have input vectors | ▶ 00:10 |
and given output labels | ▶ 00:11 |
and your goal is to predict the output label | ▶ 00:13 |
from an input vector. | ▶ 00:16 |
And we looked into parametric models | ▶ 00:18 |
like Naive Bayes | ▶ 00:21 |
non-parametric models. | ▶ 00:23 |
We talked about classification | ▶ 00:25 |
where the output is discrete | ▶ 00:27 |
versus regression where the output is continuous | ▶ 00:29 |
and we looked at samples of techniques | ▶ 00:32 |
for each of these situations. | ▶ 00:34 |
Now obviously | ▶ 00:36 |
we just scratched the surface on machine learning. | ▶ 00:38 |
There's books written about it | ▶ 00:40 |
and courses taught about it. | ▶ 00:41 |
Machine learning is a super fascinating topic. | ▶ 00:43 |
It's the one within the artificial intelligence | ▶ 00:46 |
I love the most. | ▶ 00:48 |
It's really great about the real world | ▶ 00:50 |
as we gain more data | ▶ 00:52 |
like the world wide web | ▶ 00:54 |
or medical data sets | ▶ 00:55 |
or financial data sets. | ▶ 00:56 |
Machine learning is poised | ▶ 00:58 |
to become more and more important. | ▶ 00:59 |
I hope that the things you learned in this class so far | ▶ 01:02 |
really excite you | ▶ 01:05 |
and entice you to apply machine learning | ▶ 01:07 |
to problems that you face | ▶ 01:09 |
in your professional life. | ▶ 01:11 |
[Narrator] So welcome to the class | ▶ 00:00 |
on unsupervised learning. | ▶ 00:02 |
We talked a lot about supervised learning | ▶ 00:04 |
in which we are given data and target labels. | ▶ 00:06 |
In unsupervised learning we're just given data. | ▶ 00:09 |
So here's a data matrix of | ▶ 00:12 |
data items of N features each. | ▶ 00:15 |
There's M and total. | ▶ 00:17 |
So the task of unsupervised learning is | ▶ 00:19 |
to find structure in data of this type. | ▶ 00:21 |
To illustrate why this is an interesting problem | ▶ 00:25 |
let me start with a quiz. | ▶ 00:28 |
Suppose we have 2 feature values. | ▶ 00:31 |
One over here, and one over here, | ▶ 00:34 |
and our data looks as follows. | ▶ 00:37 |
Even though we haven't been told about | ▶ 00:39 |
anything in unsupervised learning, I'd like to | ▶ 00:41 |
quiz your intuition on the following 2 questions: | ▶ 00:43 |
1. Is there structure? | ▶ 00:46 |
Or put differently do you think there's | ▶ 00:48 |
something to be learned about data like this, | ▶ 00:51 |
or is it entirely random? | ▶ 00:53 |
And second, to narrow this down, | ▶ 00:57 |
it feels that there are clusters | ▶ 01:01 |
of data the way I do it. | ▶ 01:03 |
So how many clusters can you see? | ▶ 01:05 |
And I give you a couple of choices, 1, 2, 3, 4, or none. | ▶ 01:08 |
[Narrator] The answer to the first question is yes, there is structure. | ▶ 00:00 |
Obviously these data are seen not to be completely random determinants. | ▶ 00:03 |
They seem to be for me 2 clusters. | ▶ 00:06 |
So the correct answer for the second question is 2. | ▶ 00:09 |
There's a cluster over here, and there's a cluster over here. | ▶ 00:12 |
So one of the tasks of unsupervised learning | ▶ 00:15 |
will be to recover the number of clusters, and | ▶ 00:17 |
the center of these clusters, and the variances of these clusters in data | ▶ 00:21 |
of the type I've just shown you. | ▶ 00:23 |
[Narrator] Let me ask you a second quiz. | ▶ 00:00 |
Again, we haven't talked about any details. | ▶ 00:03 |
I would like to get your intuition on the following question. | ▶ 00:06 |
Suppose in a two dimensional space, | ▶ 00:08 |
all data lies as follows. | ▶ 00:10 |
This may be reminiscent of the question I | ▶ 00:12 |
asked you for housing prices and square footage. | ▶ 00:14 |
Suppose we have 2 axes, X1 and X2. | ▶ 00:17 |
I'm going to ask you 2 questions here. | ▶ 00:20 |
One is what is the dimensionality of this space | ▶ 00:22 |
in which this data falls, and the second one | ▶ 00:25 |
is an intuitive question which is | ▶ 00:27 |
how many dimensions do you need | ▶ 00:29 |
to represent this data to capture the essence, | ▶ 00:32 |
and, again, this is not a clear crisp 0 or 1 type question, | ▶ 00:35 |
but give me your best guess. | ▶ 00:38 |
How many dimensions are intuitively needed? | ▶ 00:40 |
No subtitles... | ▶ 00:00 |
[Narrator] So to start with some lingo about unsupervised learning. | ▶ 00:00 |
If you look at this as a probabilist, you're given data, and | ▶ 00:03 |
we interpretively assume the data is IID, | ▶ 00:06 |
which means identically distributed and independently drawn | ▶ 00:09 |
from the same distribution. | ▶ 00:11 |
So a good chunk of unsupervised learning | ▶ 00:13 |
seeks to recover the underlying--the density of | ▶ 00:15 |
probability distribution that generated the data. | ▶ 00:18 |
It's called density estimation. | ▶ 00:21 |
As we find out our methods for clustering, | ▶ 00:23 |
our versions of density estimation | ▶ 00:25 |
using what's called mixture models. | ▶ 00:27 |
Dimensionality reduction is also a method | ▶ 00:29 |
for doing density estimation, | ▶ 00:31 |
and there are many others. | ▶ 00:33 |
Unsupervised learning can be applied to find | ▶ 00:35 |
structure and data. | ▶ 00:37 |
One of the fascinating ones that | ▶ 00:39 |
I believe exists is called | ▶ 00:41 |
blind signals separation. | ▶ 00:43 |
Suppose you are given a microphone, and | ▶ 00:45 |
two people simultaneously talk, and you're | ▶ 00:48 |
record the joint of both of those speakers. | ▶ 00:51 |
Blind source separation or blind signal separation | ▶ 00:54 |
addresses the question of can you recover | ▶ 00:56 |
those two speakers and filter | ▶ 00:59 |
the data into two separate streams. | ▶ 01:01 |
One for each speaker. | ▶ 01:03 |
Now this is a really complicated unsupervised | ▶ 01:05 |
learning task, but is one of the many things | ▶ 01:07 |
that don't require target signals as | ▶ 01:09 |
unsupervised learning yet make for | ▶ 01:11 |
really interesting learning problems. | ▶ 01:13 |
This can be construed as an example | ▶ 01:15 |
of what's called factor analysis where each | ▶ 01:17 |
speaker is a factor in the drawing signal that your microphone records. | ▶ 01:19 |
There are many other examples of unsupervised learning. | ▶ 01:23 |
I will show you a few in a second. | ▶ 01:25 |
Here is one of my favorite examples of unsupervised learning-- | ▶ 00:00 |
one that is yet unsolved. | ▶ 00:03 |
At Google, I had the opportunity to participate-- | ▶ 00:05 |
in the building of Street View, | ▶ 00:08 |
which is a huge photographic database-- | ▶ 00:10 |
of many, many streets in the world. | ▶ 00:13 |
As you dive into Street View-- | ▶ 00:16 |
you can get ground imagery-- | ▶ 00:18 |
of almost any location in the world-- | ▶ 00:20 |
like this house here, that I chose at random. | ▶ 00:23 |
In these images, there is vast regularities. | ▶ 00:26 |
You can go somewhere else-- | ▶ 00:29 |
and you'll find that the type of objects-- | ▶ 00:31 |
visible in Street View-- | ▶ 00:33 |
is not entirely random. | ▶ 00:35 |
For example, there is many images of homes-- | ▶ 00:37 |
many images of cars-- | ▶ 00:39 |
trees, pavement, lane markers-- | ▶ 00:41 |
stop sign, just to name a few. | ▶ 00:44 |
So one of the fascinating, unsolved, unsupervised learning tasks is: | ▶ 00:47 |
Can you take hundreds of billions of images-- | ▶ 00:52 |
as comprised in the Street View data set-- | ▶ 00:55 |
and discover from it that there are concepts such as-- | ▶ 00:58 |
trees, lane markers, stop signs, cars, and pedestrians? | ▶ 01:01 |
It seems to be tedious to hand label each image-- | ▶ 01:05 |
for the occurrence of such objects. | ▶ 01:07 |
And attempts to do so-- | ▶ 01:09 |
has resulted in very small image data sets. | ▶ 01:11 |
Humans can learn from data-- | ▶ 01:14 |
even without explicit target labels. | ▶ 01:16 |
We often just observe. | ▶ 01:18 |
In observing, we apply unsupervised learning techniques. | ▶ 01:20 |
So one of the great, great open questions of artificial intelligence is: | ▶ 01:23 |
Can you observe many intersections and many streets and many roads-- | ▶ 01:27 |
and learn from it what concepts are contained in the imagery? | ▶ 01:32 |
Of course, I can't teach you anything as complex in this class. | ▶ 01:35 |
I don't even know the answer myself. | ▶ 01:39 |
So let me start with something simple. | ▶ 01:41 |
Clustering. Clustering is the most basic form of unsupervised learning. | ▶ 01:43 |
And I will tell you about two algorithms that are very related. | ▶ 01:47 |
One is called k-means, | ▶ 01:50 |
one is called expectation maximization. | ▶ 01:52 |
K-means is a nice, intuitive algorithm to derive clusterings. | ▶ 01:55 |
Expectation maximization is a probabilistic-- | ▶ 01:59 |
generalization of k-means. | ▶ 02:02 |
They were derived from first principles. | ▶ 02:04 |
Let me explain k-means by an example. | ▶ 00:00 |
Suppose we're given the following data points in a 2-dimensional space. | ▶ 00:03 |
K-means estimates for a fixed number of k. Here k = 2. | ▶ 00:07 |
The best centers of clusters representing those data points. | ▶ 00:12 |
Those are found interatively by the following algorithm. | ▶ 00:17 |
Step 1: Guess cluster centers at random, as shown over here with the 2 stars. | ▶ 00:20 |
Step 2: Assign to each cluster center, even though they are randomly chosen, | ▶ 00:25 |
the most likely corresponding data points. | ▶ 00:30 |
This is done by minimizing Euclidian distance. | ▶ 00:33 |
In particular, each cluster center represents half of the space. | ▶ 00:36 |
And the line that separates the space between the left and right cluster center | ▶ 00:41 |
is the equidistant line, often called a Voronoi graph. | ▶ 00:45 |
All the data points on the left correspond to the red cluster, | ▶ 00:48 |
and the ones on the right to the green cluster. | ▶ 00:53 |
Step 3: Given now we have a correspondence between the data points and cluster centers, | ▶ 00:55 |
find the optimal cluster center that corresponds to the points associated with the cluster center. | ▶ 01:00 |
Our red cluster center has only 2 data points attached. | ▶ 01:06 |
So the optimal cluster center would be the halfway point in the middle. | ▶ 01:09 |
Our right cluster center has more than 2 points attached; | ▶ 01:13 |
yet it isn't placed optimally, as you can see as they move with the animation back and forth. | ▶ 01:16 |
By minimizing the joint quadratic distance to all of those points, | ▶ 01:21 |
the new cluster center has attained the center of those data points. | ▶ 01:25 |
Now the final step is iterate. Go back and reassign cluster centers. | ▶ 01:29 |
Now the Voronoi diagram has shifted, and the points are associated differently, | ▶ 01:35 |
and then reevaluate what the optimal cluster center looks like given the associated points. | ▶ 01:39 |
And in both cases we see significant motion. | ▶ 01:45 |
Repeat. Now this is the clustering. | ▶ 01:47 |
The point association doesn't change, and as a result, we just converged. | ▶ 01:49 |
You just learned about an exciting clustering algorithm | ▶ 00:00 |
that's really easy to implement called k-means. | ▶ 00:03 |
To give you the algorithm in pseudocode, | ▶ 00:07 |
initially we select k cluster centers at random and then we repeat. | ▶ 00:09 |
In a corresponding step, we correspond all the data points to the nearest cluster center, | ▶ 00:15 |
and then we calculate the new cluster center by the mean of the corresponding data points. | ▶ 00:21 |
We repeat this until nothing changes any more. | ▶ 00:26 |
Now special care has to be taken if a cluster center becomes empty--that means no data point is associated. | ▶ 00:30 |
In which case, we just restart cluster centers at random that have no corresponding points. | ▶ 00:37 |
Empty cluster centers restart at random. | ▶ 00:43 |
This algorithm is known to converge to a locally optimal clustering of data ponts. | ▶ 00:46 |
The general clustering problem is known to be NP-hard. | ▶ 00:54 |
So a locally optimal solution, in a way, is the best we can hope for. | ▶ 00:58 |
Now let me talk about problems with k-means. | ▶ 01:03 |
First we need to know k, the number of cluster centers. | ▶ 01:06 |
As I mentioned, the local minimum. | ▶ 01:09 |
For example, for 4 data points like this and 2 cluster centers that happen to be just over here, | ▶ 01:10 |
with the separation line like this there would be no motion of k means. | ▶ 01:16 |
Even though moving one over here and one over there would give a better solution. | ▶ 01:20 |
There's a general problem of high dimensionality of the space | ▶ 01:25 |
that is not dissimilar from the way k-nearest neighbor suffers from high dimensionality. | ▶ 01:28 |
And then there's lack of a mathematical basis. | ▶ 01:32 |
Now if you're a partitioner, you might not care about a mathematical basis. | ▶ 01:35 |
But for the sake of this class, let's just care about it. | ▶ 01:38 |
So here's a first quiz for k-means. | ▶ 01:42 |
Given the following two cluster centers, C1 and C2, | ▶ 01:45 |
click on exactly those points that are associated with C1 and not with C2. | ▶ 01:50 |
And the answer is these 4 points over here. | ▶ 00:00 |
And the reason is, if you draw the line of equal distance between C1 and C2, | ▶ 00:05 |
the separation of these 2 cluster areas falls over here. | ▶ 00:11 |
C2 is down there. C1 is up here. | ▶ 00:15 |
So here's my second quiz. | ▶ 00:00 |
Given the association that we just derived for C1, where do you think the new cluster center, | ▶ 00:01 |
C1, will be found after a single step of estimating its best location given the associated points? | ▶ 00:06 |
I'll give you a couple of choices. | ▶ 00:13 |
Please click on the one that you find most plausible. | ▶ 00:14 |
And the answer is, over here. | ▶ 00:00 |
These 4 data points are associated with C1, so we can safely ignore all the other ones. | ▶ 00:02 |
This one over here would be at the center of the 3 data points over here, | ▶ 00:08 |
but this one pulls back this data point drastically towards it. | ▶ 00:12 |
This is about the best trade-off between these 3 points over here that all have a string | ▶ 00:17 |
attached and pull in this direction, compared to this point over here. | ▶ 00:22 |
Any of the other ones don't even lie between those points, | ▶ 00:26 |
and therefore won't be good cluster centers. | ▶ 00:29 |
The one over here is way too far to the right. | ▶ 00:32 |
In our next quiz let's assume we've done one interation, | ▶ 00:00 |
and the cluster center of C1 moved over there and C2 moved over here. | ▶ 00:03 |
Can you once again click on all those data points that correspond to C1? | ▶ 00:07 |
And the answer is now simple. It's this one over here. | ▶ 00:00 |
This one, this one, and this one. | ▶ 00:03 |
And the reason is, the line separating both clusters runs around here. | ▶ 00:05 |
That means all the area over here is C2 territory, and the area over here is C1 territory. | ▶ 00:10 |
Obvioulsy as we now iterate k-means, these clusters that have been moved straight over | ▶ 00:16 |
here will be able to stay, whereas C2 will end up somewhere over here. | ▶ 00:20 |
So, let's now generalize k-means into expectation maximization. | ▶ 00:00 |
Expectation maximization is an algorithm that uses actual probability distributions | ▶ 00:05 |
to describe what we're doing, and it's in many ways more general, | ▶ 00:10 |
and it's also nice in that it really has a probabilistic basis. | ▶ 00:14 |
To get there, I have to take the discourse and tell you all about Gaussians, | ▶ 00:17 |
or the normal distribution, and the reason is so far, | ▶ 00:21 |
we've just encountered discrete distributions, | ▶ 00:24 |
and Gaussians will be the first example of a continuous distribution. | ▶ 00:26 |
Many of you know that a Gaussian is described by an identity that looks as follows, | ▶ 00:30 |
where the mean is called mu, and the variance is called sigma or sigma squared. | ▶ 00:34 |
And for any X along the horizontal access, the density is given by the following function: | ▶ 00:41 |
1 over square root of 2 pi times sigma, and then an exponential function | ▶ 00:47 |
of minus ½ of x - mu squared over sigma squared. | ▶ 00:52 |
This function might look complex, but it's also very, very beautiful. | ▶ 00:56 |
It peaks at X = mu where the value in the exponent becomes 0. | ▶ 01:01 |
And towards plus or minus infinity, it goes to 0 quickly. | ▶ 01:07 |
In fact, exponentially fast. | ▶ 01:11 |
The argument inside is a quadratic function. | ▶ 01:14 |
The exponential function makes it exponential. | ▶ 01:16 |
And this over here is a normalizer to make sure that the area underneath | ▶ 01:20 |
sums up to one, which is characteristic of any probability density function. | ▶ 01:23 |
If you map this back to our discrete random variables, | ▶ 01:29 |
for each possible X, we can now assign a density value, | ▶ 01:32 |
which is the function of this, and that's effectively | ▶ 01:37 |
the probability that this X might be drawn. | ▶ 01:41 |
Now, the space itself is infinite, so any individual value will have a probability of 0, | ▶ 01:43 |
but what you can do is you can make an interval, A and B, | ▶ 01:48 |
and the area underneath this function is the total probability | ▶ 01:52 |
that an experiment will come up between A and B. | ▶ 01:56 |
Clearly, it's more likely to generate values around mu | ▶ 02:00 |
then it is to generate values in the periphery summary over here. | ▶ 02:03 |
And just for completeness, I'm going to give you the formula | ▶ 02:07 |
for what's called the multi-variate Gaussian | ▶ 02:09 |
where multi-variate means nothing else but we have more than one input variable. | ▶ 02:12 |
You might have a Gaussian over a 2-dimensional space or a 3-dimensional space. | ▶ 02:17 |
Often, these Gaussians are drawn by what's called level sets, | ▶ 02:21 |
sets of equal probability. | ▶ 02:24 |
Here's one in a 2-dimensional space, X1 and X2. | ▶ 02:26 |
The Gaussian itself can be thought of as coming out of the paper towards me | ▶ 02:30 |
where the most likely or highest point of probability is the center over here. | ▶ 02:35 |
And these rings measure areas of equal probability. | ▶ 02:39 |
The formula for a multi-variate Gaussian looks as follows: | ▶ 02:43 |
N is the number of dimensions in the input space. | ▶ 02:49 |
Sigma is a covariance matrix that generalizes the value over here. | ▶ 02:53 |
And the inner product inside the exponential | ▶ 02:57 |
is now done using linear algebra where this is the difference between | ▶ 03:02 |
a probe point and the mean vector mu | ▶ 03:08 |
transposed sigma to the minus 1 times X - mu. | ▶ 03:12 |
You can find this formula in any textbook or web page | ▶ 03:16 |
on Gaussians or multi-variate normal distributions. | ▶ 03:21 |
It looks cryptic at first, but the key thing to remember is | ▶ 03:25 |
it's just a generalization of the 1-dimensional case. | ▶ 03:29 |
We have a quadratic area over here as manifested by the product | ▶ 03:33 |
of this guy and this guy. | ▶ 03:36 |
We have a normalization by a variance or covariance | ▶ 03:38 |
as shown by this number over here or the inverse matrix over here. | ▶ 03:42 |
And then this entire thing is an exponential form in both cases, | ▶ 03:48 |
and the normalizer looks a little more different in the multi-variate case, | ▶ 03:51 |
but all it does is make sure that the volume underneath adds up to 1 | ▶ 03:55 |
to make it a legitimate probability density function. | ▶ 03:59 |
For most of this explanation, I will stick with 1-dimensional Gaussians, | ▶ 04:02 |
so all you have to do is to worry about this formula over here, | ▶ 04:07 |
but this is given just for completeness. | ▶ 04:10 |
I will now talk about fitting Gaussians to data or Gaussian learning. | ▶ 00:00 |
You may be given some data points, and you might worry about | ▶ 00:06 |
what is the best Guassian fitting the data? | ▶ 00:09 |
Now, to explain this, let me first tell you what parameters characterizes a Gaussian. | ▶ 00:12 |
In the 1-dimensional case, it is mu and sigma squared. | ▶ 00:18 |
Mu is the mean. Sigma squared is called the variance. | ▶ 00:24 |
If we look at the formula of a Gaussian, it's a function over any possible input X, | ▶ 00:28 |
and it requires knowledge of mu and sigma squared. | ▶ 00:34 |
And as before, I'm just restating what I said before. | ▶ 00:38 |
We get this function over here that specifies any probability | ▶ 00:42 |
for a value X given a specific mu and sigma squared. | ▶ 00:48 |
Suppose we wish to fit data, and our data is 1-dimensional, and it looks as follows. | ▶ 00:53 |
Just looking at this diagram makes me believe | ▶ 01:01 |
that there's a high density of data points over here | ▶ 01:03 |
and a fading density of data points over there, | ▶ 01:06 |
so maybe the most likely Gaussian will look a little bit like this | ▶ 01:09 |
where this is mu and this is sigma. | ▶ 01:13 |
They are really easy formulas for fitting data to Gaussians, | ▶ 01:17 |
and I'll give you the result right now. | ▶ 01:21 |
The optimal or most likely mean is just the average of the data points. | ▶ 01:23 |
There's M data points, X1 to Xm. | ▶ 01:30 |
The average will look like this. | ▶ 01:33 |
The sum of all data points divided by the total number of data points. | ▶ 01:35 |
That's called the average, and once you calculate the average, | ▶ 01:41 |
the sigma squared is obtained by a similar normalization | ▶ 01:44 |
in a slightly more complex sum. | ▶ 01:48 |
We sum the deviation from the mean | ▶ 01:51 |
and compute the average deviation to the square from the mean, | ▶ 01:54 |
and that gives us sigma squared. | ▶ 01:58 |
So, intuitively speaking, the formulas are really easy. | ▶ 02:00 |
Mu is the mean, or the average. | ▶ 02:03 |
Sigma squared is the average quadratic deviation from the mean, as shown over here. | ▶ 02:06 |
Now I want take a second to convince ourselves | ▶ 00:00 |
this is indeed the maximum likelihood estimate | ▶ 00:03 |
of the mean and the variance. | ▶ 00:06 |
Suppose our data looks like this-- | ▶ 00:09 |
There's "M" data points. | ▶ 00:12 |
And the probability of those data points | ▶ 00:15 |
for any Gaussian model--mu and sigma squared | ▶ 00:18 |
is the product of any individual of data likelihood--x,i. | ▶ 00:22 |
And if you plug in our Gaussian formula, you get the following-- | ▶ 00:29 |
This is the normalizer multiplied "M" times | ▶ 00:34 |
where the square root is now drawn into the half over here, | ▶ 00:37 |
and here is our joint exponential. | ▶ 00:43 |
We took the product of the individual exponentials | ▶ 00:45 |
and moved it up straight in here where it becomes a sum. | ▶ 00:49 |
So the best estimates for mu and sigma squared | ▶ 00:53 |
are those that maximize this entire expression over here | ▶ 00:58 |
for given data set X1 to Xm. | ▶ 01:01 |
So we seek to maximize this over the unknown parameters | ▶ 01:05 |
mu and sigma squared. | ▶ 01:08 |
And now I will apply a trick. | ▶ 01:10 |
Instead of maximizing this expression, | ▶ 01:12 |
I will maximize the logarithm of this expression. | ▶ 01:14 |
The logarithm is a monotonic function. | ▶ 01:17 |
So let's maximize instead the logarithm | ▶ 01:19 |
where this expression over here resolves to this expression over here. | ▶ 01:23 |
The multiplication becomes a minus sign from over here, | ▶ 01:27 |
and this is the argument inside the exponent | ▶ 01:32 |
written slightly differently, | ▶ 01:35 |
but pulling the 2 sigma squared to the left. | ▶ 01:37 |
So let's maximize this one instead. | ▶ 01:40 |
The maximum was obtained where the first | ▶ 01:42 |
derivative is zero. | ▶ 01:45 |
If we do this for our variable mu, | ▶ 01:48 |
we take the "log f" expression and | ▶ 01:51 |
complete the derivative for spectrum mu, | ▶ 01:53 |
we get following-- | ▶ 01:56 |
This expression does not depend on mu at all, so it falls out. | ▶ 01:58 |
And we can still get this expression over here, which we've set to zero. | ▶ 02:01 |
And now we can multiply everything by sigma squared next to zero, | ▶ 02:05 |
and then bring the Xi to the right and the mu to the left. | ▶ 02:11 |
The sum over all "E" of the mu is mu equals sum over i, xi. | ▶ 02:15 |
Hence, we proved that the mean is indeed the maximum likelihood estimate | ▶ 02:24 |
for the Gaussian. | ▶ 02:31 |
This is now easily repeated for the variance. | ▶ 02:33 |
If you compute the derivative of this expression over here | ▶ 02:38 |
with respect to the variance, | ▶ 02:41 |
we get minus "m" over sigma, which happens to be the derivative | ▶ 02:43 |
of this expression over here. | ▶ 02:48 |
Keep in mind that the derivative of | ▶ 02:50 |
a logarithm stresses internal argument | ▶ 02:53 |
times by chain rule--the derivative of the internal argument, | ▶ 02:57 |
which if you work out becomes this expression over here. | ▶ 03:01 |
And this guy over here changes signs | ▶ 03:05 |
but becomes the following. | ▶ 03:08 |
And again, you move this guy to the left side, | ▶ 03:10 |
multiply by sigma cubed, and divide by "m". | ▶ 03:13 |
So we get the following result over here. | ▶ 03:18 |
You might take a moment to verify these steps over here, | ▶ 03:22 |
I was a little bit fast, | ▶ 03:25 |
but this is relatively straight forward mathematics. | ▶ 03:27 |
And if you will verify them, | ▶ 03:32 |
you will find that the maximum likelihood estimate | ▶ 03:34 |
for sigma squared is the average | ▶ 03:36 |
deviation of data points from the mean mu. | ▶ 03:39 |
This gives us a very nice basis to fit | ▶ 03:43 |
Gaussians to data points. | ▶ 03:45 |
So keeping these formulas in mind, here's a quick quiz, | ▶ 03:48 |
which I ask you to actually calculate the mean and variance for a data sequence. | ▶ 03:52 |
So suppose the data you observe is 3, 4, 5, 6, and 7. | ▶ 03:58 |
There is 5 data points. | ▶ 04:02 |
Compute for me the mean and the variance | ▶ 04:04 |
using the maximum likelihood estimator I just gave you. | ▶ 04:07 |
So the mean is obviously 5, | ▶ 00:00 |
it's the middle value over here. | ▶ 00:02 |
If I add those things together, I get 25 and divide by 5. | ▶ 00:04 |
The average value over here is 5. | ▶ 00:08 |
The more interesting case is sigma square, | ▶ 00:10 |
and I do this in the following steps-- | ▶ 00:12 |
I subtract 5 from each of the data points | ▶ 00:14 |
for which I get -2, -1, 0, 1, and 2. | ▶ 00:16 |
I square those differences, | ▶ 00:20 |
which gives me 4, 1, 0, 1, 4. | ▶ 00:22 |
And now I compute the mean of those square differences. | ▶ 00:24 |
To do so, I add them all up, which is 10. | ▶ 00:27 |
10 divided by 5 is 2, and sigma square equals 2. | ▶ 00:30 |
Here is another quiz--Suppose my DATA | ▶ 00:00 |
looks as follows--3,9,9,3. | ▶ 00:02 |
Compute for me mu and sigma squared | ▶ 00:06 |
using the maximum likelihood estimator I just gave you. | ▶ 00:08 |
And the answer is relatively easy. | ▶ 00:00 |
3 + 9 + 9 + 3 = 24 | ▶ 00:02 |
divided by m = 4 is 6 | ▶ 00:05 |
so the mean value is 6 | ▶ 00:08 |
subtracting the mean from the data gives us -3, 3, 3, and -3 | ▶ 00:10 |
squaring those gives us 9, 9, 9, 9 | ▶ 00:14 |
and the mean of 4 nines equals 9. | ▶ 00:18 |
I now have a more challenging quiz for you | ▶ 00:00 |
in which I give you multivariant data | ▶ 00:03 |
in this case 2-dimensional data. | ▶ 00:06 |
So suppose my data goes as follows. | ▶ 00:08 |
In the first column I get 3, 4, 5, 6, 7 | ▶ 00:11 |
these are 5 data points and this is the first feature | ▶ 00:15 |
and the second feature will be 8, 7, 5, 3, 2. | ▶ 00:18 |
The formulas for calculating Mu | ▶ 00:24 |
and the covariance matrix Sigma | ▶ 00:26 |
generalize the ones we studied before | ▶ 00:29 |
and they are given over here. | ▶ 00:31 |
So what I would like you to compute is the vector Mu | ▶ 00:33 |
which now has 2 values | ▶ 00:36 |
one for the first and one for the second column | ▶ 00:38 |
and the variance Sigma | ▶ 00:41 |
which now has 4 different values | ▶ 00:44 |
using the formula shown over here. | ▶ 00:47 |
Now the mean is calculated as before | ▶ 00:00 |
independently for each of the 2 features here. | ▶ 00:03 |
Three, 4, 5, 6, 7, the mean is 5. | ▶ 00:06 |
Eight, 7, 5, 3, 2, the mean is 5 again. | ▶ 00:08 |
Easy calculation. If you subtract the mean | ▶ 00:12 |
from the data we get the following matrix | ▶ 00:14 |
and now we just have to plug it in. | ▶ 00:19 |
For the main diagonal elements you get the same formula as before. | ▶ 00:21 |
You can do this separately for each of the 2 columns. | ▶ 00:24 |
But for the off-diagonal elements you just have to plug it in. | ▶ 00:27 |
So this is the result after plugging it in | ▶ 00:30 |
and you might just want to verify it using a computer. | ▶ 00:32 |
So this finishes the lecture on Gaussians. | ▶ 00:00 |
You learned about what a Gaussian is. | ▶ 00:02 |
We talked about the fit from data | ▶ 00:04 |
and we even talked about multivariate Gaussians. | ▶ 00:06 |
But even though I asked you to fit one of those | ▶ 00:08 |
the one we are going to focus on right now | ▶ 00:11 |
is the one-dimensional Gaussian. | ▶ 00:13 |
So let's now move back to the expectation maximization algorithm. | ▶ 00:14 |
It is now really easy to explain expectation maximization | ▶ 00:00 |
as a generalization of K-means. | ▶ 00:04 |
Again, we have a couple of data points here | ▶ 00:06 |
and 2 randomly chosen cluster centers. | ▶ 00:09 |
But in the correspondence step instead of making a hard correspondence | ▶ 00:12 |
we make a soft correspondence. | ▶ 00:16 |
Each data point is attracted to a cluster center | ▶ 00:18 |
in proportion to the posterior likelihood | ▶ 00:22 |
which we will define in a minute. | ▶ 00:24 |
In the adjustment step or the maximization step | ▶ 00:26 |
the cluster centers are being optimized just like before | ▶ 00:30 |
but now the correspondence is a soft variable | ▶ 00:34 |
and they correspond to all data points in different strengths | ▶ 00:37 |
not just the nearest ones. | ▶ 00:39 |
As a result, in EM the cluster centers | ▶ 00:41 |
tend not to move as far as in K-means. | ▶ 00:44 |
Their movement is smooth away. | ▶ 00:46 |
A new correspondence over here gives us different strength | ▶ 00:48 |
as indicated by the different coloring of the links | ▶ 00:50 |
and another relaxation step gives us better cluster centers. | ▶ 00:53 |
And as you can see over time, gradually | ▶ 00:57 |
the EM will then converge to about the same solution as K-means. | ▶ 00:59 |
However, all the correspondences are still alive. | ▶ 01:03 |
Which means there is not a 0, 1 correspondence. | ▶ 01:05 |
There is a soft correspondence | ▶ 01:08 |
which relates to a posterior probability, which I will explain next. | ▶ 01:09 |
The model of expectation maximization | ▶ 00:00 |
is that each data point | ▶ 00:02 |
is generated from what's called a mixture. | ▶ 00:04 |
The sum of all possible classes | ▶ 00:06 |
or clusters, of which there are K | ▶ 00:08 |
we draw a class at random | ▶ 00:10 |
with a prior probability of p of the class C = i | ▶ 00:13 |
and then we draw data point X | ▶ 00:17 |
from the distribution correspondent with its class over here. | ▶ 00:19 |
The way to think about this if there is K different cluster centers shown over here | ▶ 00:22 |
each one of those has a generic Gaussian attached. | ▶ 00:28 |
In the generative version of expectation maximization | ▶ 00:30 |
you first draw a cluster center | ▶ 00:34 |
and then we draw from the Gaussian attached to this cluster center. | ▶ 00:36 |
The unknowns here are the prior probabilities for each cluster center | ▶ 00:39 |
should we call P-i and the Mu-i and in the general case Sigma-i | ▶ 00:43 |
for each of the individual Gaussian. | ▶ 00:49 |
Where i = 1 all the way to K. | ▶ 00:51 |
Expectation maximization iterates 2 steps just like K-means. | ▶ 00:54 |
One is called the E-step or expectation step | ▶ 00:59 |
for which we assume that we know the Gaussian parameters and the P-i. | ▶ 01:01 |
With those known values calculating the sum over here | ▶ 01:08 |
is a fairly trivial exercise. | ▶ 01:11 |
This is our known formula for a Gaussian | ▶ 01:13 |
we just plug that in and this is a fixed probability. | ▶ 01:17 |
The sum of all possible classes. | ▶ 01:21 |
So you get for e-ij | ▶ 01:24 |
the probability that the j-th data point | ▶ 01:27 |
corresponds to cluster center number i | ▶ 01:30 |
P-i times the normalizer | ▶ 01:32 |
times the Gaussian expression. | ▶ 01:36 |
Where we have a quadratic of Xj minus Mu-i | ▶ 01:38 |
times Sigma-i to the -1 times the same thing again over here. | ▶ 01:42 |
These are the probabilities | ▶ 01:47 |
that the j-th data point | ▶ 01:49 |
corresponds to the i-th cluster center | ▶ 01:52 |
under the assumption that we do know | ▶ 01:54 |
the parameters P-i, Mu-i, and Sigma-i. | ▶ 01:57 |
In the M-step we now figure out where these parameters should have been. | ▶ 02:00 |
For the prior probability of each cluster center | ▶ 02:03 |
we just take the sum over all the e-ijs, over all data points | ▶ 02:06 |
divided by the total number of data points. | ▶ 02:11 |
The mean is obtained by the weighted mean of the x-js | ▶ 02:14 |
normalized by the sum over e-ijs | ▶ 02:21 |
and finally the sigma is obtained as a sum | ▶ 02:25 |
over the weighted expression like this | ▶ 02:30 |
and this is the same expression as before | ▶ 02:33 |
and now again we are normalizing over the sum over all e-ijs. | ▶ 02:35 |
And these are exactly the same calculations | ▶ 02:40 |
as before when we fit a Gaussian but just weighted by | ▶ 02:42 |
the soft correspondence of a data point to each Gaussian. | ▶ 02:46 |
And this weighting is relatively straightforward to apply in Gaussian fitting. | ▶ 02:51 |
Let's do a very quick quiz for EM. | ▶ 02:55 |
Suppose we're given 3 data points and 2 cluster centers. | ▶ 02:58 |
And the question is, does this point over here | ▶ 03:01 |
called X1 correspond to C1 or C2 or both of them? | ▶ 03:04 |
Please check exactly one of these 3 different check boxes here. | ▶ 03:09 |
[Thrun] And the answer is both of them, | ▶ 00:00 |
and the reason is X1 might be closer to C2 than C1, | ▶ 00:02 |
but the correspondence in EM is soft, | ▶ 00:05 |
which means each data point always corresponds to all cluster centers. | ▶ 00:07 |
It is just that this correspondence over here | ▶ 00:11 |
is much stronger than the correspondence over here. | ▶ 00:13 |
[Thrun] Here is another EM quiz. | ▶ 00:00 |
For this quiz we will assume a degenerative case of 3 data points and just 1 cluster center. | ▶ 00:02 |
My question pertains to the shape of the Gaussian after fitting, | ▶ 00:08 |
specifically M1 sigma. | ▶ 00:11 |
And the question is, is sigma circular, which would be like this, | ▶ 00:13 |
or elongated, which would be like this or like this? | ▶ 00:18 |
[Thrun] And the answer is, of course, elongated. | ▶ 00:00 |
As you look over here, what you find is that this is the best Gaussian describing the data points, | ▶ 00:02 |
and this is what EM will calculate. | ▶ 00:06 |
[Thrun] This is a quiz in which I compare EM versus K-means. | ▶ 00:00 |
Suppose we are giving you 4 data points, as indicated by those circles. | ▶ 00:05 |
Suppose we have 2 initial cluster centers, shown here in red, | ▶ 00:08 |
and those converge to possible places that are indicated by those 4 squares. | ▶ 00:11 |
Of course they won't take all 4 of them; they will just take 2 of them. | ▶ 00:17 |
But for now I'm going to give you 4 choices. | ▶ 00:19 |
We call this cluster 1, cluster 2, A, B, C, and D. | ▶ 00:21 |
In EM will C1 move towards A or will C1 move towards B? | ▶ 00:26 |
And in contrast, in K-means will C1 move towards A | ▶ 00:32 |
or will C1 move towards B? | ▶ 00:36 |
This is just asking about the left side of the diagram. | ▶ 00:38 |
So the question is will K-means find itself in the more extreme situation, | ▶ 00:41 |
or will EM find itself in the more extreme situation? | ▶ 00:44 |
[Thrun] And the answer is that while K-means will go all the way to the extreme, A, | ▶ 00:00 |
which is this one over here, EM will not. | ▶ 00:05 |
And this has to do with the soft versus hard nature of the correspondence. | ▶ 00:08 |
In K-means the correspondence is hard. | ▶ 00:13 |
So after the first situation, only these 2 data points over here | ▶ 00:17 |
correspond to cluster center 1, | ▶ 00:20 |
and they will find themselves straight in the middle where A is located. | ▶ 00:22 |
In EM, however, we find that there will still be a soft correspondence | ▶ 00:25 |
to these further away points which will then lead to a small shift of the cluster center | ▶ 00:29 |
to the right side, as indicated by B. | ▶ 00:33 |
That means K-means and EM will converge at different models of the data. | ▶ 00:36 |
[Thrun] One of the remaining open questions pertains to the number of clusters. | ▶ 00:00 |
So far I've assumed it's simply constant and you know it. | ▶ 00:03 |
But in reality, you don't know it. | ▶ 00:06 |
Practical implementations often guess the number of clusters along with the parameters. | ▶ 00:08 |
And the way this works is that you periodically evaluate which data is poorly covered | ▶ 00:12 |
by the existing mixture, you generate new cluster centers | ▶ 00:17 |
at random near unexplained points, and then you run the algorithm for a while | ▶ 00:21 |
to see whether the existence of your clusters is still justified. | ▶ 00:25 |
And the justification test is based on a memorization of a criterion | ▶ 00:29 |
that combines the negative log likelihood of your data itself | ▶ 00:33 |
and a penalty for each cluster. | ▶ 00:37 |
In particular, you're going to minimize the negative log likelihood of your data | ▶ 00:40 |
given the model plus a constant penalty per cluster. | ▶ 00:43 |
If we look at this expression, this is the expression that EM already minimizes. | ▶ 00:46 |
We maximized the posterior probability of data | ▶ 00:51 |
logarithmic is a monotonic function, and I put a minus sign over here | ▶ 00:53 |
so the optimization problem becomes a minimization problem. | ▶ 00:57 |
This one over here, we have a constant cost per cluster is new. | ▶ 01:00 |
If you increase the number of clusters, you would pay a penalty | ▶ 01:04 |
that is in the way of your attempted minimization. | ▶ 01:07 |
Typically, this expression balances out at a certain number of clusters, | ▶ 01:10 |
and it is generically the best explanation for your data. | ▶ 01:14 |
So the algorithm looks as follows. | ▶ 01:16 |
Guess an initial K, run EM, remove unnecessary clusters | ▶ 01:18 |
that will make this quote over here go up, | ▶ 01:22 |
create some new random clusters, and go back and run EM. | ▶ 01:24 |
There is all kinds of variants of this algorithm. | ▶ 01:27 |
One of the nice things here is this algorithm also overcomes local minima problems | ▶ 01:30 |
to some extent. | ▶ 01:33 |
If, for example, 2 clusters end up grabbing the same data, | ▶ 01:35 |
then your tests would show you that 1 of the clusters can be omitted; | ▶ 01:39 |
thereby the score can be improved. | ▶ 01:42 |
That cluster can later be restarted somewhere else, | ▶ 01:44 |
and by randomly restarting clusters, you tend to get a much, much better solution | ▶ 01:47 |
than if you run EM just once with a fixed number of clusters. | ▶ 01:51 |
So this trick is highly recommended for any implementation of expectation maximization. | ▶ 01:54 |
[Thrun] This finishes my unit on clustering, | ▶ 00:00 |
at least so far. | ▶ 00:03 |
I just want to briefly summarize what we've learned. | ▶ 00:05 |
We talked about K-means, and we talked about expectation maximization. | ▶ 00:07 |
K-means is a very simple almost binary algorithm | ▶ 00:10 |
that allows you to find cluster centers. | ▶ 00:14 |
EM is a probabilistic generalization that also allows you to find clusters | ▶ 00:16 |
but also modifies the shapes of the clusters by modifying the covariance matrix. | ▶ 00:19 |
EM is probabilistically sound, and you can prove convergence | ▶ 00:23 |
in a log likelihood space. K-means also converges. | ▶ 00:26 |
Both are prone to local minima. | ▶ 00:29 |
In both cases you need to know the number of cluster centers, K. | ▶ 00:31 |
I showed you a brief trick how to estimate the K as you go, | ▶ 00:34 |
which also overcomes local minima to some extent. | ▶ 00:39 |
Let's now talk about a 2nd class of unsupervised learning avenues | ▶ 00:00 |
that are called dimensionality reduction. | ▶ 00:04 |
We're going to start with a little quiz, in which I will check your intuition. | ▶ 00:06 |
Suppose we're given a 2-dimensional data field, and our data lines up as follows. | ▶ 00:10 |
My quiz is: How many dimensions do we really need? | ▶ 00:14 |
The key is the word really, | ▶ 00:17 |
which means we're willing to tolerate a certain amount of error in accuracy, | ▶ 00:19 |
because we're going to capture the essence of the problem. | ▶ 00:22 |
The answer is obviously 1. | ▶ 00:00 |
This is the key dimension over here. | ▶ 00:02 |
The orthogonal dimension in this direction carries alomst information, | ▶ 00:04 |
so it suffices, in most cases, to project the data onto this 1 dimensional space. | ▶ 00:07 |
Here is a quiz that is a little bit more tricky. | ▶ 00:00 |
I'm going to draw data for you like this. | ▶ 00:02 |
I'm going to ask the same question. | ▶ 00:04 |
How many dimensions do we really need? | ▶ 00:06 |
This answer is not at all trivial, and I don't blame you if you get it wrong. | ▶ 00:00 |
The answer is actually 1, but the projection itself is nonlinear. | ▶ 00:05 |
I can draw, really easily, a nice 1-dimensional space that follows these data points. | ▶ 00:10 |
If I am able to project all the data points on this 1-dimensional space, | ▶ 00:15 |
I capture the essence of the data. | ▶ 00:19 |
The trick, of course, is to find the nonlinear 1-dimensional space and describe it. | ▶ 00:21 |
This is what's going on in the state-of-the-art in dimensionality reduction research. | ▶ 00:25 |
For the remainder of this unit, | ▶ 00:00 |
I am going to talk about linear dimensionality reduction. | ▶ 00:02 |
Where the idea is that the given data points like this, | ▶ 00:05 |
and we seek to find a linear subspace in which to perfect the data. | ▶ 00:08 |
In this case, I would submit this is probably the most suitable linear subspace. | ▶ 00:13 |
So we remap the data onto the space over here, with x1 over here and x2 over here. | ▶ 00:17 |
Then we can capture the data in just 1 dimension. | ▶ 00:23 |
The algorithm is amazingly simple. | ▶ 00:25 |
Number 1: Fit a gaussian; we now know how this works. | ▶ 00:28 |
The gaussian will look something like this. | ▶ 00:31 |
Number 2: Caluclate the eigenvalues and eigenvectors of this gaussian. | ▶ 00:34 |
In this gaussian this would be the dominant eigenvector, | ▶ 00:39 |
and this would be the 2nd eigenvector over here. | ▶ 00:42 |
Step 3 is take those eigenvectors whose eigenvalues are the largest. | ▶ 00:45 |
Step 4 is to project the data onto the subspace of eigenvectors you chose. | ▶ 00:50 |
Now to understand this, you have to be familiar with eigenvectors and eigenvalues. | ▶ 00:55 |
I give you an intuitive familiarity with those. | ▶ 00:59 |
This is standard statistics material, and you will find this in many linear algebra classes. | ▶ 01:02 |
So let me just go through this very quickly | ▶ 01:07 |
and give you an intuition how to do linear dimensionality reduction. | ▶ 01:09 |
Suppose you're given the following data points: | ▶ 01:14 |
Your axes are 0, 1, 2, 3, and 4, | ▶ 01:16 |
4 x1, and 1.9, 3.1, 4, 5.1, and 5.9. | ▶ 01:20 |
These are essentially 2, 3, 4, 5, 6, | ▶ 01:28 |
but slightly modified to define actual variance over this dimension. | ▶ 01:33 |
So I draw this in here. | ▶ 01:38 |
What I get is a set of points that doesn't quite fit a line, but almost. | ▶ 01:40 |
There is a little error over here, a little error over here, and here and here. | ▶ 01:44 |
The mean is easily calculated; it's 2 and 4. | ▶ 01:47 |
The covairance matrix looks as follows. | ▶ 01:50 |
Notice the slightly different covairance for the 1st variable, which is exactly 2, | ▶ 01:53 |
to the 2nd variable, which is 2.008. | ▶ 01:59 |
The eigenvectors happen to be 0.7064 and 0.7078 with an eigenvalue of 4.004, | ▶ 02:02 |
and the 2nd one is orthogonal with an eigenvalue much smaller. | ▶ 02:13 |
So obviously this is the eigenvector that dominates the spread of the data points. | ▶ 02:18 |
If you look at this vector over here, it is centered around the mean, | ▶ 02:22 |
which sits over here, and is exactly this vector shown over here. | ▶ 02:27 |
Where this one is the orthogonal vector shown over here. | ▶ 02:31 |
So this single dimension with a large weight explains the data relative to | ▶ 02:34 |
any other dimension, which is a very small eidenvalue. | ▶ 02:39 |
I should mention why these numerical examples might look confusing. | ▶ 02:41 |
This is very standard linear algebra. | ▶ 02:47 |
When you estimate covariance from data and try to understand which direction they point, | ▶ 02:49 |
this kind of eigenvalue anylysis gives you the right answer. | ▶ 02:53 |
The dimensionality reduction looks a little bit silly when you go | ▶ 00:00 |
from 2 dimensions to 1 dimension. | ▶ 00:04 |
But in truly high-dimensional space it has a very strong utility. | ▶ 00:05 |
Here's an example that goes back to MIT several decades ago | ▶ 00:09 |
on something called eigenfaces. | ▶ 00:13 |
These are all well-aligned faces. | ▶ 00:15 |
The objective in eigenface research has been to find | ▶ 00:17 |
simple ways to describe different people in a parameter space, | ▶ 00:21 |
in which we can easily identify the same person again. | ▶ 00:25 |
Images like these are very high-dimensional statistics. | ▶ 00:27 |
If each image is 50 by 50 pixels, | ▶ 00:31 |
each image itself becomes a data point in a 2500 dimensional feature space. | ▶ 00:33 |
Now obviously, we don't have random images. | ▶ 00:39 |
We don't fill the space of 2500 dimensions with all face images. | ▶ 00:43 |
Instead, it is reasonable to assume that all the faces live on a small subspace in that space. | ▶ 00:48 |
Obviously, you as a human can easily distinguish what is a valid image of a face | ▶ 00:54 |
and what is a valid image of a non face, like a car or a cloud or the sky. | ▶ 00:58 |
Therefore, there are many, many images that you can | ▶ 01:02 |
represent with 2500 pixels that are not faces. | ▶ 01:04 |
So research on eigenfaces has applied | ▶ 01:08 |
principle component analysis and eigenvalues to the space of faces. | ▶ 01:10 |
Here is a database in which faces are aligned. | ▶ 01:15 |
A researcher at Santiago Serrano extracted from it | ▶ 01:19 |
the average face after alignment on the right side. | ▶ 01:23 |
The truly interesting phenomenon occurs when you look at the eigenvalues. | ▶ 01:27 |
The face on the top left, over here, is the average face, | ▶ 01:31 |
and these are the variations, | ▶ 01:34 |
the eigenvectors that correspond to the largest eigenvalues over here. | ▶ 01:37 |
This is the strongest variation. | ▶ 01:41 |
You see a certain amount of different regions in and around the head shape | ▶ 01:42 |
and the hair that gets excited. | ▶ 01:46 |
That's the 2nd strongest one, where the shirt gets more excited. | ▶ 01:48 |
As you go down, | ▶ 01:50 |
you find more and more interesting variations that can be used to reconstruct faces. | ▶ 01:51 |
Typically a dozen or so will suffice to make a face completely reconstructable, | ▶ 01:56 |
which means you've just mapped a 2500 dimensional feature space | ▶ 02:01 |
into a, perhaps, 12 dimensional feature space | ▶ 02:05 |
on which we can now learn much, much easier. | ▶ 02:08 |
In our own reserch, we also have applied eigenvector decomposition | ▶ 00:00 |
to relatively challenging problems that don't look like a linear problem at the surface. | ▶ 00:06 |
We scanned a good number of people with different physiques: | ▶ 00:11 |
Some thin, some not so thin, some tall, some short, some male, some female. | ▶ 00:15 |
We also scanned them in 3-D in different body postures: | ▶ 00:19 |
The arms down, the arms up, walking, throwing a ball, and so on. | ▶ 00:23 |
We applied eigenvector decomposition of the type I've just shown you | ▶ 00:28 |
to understand whether there is a latent low-dimensional space | ▶ 00:33 |
that is sufficient to represent the different physiques that people have, | ▶ 00:37 |
like thin or thick, and the different postures people can assume, like standing and so on. | ▶ 00:41 |
It turns out if you apply eigenvector decomposition | ▶ 00:46 |
to the space of all the formations of your body, | ▶ 00:51 |
you can find relatively low dimensional linear spaces, | ▶ 00:55 |
in which you can express different physiques and different body postures. | ▶ 01:00 |
For the space of all different physiques it turns only 3-dimensions sufficed | ▶ 01:05 |
to explain different heights, different thicknesses or body weights, | ▶ 01:11 |
and also different genders. | ▶ 01:15 |
That is, even though our surfaces themselves are representive | ▶ 01:18 |
of tens of thousands of data points, the underlying dimensionality | ▶ 01:22 |
when scanning people is really small. | ▶ 01:25 |
I'll let you watch the entire movie. | ▶ 01:29 |
Please enjoy. | ▶ 01:31 |
[SCAPE: Shape Completion and Animation of People] | ▶ 01:32 |
We present a method named SCAPE for simultaneously modeling | ▶ 01:34 |
the space of all human shapes and poses. | ▶ 01:38 |
Further, we demonstrate the method's usefulness | ▶ 01:41 |
for both shape completion and animation. | ▶ 01:44 |
The model is computed from an example set of surface meshes. | ▶ 01:48 |
We require only a limited set of training data: | ▶ 01:51 |
Examples of posed variation from a single subject | ▶ 01:55 |
and examples of the shape variation between subjects. | ▶ 01:58 |
The resulting model can represent both articulated motion | ▶ 02:02 |
and, importantly, the nonrigid muscle deformations | ▶ 02:06 |
required for natural appearance in a wide variety of poses. | ▶ 02:10 |
The model can also represent a wide variety of different body shapes, | ▶ 02:14 |
spanning both men and women. | ▶ 02:18 |
Because SCAPE incorporates both shape and pose | ▶ 02:20 |
we can jointly vary both shape and pose to create people who never existed | ▶ 02:23 |
and poses that were never observed. | ▶ 02:28 |
We demonstrate the use of this model 1st for shape completion of scanned meshes. | ▶ 02:31 |
Even when a subject has only been partially observed, | ▶ 02:36 |
we can use the model to estimate a complete surface. | ▶ 02:39 |
In this case, the entire front half of the subject has been synthesized. | ▶ 02:42 |
Note that the synthesized data both conforms to the individual subject's | ▶ 02:47 |
specific shape and faithfully represents | ▶ 02:51 |
the nonrigid muscle deformations associated with a specific pose. | ▶ 02:54 |
Mesh completion is possible even when | ▶ 02:59 |
neither the person or the pose exists in the original training set. | ▶ 03:01 |
None of the women in our example set | ▶ 03:05 |
look similar to the woman in this sequence. | ▶ 03:07 |
Shape completion can also be used to synthesize complete | ▶ 03:11 |
animated surface meshes. | ▶ 03:15 |
Starting from a single scanned mesh of an actor | ▶ 03:18 |
and a timed series of motion capture markers | ▶ 03:20 |
we can treat the markers themselves | ▶ 03:24 |
as a very sparse sampling of surface geometry | ▶ 03:26 |
and complete the surface which best fits the available data at each point in time. | ▶ 03:29 |
Using this method, animated surface models | ▶ 03:34 |
for a wide variety of motions can be created with relative ease. | ▶ 03:36 |
In addition, the target identity of the surface model can easily be changed | ▶ 03:40 |
simply by replacing the subject portion of our factorized model with a different vector. | ▶ 03:45 |
The new identity need not be present in our training set | ▶ 03:50 |
or even correspond to a real person. | ▶ 03:54 |
An artist is free to alter the identity arbitrarily. | ▶ 03:56 |
[Thrun] In modern dimensionality reduction, the trick has been to define nonlinear, | ▶ 00:00 |
sometimes piece-wise linear, subspaces on which data is being projected. | ▶ 00:05 |
This is not dissimilar from K nearest neighbors, | ▶ 00:09 |
where local regions are being defined based on local data neighborhoods. | ▶ 00:12 |
But here we need ways to interpret leveraging neighbors | ▶ 00:16 |
to make sure that the subspace itself becomes a feasible subspace. | ▶ 00:18 |
Common methods include local linear embedding, or LLE, or the Isomap method. | ▶ 00:22 |
If you're interested in this, check the Web. | ▶ 00:27 |
There's tons of information on these methods on the World Wide Web. | ▶ 00:29 |
We now talk about spectral clustering. | ▶ 00:00 |
The fundamental idea of spectral clustering | ▶ 00:04 |
is to cluster by affinity. | ▶ 00:07 |
And to understand the importance of spectral clustering, | ▶ 00:09 |
let me ask you a simple intuitive quiz. | ▶ 00:12 |
Suppose you are given data like this, | ▶ 00:16 |
and you wish to learn that there's 2 clusters-- | ▶ 00:18 |
a cluster over here and a cluster over here. | ▶ 00:22 |
So my question is, from what you understand, | ▶ 00:25 |
do you think that "EM" or "K" means | ▶ 00:28 |
we would do a great job finding those clusters | ▶ 00:30 |
or do you think they will likely fail to find those clusters? | ▶ 00:33 |
So what were the questions--Do "EM" or "K" | ▶ 00:36 |
mean succeed in finding the 2 clusters? | ▶ 00:38 |
There is a likely yes and a likely no. | ▶ 00:40 |
And the answer is likely no. | ▶ 00:00 |
The reason being that these aren't clusters | ▶ 00:02 |
defined by a center of data points, | ▶ 00:05 |
but they're clusters define by affinity, | ▶ 00:08 |
which means they're defined by the presence of nearby points. | ▶ 00:10 |
So take for example the area over here, which I'm going to circle, | ▶ 00:14 |
and ask yourself, what's the best cluster center? | ▶ 00:17 |
It's likely somewhere over here where I drew the red dot. | ▶ 00:20 |
This is the cluster center for this cluster, | ▶ 00:23 |
and perhaps this is the cluster center for the other cluster. | ▶ 00:25 |
And these points over here will likely | ▶ 00:28 |
be classified as belonging to the cluster center over here. | ▶ 00:30 |
So, "EM" will likely do a bad job. | ▶ 00:32 |
So let's look at this example again--let me redraw the data. | ▶ 00:00 |
What makes these clusters so different | ▶ 00:03 |
is not the absolute location of each data point, | ▶ 00:05 |
but the connectedness of these data points. | ▶ 00:08 |
The fact that these 2 points belong together | ▶ 00:11 |
is likely because there's lots of points in-between. | ▶ 00:13 |
In other words, it's the affinity | ▶ 00:16 |
that defines those clusters, not the absolute location. | ▶ 00:18 |
So spectral clustering gets annotation of affinity | ▶ 00:21 |
to make clustering happen. | ▶ 00:25 |
So let me look at the simple example for spectral clustering | ▶ 00:27 |
that would also work for K-means or EM, | ▶ 00:30 |
but they'll be useful to illustrate spectral clustering. | ▶ 00:33 |
Let's assume there's 9 data points as shown over here, | ▶ 00:36 |
and I've colored them differently in blue, red, and black. | ▶ 00:39 |
But to clustering algorithms, they all come with the same color. | ▶ 00:43 |
Now the key element of spectral clustering | ▶ 00:46 |
is called the affinity martrix, | ▶ 00:48 |
which is a 9 by 9 matrix in this case, | ▶ 00:50 |
where each data point gets graphed | ▶ 00:53 |
realtive to each other data point. | ▶ 00:56 |
So let me write down all the 9 data points | ▶ 00:58 |
into the different rows of this matrix-- | ▶ 01:00 |
the red ones, the black ones, and the blue ones. | ▶ 01:03 |
And in the columns, I graphed the exact same 9 data points. | ▶ 01:05 |
I then calculate for each pair of data points their affinity, | ▶ 01:09 |
where I use for now affinity as the | ▶ 01:13 |
quadratic distance in this diagram over here. | ▶ 01:16 |
Clearly, the red dots to each other have a high affinity, | ▶ 01:19 |
which means a small quadratic distance. | ▶ 01:22 |
Let me indicate this as follows-- | ▶ 01:24 |
But realtive to all the other points, the affinity is weak. | ▶ 01:26 |
So there's a very small value in these elements over here. | ▶ 01:29 |
Similarly, the affinity of the black | ▶ 01:32 |
data points to each other is very high, | ▶ 01:34 |
which means that the following block diagonal | ▶ 01:36 |
in this matrix will have a very large value. | ▶ 01:38 |
Yet the affinity to all the other data points will be low. | ▶ 01:41 |
And of course, the same is true for the blue data points. | ▶ 01:44 |
The interesting thing to notice now | ▶ 01:47 |
is that this is an approximately rank-deficient matrix. | ▶ 01:49 |
And further, the data points that belong to the same class-- | ▶ 01:52 |
like the 3 red dots or the 3 black dots, | ▶ 01:56 |
have a singular affinitive vector to all the other data points. | ▶ 01:59 |
So this vector over here is similar to this vector over here. | ▶ 02:03 |
It's similar to this vector over here, | ▶ 02:06 |
but it's very different to this vector over here, | ▶ 02:08 |
which then itself is similar to the vector over here, | ▶ 02:10 |
yet different to the previous ones. | ▶ 02:13 |
Such a situation is easily addressed by what's called | ▶ 02:15 |
principal component analysis, or PCA. | ▶ 02:17 |
PCA is a method to identify vectors that are similar | ▶ 02:21 |
in an approximate rank-deficient matrix. | ▶ 02:25 |
Consider once again our affinity matrix | ▶ 02:28 |
with prinicple component analysis, | ▶ 02:31 |
which is a standard linear trick, | ▶ 02:33 |
we can re-represent this matrix | ▶ 02:36 |
by the most dominant tivectors you'll find there. | ▶ 02:38 |
And the first one, might look like this. | ▶ 02:42 |
The second one, which would be orthogonal, may look like this. | ▶ 02:44 |
The third one, like this. | ▶ 02:47 |
These are called eigenvectors, and the principle component | ▶ 02:49 |
now is each eigenvector has an item of value | ▶ 02:51 |
that states how prevalent this vector is in the original data. | ▶ 02:53 |
And for these 3 vectors, you're going to find a large eigenvalue | ▶ 02:57 |
because there's a number data points that represent | ▶ 03:00 |
these vectors quite prevalently | ▶ 03:03 |
like the first 3 does for this guy over here. | ▶ 03:06 |
There might be additional eigenvectors like something like this, | ▶ 03:09 |
but such eigenvectors will have a small eigenvalue | ▶ 03:12 |
simply because this vector isn't really | ▶ 03:15 |
required to explain the data over here. | ▶ 03:17 |
It might just be explaining some of the noise | ▶ 03:19 |
in the affinity matrix | ▶ 03:21 |
that I didn't even dare draw in here. | ▶ 03:23 |
Now if you take the eigenvectors with the largest | ▶ 03:25 |
eigenvalues--3 in this case, | ▶ 03:27 |
you first discover that the dimensionality | ▶ 03:29 |
of the underlying data space. | ▶ 03:32 |
The dimensionality equals the number of large eigenvalues. | ▶ 03:34 |
Further, if you re-represent each data vector | ▶ 03:37 |
using those eigenvectors, | ▶ 03:40 |
you'll find a 3 dimensional space | ▶ 03:42 |
where original data falls into a varity of different places. | ▶ 03:44 |
And these places are easily told apart by conventional clustering. | ▶ 03:48 |
So in summary, spectral clustering builds | ▶ 03:51 |
an affinity matrix of the data points. | ▶ 03:53 |
It strikes the eigenvectors with the largest eigenvalues, | ▶ 03:55 |
and then re-map those vecotrs into a new space | ▶ 03:58 |
with the data points easily clustering the conventional way. | ▶ 04:01 |
This is called affinity-based clustering or spectral clustering. | ▶ 04:05 |
Let me illustrate this once again with the | ▶ 04:09 |
data set that has a different spectral clustering | ▶ 04:11 |
than a conventional clustering. | ▶ 04:13 |
In this data set, the different clusters belong | ▶ 04:15 |
together because they're affinity is similar. | ▶ 04:17 |
These 2 points belong together | ▶ 04:19 |
because there is a point in-between. | ▶ 04:21 |
If we now draw the affinity matrix for those data points, | ▶ 04:23 |
you find that the first and second data points are close together | ▶ 04:26 |
and the second and the third, but not the first and the third. | ▶ 04:29 |
Hence these 2 off diagonal elements here have remained small. | ▶ 04:32 |
Similarly for the red points as shown here | ▶ 04:35 |
with these 2 elements over here relatively small. | ▶ 04:38 |
And also for the black points | ▶ 04:40 |
where these 2 elements over here are small. | ▶ 04:42 |
And interestingly enough, even though these aren't blocked diagonal, | ▶ 04:44 |
your first 3 largest eigenvectors | ▶ 04:47 |
will still look the same as before. | ▶ 04:50 |
I find this quite remarkable | ▶ 04:52 |
that even though these aren't exactly blocks, | ▶ 04:54 |
those vecotrs still represent the 3 most | ▶ 04:56 |
important vectors for which to recover | ▶ 04:59 |
the data using principle component analysis. | ▶ 05:01 |
So in this case, spectral clustering would easily | ▶ 05:04 |
assign those guys and those guys and those guys | ▶ 05:06 |
to the respective same cluster, | ▶ 05:10 |
which wouldn't be quite as easily the case for | ▶ 05:12 |
expectation-maximization or k-means. | ▶ 05:14 |
So let me ask you the following quiz. | ▶ 05:16 |
Suppose we have 8 data points. | ▶ 05:18 |
How many elements will the affinity matrix have? | ▶ 05:20 |
And the answer is 64. | ▶ 00:00 |
There's 8 data points--8 times 8 is 64. | ▶ 00:02 |
My second question is, how many large eigenvalues | ▶ 00:00 |
will PCA find? | ▶ 00:04 |
Now I understand this doesn't have a unique answer, | ▶ 00:06 |
but in the best possible case | ▶ 00:10 |
where spectral clustering works well, | ▶ 00:12 |
how many large eigenvalues do you find? | ▶ 00:15 |
And the answer is 2. | ▶ 00:00 |
There's a cluster over here and a cluster over here. | ▶ 00:02 |
And while it might happen that it's as many as 8, | ▶ 00:06 |
if you adjust you're affinity matrix well, | ▶ 00:08 |
those 2 should correspond with the 2 larger eigenvalues. | ▶ 00:10 |
So, congratulations. | ▶ 00:00 |
You just made it through the unsupervised learning section of this class. | ▶ 00:02 |
I think you've learned a lot. | ▶ 00:05 |
You learned about K-means, you learned about expectation maximization, | ▶ 00:07 |
about dimensionality reduction and even spectral clustering. | ▶ 00:10 |
The first 3 items--K-means, EM, and dimensionality reduction-- | ▶ 00:14 |
are used very frequently, and spectral clustering is a rarer used method | ▶ 00:17 |
that shows some of the most recent research going on in the field. | ▶ 00:22 |
I hope you have fun applying these methods in practice. | ▶ 00:26 |
I'd like to say a few final words about supervised versus unsupervised learning. | ▶ 00:30 |
In both cases you're given data, but in 1 case you have labeled data, | ▶ 00:35 |
in another you have unlabeled data. | ▶ 00:39 |
The supervised learning paradigm is the dominant paradigm in machine learning, | ▶ 00:41 |
and there are a vast amount of papers being written about it. | ▶ 00:45 |
We talked about classification and regression | ▶ 00:48 |
and different methods to do supervised learning. | ▶ 00:51 |
The unsupervised paradigm is much less explored, | ▶ 00:53 |
even though I think it's at least equally important--possibly even more important. | ▶ 00:56 |
Many systems can collect vast amounts of data such as web crawlers, | ▶ 01:00 |
robots, I told you about street view, | ▶ 01:05 |
and getting the data is cheap, but getting labels is hard. | ▶ 01:08 |
So to me, unsupervised is the method of the future. | ▶ 01:11 |
It's one of the most interesting open research topics | ▶ 01:14 |
to see whether we can make sense out of large amounts of unlabeled or poorly labeled data. | ▶ 01:17 |
In between, there are techniques that do both: supervised and unsupervised. | ▶ 01:21 |
They are called semi-supervised or self-supervised, | ▶ 01:26 |
and they use elements of unsupervised learning and pair them with supervised learning. | ▶ 01:29 |
Those are fascinating by their own rights. | ▶ 01:32 |
Our robot Stanley, for example, that won the DARPA Grand Challenge | ▶ 01:35 |
used its own sensors to produce labels on the fly to other data. | ▶ 01:38 |
And I'll talk about this when I talk about robotics in more detail. | ▶ 01:43 |
But for the time being, understand that the paradigms supervised and unsupervised | ▶ 01:46 |
span 2 very large areas of machine learning, and you learn quite a bit about it. | ▶ 01:51 |
Welcome to the third homework assignment covering topics of machine learning. | ▶ 00:00 |
[Thrun] This question is about naive Bayes and Laplacian smoothing. | ▶ 00:00 |
Our training data is a set of movie titles: A Perfect World, | ▶ 00:06 |
My Perfect Woman, and Pretty Woman. | ▶ 00:12 |
We also have a song class of song titles: A Perfect Day, Electric Storm, | ▶ 00:16 |
Another Rainy Day. | ▶ 00:26 |
Suppose we get a new title, the query Perfect Storm, | ▶ 00:28 |
and we wish to know whether Perfect Storm is more likely a movie or a song. | ▶ 00:33 |
Compute for me the following model probabilities: | ▶ 00:40 |
the probability for movie class and song class, | ▶ 00:44 |
the probability of the word "perfect" conditioned on the movie class, | ▶ 00:50 |
the probability of the word "perfect" conditioned on the song class, | ▶ 00:53 |
and the same for the word "storm." | ▶ 00:58 |
Please use Laplacian smoothing for this with K equals 1. | ▶ 01:01 |
Don't compute the maximum likelihood estimate. | ▶ 01:06 |
[Thrun] Remember in Laplacian smoothing our best estimate | ▶ 00:00 |
is the count of the occurrence of the words divided by N, | ▶ 00:03 |
but we add our Laplacian smoother over here, | ▶ 00:09 |
and down here we add K times number of classes. | ▶ 00:13 |
For the movie prior we have 3 examples of movie titles over 6 total titles, | ▶ 00:16 |
which gives us 3 over 6. | ▶ 00:24 |
We add our Laplacian prior, 1 over here. | ▶ 00:28 |
There's 2 classes, movie and song, 2 over here. | ▶ 00:30 |
We get 4 over 8, which is a half. | ▶ 00:33 |
The same is the case for song. | ▶ 00:35 |
It gets more interesting for this probability over here. | ▶ 00:38 |
In our movie class there's 2 occurrences of the word "perfect" out of 8 words, | ▶ 00:42 |
so we get 2 over 8. | ▶ 00:48 |
But in adding the Laplacian prior, 1 over here | ▶ 00:50 |
and 1 number to add down here, | ▶ 00:53 |
the number of classes here is the size of the vocabulary. | ▶ 00:55 |
In total for this model there is 11 different words. | ▶ 01:01 |
There are 16 total words in both titles, | ▶ 01:06 |
but because of repetition there's only 11 distinct words: | ▶ 01:10 |
a, perfect, world, my, woman, pretty, day, electric, storm, another, rainy. | ▶ 01:14 |
So we add the number of classes over here, which is 11. | ▶ 01:26 |
We obtain 3 over 19. | ▶ 01:30 |
For the song class there's 1 occurrence of perfect. | ▶ 01:33 |
Adding 1 we get 2 over 19. | ▶ 01:37 |
There's no occurrence of storm in the movie class. | ▶ 01:41 |
However, our Laplacian prior gives us 1 over 19. | ▶ 01:43 |
And there's 1 occurrence of storm over here, which gives us 2 over 19. | ▶ 01:46 |
[Thrun] For the same example I now would like to know the probability | ▶ 00:00 |
of movie title for my query. | ▶ 00:04 |
So please write this into the following box. | ▶ 00:09 |
[Thrun] As usual, we can resolve this using Bayes' rule. | ▶ 00:00 |
Probability of Perfect Storm given movie times P of movie | ▶ 00:04 |
divided by the same expression plus this expression for the opposite class, song. | ▶ 00:09 |
Here I simply write 3 dots for the text Perfect Storm. | ▶ 00:15 |
Plugging in the values over here and assuming conditional independence, | ▶ 00:21 |
as is the case when I use Bayes, we get the probably of "perfect" given movie, | ▶ 00:24 |
which is 3/19 and "storm" given movie, 1/19, times the prior over half, | ▶ 00:29 |
and we divide this by the same number plus probability of "perfect" given song, | ▶ 00:35 |
which is 2/19, and the probability of "storm" given song, which is 2/19 times the prior of half. | ▶ 00:43 |
Now, all the enumerators fall out, and we get 3 over 3 plus 2, which is 3 over 7. | ▶ 00:51 |
[Thrun] I would now like to ask the exact same question | ▶ 00:00 |
for the maximum likelihood estimator. | ▶ 00:03 |
So let's not assume we have Laplacian smoothing | ▶ 00:05 |
and instead use the maximum likelihood estimator. | ▶ 00:09 |
Simply compute for me the probability of movie for the title Perfect Storm. | ▶ 00:12 |
[Thrun] And the answer is simply 0, without much math. | ▶ 00:00 |
The word "perfect" occurs in movie, but the word "storm" has never been seen before. | ▶ 00:04 |
Therefore, the maximum likelihood estimate we will assign a 0 probability to the word "storm," | ▶ 00:09 |
which will make the total product of the various factors involved in "storm" just 0. | ▶ 00:14 |
That is not the case for song. | ▶ 00:21 |
There is a non-zero probability for "perfect" and a non-zero probability for "storm." | ▶ 00:23 |
Hence, it will have a non-zero probability. | ▶ 00:27 |
After normalization this will become 1 and this will become 0. | ▶ 00:29 |
So without much math I can calculate the correct posterior | ▶ 00:34 |
under the maximum likelihood model, which of course is disappointing | ▶ 00:38 |
because Perfect Storm is actually a movie title. | ▶ 00:41 |
[Thrun] In this question I quiz you about linear regression. | ▶ 00:00 |
Given the following data, my first question is, | ▶ 00:04 |
can this data be fit exactly using a linear function that maps from X to Y? | ▶ 00:07 |
Yes or no. | ▶ 00:13 |
[Thrun] And the answer is no. | ▶ 00:00 |
To see, let's look at the slope of the linear function if it existed. | ▶ 00:02 |
From 0 to 1 we increment Y by 3. | ▶ 00:08 |
We go from 3 to 6. | ▶ 00:11 |
Therefore, the slope of it must be 3. | ▶ 00:13 |
However, from 1 to 2 we only increase the function by 1, from 6 to 7. | ▶ 00:16 |
Therefore, it can't be fit linearly. | ▶ 00:21 |
We can see the same if we plot the linear points. | ▶ 00:24 |
Over here we could fit a linear function, but it's very shallow, | ▶ 00:28 |
whereas those points over here have a much steeper situation. | ▶ 00:32 |
So any linear function would probably miss these points in between. | ▶ 00:36 |
[Thrun] I would now like to ask you to perform linear regression on these data points | ▶ 00:00 |
and calculate for me W0 and W1. | ▶ 00:05 |
As defined in this class, we might have to go back | ▶ 00:09 |
and look over the exact formula from the lecture that I taught on linear regression. | ▶ 00:12 |
[Thrun] For answering these questions, let me restate the essential formulas. | ▶ 00:00 |
W1 is obtained by M times sum of XY minus sum of X times sum of Y | ▶ 00:05 |
over M times sum Xi square minus sum of Xi in brackets square. | ▶ 00:12 |
And if you plug in these numbers over here for M equals 5 | ▶ 00:19 |
because there's 5 training examples, we get 5 times 88 minus 10 times 35 | ▶ 00:25 |
over 5 times 30 minus 100, which is 1.8. | ▶ 00:34 |
That is the correct answer for W1. | ▶ 00:40 |
W0 was obtained by 1 over M times sum over Ys minus W1 over M times sum over X. | ▶ 00:43 |
And plugging in the table over here gives us 1/5 times 35 minus 1.8 over 5 times 10, | ▶ 00:52 |
and that is 3.4, which would have been the correct answer over here. | ▶ 01:02 |
And again here are the data points with the solution. | ▶ 01:07 |
So if you take the axis where X equals 0, | ▶ 01:10 |
the Y value is actually 3.4, and the slope is 1.8. | ▶ 01:14 |
It's a little smaller than if you just click at the end points, | ▶ 01:20 |
which gave us a slope of 2, because there is a residual arrow over here, | ▶ 01:23 |
residual arrow over here, residual arrow over here, and a residual arrow over here. | ▶ 01:27 |
The resulting linear function ends up splitting in a quadratically optimal way | ▶ 01:30 |
the arrows between these different data points. | ▶ 01:37 |
In my next question I would like to ask you about K-nearest neighbors. | ▶ 00:00 |
Consider the following data set | ▶ 00:04 |
where plus indicates a positive traning example | ▶ 00:07 |
and minus a negative training example in this 2-dimensional space. | ▶ 00:10 |
I want you, for the following places, | ▶ 00:14 |
to check for those boxes over here | ▶ 00:18 |
whether they will be plus for K=5. | ▶ 00:21 |
Only check those boxes for which the label will be positive. | ▶ 00:26 |
And the answer would be this box over here | ▶ 00:00 |
and this box over here, nothing else. | ▶ 00:03 |
This guy has clearly 4 positive nearest neighbors, | ▶ 00:06 |
so no matter what the 5th one is we stay positive. | ▶ 00:10 |
Similarly, this guy over here, | ▶ 00:14 |
when you draw a circle, | ▶ 00:16 |
has probably these 4 guys as nearest neighbors, | ▶ 00:18 |
perhaps this one as well, but it is a little further away. | ▶ 00:21 |
With those 4 ones, it already has 3 pluses, | ▶ 00:24 |
so whatever the 5th one is it can't overturn, it must be positive. | ▶ 00:27 |
All of these are negative, even this one over here has just 2 pluses as neighbors | ▶ 00:30 |
that are positive, and those guys over here are all negative. | ▶ 00:34 |
Similarly, over here there are possibly 2 pluses | ▶ 00:37 |
in the 5 nearest neighbors, | ▶ 00:41 |
but these guys over here are all negative, and this guy is | ▶ 00:43 |
surrounded by negative examples, so they will just be negative. | ▶ 00:46 |
Here's another nearest neighbor example, and now I'm going to ask a different question. | ▶ 00:00 |
Given all the black data points, | ▶ 00:05 |
I want to make sure that the red ones are classified as indicated | ▶ 00:07 |
and I am free to choose a different value for K. | ▶ 00:10 |
Say I can choose K to be 1, 3, 5, 7, or 9. | ▶ 00:14 |
Check any or all of the K values | ▶ 00:19 |
for which you believe these 3 data points | ▶ 00:23 |
are classified correctly relative to the black training data set. | ▶ 00:26 |
And the answer is just 5. | ▶ 00:00 |
If you look carefully for K=1, | ▶ 00:04 |
this guy will be mis-classified. | ▶ 00:07 |
It's closer to a plus than a minus. | ▶ 00:10 |
Similarly, for K=3, this guy has 2 nearby pluses and 1 minus, | ▶ 00:13 |
so it would be positive. | ▶ 00:17 |
For K=5, to get the correct answer, | ▶ 00:19 |
the 5 nearest neighbors of this guy are | ▶ 00:22 |
those 3 minuses plus perhaps those 2 pluses over here. | ▶ 00:25 |
This guy has in his 5 neighborhood | ▶ 00:30 |
these 3 pluses over here, plus a minus, plus a plus. | ▶ 00:33 |
This data point over here has 2 pluses with 3 of the surrounding minuses, | ▶ 00:36 |
and they are all classified correctly. | ▶ 00:41 |
For K=7, this minus data point will have | ▶ 00:43 |
4 pluses, 3 minuses over here, | ▶ 00:48 |
and then everything in the vicinity becomes positive, | ▶ 00:51 |
so it must be 4 plus and it will be mis-classified. | ▶ 00:54 |
The same is true for K=9. | ▶ 00:57 |
The minus over here will have 1, 2, 3 minuses, | ▶ 01:00 |
5 pluses, and the minus over here, | ▶ 01:04 |
which makes 4 minuses. | ▶ 01:07 |
It will be classified as positive. | ▶ 01:10 |
So K=5 would have been the only correct answer. | ▶ 01:12 |
But nobody asked about the perceptron algorithm, suppose you have the following | ▶ 00:00 |
2 dimensional linear set where plus indicates | ▶ 00:05 |
a positive class label and minus a negative class label, and my first question is, | ▶ 00:08 |
"Are these data linearly separable?" | ▶ 00:13 |
I'd also like to know if we start perceptron | ▶ 00:15 |
with an initial separating plane like this, | ▶ 00:18 |
will it actually converge? | ▶ 00:21 |
Please check the appropriate boxes yes or no, and yes or no. | ▶ 00:25 |
[Narrator] And the answer is yes in both cases. | ▶ 00:00 |
There is a linear separation that | ▶ 00:03 |
goes along here that separates the positive class from the negative class, | ▶ 00:05 |
and it's been shown in the 60s, | ▶ 00:09 |
field perceptron algorithm always | ▶ 00:12 |
converges after finally many steps | ▶ 00:14 |
if such linear separator exists. | ▶ 00:16 |
I'm not going to prove this, and I did prove this | ▶ 00:18 |
in class, but I clearly said this in class. | ▶ 00:20 |
Congratulations! You just finished the third homework assignment. | ▶ 00:00 |
Welcome back. | ▶ 00:00 |
So far we've talked about AI | ▶ 00:02 |
as managing complexity and uncertainty. | ▶ 00:04 |
We've seen how a search can discover sequences | ▶ 00:08 |
of actions to solve problems. | ▶ 00:11 |
We've seen how probability theory | ▶ 00:13 |
can represent in reason with uncertainty. | ▶ 00:15 |
And we've seen how machine learning | ▶ 00:18 |
can be used to learn and improve. | ▶ 00:20 |
AI is a big and dynamic field | ▶ 00:24 |
because we are pushing against complexity | ▶ 00:26 |
in at least 3 directions. | ▶ 00:28 |
First, in terms of agent design, | ▶ 00:30 |
we start with a simple reflex-based agent | ▶ 00:32 |
and move into goal-based and utility-based agents. | ▶ 00:35 |
Secondly, in terms of the complexity of the environment, | ▶ 00:39 |
we start with simple environments | ▶ 00:42 |
and then start looking at partial observability, | ▶ 00:44 |
stochastic actions at multiple agents, and so on. | ▶ 00:47 |
And finally, in terms of representation, | ▶ 00:51 |
the agents model of the world | ▶ 00:54 |
becomes increasingly complex. | ▶ 00:56 |
And this unit will concentrate | ▶ 00:58 |
on that third aspect of representation, | ▶ 01:00 |
showing how the tools of logic | ▶ 01:03 |
can be used by an agent to better model the world. | ▶ 01:05 |
The first logic we will consider is called propositional logic. | ▶ 00:00 |
Let's jump right into an example, recasting the alarm problem in propositional logic. | ▶ 00:07 |
We have propositional symbols B, E, A, M, and J | ▶ 00:12 |
corresponding to the events of a burglary occurring, of\ the earthquake occurring, | ▶ 00:23 |
of the alarm going off, of Mary calling, and of John calling. | ▶ 00:28 |
And just as in the probabalistic models, | ▶ 00:34 |
these can be either true or false, | ▶ 00:37 |
but unlike improbability, our degree of belief in propositional logic | ▶ 00:40 |
is not a number. | ▶ 00:44 |
Rather, our belief is that each of these is either true or false or unknown. | ▶ 00:47 |
Now, we can make logical sentences using these symbols | ▶ 00:53 |
and also using the logical constants true and false | ▶ 00:57 |
by combining them together using logical operators. | ▶ 01:04 |
For example, we can say that the alarm is true | ▶ 01:08 |
whenever the earthquake or burglary is true with this sentence. | ▶ 01:12 |
(E V B) E or B implies A. | ▶ 01:16 |
So that says whenever the earthquake or the burglary is true, | ▶ 01:28 |
then the alarm will be true. | ▶ 01:35 |
We use this V symbol to mean or | ▶ 01:38 |
and a right arrow to mean implies. | ▶ 01:40 |
We could also say that it would be true that both John and Mary call | ▶ 01:43 |
when the alarm is true. | ▶ 01:47 |
We write that as A implies (J ^ M) | ▶ 01:50 |
and we use this symbol ^ to indicate an and, | ▶ 02:01 |
so that this upward-facing wedge looks kind of like an A | ▶ 02:05 |
with the crossbar missing, and so you can remember A is for "and" | ▶ 02:09 |
where with this downward-facing V symbol is the opposite of and, | ▶ 02:14 |
so that's the symbol for or. | ▶ 02:19 |
Now, there's 2 more connectors we haven't seen yet. | ▶ 02:22 |
There's a double arrow for equivalent, also known as a biconditional, | ▶ 02:25 |
and a not sign for negation, | ▶ 02:29 |
so we could say if we wanted to that John calls if and only if Mary calls. | ▶ 02:32 |
We would write that as J <=> M. | ▶ 02:39 |
John is equivalent to Mary--when one is true, the other is true; | ▶ 02:45 |
when one is false, the other is false. | ▶ 02:48 |
Or we could say that when John calls, Mary doesn't, and vice versa. | ▶ 02:51 |
We could write that as John is equivalent J<=> to not Mary, | ▶ 02:56 |
and this is the not sign. | ▶ 03:04 |
Now, how do we know what the sentences mean? | ▶ 03:08 |
A propositional logic sentence is either true or false | ▶ 03:11 |
with respect to a model of the world. | ▶ 03:14 |
Now, a model is just a set of true/false values for all the propositional symbols, | ▶ 03:17 |
so a model might be the set B is true, E is false, and so on. | ▶ 03:21 |
We can define the truth of the sentence in terms of the truth of the symbols | ▶ 03:34 |
with respect to the models using truth tables. | ▶ 03:39 |
[Male narrator] Here are the truth tables for all the logical connectives. | ▶ 00:00 |
What a truth table does is list all the possibilities for the propositional symbols, | ▶ 00:05 |
so P and Q can be false and false, false and true, true and false, or true and true. | ▶ 00:10 |
Those are the only 4 possibilities, | ▶ 00:16 |
and then for each of those possibilities, the truth table lists the truth value | ▶ 00:19 |
of the compound sentence. | ▶ 00:24 |
So the sentence not P is true when P is false and false when P is true. | ▶ 00:26 |
The sentence P and Q is true only when both P and Q are true and false otherwise. | ▶ 00:32 |
The sentence P or Q is true when either P or Q is true | ▶ 00:41 |
and false when both are false. | ▶ 00:47 |
Now, so far, those mostly correspond to the English meaning of those sentences | ▶ 00:50 |
with one exception, which is that in English, the word "or" is somewhat ambiguous | ▶ 00:57 |
between the inclusive and exclusive or, | ▶ 01:02 |
and this "or" means either or both. | ▶ 01:07 |
We translate this mark into English P implies Q; or as if P, then Q, | ▶ 01:12 |
but the meaning in logic is not quite the same as the meaning in ordinary English. | ▶ 01:19 |
The meaning in logic is defined explicitly by this truth table | ▶ 01:24 |
and by nothing else, but let's look at some examples in ordinary English. | ▶ 01:29 |
If we have the proposition O and have that mean 5 is an odd number | ▶ 01:34 |
and P meaning Paris is the capital of France, | ▶ 01:44 |
then under the ordinary model of the truth in the real world, | ▶ 01:50 |
what could we say about the sentence O implies P? | ▶ 01:54 |
That is, 5 is an odd number implies Paris is the capital of France. | ▶ 02:01 |
Would that be true or false? | ▶ 02:08 |
And let's look at one more example. | ▶ 02:14 |
If E is the proposition that 5 is an even number | ▶ 02:17 |
and M is the proposition that Moscow is the capital of France, | ▶ 02:21 |
what about E implies M? | ▶ 02:26 |
5 is an even number implies Moscow is the capital of France. | ▶ 02:31 |
Is that true or false? | ▶ 02:36 |
[Male narrator] The answers are first, | ▶ 00:00 |
the sentence if 5 is an odd number, | ▶ 00:03 |
then Paris is the capital of France, is true | ▶ 00:06 |
in propositional logic. | ▶ 00:10 |
It may sound odd in ordinary English, | ▶ 00:12 |
but in propositional logic, this is the same as true implies true | ▶ 00:15 |
and if we look on this line--the final line for P and Q, | ▶ 00:21 |
P implies Q is true. | ▶ 00:25 |
The second sentence, 5 is an even number, implies Moscow is the capital of France. | ▶ 00:28 |
That's the same as false implies false, | ▶ 00:35 |
and false implies false according to the definition is also true. | ▶ 00:38 |
[Male narrator] Here's a quiz. | ▶ 00:00 |
Use truth tables or whatever other method you want | ▶ 00:02 |
to fill in the values of these tables. | ▶ 00:06 |
For each of the values of P and Q--false/false, false/true, true/false, or true/true-- | ▶ 00:09 |
look at each of these boxes and click on just the boxes | ▶ 00:14 |
in which the formula for that column will be true. | ▶ 00:18 |
So which of these 4 boxes, if any, will this formula be true, | ▶ 00:22 |
and this formula and this formula? | ▶ 00:28 |
[Male narrator] Here are the answers. | ▶ 00:00 |
For P and P implies Q, we know that P is true | ▶ 00:03 |
in these bottom 2 cases, and P implies Q, we saw the truth table for P implies Q | ▶ 00:08 |
is true in the first, second, and fourth case. | ▶ 00:14 |
So the only case that's true for both P and P implies Q is the fourth case. | ▶ 00:19 |
Now, this formula, not the quantity, not P or not Q, can work that out to be the same | ▶ 00:28 |
as P and Q, and we know that P and Q is true only when both are true, | ▶ 00:37 |
so that would be true only in the fourth case and none of the other cases. | ▶ 00:46 |
And now, we're asking for an equivalent or biconditional between these 2 cases. | ▶ 00:51 |
Is this one the same as this one? | ▶ 00:57 |
And we see that it is the same because they match up in all 4 cases. | ▶ 00:59 |
They're false for each of the first 3 and true in the fourth one, | ▶ 01:03 |
so that means that this is going to be true no matter what. | ▶ 01:07 |
They're always equivalent, either both false or both true, | ▶ 01:11 |
and so we should check all 4 boxes. | ▶ 01:15 |
[Male narrator] Here's one more example of reasoning in propositional logic. | ▶ 00:00 |
In a particular model of the world, we know the following 3 sentences are true. | ▶ 00:04 |
E or B implies A, | ▶ 00:08 |
A implies J and M, | ▶ 00:15 |
and B. | ▶ 00:23 |
We know those 3 senetences to be true, and that's all we know. | ▶ 00:26 |
Now, I want you to tell me for each of the 5 propositional symbols, | ▶ 00:31 |
is that symbol true or false, or unknown in this model, | ▶ 00:38 |
and tell me for the symbols E, B, A, J, and M. | ▶ 00:45 |
The answer is that B is true. | ▶ 00:00 |
And we know that because it was one of the 3 sentences that was given to us. | ▶ 00:04 |
And now, according to the first sentence, says that if E or B is true then A is true. | ▶ 00:08 |
So now we know that A is true. | ▶ 00:15 |
And the second sentence says if A is true then J and M are true. | ▶ 00:17 |
What about E? That wasn't mentioned. | ▶ 00:24 |
Does that mean E is false? No. | ▶ 00:26 |
It means that it is unknown that a model where E is true and a model where E is false | ▶ 00:28 |
would both satisfy these 3 sentences. So we mark E as unknown. | ▶ 00:34 |
Now for a little more terminology. | ▶ 00:00 |
We say that a valid sentence is one that is true in every possible model, | ▶ 00:03 |
for every combination of values of the propositional symbols. | ▶ 00:09 |
And a satisfiable sentence is one that is true in some models, but not necessarily in all the models. | ▶ 00:14 |
So what I want you to do is tell me for each of these sentences, | ▶ 00:24 |
whether it is valid, satisfiable but not valid, or unsatisfiable, in other words, false for all models. | ▶ 00:30 |
And the sentences are P or not P, P and not P, | ▶ 00:42 |
P or Q or P is equivalent to Q, P implies Q or Q implies P. | ▶ 00:51 |
And finally, Food implies Party or Drinks implies party implies Food and Drinks implies Party. | ▶ 01:10 |
The answers are P and not P is valid. | ▶ 00:00 |
That is, it's true when P is true because of this, and it's true when P is false because of this clause. | ▶ 00:05 |
P and not P is unsatisfiable. | ▶ 00:13 |
A symbol can't be both true and false at the same time. | ▶ 00:17 |
P or Q or P is equivalent to Q is valid. | ▶ 00:22 |
So we know that it's true when either P or Q is true, so that's 3 out of the 4 cases. | ▶ 00:28 |
In the fourth case, both P and Q are false, and that means P is equivalent to Q. | ▶ 00:34 |
And therefore, in all 4 cases, it's true. | ▶ 00:40 |
P implies Q or Q implies P, that's also valid. | ▶ 00:44 |
Now in ordinary English that wouldn't be valid. | ▶ 00:48 |
If the 2 clauses or the 2 symbols P and Q were irrelevant to each other we wouldn't say that either one of those was true. | ▶ 00:51 |
But in logic, one or the other must be true, according to the definitions of the truth tables. | ▶ 00:58 |
And finally, this one's more complicated, | ▶ 01:04 |
if Food then Party or if Drinks then Party implies if Food and Drinks then Party. | ▶ 01:08 |
You can work it all out and both sides of the main implication work out to be equivalent to Not Food or Not Drinks or Party. | ▶ 01:17 |
So that's the same as saying P implies P, saying one side is equivalent to the other side. | ▶ 01:29 |
And if they're equivalent, then the implication relation holds. | ▶ 01:35 |
Propositional logic. It's a powerful language for what it does. | ▶ 00:00 |
And there are very efficient inference mechanisms for determining | ▶ 00:04 |
validity and satisfiability, alhough we haven't discussed them. | ▶ 00:07 |
But propositional logic has a few limitations. | ▶ 00:12 |
First, it can only handle true and false values. | ▶ 00:15 |
No capability to handle uncertainty like we did in probability theory. | ▶ 00:19 |
And second, we can only talk about events that are true or false in the world. | ▶ 00:27 |
We can't talk about objects that have properties, | ▶ 00:31 |
such as size, weight, color, and so on. | ▶ 00:37 |
Nor can we talk about the relations between objects. | ▶ 00:40 |
And third, there are no shortcuts to succinctly talk about a lot of different things happening. | ▶ 00:44 |
Say if we had a vacuum world with a thousand locations, and we wanted to say that every location is free of dirt. | ▶ 00:53 |
We would need a conjunction of a thousand propositions. | ▶ 00:59 |
There's no way to have a single sentence saying that all the locations are clean all at once. | ▶ 01:03 |
So, we will next cover first-order logic which addresses these two limitations. | ▶ 01:09 |
[Norvig] I'm going to talk about first order logic | ▶ 00:00 |
and its relation to the other logics we've seen so far-- | ▶ 00:04 |
namely, propositional logic and probability theory. | ▶ 00:09 |
We're going to talk about them in terms of what they say about the world, | ▶ 00:18 |
which we call the ontological commitment of these logics, | ▶ 00:23 |
and what types of beliefs agents can have using these logics, | ▶ 00:29 |
which we call the epistemological commitments. | ▶ 00:35 |
So in first order logic we have relations about things in the world, | ▶ 00:39 |
objects, and functions on those objects. | ▶ 00:46 |
And what we can believe about those relations is that they're true or false or unknown. | ▶ 00:49 |
So this is an extension of propositional logic | ▶ 00:59 |
in which all we had was facts about the world | ▶ 01:02 |
and we could believe that those facts were true or false or unknown. | ▶ 01:06 |
In probability theory we had the same types of facts as in propositional logic-- | ▶ 01:13 |
the symbols or variables--but the beliefs could be a real number in the range 0 to 1. | ▶ 01:21 |
So logics vary both in what you can say about the world | ▶ 01:30 |
and what you can believe about what's been said about the world. | ▶ 01:34 |
Another way to look at representation | ▶ 01:38 |
is to break the world up into representations that are atomic, | ▶ 01:41 |
meaning that a representation of the state is just an individual state | ▶ 01:50 |
with no pieces inside of it. | ▶ 01:54 |
And that's what we used for search and problem solving. | ▶ 01:57 |
We had a state, like state A, | ▶ 02:03 |
and then we transitioned to another state, like state B, | ▶ 02:06 |
and all we could say about those states was are they identical to each other or not | ▶ 02:11 |
and maybe is one of them a goal state or not. | ▶ 02:15 |
But there wasn't any internal structure to those states. | ▶ 02:19 |
In propositional logic, as well as in probability theory, | ▶ 02:24 |
we break up the world into a set of facts that are true or false, | ▶ 02:28 |
so we call this a factored representation-- | ▶ 02:33 |
that is, the representation of an individual state of the world | ▶ 02:37 |
is factored into several variables--the B and E and A and M and J, for example-- | ▶ 02:41 |
and those could be Boolean variables or in some types of representations-- | ▶ 02:47 |
not in propositional logic--they can be other types of variables besides Boolean. | ▶ 02:51 |
Then the third type--the most complex type of representation--we call structured. | ▶ 02:59 |
And in a structured representation, an individual state is not just a set of values for variables, | ▶ 03:06 |
but it can include relationships between objects, | ▶ 03:14 |
a branching structure, and complex representations and relations | ▶ 03:17 |
between one object and another. | ▶ 03:22 |
And that's what we see in traditional programming languages, | ▶ 03:25 |
it's what we see in databases--they're called structured databases, | ▶ 03:28 |
and we have structured query languages over those databases-- | ▶ 03:32 |
and that's a more powerful representation, | ▶ 03:36 |
and that's what we get in first order logic. | ▶ 03:39 |
[Norvig] How does first order logic work? What does it do? | ▶ 00:00 |
Like propositional logic, we start with a model. | ▶ 00:04 |
In propositional logic a model was a value for each propositional symbol. | ▶ 00:08 |
So we might say that the symbol P was true | ▶ 00:13 |
and the symbol Q was false, | ▶ 00:18 |
and that would be a model that corresponds to what's going on in a possible world. | ▶ 00:22 |
In first order logic the models are more complex. | ▶ 00:30 |
We start off with a set of objects. | ▶ 00:32 |
Here I've shown 4 objects, these 4 tiles, | ▶ 00:35 |
but we could have more objects than that. | ▶ 00:39 |
We could say, for example, that the numbers 1, 2, and 3 | ▶ 00:42 |
were also objects in our model. | ▶ 00:46 |
So we have a set of objects. | ▶ 00:49 |
We can also have a set of constants that refer to those objects. | ▶ 00:51 |
So I could use the constant names A, B, C, D, 1, 2, 3, | ▶ 00:58 |
but I don't have to have a one-to-one correspondence | ▶ 01:08 |
between constants and objects. | ▶ 01:10 |
I could have 2 different constant names that refer to the same object. | ▶ 01:13 |
I could also have, say, the name C that refers to this object, | ▶ 01:18 |
or I could have some of the objects that don't have any names at all. | ▶ 01:24 |
But I've got a set of constants, and I also have a set of functions. | ▶ 01:28 |
A function is defined as a mapping from objects to objects. | ▶ 01:38 |
And so, for example, I might have the Number Of function | ▶ 01:46 |
that maps from a tile to the number on that tile, | ▶ 01:52 |
and that function then would be defined by the mapping from A to 1 | ▶ 01:56 |
and B to 3 and C to 3 and D to 2, | ▶ 02:04 |
and I could have other functions as well. | ▶ 02:13 |
In addition to functions, I can have relations. | ▶ 02:17 |
For example, I could have the Above relation, | ▶ 02:23 |
and I could say in this model of the world the Above relation is a set of tuples. | ▶ 02:28 |
Say A is above B and C is above D. | ▶ 02:36 |
So that was a binary relation holding between 2 objects. | ▶ 02:41 |
Say 1 block is above another block. | ▶ 02:46 |
We can have other types of relations. | ▶ 02:50 |
For example, here is a unary relation--vowel-- | ▶ 02:52 |
and if we want to say the relation Vowel is true only of the object that we call A, | ▶ 02:57 |
then that's a set of tuples of length 1 that contains just A. | ▶ 03:04 |
We can even have relations over no objects. | ▶ 03:11 |
Say we wanted to have the relation Rainy, which doesn't refer to any objects at all | ▶ 03:16 |
but just refers to the current situation. | ▶ 03:20 |
Then since it's not rainy today, we would represent that as the empty set. | ▶ 03:24 |
There's no tuples corresponding to that relation. | ▶ 03:30 |
Or, if it was rainy, we could say that it's represented by a singleton set, | ▶ 03:34 |
and since the arity of Rainy is 0, there would be 0 elements in each one of those tuples. | ▶ 03:42 |
So that's what a model in first order logic looks like. | ▶ 03:50 |
[Man] Now let's talk about the syntax of first order logic, | ▶ 00:00 |
and like in propositional logic, | ▶ 00:05 |
we have sentences which describe facts that are true or false. | ▶ 00:09 |
But unlike propositional logic, we also have terms | ▶ 00:14 |
which describe objects. | ▶ 00:20 |
Now, the atomic sentences are predicates corresponding to relations, | ▶ 00:22 |
so we can say vowel (A) is an atomic sentence | ▶ 00:29 |
or above (A, B). | ▶ 00:37 |
And we also have a distinguished relation--the equality relation. | ▶ 00:43 |
We can say 2 = 2 and the equality relation is always in every model, | ▶ 00:49 |
and sentences can be combined with all the operators from propositional logic | ▶ 00:58 |
so that's and, or, not, implies, equivalent, and parentheses. | ▶ 01:07 |
Now, terms, which refer to objects, can be constants, | ▶ 01:20 |
like A, B, and 2. | ▶ 01:26 |
They can be variables. | ▶ 01:30 |
We normally use lowercase, like x and y. | ▶ 01:32 |
And they can be functions, like number of A, | ▶ 01:36 |
which is just another name or another expression that refers to the same object as 1, | ▶ 01:41 |
at least in the model that we showed previously. | ▶ 01:48 |
And then, there's 1 more type of complex sentence | ▶ 01:50 |
besides the sentences we get by combining operators, | ▶ 01:53 |
that makes first order logic unique, and these are the quantifiers. | ▶ 01:57 |
And there are two quantifiers for all, which we write with an upside-down A | ▶ 02:03 |
followed by a variable that it introduces | ▶ 02:09 |
and there exists, which we write with an upside-down E | ▶ 02:13 |
followed by the variable that it introduces. | ▶ 02:18 |
So for example, we could say for all x, if x is a vowel, | ▶ 02:21 |
then the number of (x) is equal to 1, | ▶ 02:28 |
and that's the valid sentence in first order logic. | ▶ 02:33 |
Or we could say there exists in x such that the number of (x) | ▶ 02:36 |
is equal to 2, | ▶ 02:45 |
and this is saying that there's some object in the domain | ▶ 02:47 |
to which the number of function applies and has a value of 2, | ▶ 02:51 |
but we're not saying what that object is. | ▶ 02:55 |
Now, another note is that sometimes as an abbreviation, | ▶ 02:58 |
we'll omit the quantifier, and when we do that, | ▶ 03:01 |
you can just assume that it means for all; that's left out just as a shortcut. | ▶ 03:06 |
And I should say that these forms, or these sentences are typical, | ▶ 03:13 |
and you'll see these form over and over again, | ▶ 03:16 |
so typically, whenever we have a "for all" quantifier introduced, | ▶ 03:19 |
it tends to go with a conditional like vowel of (x) implies number of (x) =1, | ▶ 03:24 |
and the reason is because we usually don't want to say something about every object | ▶ 03:31 |
in the domain, since the objects can be so different, | ▶ 03:35 |
but rather, we want to say something about a particular type of object, | ▶ 03:39 |
say, in this case, vowels. | ▶ 03:43 |
And also, typically, when we have an exists an x, or an exists any variable, | ▶ 03:45 |
that typically goes with just a form like this, | ▶ 03:54 |
and not with a conditional, because we're talking about just 1 object | ▶ 03:58 |
that we want to describe. | ▶ 04:02 |
[man] Now let's go back to the 2-location vacuum world | ▶ 00:00 |
and represent it in first order logic. | ▶ 00:03 |
So first of all, we can have locations. | ▶ 00:06 |
We can call the left location A and the right location B | ▶ 00:09 |
and the vacuum V, and the dirt--say, D1 and D2. | ▶ 00:15 |
Then, we can have relations. | ▶ 00:23 |
The relation loc, which is true of any location; | ▶ 00:27 |
vacuum, which is true of the vacuum; | ▶ 00:32 |
dirt, which is true of dirt; | ▶ 00:34 |
and at, which is true of an object and a location. | ▶ 00:37 |
And so if we wanted to say the vacuum is at location A, | ▶ 00:44 |
we just say at (V, A). | ▶ 00:49 |
If we want to say there's no dirt in any location, it's a little bit more complicated. | ▶ 00:54 |
We can say for all dirt and for all locations, | ▶ 01:00 |
if D is a dirt, and L is a location, | ▶ 01:07 |
then D is not at L. | ▶ 01:13 |
So that says there's no dirt in any location. | ▶ 01:18 |
Now, note if there were thousands of locations instead of just 2, | ▶ 01:21 |
this sentence would still hold, and that's really the power of first order logic. | ▶ 01:26 |
Let's keep going and try some more examples. | ▶ 01:32 |
If I want to say the vacuum is in a location with dirt without specifying what location it's in, | ▶ 01:35 |
I can do that. | ▶ 01:42 |
I can say there exists an L and there exists a D | ▶ 01:44 |
such that D is a dirt and L is a location | ▶ 01:53 |
and the vacuum is at the location | ▶ 02:01 |
and the dirt is at that same location. | ▶ 02:07 |
and that's the power of first order logic. | ▶ 02:11 |
Now one final thing. | ▶ 02:14 |
You might ask what "first order" means. | ▶ 02:16 |
It means that the relations are on objects, but not on relations, | ▶ 02:19 |
and that would be called "higher order." | ▶ 02:24 |
In higher order logic, we could, say, define the notion of a transitive relation | ▶ 02:26 |
talking about relations itself, and so we could say | ▶ 02:33 |
for all R, transitive of R is equivalent to for all A, B, and C; | ▶ 02:38 |
R of (A, B) and R of (B, C) implies R (A, C). | ▶ 02:52 |
So that would be a valid statement in higher order logic | ▶ 03:06 |
that would define the notion of a transitive relation, | ▶ 03:10 |
but this would be invalid in first order logic. | ▶ 03:13 |
[Man] Now let's get some practice in first order logic. | ▶ 00:00 |
I'm going to give you some sentences, and for each one, | ▶ 00:03 |
I want you to tell me if it is valid--that is, O is true-- | ▶ 00:06 |
satisfiable, but not valid; that is, there's some models for which it is true; | ▶ 00:12 |
or unsatisfiable, meaning there are no models for which it is true. | ▶ 00:19 |
And the first sentence is there exists an x and a y | ▶ 00:25 |
such that x = y. | ▶ 00:31 |
Second sentence: there exists an x such that x = x, | ▶ 00:35 |
implies for all y there exists a z such that y = z. | ▶ 00:43 |
Third sentence: for all x, p of x or not p of x. | ▶ 00:56 |
And fourth: there exists an x, P of x. | ▶ 01:06 |
[Man] The answers are the first sentence is valid. | ▶ 00:00 |
It's always true. | ▶ 00:04 |
Why is that? | ▶ 00:06 |
Because every model has to have at least 1 object | ▶ 00:07 |
and we can have both x and y refer to that same object, | ▶ 00:10 |
and so that object must be equal to itself. | ▶ 00:14 |
Second, let's see. | ▶ 00:17 |
The left-hand side of this implication has to be true. | ▶ 00:20 |
X is always equal to x, | ▶ 00:23 |
and the right-hand side says for every y, does there exist a z | ▶ 00:25 |
such that y equals z? | ▶ 00:31 |
And we can say yes, there is. | ▶ 00:34 |
We can always choose y itself for the value of z, | ▶ 00:35 |
and then y = y, so true implies true. | ▶ 00:38 |
That's always true. | ▶ 00:42 |
Valid. | ▶ 00:45 |
Third sentence: for all x, P of x or not P of x, | ▶ 00:46 |
and that's always true because everything has to be either in the relation for P | ▶ 00:50 |
or out of the relation for P, so that's valid. | ▶ 00:55 |
And the fourth: there exists an x, P of x, and that's true for the models | ▶ 01:00 |
in which there is some x that is a member of P, | ▶ 01:05 |
but it doesn' t necessarily have to be any at all. | ▶ 01:09 |
P might be an empty relation, so this is satisfiable. | ▶ 01:11 |
True in some models, but not true in all models. | ▶ 01:16 |
[Man] Now I'm going to give you some sentences or axioms in first order logic, | ▶ 00:00 |
and I want you to tell me if they correctly or incorrectly represent the English | ▶ 00:05 |
that I'm asking about. | ▶ 00:11 |
So tell me yes or no, are these good representations? | ▶ 00:13 |
And the first, I want to represent the English sentence | ▶ 00:19 |
"Sam has 2 jobs," and the first order logic sentence is | ▶ 00:23 |
there exists an x and y such that job of Sam x | ▶ 00:29 |
and job of Sam y and not x = y. | ▶ 00:39 |
And so tell me yes, that correctly represents Sam has 2 jobs, | ▶ 00:51 |
or no, there's a problem. | ▶ 00:57 |
And secondly, I want to represent the idea of set membership. | ▶ 00:59 |
Now, assume I've already defined the notion of adding an element to a set. | ▶ 01:04 |
Can I define set membership with these 2 axioms? | ▶ 01:10 |
For all x and s, x is a member of the result of adding x to any set s, | ▶ 01:13 |
and for all x and s, x is a member of s implies that for all y, | ▶ 01:26 |
x is a member of the set that you get when you add y to s. | ▶ 01:39 |
And third, I'm going to try to define the notion of adjacent squares | ▶ 01:50 |
on, say, a checkerboard, where the squares are numbered with x and y coordinates | ▶ 01:56 |
and we want to just talk about adjacency in the horizontal and vertical direction. | ▶ 02:02 |
Can I define that as follows? | ▶ 02:08 |
For all x and y, the square x, y is adjacent to the square +(x,1), y, | ▶ 02:11 |
and the square (x, y) is adjacent to the square (x, +(y, 1) | ▶ 02:26 |
and assume that we've defined the notion of + somewhere | ▶ 02:40 |
and that the character set allows + to occur as the character for a function. | ▶ 02:44 |
Tell me yes or no, is that a good representation of the notion of adjacency? | ▶ 02:53 |
[Man] The first answer is yes, this is a good representation | ▶ 00:00 |
of the sentence "Sam has 2 jobs." | ▶ 00:06 |
It says there exists an x and y, and one of them is a job of Sam. | ▶ 00:09 |
The other one is a job of Sam, and crucially, we have to say that x is not equal to y. | ▶ 00:13 |
Otherwise, this would be satisfied and we could have the same job | ▶ 00:18 |
represented by the variables x and y. | ▶ 00:23 |
Is this a good representation of the member function? | ▶ 00:26 |
No. | ▶ 00:30 |
It does do a good job of telling you what is a member, | ▶ 00:31 |
so if x is a member of a set because it's one member | ▶ 00:34 |
and then we can always add other members and it's still a member of that set, | ▶ 00:40 |
but it doesn't tell you anything about what x is not a member of. | ▶ 00:44 |
So for example, we want to know that 3 is not a member of the empty set, | ▶ 00:48 |
but we can't prove that with what we have here. | ▶ 00:53 |
And we have a similar problem down here. | ▶ 00:57 |
This is not a good representation of adjacent relation. | ▶ 01:00 |
So it will tell you, for example, that square (1,1) is adjacent to square (2,1) | ▶ 01:05 |
and also to square (1,2). | ▶ 01:16 |
So it's doing something right, but one problem is that it doesn't tell you in the other direction. | ▶ 01:20 |
It doesn't tell you that (2,1) is adjacent to (1,1) | ▶ 01:25 |
and another problem is that it doesn't tell you that (1,1) is not adjacent to (8,9) | ▶ 01:29 |
because again, there's no way to prove the negative. | ▶ 01:37 |
And the moral is that when you're trying to do a definition, | ▶ 01:40 |
like adjacent or member, what you usually want to do | ▶ 01:43 |
is have a sentence with the equivalent or the biconditional sign | ▶ 01:47 |
to say this is true if and only if rather than to just have an assertion | ▶ 01:52 |
or to have an implication in one direction. | ▶ 02:01 |
[Narrator] Hi, and welcome back. | ▶ 00:00 |
This unit is about planning. | ▶ 00:02 |
We defined AI to be the study | ▶ 00:04 |
and process of finding appropriate | ▶ 00:06 |
actions for an agent. | ▶ 00:08 |
So in some sense planning is really | ▶ 00:10 |
the core of all of AI. | ▶ 00:12 |
The technique we looked at so far | ▶ 00:14 |
was problem solving search | ▶ 00:16 |
over a state space using techniques | ▶ 00:18 |
like A star. | ▶ 00:20 |
Given a state space and a problem description, | ▶ 00:23 |
we can find a solution, | ▶ 00:25 |
a path to the goal. | ▶ 00:27 |
Those approaches are great for a variety of environments, | ▶ 00:29 |
but they only work when the environment | ▶ 00:31 |
is deterministic and fully observable. | ▶ 00:33 |
In this unit, we will see how to relax those constraints. | ▶ 00:36 |
[Narrator] You remember our problem-solving work? | ▶ 00:00 |
We have a state space like this, and | ▶ 00:03 |
we're given a start space and | ▶ 00:06 |
a goal to reach, | ▶ 00:09 |
and then we'd search for a path | ▶ 00:11 |
to find that goal, and maybe we find | ▶ 00:13 |
this path. | ▶ 00:16 |
Now the way a problem-solving agent | ▶ 00:19 |
would work is first it does all the work | ▶ 00:21 |
to figure out the path to the goal | ▶ 00:24 |
just doing by thinking, | ▶ 00:26 |
and then it starts to execute that path | ▶ 00:29 |
to drive or walk, however you want to get there, | ▶ 00:31 |
from the start state to the end state, | ▶ 00:35 |
but think about what would happen | ▶ 00:37 |
if you did that in real life; if you did all | ▶ 00:39 |
your planning ahead of time, you had the complete goal, | ▶ 00:41 |
and then without interacting with the world, | ▶ 00:43 |
without sensing it at all, | ▶ 00:46 |
you started to execute that path. | ▶ 00:48 |
Well this has, in fact, been studied. | ▶ 00:50 |
People have gone out and | ▶ 00:53 |
blindfolded walkers, put them in a field | ▶ 00:56 |
and told them to walk in a straight line, | ▶ 00:59 |
and the results are not pretty. | ▶ 01:01 |
Here are the GPS tracks to prove it. | ▶ 01:04 |
So we take a hiker, we put him at a | ▶ 01:07 |
start location, say here, | ▶ 01:09 |
and we blindfold him so that he can't | ▶ 01:11 |
see anything in the horizon, | ▶ 01:13 |
but just has enough to see his or her feet | ▶ 01:15 |
so that they won't stumble over something, | ▶ 01:18 |
and tell them execute the plan of going forward. | ▶ 01:20 |
Put one foot in front of each other and walk forward in a straight line, | ▶ 01:23 |
and these are the typical paths we see. | ▶ 01:26 |
They start out going straight for awhile | ▶ 01:28 |
but then go in loop de loops | ▶ 01:30 |
and end up not at a straight path at all. | ▶ 01:32 |
These ones over here, starting in this location, | ▶ 01:35 |
are even more convoluted. | ▶ 01:37 |
They get going straight for a little bit | ▶ 01:39 |
and then go in very tight loops. | ▶ 01:41 |
So people are incapable of walking a straight line | ▶ 01:43 |
without any feedback from the environment. | ▶ 01:45 |
Now here on this yellow path, this one did much better, | ▶ 01:48 |
and why was that? | ▶ 01:51 |
Well it's because these paths were on overcast days, | ▶ 01:53 |
and so there was no input to make sense of. | ▶ 01:56 |
Whereas on this path was on a very sunny day, | ▶ 01:59 |
and so even though the hiker couldn't | ▶ 02:02 |
see farther than a few feet in front of him, | ▶ 02:04 |
he could see shadows and say, | ▶ 02:07 |
"As long as I keep the shadows pointing in the right direction then | ▶ 02:10 |
I can go in a relatively straight line." | ▶ 02:12 |
So the moral is we need some feedback from the environment. | ▶ 02:15 |
We can't just plan ahead and come up with a whole plan. | ▶ 02:18 |
We've got to interleave planning | ▶ 02:21 |
and executing. | ▶ 02:24 |
[Narrator] Now why do we have to interleave | ▶ 00:00 |
planning and execution? | ▶ 00:02 |
Mostly because of properties of the | ▶ 00:04 |
environment that make it difficult to deal with. | ▶ 00:06 |
The most important one is | ▶ 00:08 |
if the environment is | ▶ 00:10 |
stochastic. | ▶ 00:12 |
That is if we don't know for sure what | ▶ 00:14 |
an action is going to do. | ▶ 00:16 |
If we know what everything is going to do, | ▶ 00:18 |
we can plan it our right from the start, but if we don't, we have to | ▶ 00:20 |
be able to deal with contingencies of | ▶ 00:22 |
say I tried to move forward, | ▶ 00:24 |
and the wheels slipped, and I went someplace else, | ▶ 00:26 |
or the brakes might skid, or | ▶ 00:29 |
if we're walking our feet don't go 100% straight, | ▶ 00:31 |
or consider the problem of traffic lights. | ▶ 00:34 |
If the traffic light is red, | ▶ 00:37 |
then the result of the action of go | ▶ 00:39 |
forward through the intersection | ▶ 00:41 |
is bound to be different than if the traffic light is green. | ▶ 00:43 |
Another difficulty we have to deal with | ▶ 00:46 |
is multi-agent environments. | ▶ 00:48 |
If there are other cars and people that can get in our way, | ▶ 00:51 |
we have to plan about what they're going to do, | ▶ 00:54 |
and we have to react when they do something unexpected, | ▶ 00:57 |
and we can only know that | ▶ 01:00 |
at execution time, not at planning time. | ▶ 01:02 |
The other big problem is with | ▶ 01:05 |
partial observability. | ▶ 01:07 |
Suppose we've come up with a plan | ▶ 01:11 |
to go from A to S to F to B. | ▶ 01:14 |
That plan looks like it will work, | ▶ 01:19 |
but we know that at S, | ▶ 01:21 |
the road to F is sometimes closed, | ▶ 01:24 |
and there will be a sign there | ▶ 01:27 |
telling us whether it's closed or not, | ▶ 01:29 |
but when we start off, we can't read that sign. | ▶ 01:31 |
So that's partial observability. | ▶ 01:33 |
Another way to look at it is when we start off | ▶ 01:35 |
we don't know what state we're in. | ▶ 01:37 |
We know we're in A, but we don't know | ▶ 01:39 |
if we're in A in the state where | ▶ 01:41 |
the road is closed or if we're in A | ▶ 01:43 |
in the state where the road is open, | ▶ 01:46 |
and it's not until we get to S | ▶ 01:48 |
that we discover what state we're actually in, | ▶ 01:50 |
and then we know if we can continue along | ▶ 01:53 |
that route or if we have to take a detour south. | ▶ 01:55 |
Now in addition to these properties of | ▶ 01:58 |
the environment, we can also have | ▶ 02:00 |
difficulty because of | ▶ 02:02 |
lack of knowledge on our own part. | ▶ 02:04 |
So if some model of the world is unknown, | ▶ 02:06 |
that is, for example, | ▶ 02:12 |
we have map or GPS software | ▶ 02:14 |
that's inaccurate or incomplete, | ▶ 02:16 |
then we won't be able to | ▶ 02:18 |
executive a straight-line plan, | ▶ 02:20 |
and, similarly, often we want to deal with | ▶ 02:23 |
a case where the plans have to be | ▶ 02:26 |
hierarchical. | ▶ 02:29 |
And, certainly, a plan like this | ▶ 02:31 |
is at a very high level. | ▶ 02:33 |
We can't really execute the action | ▶ 02:37 |
of going from A to S | ▶ 02:39 |
when we're in a car. | ▶ 02:41 |
All the actions that we can actually execute | ▶ 02:43 |
are things like turn the steering wheel a little bit | ▶ 02:45 |
to the right, press on the pedal a little bit more. | ▶ 02:47 |
So those are the low-level steps of the plan, | ▶ 02:50 |
but those aren't sketched out in detail when we start, | ▶ 02:54 |
when we only have the high-level parts of the plan, | ▶ 02:57 |
and then it's during execution that we schedule | ▶ 03:00 |
the rest of the low-level parts of the plan. | ▶ 03:03 |
Now most of these difficulties can be | ▶ 03:05 |
addressed by changing our point of view. | ▶ 03:08 |
Instead of planning in the space of world states, | ▶ 03:10 |
we plan in the space of belief states. | ▶ 03:13 |
To understand that let's look at a state. | ▶ 03:16 |
[Narrator] Here's a state space | ▶ 00:00 |
diagram for a simple problem. | ▶ 00:02 |
It involves a room with 2 locations. | ▶ 00:04 |
The left we call A, and the right we call B, | ▶ 00:07 |
and in that environment | ▶ 00:11 |
there's a vacuum cleaner, and there | ▶ 00:13 |
may or may not be dirt in either of the 2 locations, | ▶ 00:15 |
and so that gives us 8 total states. | ▶ 00:18 |
Dirt is here or not, here or not, and | ▶ 00:22 |
the vacuum cleaner is here or here. | ▶ 00:25 |
So that's 2 times 2 times 2 | ▶ 00:27 |
is 8 possible states, and I've drawn | ▶ 00:29 |
here the states based diagram | ▶ 00:31 |
with all the transitions | ▶ 00:33 |
for the 3 possible actions, and the actions are moving right. | ▶ 00:35 |
So we'd go from this state to this state. | ▶ 00:38 |
Moving left, we'd go from this state to this state, | ▶ 00:40 |
and sucking up dirt, we'd go from this state | ▶ 00:43 |
to this state for example, and | ▶ 00:45 |
in this state space diagram, | ▶ 00:48 |
if we have a fully deterministic, | ▶ 00:51 |
fully observable world, it's easy to plan. | ▶ 00:53 |
Say we start in this state, and we want to be-- | ▶ 00:56 |
end up in a goal state where both sides are clean. | ▶ 00:59 |
We can execute the suck-dirt action | ▶ 01:02 |
and get here and then move right, | ▶ 01:04 |
and then suck dirt again, | ▶ 01:06 |
and now we end up in a goal state | ▶ 01:08 |
where everything is clean. | ▶ 01:11 |
Now suppose our robot vacuum cleaner's | ▶ 01:14 |
sensors break down, and so the robot | ▶ 01:16 |
can no longer perceive either | ▶ 01:18 |
which location its in | ▶ 01:20 |
or whether there's any dirt. | ▶ 01:22 |
So we now have an unobservable | ▶ 01:24 |
or sensor-less world rather | ▶ 01:26 |
than a fully observable one, | ▶ 01:28 |
and how does the agent then represent the state of the world? | ▶ 01:30 |
Well it could be in any one of these 8 states, | ▶ 01:33 |
and so all we can do to represent | ▶ 01:36 |
the current state is draw a big circle | ▶ 01:39 |
or box around everything, and say, | ▶ 01:42 |
"I know I'm somewhere inside here." | ▶ 01:44 |
Now that doesn't seem like it helps very much. | ▶ 01:48 |
What good is it to know that | ▶ 01:50 |
we don't really know anything at all? | ▶ 01:52 |
But the point is that we can search in the state | ▶ 01:54 |
space of the least states rather | ▶ 01:57 |
than in the state space of actual spaces. | ▶ 01:59 |
So we believe that we're in 1 of these 8 states, | ▶ 02:02 |
and now when we execute an action, | ▶ 02:05 |
we're going to get to another belief state. | ▶ 02:07 |
Let's take a look at how that works. | ▶ 02:09 |
[Narrator] This is the belief state space | ▶ 00:00 |
for the sensor-less vacuum problem. | ▶ 00:03 |
So we started off here. | ▶ 00:05 |
We drew the circle around this belief state. | ▶ 00:07 |
So we don't anything about where we are, | ▶ 00:10 |
but the amazing thing is, | ▶ 00:13 |
if we execute actions, we can gain knowledge | ▶ 00:15 |
about the world even without sensing. | ▶ 00:17 |
So let's say we move right, | ▶ 00:20 |
then we'll know we're in the right-hand location. | ▶ 00:22 |
Either we were in the left, and we moved right | ▶ 00:26 |
and arrived there, or we were in the right | ▶ 00:28 |
to begin with, and we bumped against the wall | ▶ 00:30 |
and stayed there. | ▶ 00:32 |
So now we end up in this state. | ▶ 00:34 |
We now know more about the world. | ▶ 00:37 |
We're down to 4 possibilities rather than 8, | ▶ 00:40 |
even though we haven't observed anything, | ▶ 00:43 |
and now note something interesting, | ▶ 00:46 |
that in the real world, the operations | ▶ 00:48 |
of going left and going right are | ▶ 00:50 |
inverses of each other, but | ▶ 00:52 |
in the belief state world | ▶ 00:54 |
going right and going left are not inverses. | ▶ 00:56 |
If we go right, and then we go left, | ▶ 00:59 |
we don't end up back where we were | ▶ 01:01 |
in a state of total uncertainty, rather | ▶ 01:03 |
going left takes us over here | ▶ 01:05 |
where we still know we're in 1 of 4 states | ▶ 01:08 |
rather than in 1 of 8 states. | ▶ 01:10 |
Note that it's possible to form a plan that | ▶ 01:13 |
reaches a goal without ever observing the world. | ▶ 01:15 |
Plans like that are called conform-it plans. | ▶ 01:18 |
For example, if the goal is to be | ▶ 01:21 |
in a clean location | ▶ 01:23 |
all we have to do is suck. | ▶ 01:25 |
So we go from one of these 8 states | ▶ 01:28 |
to one of these 4 states and, | ▶ 01:30 |
every one of those 4, | ▶ 01:32 |
we're in a clean location. | ▶ 01:34 |
We don't know which of the 4 we're in, | ▶ 01:36 |
but we know we've achieved the goal. | ▶ 01:38 |
It's also possible to arrive | ▶ 01:41 |
at a completely known state. | ▶ 01:43 |
For example, if we start here, | ▶ 01:45 |
we go left; we suck up the dirt there. | ▶ 01:47 |
If we go right and suck up the dirt, | ▶ 01:50 |
now we're down to a belief state | ▶ 01:53 |
consisting of 1 single state that is | ▶ 01:55 |
we know exactly where we are. | ▶ 01:57 |
Here's a question for you: | ▶ 02:00 |
How do I get from the state where I know | ▶ 02:02 |
my current square is clean, | ▶ 02:04 |
but know nothing else, to the belief state | ▶ 02:06 |
where I know that I'm in the right-hand side | ▶ 02:08 |
location and that that location is clean? | ▶ 02:10 |
What I want you to do is click on the | ▶ 02:14 |
sequence of actions, left, right, or suck | ▶ 02:16 |
that will take us from that start to that goal. | ▶ 02:18 |
[Narrator] And the answer is that the state | ▶ 00:00 |
of knowing that you're current square is clean | ▶ 00:03 |
corresponds to this state. | ▶ 00:06 |
This belief state with 4 possible world states, | ▶ 00:08 |
and if I then execute the right action, | ▶ 00:10 |
followed by the suck action, | ▶ 00:13 |
then I end up in this belief state, | ▶ 00:15 |
and that satisfies the goal. | ▶ 00:17 |
I know I'm in the right-hand-side location | ▶ 00:19 |
and I know that location is clean. | ▶ 00:21 |
[Narrator] We've been considering sensor-less planning in a deterministic world. | ▶ 00:00 |
Now I want to turn our attention to partially observable planning | ▶ 00:05 |
but still in a deterministic world. | ▶ 00:08 |
Suppose we have what's called local sensing, | ▶ 00:10 |
that is our vacuum can see what location | ▶ 00:13 |
it is in and it can see | ▶ 00:15 |
what's going on in the current location, that is | ▶ 00:17 |
whether there's dirt in the current location or not, | ▶ 00:21 |
but it can't see anything about | ▶ 00:23 |
whether there's dirt in any other location. | ▶ 00:25 |
So here's a partial diagram of the-- | ▶ 00:29 |
part of the belief state from that world, | ▶ 00:31 |
and I want it to show | ▶ 00:35 |
how the belief state unfolds | ▶ 00:37 |
as 2 things happen. | ▶ 00:39 |
First, as we take action, | ▶ 00:41 |
so we start in this state, | ▶ 00:43 |
and we take the action of going right, | ▶ 00:46 |
and in this case we still go | ▶ 00:49 |
from 2 world states in our belief state | ▶ 00:53 |
to 2 new ones, | ▶ 00:56 |
but then, after we do an action, | ▶ 00:58 |
we do an observation, and we have the act | ▶ 01:00 |
precept cycle, and now, | ▶ 01:03 |
once we get the observation, | ▶ 01:05 |
we can split that world, | ▶ 01:07 |
we can split our belief state to say, | ▶ 01:09 |
"If we observe that we're in | ▶ 01:11 |
location B and it's dirty, then we know | ▶ 01:13 |
we're in this belief state here, | ▶ 01:15 |
which happens to have exactly 1 world state in it, | ▶ 01:18 |
and if we observe that we're clean | ▶ 01:21 |
then we know that we're in this state, | ▶ 01:23 |
which also has exactly 1 in it. | ▶ 01:25 |
Now what is the act-observe cycle do | ▶ 01:27 |
to the sizes of the belief states? | ▶ 01:29 |
Well in a deterministic world, | ▶ 01:32 |
each of the individual world states within | ▶ 01:34 |
a belief state maps into exactly 1 other one. | ▶ 01:36 |
That's what we mean by deterministic, | ▶ 01:40 |
and so that means the size of the belief state | ▶ 01:42 |
will either stay the same or it might decrease | ▶ 01:45 |
if 2 of the actions sort of accidentally | ▶ 01:48 |
end up bringing you to the same place. | ▶ 01:50 |
On the other hand, the observation | ▶ 01:53 |
works in kind of the opposite way. | ▶ 01:55 |
When we observe the world, what we're doing | ▶ 01:58 |
is we're taking the current belief state and | ▶ 02:00 |
partitioning it up into pieces. | ▶ 02:02 |
Observations alone can't introduce | ▶ 02:05 |
a new state--a new world state into the belief state. | ▶ 02:07 |
All they can do is say, | ▶ 02:10 |
"Some of them go here and some of them go here." | ▶ 02:13 |
Now maybe that for some observation | ▶ 02:16 |
all the belief states go into 1 bin, | ▶ 02:18 |
and so we make an observation | ▶ 02:21 |
that we don't learn anything new, but at least | ▶ 02:23 |
the observation can't make us more confused | ▶ 02:25 |
than we were before the observation. | ▶ 02:28 |
[Norvig] Now let's move on to stochastic environments. | ▶ 00:00 |
Let's consider a robot that has slippery wheels | ▶ 00:03 |
so that sometimes when you make a movement--a left or a right action-- | ▶ 00:06 |
the wheels slip and you stay in the same location. | ▶ 00:10 |
And sometimes they work and you arrive where you expected to go. | ▶ 00:13 |
And let's assume that the suck action always works perfectly. | ▶ 00:17 |
We get a belief state space that looks something like this. | ▶ 00:21 |
Notice that the results of actions will often result in a belief state | ▶ 00:25 |
that's larger than it was before--that is, the action will increase uncertainty | ▶ 00:30 |
because we don't know what the result of the action is going to be. | ▶ 00:34 |
And so here for each of the individual world states belonging to a belief state, | ▶ 00:37 |
we have multiple outcomes for the action, and that's what stochastic means. | ▶ 00:42 |
And so we end up with a larger belief state here. | ▶ 00:47 |
But in terms of the observation, the same thing holds as in the deterministic world. | ▶ 00:50 |
The observation partitions the belief state into smaller belief states. | ▶ 00:55 |
So in a stochastic partially observable environment, | ▶ 01:01 |
the actions tend to increase uncertainty, | ▶ 01:04 |
and the observations tend to bring that uncertainty back down. | ▶ 01:07 |
Now, how would we do planning in this type of environment? | ▶ 01:11 |
I haven't told you yet, so you won't know the answer for sure, | ▶ 01:14 |
but I want you to try to figure it out anyways, even if you might get the answer wrong. | ▶ 01:17 |
Imagine I had the whole belief state from which I've diagrammed just a little bit here | ▶ 01:21 |
and I wanted to know how to get from this belief state | ▶ 01:27 |
to one in which all squares are clean. | ▶ 01:31 |
So I'm going to give you some possible plans, | ▶ 01:34 |
and I want you to tell me whether you think each of these plans will always work | ▶ 01:36 |
or maybe sometimes work depending on how the stochasticity works out. | ▶ 01:42 |
Here are the possible plans. | ▶ 01:47 |
Remember I'm starting here, and I want to know how to get to a belief state | ▶ 01:49 |
in which all the squares are clean. | ▶ 01:54 |
One possibility is suck right and suck, one is right suck left suck, | ▶ 01:57 |
one is suck right right suck, | ▶ 02:06 |
and the other is suck right suck right suck. | ▶ 02:11 |
So some of these actions might take you out of this little belief state here, | ▶ 02:18 |
but just use what you knew from the previous definition of the state space | ▶ 02:22 |
and the results of each of those actions | ▶ 02:27 |
and the fact that the right and left actions are nondeterministic | ▶ 02:29 |
and tell me which of these you think will always achieve the goal | ▶ 02:34 |
or will maybe achieve the goal. | ▶ 02:39 |
And then I want you to also answer for the fill-in-the-blank plan-- | ▶ 02:42 |
that is, is there some plan, some ideal plan, which always or maybe achieves the goal? | ▶ 02:48 |
And the answer is that any plan that would work | ▶ 00:00 |
in the deterministic world might work in the stochastic world | ▶ 00:03 |
if everything works out okay | ▶ 00:07 |
and all of these plans meet that criteria. | ▶ 00:10 |
But no finite plan is guaranteed to always work | ▶ 00:13 |
because a successful plan has to include at least 1 move action. | ▶ 00:18 |
And if we try a move action a finite number of times, | ▶ 00:23 |
each of those times, the wheels might slip, and it won't move, | ▶ 00:27 |
and so we can never be guaranteed to achieve the goal | ▶ 00:30 |
with a finite sequence of actions. | ▶ 00:33 |
Now, what about an infinite sequence of actions? | ▶ 00:36 |
Well, we can't represent that in the language we have so far | ▶ 00:39 |
where a plan is a linear sequence. | ▶ 00:42 |
But we can introduce a new notion of plans | ▶ 00:45 |
in which we do have infinite sequences. | ▶ 00:47 |
In this new notation, instead of writing plans | ▶ 00:00 |
as a linear sequence of, say, suck, move right, and suck, | ▶ 00:03 |
I'm going to write them as a tree structure. | ▶ 00:09 |
We start off in this belief state here, | ▶ 00:12 |
which we'll diagram like this. | ▶ 00:15 |
And then we do a suck action. | ▶ 00:18 |
We end up in a new state. | ▶ 00:22 |
And then we do a right action, | ▶ 00:27 |
and now we have to observe the world, | ▶ 00:33 |
and if we observe that we're still in state A, | ▶ 00:36 |
we loop back to this part of the plan. | ▶ 00:41 |
And if we observe that we're in B, | ▶ 00:46 |
we go on and then execute the suck action. | ▶ 00:49 |
And now we're at the end of the plan. | ▶ 00:56 |
So, we see that there's a choice point here, | ▶ 00:59 |
which we indicate with this sort of tie | ▶ 01:03 |
to say we're following a straight line, but now we can branch. | ▶ 01:06 |
There's a conditional, and we can either loop, | ▶ 01:09 |
or we can continue on, | ▶ 01:12 |
so we see that this finite representation | ▶ 01:14 |
represents an infinite sequence of plans. | ▶ 01:17 |
We could write it in a more sort of linear notation | ▶ 01:21 |
as S, while we observe A, | ▶ 01:26 |
do R, and then do S. | ▶ 01:32 |
Now, what can we say about this plan? | ▶ 01:36 |
Does this plan achieve the goal? | ▶ 01:38 |
Well, what we can say is that if the stochasticity | ▶ 01:40 |
is independent, that is, if sometimes it works | ▶ 01:44 |
and sometimes it doesn't, | ▶ 01:47 |
then with probability 1 in the limit, | ▶ 01:49 |
this plan will, in fact, achieve the goal, | ▶ 01:53 |
but we can't state any bounded number of steps | ▶ 01:55 |
under which it's guaranteed to achieve the goal. | ▶ 02:00 |
We can only say it's guaranteed at infinity. | ▶ 02:03 |
Now, I've told you what a successful plan looks like, | ▶ 00:00 |
but I haven't told you how to find one. | ▶ 00:03 |
The process of finding it can be done through search | ▶ 00:06 |
just as we did in problem solving. | ▶ 00:08 |
So, remember in problem solving, | ▶ 00:10 |
we start off in a state, and it's a single state, not a belief state. | ▶ 00:12 |
And then we start searching a tree, | ▶ 00:15 |
and we have a big triangle of possible states | ▶ 00:19 |
that we search through, and then we find | ▶ 00:23 |
one path that gets us all the way to a goal state. | ▶ 00:27 |
And we pick from this big tree a single path. | ▶ 00:31 |
So, with belief states and with branching | ▶ 00:36 |
plan structures, we do the same sort of process, | ▶ 00:40 |
only the tree is just a little bit more complicated. | ▶ 00:43 |
Here we show one of these trees, | ▶ 00:46 |
and it has different possibilities. | ▶ 00:48 |
For example, we start off here, and we have one possibility | ▶ 00:52 |
that the first action will be going right, | ▶ 00:55 |
or another possibility that the first action | ▶ 00:57 |
will be performing a suck. | ▶ 00:59 |
But then it also has branches that are part of the plan itself. | ▶ 01:02 |
This branch here is actually part of the plan | ▶ 01:06 |
as we saw before. | ▶ 01:09 |
It's not a branch in the search space. | ▶ 01:11 |
It's a branch in the plan, so what we do | ▶ 01:14 |
is we search through this tree. | ▶ 01:17 |
We try right as a first action. | ▶ 01:19 |
We try suck as a first action. | ▶ 01:21 |
We keep expanding nodes | ▶ 01:23 |
until we find a portion of the tree | ▶ 01:25 |
like this path is a portion of this search tree. | ▶ 01:28 |
We find that portion which is a successful plan | ▶ 01:31 |
according to the criteria of reaching the goal. | ▶ 01:35 |
Let's say we performed that search. | ▶ 00:00 |
We had a big search tree, and then we threw out | ▶ 00:03 |
all the branches except one, and this branch of the search tree | ▶ 00:06 |
does itself have branches, but this branch of the search tree | ▶ 00:09 |
through the belief state represents a single plan, | ▶ 00:13 |
not multiple possible plans. | ▶ 00:17 |
Now, what I want to know is, for this single plan, | ▶ 00:19 |
what can we guarantee about it? | ▶ 00:22 |
So, say we wanted to know is this plan guaranteed to find the goal | ▶ 00:24 |
in an unbounded number of steps? | ▶ 00:29 |
And what do we need to guarantee that? | ▶ 00:32 |
So, it's an unbounded solution. | ▶ 00:35 |
Do we need to guarantee that | ▶ 00:38 |
some leaf node is a goal? | ▶ 00:43 |
So, for example, here's a plan to go through, | ▶ 00:47 |
and at the bottom, there's a leaf node. | ▶ 00:49 |
Now, if this were in problem solving, | ▶ 00:53 |
then remember, it would be a sequence of steps | ▶ 00:57 |
with no branches in it, and we know it's a solution | ▶ 01:01 |
if the one leaf node is a goal. | ▶ 01:04 |
But for these with branches, do we need to guarantee | ▶ 01:07 |
that some leaf is a goal, | ▶ 01:10 |
or do we need to guarantee | ▶ 01:13 |
that every leaf is a goal, | ▶ 01:17 |
or is there no possible guarantee | ▶ 01:22 |
that will mean that for sure we've got a solution, | ▶ 01:27 |
although the solution may be of unbounded length? | ▶ 01:31 |
Then I also want you to answer | ▶ 01:33 |
what does it take to guarantee | ▶ 01:36 |
that we have a bounded solution? | ▶ 01:38 |
That is, a solution that is guaranteed to reach the goal | ▶ 01:41 |
in a bounded, finite number of steps. | ▶ 01:45 |
Do we need to have a plan that has | ▶ 01:49 |
no branches in it, like this branch? | ▶ 01:53 |
Or a plan that has no loops in it, | ▶ 01:57 |
like this loop that goes back to a previous state? | ▶ 02:02 |
Or is there no guarantee that we have a bounded solution? | ▶ 02:05 |
And the answer is we have an unbounded solution | ▶ 00:00 |
if every leaf in the plan ends up in a goal. | ▶ 00:03 |
So, if we follow through the plan, no matter what path | ▶ 00:07 |
we execute based on the observations-- | ▶ 00:09 |
and remember, we don't get to pick the observations. | ▶ 00:12 |
The observations come into us, and we follow one path or another | ▶ 00:16 |
based on what we observe. | ▶ 00:18 |
So, we can't guide it in one direction or another, | ▶ 00:20 |
and so we need every possible leaf node. | ▶ 00:23 |
This one only has one, but if a plan had multiple leaf nodes, | ▶ 00:26 |
every one of them would have to be a goal. | ▶ 00:30 |
Now, in terms of a bounded solution, | ▶ 00:33 |
it's okay to have branches but not to have loops. | ▶ 00:35 |
If we had branches and we ended up with one goal here | ▶ 00:39 |
and one goal here in 1, 2, 3, steps, | ▶ 00:42 |
1, 2, 3, steps, that would be a bounded solution. | ▶ 00:45 |
But if we have a loop, we might be 1, 2, 3, 4, 5-- | ▶ 00:48 |
we don't know how many steps it's going to take. | ▶ 00:54 |
Now, some people like manipulating trees | ▶ 00:00 |
and some people like a more--sort of formal--mathematical notation. | ▶ 00:02 |
So if you're one of those, I'm going to give you another way to think about | ▶ 00:06 |
whether or not we have a solution; | ▶ 00:09 |
and let's start with a problem-solving | ▶ 00:12 |
where a plan consists of a straight line sequence. | ▶ 00:15 |
And we said one way to decide if this is a plan that satisfies the goal | ▶ 00:20 |
is to say, "Is the end state a goal state?" | ▶ 00:25 |
If we want to be more formal and write that out mathematically, | ▶ 00:30 |
what we can say is--what this plan represents | ▶ 00:33 |
is--we started in the start state, | ▶ 00:37 |
and then we transitioned | ▶ 00:40 |
to the state that is the result of applying the action | ▶ 00:43 |
of going from A to S, to that start state; | ▶ 00:47 |
and then we applied to that, the result of starting in that intermediate state | ▶ 00:53 |
and applying the action of going from S to F. | ▶ 01:01 |
And if that resulting state is an element of the set of Goals, | ▶ 01:08 |
then this plan is valid; this plan gives us a solution. | ▶ 01:14 |
And so that's a mathematical formulation of what it means for this plan to be a Goal. | ▶ 01:19 |
Now, in stochastic partially observable worlds, | ▶ 01:24 |
the equations are a little bit more complicated. | ▶ 01:27 |
Instead of just having S Prime is a result of applying some action to the initial state, | ▶ 01:30 |
we're dealing with belief states, rather than individual states. | ▶ 01:40 |
And what we say is our new belief state | ▶ 01:44 |
is the result of updating what we get from predicting what our action will do; | ▶ 01:50 |
and then updating it, based on our observation, O, of the world. | ▶ 01:59 |
So the prediction step is when we start off in a belief state; | ▶ 02:06 |
we look at the action, we look at each possible result of the action-- | ▶ 02:10 |
because they're stochastic--to each possible member of the belief state, | ▶ 02:15 |
and so that gives us a larger belief state; | ▶ 02:18 |
and then we update that belief state by taking account of the observation-- | ▶ 02:21 |
and that will give us a smaller--or same size--belief state. | ▶ 02:25 |
And now, that gives us the new state. | ▶ 02:29 |
Now, we can use this to predict and update cycle | ▶ 02:32 |
to keep track of where we are in a belief state. | ▶ 02:35 |
Here's an example of tracking the Predict Update Cycle; | ▶ 00:00 |
and this is in a world in which the actions are guaranteed to work, as advertised-- | ▶ 00:04 |
that is, if you start to clean up the current location, | ▶ 00:09 |
and if you move right or left, the wheels actually turn; and you do move. | ▶ 00:12 |
But we can call this the kindergarten world because there are little toddlers | ▶ 00:17 |
walking around who can deposit Dirt in any location, at any time. | ▶ 00:22 |
So if we start off in this state, and execute the Suck action, | ▶ 00:27 |
we can predict that we'll end up in one of these 2 states. | ▶ 00:32 |
Then, if we have an observation--well, we know what that observation's going to be | ▶ 00:38 |
because we know the Suck action always works, and we know we were in A; | ▶ 00:42 |
so the only observation we can get is that we're in A--and that it's Clean-- | ▶ 00:45 |
so we end up in that same belief state. | ▶ 00:50 |
And then, if we execute the Right action-- | ▶ 00:54 |
well, then lots of things could happen; | ▶ 00:58 |
because we move Right, and somebody might have dropped Dirt in the Right location, | ▶ 01:01 |
and somebody might have dropped Dirt in the Left location--or maybe not. | ▶ 01:06 |
So we end up with 4 possibilities, | ▶ 01:10 |
and then we can update again when we get the next observation-- | ▶ 01:12 |
say, if we observed that we're in B and it's Dirty, then we end up in this belief state. | ▶ 01:17 |
And we can keep on going--specifying new belief states-- | ▶ 01:23 |
as a result of success of predicts and updates. | ▶ 01:27 |
Now, this Predict Update Cycle gives us a kind of calculus of belief states | ▶ 01:33 |
that can tell us, really, everything we need to know. | ▶ 01:38 |
But there is one weakness with this approach-- | ▶ 01:41 |
that, as you can see here, some of the belief states start to get large; | ▶ 01:43 |
and this is a tiny little world. | ▶ 01:47 |
Already, we have a belief state with 4 world states in it. | ▶ 01:49 |
We could have one with 8, 16, 10, 24--or whatever. | ▶ 01:53 |
And it seems that there may be more succinct representations of a belief state, | ▶ 01:58 |
rather than to just list all the world states. | ▶ 02:03 |
For example, take this one here: | ▶ 02:06 |
If we had divided the world up--not into individual world states, | ▶ 02:08 |
but into variables describing that state, | ▶ 02:13 |
then this whole belief state could be represented just by: Vacuum is on the Right. | ▶ 02:17 |
So the whole world could be represented by 3 states--or 3 variables: | ▶ 02:23 |
One, where is the Vacuum--is it on the Right, or not? | ▶ 02:29 |
Secondly, is there Dirt in the Left location? | ▶ 02:33 |
And third, is there Dirt in the Right location? | ▶ 02:36 |
And we could have some formula, over those variables, to describe states. | ▶ 02:39 |
And with that type of formulation, | ▶ 02:44 |
some very large states--in terms of enumerating the world states-- | ▶ 02:47 |
can be made small, in terms of the description. | ▶ 02:51 |
[Norvig] I want to describe a notation which we call classical planning, | ▶ 00:00 |
which is a representation language for dealing with states and actions and plans, | ▶ 00:06 |
and it's also an approach for dealing with the problem of complexity | ▶ 00:13 |
by factoring the world into variables. | ▶ 00:17 |
So under classical planning, a state space consists of all the possible assignments | ▶ 00:21 |
to k Boolean variables. | ▶ 00:28 |
So that means they'll be 2 to the k states in that state space. | ▶ 00:32 |
And if we think about the 2 location vacuum world, | ▶ 00:38 |
we would have 3 Boolean variables. | ▶ 00:41 |
We could have dirt in location A, dirt in location B, and vacuum in location A. | ▶ 00:44 |
The vacuum has to be in either A or B. | ▶ 00:57 |
So these 3 variables will do, and there will be 8 possible states in that world, | ▶ 01:00 |
but they can be succinctly represented through the 3 variables. | ▶ 01:06 |
And then a world state consists of a complete assignment of true or false | ▶ 01:11 |
through each of the 3 variables. | ▶ 01:18 |
And then a belief state. | ▶ 01:20 |
Just as in problem solving, the belief state depends on | ▶ 01:24 |
what type of environment you want to deal with. | ▶ 01:28 |
In the core classical planning, the belief state had to be a complete assignment, | ▶ 01:31 |
and that was useful for dealing with deterministic fully observable domains. | ▶ 01:38 |
But we can easily extend classical planning, | ▶ 01:43 |
and we can deal with belief states that are partial assignments-- | ▶ 01:47 |
that is, some of the variables have values and others don't. | ▶ 01:51 |
So we could have the belief state consisting of vacuum in A is true | ▶ 01:56 |
and the others are unknown, and that small formula represents 4 possible world states. | ▶ 02:01 |
We can even have a belief state which is an arbitrary formula in Boolean logic, | ▶ 02:08 |
and that can represent anything we want. | ▶ 02:18 |
So that's what states look like. | ▶ 02:20 |
Now we have to figure out what actions look like | ▶ 02:22 |
and what the results of those actions look like. | ▶ 02:25 |
These are represented in classical planning by something called an action schema. | ▶ 02:28 |
It's called a schema because it represents many possible actions that are similar to each other. | ▶ 02:34 |
So let's take an example of we want to send cargo around the world, | ▶ 02:40 |
and we've got a bunch of planes in airports, and we have cargo and so on. | ▶ 02:46 |
I'll show you the action for having a plane fly from one location to another. | ▶ 02:50 |
Here's one possible representation. | ▶ 02:56 |
We say it's an action schema, so we write the word Action | ▶ 02:59 |
and then we write the action operator and its arguments, | ▶ 03:03 |
so it's a Fly of P from X to Y. | ▶ 03:08 |
And then we list the preconditions, | ▶ 03:15 |
what needs to be true in order to be able to execute this action. | ▶ 03:19 |
We can say something like P better be a plane. | ▶ 03:24 |
It's no good trying to fly a truck or a submarine. | ▶ 03:29 |
And we'll use the And formula from Boolean propositional logic. | ▶ 03:35 |
X better be an airport. | ▶ 03:43 |
We don't want to try to take off from my backyard. | ▶ 03:47 |
And similarly, Y better be an airport. | ▶ 03:50 |
And, most importantly, P better be at airport X in order to take off from there. | ▶ 03:55 |
And then we represent the effects of the action by saying | ▶ 04:02 |
what's going to happen. | ▶ 04:08 |
Once we fly from X to Y, | ▶ 04:10 |
the plane is no longer at X, | ▶ 04:13 |
so we say not at P,X--the plane is no longer at X-- | ▶ 04:16 |
and the plane is now at Y. | ▶ 04:23 |
This is called an action schema. | ▶ 04:27 |
It represents a set of actions for all possible planes, for all X and for all Y, | ▶ 04:30 |
represents all of those actions in one schema | ▶ 04:36 |
that says what we need to know in order to apply the action and it says what will happen. | ▶ 04:39 |
In terms of the transition from state spaces, this variable will become false | ▶ 04:45 |
and this one will become true. | ▶ 04:50 |
When we look at this formula, this looks like a term in first order logic, | ▶ 04:53 |
but we're actually dealing with a completely propositional world. | ▶ 05:00 |
It just looks like that because this is a schema. | ▶ 05:04 |
We can apply this schema to specific ground states, specific world states, | ▶ 05:08 |
and then P and X would have specific values, | ▶ 05:15 |
and you could just think of it as concatenating their names all together, | ▶ 05:18 |
and that's just the name of one variable. | ▶ 05:21 |
The name just happens to have this complex form with parentheses and commas in it | ▶ 05:24 |
to make it easier to write one schema that covers all the individual fly actions. | ▶ 05:29 |
[Norvig] Here we see a more complete representation of a problem solving domain | ▶ 00:00 |
in the language of classical planning. | ▶ 00:05 |
Here's the Fly action schema. | ▶ 00:09 |
I've made it a little bit more explicit with from and to airports | ▶ 00:11 |
rather than X or Y. | ▶ 00:14 |
We want to deal with transporting cargo. | ▶ 00:17 |
So in addition to flying, we have an operator to load cargo, C, onto a plane, P, at airport A-- | ▶ 00:21 |
you can see the preconditions and effects there-- | ▶ 00:29 |
and an action to unload the cargo from the plane | ▶ 00:32 |
with preconditions and effects. | ▶ 00:35 |
We have a representation of the initial state. | ▶ 00:37 |
There's 2 pieces of cargo, there's 2 planes and 2 airports. | ▶ 00:40 |
This representation is rich enough and the algorithms on it are good enough | ▶ 00:45 |
that we could have hundreds or thousands of cargo planes and so on | ▶ 00:50 |
representing millions of ground actions. | ▶ 00:57 |
If we had 10 airports and 100 planes, that would be 100, 1,000, 10,000 different Fly actions. | ▶ 01:00 |
And if we had thousands of pieces of cargo, | ▶ 01:12 |
there would be even more Load and Unload actions, | ▶ 01:16 |
but they can all be represented by the succinct schema. | ▶ 01:18 |
So the initial state tells us what's what, where everything is, | ▶ 01:22 |
and then we can represent the goal state: | ▶ 01:27 |
that we want to have this piece of cargo has to be delivered to this airport, | ▶ 01:30 |
and another piece of cargo has to be delivered to this airport. | ▶ 01:34 |
So now we know what actions and problems of initial and goal state looks like | ▶ 01:38 |
in this representation, but how do we do planning using this? | ▶ 01:45 |
[Norvig] The simplest way to do planning is really the exact same way | ▶ 00:00 |
that we did it in problem solving. | ▶ 00:04 |
We start off in an initial state. | ▶ 00:06 |
So P1 was at SFO, say, and cargo, C1, was also at SFO, | ▶ 00:09 |
and all the other things that were in that initial state. | ▶ 00:20 |
And then we start branching on the possible actions, | ▶ 00:25 |
so say one possible action would be to load the cargo, C1, onto the plane, P1, at SFO, | ▶ 00:30 |
and then that would bring us to another state | ▶ 00:41 |
which would have a different set of state variables set, | ▶ 00:45 |
and we'd continue branching out like that until we hit a state which satisfied the goal predicate. | ▶ 00:51 |
So we call that forward or progression state space search | ▶ 00:58 |
in that we're searching through the space of exact states. | ▶ 01:03 |
Each of these is an individual world state, | ▶ 01:09 |
and if the actions are deterministic, then it's the same thing as we had before. | ▶ 01:12 |
But because we have this representation, | ▶ 01:17 |
there are other possibilities that weren't available to us before. | ▶ 01:20 |
[Norvig] Another way to search is called backwards or regression search | ▶ 00:00 |
in which we start at the goal. | ▶ 00:05 |
So we take the description of the goal state. | ▶ 00:07 |
C1 is at JFK and C2 is at SFO, so that's the goal state. | ▶ 00:10 |
And notice that that's the complete goal state. | ▶ 00:21 |
It's not that I left out all the other facts about the state; | ▶ 00:23 |
it's that that's all that's known about the state is that these 2 propositions are true | ▶ 00:26 |
and all the others can be anything you want. | ▶ 00:31 |
And now we can start searching backwards. | ▶ 00:34 |
We can say what actions would lead to that state. | ▶ 00:37 |
Remember in problem solving we did have that option of searching backwards. | ▶ 00:40 |
If there was a single goal state, we could say what other arcs are coming into that goal state. | ▶ 00:45 |
But here, this goal state doesn't represent a single state; | ▶ 00:51 |
it represents a whole family of states with different values for all the other variables. | ▶ 00:54 |
And so we can't just look at that, | ▶ 01:01 |
but what we can do is look at the definition of possible actions that will result in this goal. | ▶ 01:03 |
So let's look at it one at a time. | ▶ 01:10 |
Let's first look at what actions could result at C1, JFK. | ▶ 01:12 |
We look at our action schema, and there's only 1 action schema that adds an At, | ▶ 01:19 |
and that would be the Unload schema. | ▶ 01:26 |
Unload of C, P, A adds C, A. | ▶ 01:30 |
And so what we would know is if we want to achieve this, | ▶ 01:37 |
then we would have to do an Unload where the C variable would have to be C1, | ▶ 01:40 |
the P variable is still unknown--it could be any plane-- | ▶ 01:50 |
and the A variable has to be JFK. | ▶ 01:55 |
Notice what we've done here. | ▶ 02:01 |
We have this representation in terms of logical formula | ▶ 02:03 |
that allows us to specify a goal as a set of many world states, | ▶ 02:07 |
and we also can use that same representation to represent an arrow here | ▶ 02:12 |
not as a single action but as a set of possible actions. | ▶ 02:18 |
So this is representing all possible actions for any plane, P, | ▶ 02:21 |
of unloading cargo at the destination. | ▶ 02:26 |
And then we can regress this state over this operator | ▶ 02:29 |
and now we have another representation of this state here. | ▶ 02:36 |
But just as this state was uncertain--not all the variables were known-- | ▶ 02:40 |
this state too will be uncertain. | ▶ 02:44 |
For example, we won't know anything about what plane, P, is involved, | ▶ 02:46 |
and now we continue searching backwards until we get to a state | ▶ 02:51 |
where enough of the variables are filled in and where we match against the initial state. | ▶ 02:56 |
And then we have our solution. | ▶ 03:01 |
We found it going backwards, but we can apply the solution going forwards. | ▶ 03:03 |
[Norvig] Let's show an example of where a backwards search makes sense. | ▶ 00:00 |
I'm going to describe a world in which there is one action, | ▶ 00:04 |
the action of buying a book. | ▶ 00:07 |
And the precondition is we have to know which book it is, | ▶ 00:14 |
and let's identify them by ISBN number. | ▶ 00:18 |
So we can buy ISBN number B, and the effect is that we own B. | ▶ 00:21 |
And probably there should be something about money, | ▶ 00:30 |
but we're going to leave that out for now to make it simple. | ▶ 00:32 |
And then the goal would be to own ISBN number 0136042597. | ▶ 00:35 |
Now, if we try to solve this problem with forward search, we'd start in the initial state. | ▶ 00:47 |
Let's say the initial state is we don't own anything. | ▶ 00:51 |
And then we'd think about what actions can we apply. | ▶ 00:55 |
If there are 10 million different books, 10 million ISBN numbers, | ▶ 00:59 |
then there is a branching factor of 10 million coming out of this node, | ▶ 01:02 |
and we'd have to try them all in order until we happened to hit upon one that was the right one. | ▶ 01:08 |
It seems very inefficient. | ▶ 01:12 |
If we go in the backward direction, then we start at the goal. | ▶ 01:14 |
The goal is to own this number. | ▶ 01:19 |
Then we look at our available actions, and out of the 10 million actions | ▶ 01:22 |
there's only 1 action schema, | ▶ 01:25 |
and that action schema can match the goal in exactly one way, | ▶ 01:27 |
when B equals this number, and therefore we know the action is to buy this number, | ▶ 01:30 |
and we can connect the goal to the initial state in the backwards direction in just 1 step. | ▶ 01:36 |
So that's the advantage of doing backwards or regression search rather than forward search. | ▶ 01:42 |
[Norvig] There's one more type of search for plans | ▶ 00:00 |
that we can do with the classical planning language | ▶ 00:03 |
that we couldn't do before, and this is searching through the space of plans | ▶ 00:06 |
rather than searching through the space of states. | ▶ 00:11 |
In forward search we were searching through concrete world states. | ▶ 00:14 |
In backward search we were searching through abstract states | ▶ 00:18 |
in which some of the variables were unspecified. | ▶ 00:22 |
But in plan space search we search through the space of plans. | ▶ 00:25 |
And here's how it works. | ▶ 00:29 |
We start off with an empty plan. | ▶ 00:31 |
We have the start state and the goal state, and that's all we know about the plan. | ▶ 00:33 |
So obviously, this plan is flawed. It doesn't lead us from the start to the goal. | ▶ 00:38 |
And then we say let's do an operation to edit or modify that plan | ▶ 00:43 |
by adding something in new. | ▶ 00:48 |
And here we're tackling the problem of how to get dressed | ▶ 00:50 |
and put on all the clothes in the right order, | ▶ 00:53 |
so we say out of all the operators we have, we could add one of those operators into the plan. | ▶ 00:56 |
And so here we say what if we added the put on right shoe operator. | ▶ 01:01 |
Then we end up with this plan. | ▶ 01:06 |
That still doesn't solve the problem, so we need to keep refining that plan. | ▶ 01:09 |
Then we come here and say maybe we could add in the put on left shoe operator. | ▶ 01:13 |
And here I've shown the plan as a parallel branching structure | ▶ 01:20 |
rather than just as a sequence. | ▶ 01:24 |
And that's a useful thing to do because it captures the fact | ▶ 01:27 |
that these can be done in either order. | ▶ 01:30 |
And we keep refining like that, adding on new branches or new operators | ▶ 01:32 |
into the plan until we got a plan that was guaranteed to work. | ▶ 01:38 |
This approach was popular in the 1980s, but it's faded from popularity. | ▶ 01:42 |
Right now the most popular approaches have to do with forward search. | ▶ 01:47 |
We saw some of the advantages of backward search. | ▶ 01:52 |
The advantage of forward search seems to be that we can come up with very good heuristics. | ▶ 01:55 |
So we can do heuristic search, and we saw how important it was to have good heuristics | ▶ 01:59 |
to do heuristic search. | ▶ 02:04 |
And because the forward search deals with concrete plan states, | ▶ 02:06 |
it seems to be easier to come up with good heuristics. | ▶ 02:09 |
[Norvig] To understand the idea of heuristics, let's talk about another domain. | ▶ 00:00 |
Here we have the sliding puzzle domain. | ▶ 00:05 |
Remember we can slide around these little tiles and we try to reach a goal state. | ▶ 00:07 |
A 16 puzzle is kind of big, so let's show you the state space for the smaller 8 puzzle. | ▶ 00:13 |
Here is just a small portion of it. | ▶ 00:20 |
Let's figure out what the action schema looks like for this puzzle. | ▶ 00:22 |
We only need to describe one action, which is to slide a tile, T, | ▶ 00:27 |
from location A to location B. | ▶ 00:33 |
The precondition: the tile has to be on location A | ▶ 00:38 |
and has to be a tile | ▶ 00:45 |
and B has to be blank and A and B have to be adjacent. | ▶ 00:50 |
This should be an And sign, not an A. | ▶ 01:02 |
So that's the action schema. | ▶ 01:06 |
Oops. I forgot we need an effect, which should be that the tile is now on B | ▶ 01:08 |
and the blank is now on A and the tile is no longer on A and the blank is no longer on B. | ▶ 01:19 |
We talked before about how a human analyst could examine a problem | ▶ 01:38 |
and come up with heuristics and encode those heuristics as a function | ▶ 01:43 |
that would help search do a better job. | ▶ 01:47 |
But with this kind of a formal representation | ▶ 01:50 |
we can automatically come up with good representations of heuristics. | ▶ 01:53 |
For example, if we came up with a relaxed problem | ▶ 01:57 |
by automatically going in and throwing out some of the prerequisites-- | ▶ 02:02 |
if you throw out a prerequisite, you make the problem strictly easier-- | ▶ 02:06 |
then you get a new heuristic. | ▶ 02:10 |
So for example, if we crossed out the requirement that B has to be blank, | ▶ 02:12 |
then we end up with the Manhattan or city block heuristic. | ▶ 02:17 |
And if we also throw out the requirement that A and B have to be adjacent, | ▶ 02:22 |
then we get the number of misplaced tiles heuristic. | ▶ 02:28 |
So that means we could slide a tile from any A to any B, no matter how far apart they were. | ▶ 02:31 |
That's the number of misplaced tiles. | ▶ 02:37 |
Other heuristics are possible. | ▶ 02:40 |
For example, one popular thing is to ignore negative effects, | ▶ 02:42 |
to say let's not say that this takes away the blank being in B. | ▶ 02:46 |
So if we ignore that negative effect, we make the whole problem strictly easier. | ▶ 02:52 |
We'd have a relaxed problem, and that might end up being a good heuristic. | ▶ 02:56 |
So because we have our actions encoded in this logical form, | ▶ 03:00 |
we can automatically edit that form. | ▶ 03:04 |
A program can do that, and the program can come up with heuristics | ▶ 03:07 |
rather than requiring the human to come up with heuristics. | ▶ 03:10 |
[Norvig] Now I want to talk about 1 more representation for planning | ▶ 00:00 |
called situation calculus. | ▶ 00:03 |
To motivate this, suppose we wanted to have the goal of moving all the cargo | ▶ 00:07 |
from airport A to airport B, regardless of how many pieces of cargo there are. | ▶ 00:12 |
You can't express the notion of All in propositional languages like classical planning, | ▶ 00:17 |
but you can in first order logic. | ▶ 00:22 |
There are several ways to use first order logic for planning. | ▶ 00:25 |
The best known is situation calculus. | ▶ 00:27 |
It's not a new kind of logic; | ▶ 00:30 |
rather, it's regular first order logic with a set of conventions | ▶ 00:32 |
for how to represent states and actions. | ▶ 00:36 |
I'll show you what the conventions are. | ▶ 00:38 |
First, actions are represented as objects in first order logic, | ▶ 00:41 |
normally by functions. | ▶ 00:49 |
And so we would have a function like the function Fly | ▶ 00:51 |
of a plane and a From Airport and a To Airport | ▶ 00:56 |
which represents an object, which is the action. | ▶ 01:02 |
Then we have situations, and situations are also objects in the logic, | ▶ 01:08 |
and they correspond not to states but rather to paths-- | ▶ 01:16 |
the paths of actions that we have in state space search. | ▶ 01:22 |
So if you arrive at what would be the same world state by 2 different sets of actions, | ▶ 01:27 |
those would be considered 2 different situations in situation calculus. | ▶ 01:33 |
We describe the situations by objects, so we usually have an initial situation, | ▶ 01:37 |
often called S0, | ▶ 01:43 |
and then we have a function on situations called Result. | ▶ 01:46 |
So the result of a situation object and an action object is equal to another situation. | ▶ 01:52 |
And now instead of describing the actions that are applicable | ▶ 02:02 |
in a situation with a predicate Actions of S, | ▶ 02:07 |
situation calculus for some reason decided not to do that | ▶ 02:14 |
and instead we're going to talk about the actions that are possible in the state, | ▶ 02:17 |
and we're going to do that with a predicate. | ▶ 02:23 |
If we have a predicate Possible of A and S, is an action A possible in a state? | ▶ 02:28 |
There's a specific form for describing these predicates, | ▶ 02:37 |
and in general, it has the form of some precondition of state S | ▶ 02:43 |
implies that it's possible to do action A in state S. | ▶ 02:52 |
I'll show you the possibility axiom for the Fly action. | ▶ 02:59 |
We would say if there is some P, which is the plane in state S, | ▶ 03:04 |
and there is some X, which is an airport in state S, | ▶ 03:10 |
and there is some Y, which is also an airport in state S, | ▶ 03:16 |
and P is at location X in state S, | ▶ 03:21 |
then that implies that it's possible to fly P from X to Y in state S. | ▶ 03:28 |
And that's known as the possibility axiom for the action Fly. | ▶ 03:41 |
[Norvig] There's a convention in situation calculus that predicates like At-- | ▶ 00:01 |
we said plane P was at airport X in situation S-- | ▶ 00:07 |
these types of predicates that can vary from 1 situation to another are called fluents, | ▶ 00:14 |
from the word fluent, having to do with fluidity or change over time. | ▶ 00:19 |
And the convention is that they refer to a specific situation, | ▶ 00:25 |
and we always put that situation argument as the last in the predicate. | ▶ 00:29 |
Now, the trickiest part about situation calculus is describing what changes | ▶ 00:35 |
and what doesn't change as a result of an action. | ▶ 00:41 |
Remember in classical planning we had action schemas | ▶ 00:44 |
where we described 1 action at a time and said what changed. | ▶ 00:48 |
For situation calculus it turns out to be easier to do it the other way around. | ▶ 00:53 |
Instead of writing 1 action or 1 schema or 1 axiom for each action, | ▶ 00:57 |
we do 1 for each fluent, for each predicate that can change. | ▶ 01:03 |
We use the convention called successor state axioms. | ▶ 01:07 |
These are used to describe what happens in the state | ▶ 01:12 |
that's a successor of executing an action. | ▶ 01:15 |
And in general, a successor state axiom will have the form of saying | ▶ 01:19 |
for all actions and states, if it's possible to execute action A in state S, | ▶ 01:26 |
then--and I'll show in general what they look like here-- | ▶ 01:35 |
the fluent is true if and only if action A made it true | ▶ 01:42 |
or action A didn't undo it. | ▶ 01:54 |
So that is, either it wasn't true before and A made it be true, | ▶ 02:03 |
or it was true before and A didn't do something to stop it being true. | ▶ 02:08 |
For example, I'll show you the successor state axiom for the In predicate. | ▶ 02:14 |
And just to make it a little bit simpler, I'll leave out all the For All quantifiers. | ▶ 02:18 |
So wherever you see a variable without a quantifier, assume that there's a For All. | ▶ 02:23 |
What we'll say is it's possible to execute A in situation S. | ▶ 02:28 |
If that's true, then the In predicate holds between some cargo C | ▶ 02:38 |
and some plane in the state, which is the result of executing action A in state S. | ▶ 02:48 |
So that In predicate will hold if and only if either A was a load action-- | ▶ 03:01 |
so if we load the cargo into the plane, then the result of executing that action A | ▶ 03:12 |
is that the cargo is in the plane-- | ▶ 03:19 |
or it might be that it was already true that the cargo was in the plane in situation S | ▶ 03:23 |
and A is not equal to an unload action. | ▶ 03:30 |
So for all A and S for which it's possible to execute A in situation S, | ▶ 03:38 |
the In predicate holds if and only if the action was a load | ▶ 03:45 |
or the In predicate used to hold in the previous state and the action is not an unload. | ▶ 03:50 |
[Norvig] So I've talked about the possibility axioms and the successor state axioms. | ▶ 00:00 |
That's most of what's in situation calculus, | ▶ 00:04 |
and that's used to describe an entire domain like the airport cargo domain. | ▶ 00:07 |
And now we describe a particular problem within that domain by describing the initial state. | ▶ 00:11 |
Typically we call that S0, the initial situation. | ▶ 00:18 |
And in S0 we can make various types of assertions | ▶ 00:23 |
of different types of predicates. | ▶ 00:29 |
So we could say that plane P1 is at airport JFK in S0, so just a simple predicate. | ▶ 00:31 |
And we could also make larger sentences, so we could say | ▶ 00:43 |
for all C, if C is cargo, then that C is at JFK in situation S0. | ▶ 00:52 |
So we have much more flexibility in situation calculus to say almost anything we want. | ▶ 01:07 |
Anything that's a valid sentence in first order logic can be asserted about the initial state. | ▶ 01:11 |
The goal state is similar. | ▶ 01:18 |
We could have a goal of saying there exists some goal state S | ▶ 01:20 |
such that for all C, if C is cargo, then we want that cargo to be at SFO in state S. | ▶ 01:25 |
So this initial state and this goal says move all the cargo-- | ▶ 01:41 |
I don't care how much there is--from JFK to SFO. | ▶ 01:45 |
The great thing about situation calculus is that once we've described this | ▶ 01:50 |
in the ordinary language of first order logic, | ▶ 01:55 |
we don't need any special programs to manipulate it and come up with the solution | ▶ 01:58 |
because we already have theorem provers for first order logic | ▶ 02:03 |
and we can just state this as a problem, | ▶ 02:06 |
apply the normal theorem prover that we already had for other uses, | ▶ 02:08 |
and it can come up with an answer of a path that satisfies this goal, | ▶ 02:13 |
a situation which corresponds to a path which satisfies this | ▶ 02:19 |
given the initial state and given the descriptions of the actions. | ▶ 02:23 |
So the advantage of situation calculus is that we have the full power of first order logic. | ▶ 02:28 |
We can represent anything we want. | ▶ 02:32 |
Much more flexibility than in problem solving or classical planning. | ▶ 02:34 |
So all together now, we've seen several ways of dealing with planning. | ▶ 02:39 |
We started in deterministic, fully observable environments | ▶ 02:42 |
and we moved into stochastic and partially observable environments. | ▶ 02:45 |
We were able to distinguish between plans that can or cannot solve a problem, | ▶ 02:49 |
but we had 1 weakness in all these different approaches. | ▶ 02:55 |
It is that we weren't able to distinguish between probable and improbable solutions. | ▶ 02:58 |
And that will be the subject of the next unit. | ▶ 03:03 |
In this exercise, I'm going to write some logical expressions | ▶ 00:00 |
in propositional logic and ask you | ▶ 00:04 |
if these expressions are always true or always false | ▶ 00:07 |
or if their truth value depends on the values of the propositional variables. | ▶ 00:13 |
The first sentence is smoke implies fire | ▶ 00:19 |
is equivalent to smoke or not fire. | ▶ 00:26 |
Is that true or false, or does it depend on the values of smoke and fire? | ▶ 00:32 |
The second sentence, again, smoke implies fire | ▶ 00:39 |
is equivalent to not smoke implies not fire. | ▶ 00:45 |
The third sentence, smoke implies fire | ▶ 00:54 |
is equivalent to not fire implies not smoke. | ▶ 01:00 |
The fourth sentence, big or dumb | ▶ 01:09 |
or big implies dumb. | ▶ 01:17 |
The final sentence, big and dumb | ▶ 01:22 |
is equivalent to not, not big or not dumb. | ▶ 01:31 |
For each of these, tell me if they're always true regardless of the values of the variables, | ▶ 01:39 |
always false or sometimes true and sometimes false. | ▶ 01:45 |
In this exercise, I'm going to write some logical expressions | ▶ 00:00 |
in propositional logic and ask you | ▶ 00:04 |
if these expressions are always true or always false | ▶ 00:07 |
or if their truth value depends on the values of the propositional variables. | ▶ 00:13 |
The first sentence is smoke implies fire | ▶ 00:19 |
is equivalent to smoke or not fire. | ▶ 00:26 |
Is that true or false, or does it depend on the values of smoke and fire? | ▶ 00:32 |
The second sentence, again, smoke implies fire | ▶ 00:39 |
is equivalent to not smoke implies not fire. | ▶ 00:45 |
The third sentence, smoke implies fire | ▶ 00:54 |
is equivalent to not fire implies not smoke. | ▶ 01:00 |
The fourth sentence, big or dumb | ▶ 01:09 |
or big implies dumb. | ▶ 01:17 |
The final sentence, big and dumb | ▶ 01:22 |
is equivalent to not, not big or not dumb. | ▶ 01:31 |
For each of these, tell me if they're always true regardless of the values of the variables, | ▶ 01:39 |
always false or sometimes true and sometimes false. | ▶ 01:45 |
Here are the answers. The first sentence is true half the time and false half the time. | ▶ 00:00 |
It would have been true all the time if we had written fire or not smoke | ▶ 00:06 |
rather than smoke or not fire on the right-hand side. | ▶ 00:10 |
The second sentence is false when fire is true and smoke is false | ▶ 00:15 |
and otherwise true. | ▶ 00:19 |
The third sentence is always true, and this is called the contrapositive. | ▶ 00:22 |
Smoke implies fire is the same thing as not fire implies not smoke. | ▶ 00:27 |
The fourth sentence is always true, and you can figure that out | ▶ 00:33 |
by writing out the full truth tables or by reasoning about the variable big. | ▶ 00:37 |
When big is true, the whole sentence is true because big is one of the disjuncts, | ▶ 00:42 |
and when big is false, it's true because big implies dumb is true | ▶ 00:48 |
whenever the antecedent is false. | ▶ 00:53 |
And the final sentence is also always true, | ▶ 00:57 |
and this is known as de Morgan's law. | ▶ 01:00 |
In this exercise, I'm going to give you some English sentences | ▶ 00:00 |
and then some first-order logic sentences | ▶ 00:04 |
and ask you does the first-order logic sentence | ▶ 00:07 |
correctly encode the English sentence, does it incorrectly encode it, | ▶ 00:10 |
or is it just an error that is not a legitimate sentence | ▶ 00:15 |
in first-order logic? | ▶ 00:21 |
The first English sentence is "Paris and Nice are both in France." | ▶ 00:25 |
Here's one possible translation. | ▶ 00:30 |
Paris and Nice are in France. | ▶ 00:33 |
Here's another. | ▶ 00:41 |
Paris is in France, and Nice is in France. | ▶ 00:44 |
Tell us if each of these is a correct encoding of English, | ▶ 00:49 |
incorrect, or if it's erroneous first-order logic syntax. | ▶ 00:54 |
The second sentence in English is "There is a country that borders Iran and Syria." | ▶ 01:00 |
Here are the possible translations. | ▶ 01:07 |
There exists a c, and we're going to use the predicate capital C | ▶ 01:09 |
to mean C when the argument is a country. | ▶ 01:14 |
So, there exists a c such that C of c, | ▶ 01:22 |
and we're going to use the predicate B to mean 2 objects border each other. | ▶ 01:26 |
So, c borders Iran, and c borders Syria. | ▶ 01:32 |
That's one translation. Here's the other translation. | ▶ 01:40 |
There exists a c if C is a country, | ▶ 01:44 |
then c borders Iran and c borders Syria. | ▶ 01:50 |
And the final English sentence is no 2 bordering countries | ▶ 02:01 |
can have the same map color, and we're going to use the predicate MC for map color. | ▶ 02:04 |
Here's one possibility for all x and y. | ▶ 02:10 |
X is a country, and y is a country. | ▶ 02:14 |
And x and y border each other. | ▶ 02:21 |
That implies it's not the case that the map color | ▶ 02:26 |
of x equals the map color of y. | ▶ 02:32 |
And I should say we're using map color here as a function, not as a predicate. | ▶ 02:38 |
Here's another possibility. | ▶ 02:43 |
For all x and y, it's not the case that x is a country, | ▶ 02:46 |
or it's not the case that y is a country, | ▶ 02:53 |
or it's not the case that x and y border, | ▶ 02:59 |
or it's not the case that the map color of x | ▶ 03:05 |
is equal to the map color of y. | ▶ 03:11 |
The answers are the first sentence has erroneous syntax. | ▶ 00:00 |
We're using an and here between 2 terms, but you can't do that. | ▶ 00:06 |
An and can only be used between sentences and predicates in first-order logic. | ▶ 00:10 |
The second sentence does correctly encode the English sentence | ▶ 00:16 |
Paris and Nice are both in France. | ▶ 00:19 |
Similarly, the third sentence does correctly encode | ▶ 00:23 |
there's a country that borders Iran and Syria, | ▶ 00:27 |
but the fourth one incorrectly encodes it. | ▶ 00:30 |
Here we have an existential. There exists a C. | ▶ 00:34 |
And then an implication, and that's usually the wrong thing. | ▶ 00:37 |
The problem here is not if C represents a country, | ▶ 00:42 |
but what if C represents something that's not a country, say my dog. | ▶ 00:46 |
My dog is not a country, so there does exist a c, | ▶ 00:51 |
which is my dog, such that this implication is true | ▶ 00:56 |
because whenever the antecedent of an implication is false, | ▶ 01:00 |
my dog is not a country, then the whole thing is true. | ▶ 01:04 |
For the final sentence in English, no 2 bordering countries | ▶ 01:08 |
can have the same map color, both of these are correct encodings. | ▶ 01:11 |
The first one seems more obvious, | ▶ 01:16 |
and the second one we've just manipulated things a little bit. | ▶ 01:18 |
We know that A implies B is the same thing as saying not A or B, | ▶ 01:21 |
so here we've just taken the left-hand side and negated it | ▶ 01:27 |
and then put those all together with an or, | ▶ 01:31 |
so these 2 sentences represent the same thing, | ▶ 01:34 |
and they're both correct. | ▶ 01:38 |
This problem is about planning in belief space. | ▶ 00:00 |
We have the 2 room vacuum world, and we've represented various belief states here. | ▶ 00:04 |
Now, in this version, there are no sensors and so no percepts. | ▶ 00:10 |
The actions are all deterministic. | ▶ 00:14 |
A right or left or suck action will always do what it's supposed to do, | ▶ 00:17 |
and the environment is static. | ▶ 00:22 |
That is, dirt stays put until it's cleaned up. | ▶ 00:24 |
Now, in the start, you know nothing about the environment. | ▶ 00:28 |
You have no input. You don't know what location you're in. | ▶ 00:32 |
You don't know where the dirt is, and your goal is to be in the leftmost of the 2 squares | ▶ 00:34 |
and have both squares cleaned up, and what I want you to do | ▶ 00:40 |
is click on the sequence of actions, an action like this one or this one or this one | ▶ 00:44 |
or this one, that constitute a path from the start state to the goal state. | ▶ 00:49 |
And then I want you to click on yes if that path is guaranteed | ▶ 00:56 |
to always reach the goal, and no if the path only sometimes reaches the goal. | ▶ 01:03 |
The answer is that we start off knowing nothing, | ▶ 00:00 |
so we're in this belief state here where any of | ▶ 00:03 |
the 8 possible states are possibilities. | ▶ 00:06 |
Then our path to the goal is to move right, and we arrive in this belief state, | ▶ 00:10 |
and then suck up the dirt there. | ▶ 00:16 |
Then move left, and then suck up the dirt there, | ▶ 00:19 |
and we end up in a belief state with only a single world state, | ▶ 00:23 |
and that's one that reaches the goal where we're on the left | ▶ 00:28 |
and both squares are clean. | ▶ 00:31 |
And yes, that is guaranteed to reach the goal. | ▶ 00:33 |
In this problem, we're again in the 2-location vacuum world, | ▶ 00:00 |
but this time around, we have local sensing, | ▶ 00:06 |
meaning at each turn, we get input of what location we're at, | ▶ 00:10 |
the left or the right, and whether there's dirt in that location. | ▶ 00:15 |
But we don't know what's going on in the other location. | ▶ 00:18 |
We have a dynamic world where dirt can appear anywhere. | ▶ 00:21 |
As we move around, dirt can spontaneously appear | ▶ 00:28 |
in the location we left or in the location we're going to visit. | ▶ 00:33 |
However, if we're sucking, the dirt can't appear, | ▶ 00:37 |
because if it did appear there, we would successfully suck it up. | ▶ 00:41 |
And now in addition, the right and left moves | ▶ 00:45 |
are stochastic in that they don't always succeed. | ▶ 00:50 |
Sometimes when you try to go right, you do successfully go right, | ▶ 00:54 |
and sometimes you stay in the same location, same for left. | ▶ 00:57 |
The suck action is always successful. | ▶ 01:01 |
It will always clean up dirt in the current location. | ▶ 01:04 |
Now, when we start out, we get the percept | ▶ 01:07 |
saying that we're in the leftmost location | ▶ 01:10 |
and that location is clean, and that means our belief state | ▶ 01:14 |
is that we're in either 5 or 7. | ▶ 01:19 |
Now, the first thing I want you to answer is if we decide to move right, | ▶ 01:23 |
what do we predict the possible belief state will be, | ▶ 01:29 |
the possible set of states in our belief state will be | ▶ 01:33 |
after we execute the right movement? | ▶ 01:36 |
The answer is the right movement is stochastic, | ▶ 00:00 |
so it may fail, so that means we may stay on the left, and we may move to the right. | ▶ 00:05 |
And the world is dynamic, which means dirt may appear | ▶ 00:09 |
in either the left or the right location, | ▶ 00:13 |
and we didn't know for sure if there was dirt or not in the right, | ▶ 00:16 |
so that means any of the 8 possible states belong to the belief state | ▶ 00:20 |
for the prediction of moving right. | ▶ 00:26 |
That means any of the 8 states belong to the belief state | ▶ 00:31 |
for the prediction of moving right. | ▶ 00:36 |
Now we get a percept from the world, and we've observed | ▶ 00:00 |
that we're in the rightmost square, so the action worked, | ▶ 00:03 |
and that square is dirty. | ▶ 00:07 |
Now we want to update our belief state | ▶ 00:09 |
and click on all the states that belong to the belief state now | ▶ 00:12 |
as we update due to this percept. | ▶ 00:15 |
The answer is state 6 and state 2. | ▶ 00:00 |
Those are the 2 states in which the vacuum is on the right and that state is dirty. | ▶ 00:04 |
The other state we can't observe, and it could have been in any state before, | ▶ 00:08 |
so now it can be either clean or dirty. | ▶ 00:12 |
Now our belief state contains 2 and 6, | ▶ 00:00 |
and we decide we want to execute the suck action. | ▶ 00:04 |
Now tell me, by clicking on the appropriate states, | ▶ 00:08 |
what states belong to the belief state | ▶ 00:11 |
after we make a prediction for what's going to happen after the suck action. | ▶ 00:14 |
The answer is states 4 and 8. | ▶ 00:00 |
We know that the suck action will make it clean | ▶ 00:03 |
in our current location, but we don't know what's going on in the other location. | ▶ 00:06 |
Now, we make the observation | ▶ 00:00 |
right, clean, and I want you to | ▶ 00:04 |
update our belief state by clicking on the states | ▶ 00:09 |
that belong to the new belief state now | ▶ 00:12 |
after taking that observation into account. | ▶ 00:16 |
The answer is nothing has changed. | ▶ 00:00 |
Our belief state is still 4 and 8. | ▶ 00:02 |
We didn't really get any new information from that input | ▶ 00:05 |
because we knew that the result of the suck action | ▶ 00:07 |
was going to have to clean up locally, and we still didn't know anything | ▶ 00:10 |
about the other non-local state. | ▶ 00:13 |
This is a famous problem called the monkey and bananas problem, | ▶ 00:00 |
described in the language of classical planning. | ▶ 00:04 |
There are six actions. The monkey can go from location x to y. | ▶ 00:08 |
It can push some object from x to y. It can climb up an object. It can grab something. | ▶ 00:12 |
It can climb down from an object, and it can un-grab something. | ▶ 00:19 |
Initially, the monkey is at location A. The bananas are at location B. | ▶ 00:24 |
The box is at C, and the monkey is at a low height, as is the box, | ▶ 00:29 |
but the bananas are at a high height, | ▶ 00:34 |
but the box is pushable and climbable. | ▶ 00:37 |
Now, assuming that we execute this plan--go from A to C, push the box from C to B, | ▶ 00:40 |
climb up on the box, grasp the bananas, and climb down from the box. | ▶ 00:46 |
What I want you to do is look at these definitions of actions, | ▶ 00:51 |
tell me how the state unfolds from this initial state here to the final state, | ▶ 00:57 |
and then click off all of these instances that are going to be true in the final state. | ▶ 01:03 |
The answers are, yes, the monkey has the bananas. | ▶ 00:00 |
No, the box is not at C. It has been pushed to B. | ▶ 00:04 |
Yes, the monkey is at B. Yes, the bananas are at B. | ▶ 00:08 |
No, the height of the monkey is not high, because he climbed down, | ▶ 00:13 |
which means that that the effect was that he is at height low. | ▶ 00:17 |
But, yes, the height of the bananas is high, according to these definitions. | ▶ 00:22 |
You would think once the monkey grasped the bananas and climbed down | ▶ 00:27 |
that the height of the bananas should be low, | ▶ 00:32 |
but if we look at the operator for climb down, it doesn't say that. | ▶ 00:35 |
It refers to the monkey, but it doesn't refer to anything that the monkey is holding. | ▶ 00:39 |
That kind of thing is difficult to express in the language of classical planning. | ▶ 00:44 |
You could say that's a weakness in the definition of climb down. | ▶ 00:49 |
[Norvig] The final problem involves situation calculus. | ▶ 00:00 |
In the domain I want to describe, we have a combination lock with 4 digits, | ▶ 00:05 |
and the correct combination that will open the lock we'll call X. | ▶ 00:11 |
There are 2 actions you can perform. | ▶ 00:17 |
One is to dial any combination on the dial, | ▶ 00:20 |
and if you dial the correct one, X, then the lock will open. | ▶ 00:24 |
And the other action you can perform is to press a lock button, | ▶ 00:29 |
and if you press that button, then the lock will be locked, | ▶ 00:34 |
whether it was open before or not. | ▶ 00:37 |
I'm going to describe some axioms, | ▶ 00:40 |
and I want you to tell me whether these axioms are correct for the domain or not. | ▶ 00:43 |
First the possibility axioms. | ▶ 00:51 |
One choice is the possibility axiom that says | ▶ 00:53 |
if C equals X, then it's possible to dial C in situation S. | ▶ 00:59 |
And here I'm assuming that all variables are scoped | ▶ 01:09 |
so that we say an implicit for all C and for all S here. | ▶ 01:13 |
And X is not a variable. This is a constant, referring to the correct combination. | ▶ 01:18 |
The other possible axiom is for all C if C is greater than or equal to 0 | ▶ 01:26 |
and less than or equal to 9999, | ▶ 01:36 |
then it's possible to dial C in any situation S. | ▶ 01:41 |
So tell me which, if any or both, of these axioms you think correctly encode the situation. | ▶ 01:50 |
Next we'll look at the possibility axioms for the lock action. | ▶ 02:00 |
Here's one. | ▶ 02:05 |
We can say if the safe is open in situation S, | ▶ 02:07 |
then it's possible to execute the lock action in S. | ▶ 02:12 |
Or maybe we should say if the safe is not open in S, | ▶ 02:18 |
then it's possible to execute Lock in S. | ▶ 02:24 |
Or maybe we should say if true, | ▶ 02:30 |
then it's possible to execute the lock action in situation S. | ▶ 02:35 |
And tell me which, if any, of those represents a correct representation of the problem. | ▶ 02:42 |
And finally we need successor state axioms for all the fluents, | ▶ 02:50 |
but there's really only one fluent, and that's whether or not the safe is open. | ▶ 02:54 |
So here's one example of a successor state axiom. | ▶ 02:59 |
We could say for any situation and action, | ▶ 03:06 |
if it's possible to execute that action in the situation, | ▶ 03:13 |
then the Open fluent is going to be true in the result of executing that action | ▶ 03:18 |
if and only if the action is dialing the correct combination, X, | ▶ 03:26 |
or if the safe was already open in S and the action is not equal to Lock. | ▶ 03:36 |
That's one option. | ▶ 03:47 |
And the other option is the same thing on the left-hand side, | ▶ 03:49 |
and on the right-hand side it's open if and only if the action is dialing the correct combination | ▶ 03:54 |
and the action is not equal to Lock. | ▶ 04:03 |
So tell me which, if any or all, of these are accurate representations of the problem. | ▶ 04:09 |
In each case I want you to tell me if each of these axioms are good as they stand alone. | ▶ 04:16 |
I don't want you to look at any combinations of axioms | ▶ 04:23 |
but just go through each one and check the box if you think that the axiom on that line alone | ▶ 04:26 |
is a correct representation of the problem. | ▶ 04:33 |
[Norvig] The answers are, in this case, only the second is a correct representation. | ▶ 00:00 |
Any combination is possible to be dialed. | ▶ 00:06 |
It's not the case that it's only possible to dial the correct combination. | ▶ 00:09 |
Now, here we said that the lock button works at any point. | ▶ 00:15 |
Whether it's open or not, the lock button will always lock it. | ▶ 00:19 |
And so that's represented by the third option. | ▶ 00:23 |
True implies it's possible to lock. | ▶ 00:25 |
In this case the first one is a correct representation | ▶ 00:30 |
of the successor state axiom for Open, | ▶ 00:34 |
and the second one is not, because note what it says. | ▶ 00:38 |
If we already have the lock open and then we execute some dialing action | ▶ 00:41 |
that's not dialing the correct combination, X, we want it to remain open. | ▶ 00:49 |
But this second axiom would make it be closed, which is not what we want. | ▶ 00:54 |
In this exercise, I'm going to give you some English sentences | ▶ 00:00 |
and then some first-order logic sentences | ▶ 00:04 |
and ask you does the first-order logic sentence | ▶ 00:07 |
correctly encode the English sentence, does it incorrectly encode it, | ▶ 00:10 |
or is it just an error that is not a legitimate sentence | ▶ 00:15 |
in first-order logic? | ▶ 00:21 |
The first English sentence is "Paris and Nice are both in France." | ▶ 00:25 |
Here's one possible translation. | ▶ 00:30 |
Paris and Nice are in France. | ▶ 00:33 |
Here's another. | ▶ 00:41 |
Paris is in France, and Nice is in France. | ▶ 00:44 |
Tell us if each of these is a correct encoding of English, | ▶ 00:49 |
incorrect, or if it's erroneous first-order logic syntax. | ▶ 00:54 |
The second sentence in English is "There is a country that borders Iran and Syria." | ▶ 01:00 |
Here are the possible translations. | ▶ 01:07 |
There exists a c, and we're going to use the predicate capital C | ▶ 01:09 |
to mean C when the argument is a country. | ▶ 01:14 |
So, there exists a c such that C of c, | ▶ 01:22 |
and we're going to use the predicate B to mean 2 objects border each other. | ▶ 01:26 |
So, c borders Iran, and c borders Syria. | ▶ 01:32 |
That's one translation. Here's the other translation. | ▶ 01:40 |
There exists a c if C is a country, | ▶ 01:44 |
then c borders Iran and c borders Syria. | ▶ 01:50 |
And the final English sentence is no 2 bordering countries | ▶ 02:01 |
can have the same map color, and we're going to use the predicate MC for map color. | ▶ 02:04 |
Here's one possibility for all x and y. | ▶ 02:10 |
X is a country, and y is a country. | ▶ 02:14 |
And x and y border each other. | ▶ 02:21 |
That implies it's not the case that the map color | ▶ 02:26 |
of x equals the map color of y. | ▶ 02:32 |
And I should say we're using map color here as a function, not as a predicate. | ▶ 02:38 |
Here's another possibility. | ▶ 02:43 |
For all x and y, it's not the case that x is a country, | ▶ 02:46 |
or it's not the case that y is a country, | ▶ 02:53 |
or it's not the case that x and y border, | ▶ 02:59 |
or it's not the case that the map color of x | ▶ 03:05 |
is equal to the map color of y. | ▶ 03:11 |
This problem is about planning in belief space. | ▶ 00:00 |
We have the 2 room vacuum world, and we've represented various belief states here. | ▶ 00:04 |
Now, in this version, there are no sensors and so no percepts. | ▶ 00:10 |
The actions are all deterministic. | ▶ 00:14 |
A right or left or suck action will always do what it's supposed to do, | ▶ 00:17 |
and the environment is static. | ▶ 00:22 |
That is, dirt stays put until it's cleaned up. | ▶ 00:24 |
Now, in the start, you know nothing about the environment. | ▶ 00:28 |
You have no input. You don't know what location you're in. | ▶ 00:32 |
You don't know where the dirt is, and your goal is to be in the leftmost of the 2 squares | ▶ 00:34 |
and have both squares cleaned up, and what I want you to do | ▶ 00:40 |
is click on the sequence of actions, an action like this one or this one or this one | ▶ 00:44 |
or this one, that constitute a path from the start state to the goal state. | ▶ 00:49 |
And then I want you to click on yes if that path is guaranteed | ▶ 00:56 |
to always reach the goal, and no if the path only sometimes reaches the goal. | ▶ 01:03 |
In this problem, we're again in the 2-location vacuum world, | ▶ 00:00 |
but this time around, we have local sensing, | ▶ 00:06 |
meaning at each turn, we get input of what location we're at, | ▶ 00:10 |
the left or the right, and whether there's dirt in that location. | ▶ 00:15 |
But we don't know what's going on in the other location. | ▶ 00:18 |
We have a dynamic world where dirt can appear anywhere. | ▶ 00:21 |
As we move around, dirt can spontaneously appear | ▶ 00:28 |
in the location we left or in the location we're going to visit. | ▶ 00:33 |
However, if we're sucking, the dirt can't appear, | ▶ 00:37 |
because if it did appear there, we would successfully suck it up. | ▶ 00:41 |
And now in addition, the right and left moves | ▶ 00:45 |
are stochastic in that they don't always succeed. | ▶ 00:50 |
Sometimes when you try to go right, you do successfully go right, | ▶ 00:54 |
and sometimes you stay in the same location, same for left. | ▶ 00:57 |
The suck action is always successful. | ▶ 01:01 |
It will always clean up dirt in the current location. | ▶ 01:04 |
Now, when we start out, we get the percept | ▶ 01:07 |
saying that we're in the leftmost location | ▶ 01:10 |
and that location is clean, and that means our belief state | ▶ 01:14 |
is that we're in either 5 or 7. | ▶ 01:19 |
Now, the first thing I want you to answer is if we decide to move right, | ▶ 01:23 |
what do we predict the possible belief state will be, | ▶ 01:29 |
the possible set of states in our belief state will be | ▶ 01:33 |
after we execute the right movement? | ▶ 01:36 |
Now we get a percept from the world, and we've observed | ▶ 00:00 |
that we're in the rightmost square, so the action worked, | ▶ 00:03 |
and that square is dirty. | ▶ 00:07 |
Now we want to update our belief state | ▶ 00:09 |
and click on all the states that belong to the belief state now | ▶ 00:12 |
as we update due to this percept. | ▶ 00:15 |
Now our belief state contains 2 and 6, | ▶ 00:00 |
and we decide we want to execute the suck action. | ▶ 00:04 |
Now tell me, by clicking on the appropriate states, | ▶ 00:08 |
what states belong to the belief state | ▶ 00:11 |
after we make a prediction for what's going to happen after the suck action. | ▶ 00:14 |
Now, we make the observation | ▶ 00:00 |
right, clean, and I want you to | ▶ 00:04 |
update our belief state by clicking on the states | ▶ 00:09 |
that belong to the new belief state now | ▶ 00:12 |
after taking that observation into account. | ▶ 00:16 |
This is a famous problem called the monkey and bananas problem, | ▶ 00:00 |
described in the language of classical planning. | ▶ 00:04 |
There are six actions. The monkey can go from location x to y. | ▶ 00:08 |
It can push some object from x to y. It can climb up an object. It can grab something. | ▶ 00:12 |
It can climb down from an object, and it can un-grab something. | ▶ 00:19 |
Initially, the monkey is at location A. The bananas are at location B. | ▶ 00:24 |
The box is at C, and the monkey is at a low height, as is the box, | ▶ 00:29 |
but the bananas are at a high height, | ▶ 00:34 |
but the box is pushable and climbable. | ▶ 00:37 |
Now, assuming that we execute this plan--go from A to C, push the box from C to B, | ▶ 00:40 |
climb up on the box, grasp the bananas, and climb down from the box. | ▶ 00:46 |
What I want you to do is look at these definitions of actions, | ▶ 00:51 |
tell me how the state unfolds from this initial state here to the final state, | ▶ 00:57 |
and then click off all of these instances that are going to be true in the final state. | ▶ 01:03 |
[Norvig] The final problem involves situation calculus. | ▶ 00:00 |
In the domain I want to describe, we have a combination lock with 4 digits, | ▶ 00:05 |
and the correct combination that will open the lock we'll call X. | ▶ 00:11 |
There are 2 actions you can perform. | ▶ 00:17 |
One is to dial any combination on the dial, | ▶ 00:20 |
and if you dial the correct one, X, then the lock will open. | ▶ 00:24 |
And the other action you can perform is to press a lock button, | ▶ 00:29 |
and if you press that button, then the lock will be locked, | ▶ 00:34 |
whether it was open before or not. | ▶ 00:37 |
I'm going to describe some axioms, | ▶ 00:40 |
and I want you to tell me whether these axioms are correct for the domain or not. | ▶ 00:43 |
First the possibility axioms. | ▶ 00:51 |
One choice is the possibility axiom that says | ▶ 00:53 |
if C equals X, then it's possible to dial C in situation S. | ▶ 00:59 |
And here I'm assuming that all variables are scoped | ▶ 01:09 |
so that we say an implicit for all C and for all S here. | ▶ 01:13 |
And X is not a variable. This is a constant, referring to the correct combination. | ▶ 01:18 |
The other possible axiom is for all C if C is greater than or equal to 0 | ▶ 01:26 |
and less than or equal to 9999, | ▶ 01:36 |
then it's possible to dial C in any situation S. | ▶ 01:41 |
So tell me which, if any or both, of these axioms you think correctly encode the situation. | ▶ 01:50 |
Next we'll look at the possibility axioms for the lock action. | ▶ 02:00 |
Here's one. | ▶ 02:05 |
We can say if the safe is open in situation S, | ▶ 02:07 |
then it's possible to execute the lock action in S. | ▶ 02:12 |
Or maybe we should say if the safe is not open in S, | ▶ 02:18 |
then it's possible to execute Lock in S. | ▶ 02:24 |
Or maybe we should say if true, | ▶ 02:30 |
then it's possible to execute the lock action in situation S. | ▶ 02:35 |
And tell me which, if any, of those represents a correct representation of the problem. | ▶ 02:42 |
And finally we need successor state axioms for all the fluents, | ▶ 02:50 |
but there's really only one fluent, and that's whether or not the safe is open. | ▶ 02:54 |
So here's one example of a successor state axiom. | ▶ 02:59 |
We could say for any situation and action, | ▶ 03:06 |
if it's possible to execute that action in the situation, | ▶ 03:13 |
then the Open fluent is going to be true in the result of executing that action | ▶ 03:18 |
if and only if the action is dialing the correct combination, X, | ▶ 03:26 |
or if the safe was already open in S and the action is not equal to Lock. | ▶ 03:36 |
That's one option. | ▶ 03:47 |
And the other option is the same thing on the left-hand side, | ▶ 03:49 |
and on the right-hand side it's open if and only if the action is dialing the correct combination | ▶ 03:54 |
and the action is not equal to Lock. | ▶ 04:03 |
So tell me which, if any or all, of these are accurate representations of the problem. | ▶ 04:09 |
In each case I want you to tell me if each of these axioms are good as they stand alone. | ▶ 04:16 |
I don't want you to look at any combinations of axioms | ▶ 04:23 |
but just go through each one and check the box if you think that the axiom on that line alone | ▶ 04:26 |
is a correct representation of the problem. | ▶ 04:33 |
So today is an exciting day. | ▶ 00:00 |
We'll talk about planning under uncertainty, | ▶ 00:02 |
and it really puts together from the material | ▶ 00:05 |
we've talked about in past classes. | ▶ 00:07 |
We talked about planning, | ▶ 00:09 |
but not under uncertainty, and you've had | ▶ 00:11 |
many, many classes of under uncertainty, | ▶ 00:13 |
and now it gets to the point where we can make | ▶ 00:15 |
decisions under uncertainty. | ▶ 00:17 |
This is really important for my own research field | ▶ 00:19 |
like robotics where the world is full of uncertainty, and the | ▶ 00:21 |
type of techniques I'll tell you about today | ▶ 00:24 |
will really make it possible to drive robots | ▶ 00:26 |
in actual physical roles and | ▶ 00:28 |
find good plans for these robots to execute. | ▶ 00:30 |
[Narrator] Planning under uncertainty. | ▶ 00:00 |
In this class so far | ▶ 00:04 |
we talked a good deal about planning. | ▶ 00:06 |
We talked about uncertainty and probabilities, | ▶ 00:08 |
and we also talked about learning, | ▶ 00:12 |
but all 3 items were discussed separately. | ▶ 00:15 |
We never brought planning and uncertainty together, | ▶ 00:18 |
uncertainty and learning, or planning and learning. | ▶ 00:20 |
So the class today, we'll fuse planning and uncertainty | ▶ 00:23 |
using techniques known as Markov decision processes or MDPs, | ▶ 00:26 |
and partial observer Markov decision processes or POMDPs. | ▶ 00:31 |
We also have a class coming up on reinforcement learning | ▶ 00:36 |
which combines all 3 of his aspects, | ▶ 00:39 |
planning, uncertainty, and machine learning. | ▶ 00:41 |
You might remember in the very first class | ▶ 00:44 |
we distinguished very different characteristics of agent tasks, | ▶ 00:46 |
and here are some of those. | ▶ 00:49 |
We distinguished deterministic was the casting environments, | ▶ 00:51 |
and we also talked about photos as partial observable. | ▶ 00:54 |
In the area of planning so far | ▶ 00:58 |
all of our evidence falls into this field over here, | ▶ 01:01 |
like A*, depth first, right first and so on. | ▶ 01:04 |
The MDP algorithms | ▶ 01:10 |
which I will talk about first | ▶ 01:12 |
fall into the intersection of fully observable | ▶ 01:14 |
yet stochastic, and just to remind us | ▶ 01:17 |
what the difference was, | ▶ 01:19 |
stochastic is an environment where the outcome of an action is somewhat random. | ▶ 01:21 |
Whereas an environment that's deterministic | ▶ 01:24 |
where the outcome of an action is predictable | ▶ 01:26 |
and always the same. | ▶ 01:29 |
An environment is fully observable if you can | ▶ 01:31 |
see the state of the environment which means if you can make all decisions | ▶ 01:33 |
based on the momentary sensory input. | ▶ 01:35 |
Whereas if you need memory, | ▶ 01:37 |
it's partially observable. | ▶ 01:39 |
Planning in the partially observable case | ▶ 01:41 |
is called POMDP, and towards the end of this class, | ▶ 01:43 |
I'll briefly talk about POMDPs but not in any depth. | ▶ 01:47 |
So most of this class focuses on Markov decision processes | ▶ 01:50 |
as opposed to partially observable Markov decision processes. | ▶ 01:53 |
So what is a Markov decision process? | ▶ 01:57 |
One way you can specify a Markov decision process by a graph. | ▶ 01:59 |
Suppose you have states S1, S2, and S3, | ▶ 02:04 |
and you have actions A1 and A2. | ▶ 02:08 |
In a state transition graph, like this, | ▶ 02:11 |
is a finite state machine, | ▶ 02:14 |
and it becomes Markov if the outcomes of actions are somewhat random. | ▶ 02:16 |
So for example if A1 over here, with a 50% probability, leads to | ▶ 02:20 |
state S2 but with another 50% probability | ▶ 02:25 |
leads to state S3. | ▶ 02:29 |
So put differently, a Markov decision process of | ▶ 02:32 |
states, actions, a state's transition matrix, | ▶ 02:34 |
often written of the following form | ▶ 02:40 |
which is just about the same as | ▶ 02:42 |
a conditional state transition probability | ▶ 02:44 |
that a state is prime | ▶ 02:47 |
is the correct posterior state | ▶ 02:49 |
after executing action A in a state S, | ▶ 02:51 |
and the missing thing is the objective for the Markov decision process. | ▶ 02:55 |
What do we want to achieve? | ▶ 02:58 |
For that we often define a reward function, | ▶ 03:00 |
and for the sake of this lecture, | ▶ 03:03 |
I will attach rewards just to states. | ▶ 03:05 |
So each state will have a function R attached | ▶ 03:07 |
that tells me how good the state is. | ▶ 03:10 |
So for example it might be worth $10 | ▶ 03:12 |
to be in the state over here, | ▶ 03:14 |
$0 to be in the state over here, | ▶ 03:16 |
and $100 to be in a state over here. | ▶ 03:18 |
So the planning problem is now the problem | ▶ 03:21 |
which relies on an action to each possible state. | ▶ 03:23 |
So that we maximize our total reward. | ▶ 03:27 |
[Narrator] Before diving into too much detail, | ▶ 00:00 |
let me explain to you why MDPs really matter. | ▶ 00:03 |
What you see here is a robotic tour guide | ▶ 00:07 |
that the University of Bonn, with my assistance, | ▶ 00:10 |
deployed in the German museum in Bonn, | ▶ 00:13 |
and the objective of the this robot was to | ▶ 00:17 |
navigate the museum and guide visitors, | ▶ 00:20 |
mostly kids, from exhibit to exhibit. | ▶ 00:23 |
This is a challenging planning problem because | ▶ 00:27 |
as the robot moves | ▶ 00:31 |
it can't really predict its action outcomes | ▶ 00:33 |
because of the randomness of the environment | ▶ 00:35 |
and the carpet and the wheels of the robot. | ▶ 00:38 |
The robot is not able to really follow its own commands very well, | ▶ 00:40 |
and it has to take this into consideration during the planning process | ▶ 00:44 |
so when it finds itself in a location it didn't expect, | ▶ 00:47 |
it knows what to do. | ▶ 00:50 |
In the second video here, you see a successor robot | ▶ 00:53 |
that was deployed in the Smithsonian National | ▶ 00:56 |
Museum of American History in the late 1990s | ▶ 00:59 |
where it guided many, many thousands of kids | ▶ 01:03 |
through the entrance hall of the museum, | ▶ 01:05 |
and once again, this is a challenging planning problem. | ▶ 01:08 |
As you can see people are often in the way of the robot. | ▶ 01:10 |
The robot has to take detours. | ▶ 01:13 |
Now this one is particularly difficult because | ▶ 01:15 |
there were obstacles that were invisible | ▶ 01:17 |
like a downward staircase. | ▶ 01:19 |
So this is a challenging localization problem | ▶ 01:21 |
trying to find out where you are, | ▶ 01:23 |
but that's for a later class. | ▶ 01:25 |
In the video here, you see a robot being deployed in a nursing home | ▶ 01:30 |
with the objective to assist elderly people | ▶ 01:33 |
by guiding them around, bring them to appointments, | ▶ 01:36 |
reminding them to take their medication, and | ▶ 01:39 |
interacting with them, and this robot has been active for many, many years | ▶ 01:42 |
and been used, and, again, it's a very challenging planning problem | ▶ 01:45 |
to navigate through this elderly home. | ▶ 01:48 |
And the final robot I'm showing you here. | ▶ 01:52 |
This was built with my colleague Will Whittaker at Carnegie Melon University. | ▶ 01:54 |
The objective here was to explore abandoned mines. | ▶ 01:57 |
Pennsylvania and West Virginia | ▶ 02:01 |
and other states are heavily mined. | ▶ 02:03 |
There's many abandoned old coal mines, | ▶ 02:06 |
and for many of these mines, | ▶ 02:09 |
it's unknown what the conditions are and where exactly they are. | ▶ 02:11 |
They're not really human accessible. | ▶ 02:14 |
They tend to have roof fall and very low oxygen levels. | ▶ 02:16 |
So we made a robot that went inside | ▶ 02:19 |
and built maps of those mines. | ▶ 02:21 |
All these problems have in common that they | ▶ 02:26 |
have really challenging planning problems. | ▶ 02:29 |
The environments are stochastic. | ▶ 02:32 |
That is the outcome of actions are unknown, | ▶ 02:34 |
and the robot has to be able to react to | ▶ 02:36 |
all kinds of situations, even the ones that it didn't plan for. | ▶ 02:39 |
[Narrator] Let me give a much simpler example | ▶ 00:00 |
often called grid world for MDPs, | ▶ 00:03 |
and I'll be using this insanely simple | ▶ 00:06 |
example over here throughout this class. | ▶ 00:08 |
Let's assume we have a starting state over here, | ▶ 00:11 |
and there's 2 goal states who | ▶ 00:13 |
are often called absorbing states | ▶ 00:16 |
with very different reward or payout. | ▶ 00:18 |
Plus 100 for the state over here, | ▶ 00:21 |
minus 100 for the state over here, | ▶ 00:23 |
and our agent is able to move about the environment, | ▶ 00:25 |
and when it reaches one of those 2 states, | ▶ 00:28 |
the game is over and the task is done. | ▶ 00:30 |
Obviously the top state is much more attractive than the bottom state with minus 100. | ▶ 00:33 |
Now to turn this into an MDP, let's assume | ▶ 00:37 |
actions are somewhat stochastic. | ▶ 00:41 |
So suppose we had a grid cell, and we attempt to go north. | ▶ 00:43 |
The deterministic agent would always succeed | ▶ 00:46 |
to go to the north square if it's available, | ▶ 00:49 |
but let's assume that we only have an 80% chance | ▶ 00:53 |
to make it to the cell in the north. | ▶ 00:55 |
If there's no cell at all, | ▶ 00:57 |
there's a wall like over here, | ▶ 00:59 |
we assume with 80% chance, we just bounce back to the same cell, | ▶ 01:01 |
but with 10% chance, we instead go left. | ▶ 01:04 |
Another 10% chance, we go right. | ▶ 01:08 |
So if an agent is over here and wishes to go north, | ▶ 01:10 |
then with 80% chance, it finds itself over here, | ▶ 01:13 |
10% over here, 10% over here. | ▶ 01:15 |
If it goes north from here, | ▶ 01:17 |
because there's no north cell, | ▶ 01:19 |
it'll bounce back with 80% probability, | ▶ 01:21 |
10% left, 10% right. | ▶ 01:23 |
In a cell like this one over here, | ▶ 01:25 |
it'll bounce back with 90% probability, | ▶ 01:27 |
80 from the top and 10 from the left, | ▶ 01:30 |
but it still has a 10% chance of going right. | ▶ 01:32 |
This is a stochastic state transition which | ▶ 01:35 |
we can equally define for actions, | ▶ 01:37 |
south, west and east, and | ▶ 01:39 |
now we can see a situation like this | ▶ 01:41 |
conventional planning is insufficient. | ▶ 01:43 |
So for example if you're plan a sequence of actions starting over here, | ▶ 01:45 |
you might go north, north, east, east, east | ▶ 01:48 |
to reach our plus 100 absorbing or final state, | ▶ 01:51 |
but with this state transition model over here, | ▶ 01:55 |
even with the first step it might happen with 10% chance do you find yourselves over here, | ▶ 01:57 |
in which case conventional planning would not give us an answer. | ▶ 02:02 |
So we wish to have a planning method that provides an answer no matter where we are | ▶ 02:05 |
and that's called a policy. | ▶ 02:09 |
A policy assigns actions to any state. | ▶ 02:12 |
So for example a policy might look as follows: | ▶ 02:15 |
for this state, we wish to go north, north, east, east, east, | ▶ 02:17 |
but for this state over here, we wish to go north, maybe east over here, | ▶ 02:24 |
and maybe west over here. | ▶ 02:28 |
So each state, except for the absorbing states, | ▶ 02:31 |
we have to define an action to define a policy. | ▶ 02:33 |
The planning problem we have becomes one of finding the optimal policy | ▶ 02:36 |
[Narrator] To understand the beauty | ▶ 00:00 |
of a policy, let me look into stochastic environments, | ▶ 00:02 |
and let me try to apply conventional planning. | ▶ 00:07 |
Consider the same grid I just gave you, | ▶ 00:11 |
and let's assume there's a discrete start state, | ▶ 00:15 |
the one over here, and we wish to find | ▶ 00:18 |
an action sequence that leads us to | ▶ 00:20 |
the goal state over here. | ▶ 00:22 |
In conventional planning we would create a tree. | ▶ 00:24 |
In C1, we're given 4 action choices, | ▶ 00:27 |
north, south, west, and east. | ▶ 00:30 |
However, the outcome of those choices | ▶ 00:35 |
is not deterministic | ▶ 00:37 |
So rather than having a single outcome, | ▶ 00:39 |
nature will choose for us the actual outcome. | ▶ 00:41 |
In the case of going north, for example, | ▶ 00:44 |
we may find ourselves in B1, | ▶ 00:46 |
or back into C1. | ▶ 00:49 |
Similarly for going south, we might find | ▶ 00:52 |
ourselves in C1, or back in C2, and so on. | ▶ 00:54 |
This tree has a number of problems. | ▶ 01:01 |
The first problem is the branching factor. | ▶ 01:04 |
While we have 4 different action choices, | ▶ 01:07 |
nature will give us up to 3 different outcomes | ▶ 01:10 |
which makes up to 12 different things we have to follow. | ▶ 01:13 |
Now in conventional planning we might have to follow just 1 of those, | ▶ 01:17 |
but here we might have to follow up to 3 of those things. | ▶ 01:20 |
So every time we plan a step ahead, | ▶ 01:23 |
we might have to increase the breadth of | ▶ 01:26 |
the search of tree by at least a factor of 3. | ▶ 01:28 |
So one of the problem is the branching | ▶ 01:32 |
factor is too large. | ▶ 01:34 |
[Narrator] To understand the branching factor, | ▶ 00:00 |
let me quiz you on | ▶ 00:02 |
how many states you can possibly reach | ▶ 00:04 |
from any other states, and as an example | ▶ 00:07 |
from C1, you can reach under | ▶ 00:09 |
any action choice B1, C1, and C2, but it | ▶ 00:12 |
will give you an affective branching factor of 3. | ▶ 00:17 |
So when I ask you what's the affective branching factor in B3? | ▶ 00:19 |
What is the maximum level of states | ▶ 00:23 |
you can reach under | ▶ 00:25 |
any possible action from B3? | ▶ 00:27 |
So how many states can we reach | ▶ 00:29 |
from B3 over here? | ▶ 00:31 |
[Narrator] And the answer is 8. | ▶ 00:00 |
If you go north, you might reach this state over here, | ▶ 00:02 |
this one over here, this one over here. | ▶ 00:05 |
If you go east, you might reach this state over here, | ▶ 00:08 |
or this one over here, or this one over here, | ▶ 00:10 |
or this one over here. | ▶ 00:12 |
When you put it all together, you can reach all of those 8 states over here. | ▶ 00:14 |
[Narrator] There are other problems with | ▶ 00:00 |
the search paradigm. | ▶ 00:02 |
The second one is that the tree could be very deep, | ▶ 00:04 |
and the reason is we might be able to | ▶ 00:08 |
circle forever in the area over here | ▶ 00:10 |
without reaching the goal state, and | ▶ 00:13 |
that makes for a very deep tree, and until | ▶ 00:15 |
we reach the goal state, we won't even know | ▶ 00:17 |
it's the best possible action. | ▶ 00:20 |
So conventional planning might have difficulties | ▶ 00:22 |
with basically infinite loops. | ▶ 00:24 |
The third problem is that many states | ▶ 00:27 |
recur in the search. | ▶ 00:30 |
In a star, we were careful | ▶ 00:32 |
to visit each state only once, | ▶ 00:34 |
but here because the actions might | ▶ 00:37 |
carry you back here to the same state, | ▶ 00:40 |
C1 is, for example, over here and over here. | ▶ 00:42 |
You might find that many states in the tree | ▶ 00:45 |
might be visited many, many different times. | ▶ 00:47 |
Now if you had a state it doesn't really matter how you got there. | ▶ 00:50 |
Yet, the tree doesn't understand this, and it | ▶ 00:53 |
might expand states more than once. | ▶ 00:55 |
These are the 3 problems | ▶ 00:58 |
that are overcome by our policy method, | ▶ 01:00 |
and this motivates in part by calculating policies | ▶ 01:04 |
is so much better of an idea than using | ▶ 01:07 |
conventional planning and still casting environments. | ▶ 01:09 |
So let's get back to the policy case. | ▶ 01:12 |
[Narrator] Let's look at the grid world, again, | ▶ 00:00 |
and let me ask you a question. | ▶ 00:03 |
I wish to find an optimal policy | ▶ 00:05 |
for all these states that | ▶ 00:07 |
with maximum probability leads me to | ▶ 00:09 |
the absorbing state plus 100, | ▶ 00:11 |
and as I just discussed, I assume | ▶ 00:14 |
there's 4 different actions, | ▶ 00:16 |
north, south, west, and east | ▶ 00:18 |
that succeed with probability 80% provided | ▶ 00:20 |
that the corresponding grid cell is actually attainable. | ▶ 00:23 |
I wish to know what is the optimal action | ▶ 00:26 |
in the corner set over here, A1, | ▶ 00:29 |
and I give you 4 choices, | ▶ 00:32 |
north, south, west, and east. | ▶ 00:34 |
[Narrator] And the answer is east. | ▶ 00:00 |
East in expectation transfers you to the right side, | ▶ 00:02 |
and you're one closer to your goal position. | ▶ 00:04 |
[Narrator] Let me ask the same question for the state over here, C1, | ▶ 00:00 |
which one is the optimal action for C1? | ▶ 00:03 |
[Narrator] And the answer is north. It gets you one step closer. | ▶ 00:00 |
There is 2 equally long paths, but over here | ▶ 00:03 |
you risk falling into the minus 100; therefore, you'd rather go north. | ▶ 00:05 |
[Narrator] The next question is challenging. | ▶ 00:00 |
Consider state C4, which one is the optimal action | ▶ 00:02 |
provided that you can run around as long as you want. | ▶ 00:06 |
There's no costs associated with steps, but | ▶ 00:09 |
you wish to maximize the probability of ending up in plus 100 over here. | ▶ 00:11 |
Think before you answer this question. | ▶ 00:15 |
[Narrator] And the answer is south. | ▶ 00:00 |
The reason why it's south is if we attempt to go south, | ▶ 00:02 |
an 80% probability we'll stay in the same cell. | ▶ 00:06 |
In fact, a 90% probability because we can't | ▶ 00:08 |
go south and we can't go east. | ▶ 00:10 |
In a 10% probability, we find ourselves over here which is a relatively | ▶ 00:12 |
safe state because we can actually go to the left side. | ▶ 00:15 |
If we were to go just west which is the intuitive answer, | ▶ 00:18 |
then there's a 10% chance we end up in the minus 100 absorbing state. | ▶ 00:22 |
You can convince yourself if you go south, | ▶ 00:27 |
find ourselves eventually in state C3, and then | ▶ 00:29 |
go west, west, north, north, east, east, east. | ▶ 00:32 |
You will never ever run risk of falling into the minus 100, and | ▶ 00:37 |
that argument is tricky and to convince ourselves | ▶ 00:42 |
let me ask the other hard question: | ▶ 00:45 |
so what shall we do in state B3 that's the optimal action? | ▶ 00:47 |
[Narrator] And the answer is west. | ▶ 00:00 |
If you're over here, and we go east, | ▶ 00:02 |
we'd likely end up with minus 100. | ▶ 00:04 |
If you go north, which seems to be the intuitive answer, | ▶ 00:06 |
there's a 10% chance we fall into the minus 100. | ▶ 00:09 |
However, if we go west, then there's absolutely | ▶ 00:12 |
no chance we fall into the minus 100. | ▶ 00:14 |
We might find ourselves over here. | ▶ 00:16 |
We might be in the same state. We might find ourselves over here, | ▶ 00:18 |
but from these states over here, | ▶ 00:20 |
there's safe policies that can safely avoid the minus 100. | ▶ 00:22 |
[Narrator] So even for the simple grid world, | ▶ 00:00 |
the optimal control policy assuming stochastic actions | ▶ 00:04 |
and no costs of moving, except for the final absorbing costs, | ▶ 00:08 |
is somewhat nontrivial. | ▶ 00:12 |
Take a second to look at this. | ▶ 00:14 |
Along here it seems pretty obvious, but | ▶ 00:17 |
for the state over here, B3, and for the state over here, C4, | ▶ 00:19 |
we choose an action that just avoids falling into the minus 100, | ▶ 00:24 |
which is more important than trying to make progress towards the plus 100. | ▶ 00:27 |
Now obviously this is not the general case of an MDP, | ▶ 00:32 |
and it's somewhat frustrating they'd be willing to run through the wall, | ▶ 00:35 |
just so as to avoid falling into the minus 100, | ▶ 00:38 |
and the reason why this seems unintuitive is | ▶ 00:41 |
because we're really forgetting the issue of costs. | ▶ 00:43 |
In normal life, there is a cost associated with moving. | ▶ 00:46 |
MDPs are gentle enough to have a cost factor, | ▶ 00:49 |
and the way we're going to denote costs | ▶ 00:53 |
is by defining our award function over any possible state. | ▶ 00:56 |
We are reaching the state A4, | ▶ 01:00 |
gives us plus 100, minus 100 for B4, | ▶ 01:03 |
and perhaps minus 3 for every other state, | ▶ 01:07 |
which reflects the fact that if you take a step somewhere | ▶ 01:10 |
that we will pay minus 3. | ▶ 01:13 |
So this gives an incentive to shorten the final action sequence. | ▶ 01:15 |
So we're now ready to state the actual objective | ▶ 01:19 |
of an MDP which is to minimize not | ▶ 01:23 |
just the momentary costs, but the sum | ▶ 01:25 |
of all future rewards, | ▶ 01:29 |
but you're going to write RT to denote the fact that | ▶ 01:32 |
this reward has received time T, and because | ▶ 01:35 |
our reward itself is stochastic, | ▶ 01:38 |
we have to complete the expectation over those, | ▶ 01:41 |
and that we seek to maximize. | ▶ 01:44 |
So we seek to find the policy that maximizes the expression over here. | ▶ 01:46 |
Now another interesting caveat is a sentence people put | ▶ 01:50 |
a so called discount factor into this equation | ▶ 01:54 |
with an exponent of T, where a discount factor was going to be 0.9, | ▶ 01:57 |
and what this does is it decays future reward | ▶ 02:01 |
relative to more immediate rewards, and it's | ▶ 02:04 |
kind of an alternative way to specify costs. | ▶ 02:07 |
So we can make this explicit by a negative reward per state | ▶ 02:10 |
or we can bring in a discount factor | ▶ 02:13 |
that discounts the plus 100 by the | ▶ 02:16 |
number of steps that it went by before it reached the plus 100. | ▶ 02:19 |
This also gives an incentive to get to the goal as fast as possible. | ▶ 02:23 |
The nice mathematical thing about discount factor is | ▶ 02:27 |
it keeps this expectation bounded. | ▶ 02:30 |
It easy to show that this expression over here | ▶ 02:33 |
will always be smaller or equal to 1 over 1 minus gamma times the | ▶ 02:36 |
absolute reward maximizing value and | ▶ 02:41 |
which in this case would be plus 100. | ▶ 02:44 |
The definition of the expected sum of future | ▶ 00:00 |
possible discounted reward that it has given you | ▶ 00:03 |
allows me to define a value function. | ▶ 00:06 |
For each status, my value of the state | ▶ 00:10 |
is the expected sum of future discounted reward | ▶ 00:13 |
provided that I start in state S, | ▶ 00:17 |
then I execute policy pi. | ▶ 00:21 |
This expression looks really complex, | ▶ 00:23 |
but it really means something really simple, | ▶ 00:26 |
which is suppose we start in the state over here, | ▶ 00:28 |
and you get +100 over here, -100 over here. | ▶ 00:30 |
And suppose for now, every other state costs you -3. | ▶ 00:34 |
For any possible policy that assigns actions to | ▶ 00:38 |
the non-absorbing states, you can now | ▶ 00:41 |
simulate the agent for quite a while and compute empirically | ▶ 00:44 |
what is the average reward that is being received | ▶ 00:47 |
until you finally hit a goal state. | ▶ 00:52 |
For example, for the policy that you like, | ▶ 00:54 |
the value would, of course, for any state | ▶ 00:57 |
depend on how much you make progress towards the goal, | ▶ 01:00 |
or whether you bounce back and forth. | ▶ 01:02 |
In fact, in this state over here, you might bounce down | ▶ 01:04 |
and have to do the loop again. | ▶ 01:06 |
But there's a well defined expectation | ▶ 01:08 |
over any possible execution of the policy pi | ▶ 01:11 |
that is generic to each state and each policy pi. | ▶ 01:14 |
That's called a value. | ▶ 01:17 |
And value functions are absolutely essential to MDP, | ▶ 01:19 |
so the way we're going to plan is we're going to iterate | ▶ 01:21 |
and compute value functions, and it will turn out | ▶ 01:25 |
that by doing this, we're going to find better and better policies as well. | ▶ 01:28 |
Before I dive into mathematical detail about | ▶ 00:00 |
value functions, let me just give you a tutorial. | ▶ 00:03 |
The value function is a potential function | ▶ 00:06 |
that leads from the goal location--in this case, the 100 in the upper right-- | ▶ 00:08 |
all the way into the space so that hill climbing | ▶ 00:13 |
in this potential function leads you on the shortest path to the goal. | ▶ 00:16 |
The algorithm is a recursive algorithm. | ▶ 00:20 |
It spreads the value through the space, as you can see in this animation, | ▶ 00:22 |
and after a number of iterations, it converges, | ▶ 00:25 |
and you have a grayscale value | ▶ 00:28 |
that really corresponds to the best way of getting to the goal. | ▶ 00:30 |
Hill climbing in that function gets you to the goal. | ▶ 00:34 |
You can simplify. | ▶ 00:36 |
Think about this as pouring a glass of milk | ▶ 00:38 |
into the 100th state and having the milk | ▶ 00:41 |
descend through the maze, and later on, | ▶ 00:44 |
when you go in the gradient of the milk flow, | ▶ 00:47 |
you will reach the goal in the optimal possible way. | ▶ 00:51 |
Let me tell you about a truly magical algorithm called value iteration. | ▶ 00:00 |
In value iteration, we recursively calculate the value function | ▶ 00:05 |
so that in the end, we get what's called the optimal value function. | ▶ 00:08 |
And from that, we can derive, | ▶ 00:12 |
look up, the optimal policy. | ▶ 00:14 |
Here's how it goes. | ▶ 00:18 |
Suppose we start with a value function of 0 everywhere | ▶ 00:20 |
except for the 2 absorbing states, whose value is +100 and -100. | ▶ 00:26 |
Then we can ask ourselves the question is, for example, | ▶ 00:30 |
for the field A3, 0 a good value. | ▶ 00:33 |
And the answer is no, it isn't. It is somewhat inconsistent. | ▶ 00:37 |
We can compute a better value. | ▶ 00:40 |
In particular, we can understand that | ▶ 00:42 |
if we're in A3 and we choose to go east, | ▶ 00:46 |
then with 0.8 chance we should expect a value of 100. | ▶ 00:50 |
With 0.1 chance, we'll stay in the same state, | ▶ 00:55 |
in which case the value is -3. | ▶ 00:58 |
And with 0.1 chance, we're going to stay down here for -3. | ▶ 01:01 |
With the appropriate definition of value, | ▶ 01:05 |
we would get the following formula, | ▶ 01:08 |
which is 77. | ▶ 01:11 |
So, 77 is a better estimate of value | ▶ 01:13 |
for the state over here. | ▶ 01:18 |
And now that we've done it, we can ask ourselves the question | ▶ 01:20 |
is this a good value, or this a good value, or this a good value? | ▶ 01:22 |
And we can propagate value backwards | ▶ 01:25 |
in reverse order of action execution | ▶ 01:27 |
from the positive absorbing state through this grid world | ▶ 01:30 |
and fill every single state with a better value estimate | ▶ 01:34 |
then the one we assumed initially. | ▶ 01:38 |
If we do this for the grid over here and run value iteration | ▶ 01:42 |
through convergence, then we get the following value function. | ▶ 01:46 |
We get 93 over here. We're very close to the goal. | ▶ 01:50 |
89, 85, 81, 77, 73, 70, over here. | ▶ 01:53 |
This state will be worth 68, and this state is worth 47, | ▶ 01:58 |
and the reason why these are not so good is because | ▶ 02:02 |
we might stay quite a while in those | ▶ 02:04 |
before we'll be able to execute an action | ▶ 02:06 |
that gets us outside the state. | ▶ 02:09 |
Let me give you an algorithm that defines value iteration. | ▶ 02:12 |
We wish to estimate recursively the value of state S. | ▶ 02:15 |
And we do this based on a possible successor state | ▶ 02:20 |
as prime that we look up in the existing table. | ▶ 02:23 |
Now, actions A are non-deterministic. | ▶ 02:27 |
Therefore, we have to go through all possible as primes | ▶ 02:30 |
and weigh each outcome with the associated probability. | ▶ 02:34 |
The probability of reaching S prime given that we started state S | ▶ 02:37 |
and apply action A. | ▶ 02:40 |
This expression is usually discounted by gamma, | ▶ 02:42 |
and we also add the reward or the costs of the state. | ▶ 02:46 |
And because there's multiple actions and it's up to us | ▶ 02:51 |
to choose the right action, we will maximize over all possible actions. | ▶ 02:54 |
See, we look at this equation, and it looks really complicated, | ▶ 03:00 |
but it's actually really simple. | ▶ 03:03 |
We compute a value recursively based on successor values | ▶ 03:06 |
plus the reward and minus the cost that it takes us to get us there. | ▶ 03:11 |
Because Mother Nature picks a successor state for us for any given action, | ▶ 03:15 |
if you compute an expectation over the value of the successor state | ▶ 03:20 |
weighted by the corresponding probabilities which is happening over here, | ▶ 03:25 |
and because we can choose our action, | ▶ 03:29 |
we maximize over all possible actions. | ▶ 03:32 |
Therefore, the max as opposed to the expectation on the left side over here. | ▶ 03:35 |
This is an equation that's called backup. | ▶ 03:39 |
In terminal states, we just assign R(s), | ▶ 03:43 |
and obviously, in the beginning of value iteration, | ▶ 03:48 |
these expressions are different, and we have to update. | ▶ 03:52 |
But as Bellman has shown a while ago, | ▶ 03:55 |
this process of updates converges. | ▶ 03:58 |
After convergence, this assignment over here | ▶ 04:01 |
is replaced by the equality sign, | ▶ 04:05 |
and when this equality holds true, | ▶ 04:07 |
we have what is called a Bellman equality or Bellman equation. | ▶ 04:10 |
And that's all there is to know to compute values. | ▶ 04:16 |
If you assign this specific equation over and over again to each state, | ▶ 04:19 |
eventually you get a value function that looks just like this, | ▶ 04:24 |
where the value really corresponds to what's the optimal future | ▶ 04:27 |
cost reward trade off that you can achieve | ▶ 04:30 |
if you act optimally in any given state over here. | ▶ 04:33 |
Let me take my example world | ▶ 00:00 |
and apply value iteration in a quiz. | ▶ 00:02 |
As before, assume the value is initialized as | ▶ 00:06 |
+100 and -100 for the absorbing states | ▶ 00:09 |
and 0 everywhere else. | ▶ 00:12 |
And let me make the assumption that our transition probability is deterministic. | ▶ 00:14 |
That is, if I execute the east action of this state over here | ▶ 00:18 |
with probability 1 item over here, | ▶ 00:22 |
if I assume the north action over here, probability 1, | ▶ 00:24 |
I will find myself in the same state as before. | ▶ 00:28 |
There is no uncertainty anymore. | ▶ 00:30 |
That's really important for now, just for this one quiz. | ▶ 00:32 |
I'll also assume gamma equals 1, | ▶ 00:35 |
just to make things a little bit simpler. | ▶ 00:38 |
And the cost over here is -3 unless you reach an absorbing state. | ▶ 00:40 |
What I'd like to know, after a single backup, | ▶ 00:44 |
what's the value of A3? | ▶ 00:47 |
And the answer is 97. | ▶ 00:00 |
It's easy to see that the action east | ▶ 00:02 |
is the value maximizing action. | ▶ 00:04 |
Let's plug in east over here. Gamma equals 1. | ▶ 00:06 |
This is a single successor state of 100, | ▶ 00:09 |
so if we have 100 over here minus the 3 | ▶ 00:12 |
that it costs us to get there, we get 97. | ▶ 00:15 |
Let's run value iteration again, and let me ask | ▶ 00:00 |
what's the value for B3, assuming that we already updated | ▶ 00:03 |
the value for A3 as shown over here. | ▶ 00:06 |
And again, making use of the observation that our | ▶ 00:00 |
state transition function is deterministic, | ▶ 00:02 |
we get 94, and the logic is the same as before. | ▶ 00:04 |
The optimal action here is going north, | ▶ 00:07 |
which we will succeed with the probability 1. | ▶ 00:09 |
Therefore, we can use the value recursively from A3 | ▶ 00:12 |
to deflect back to B3. | ▶ 00:16 |
97 - 3 gives us 94. | ▶ 00:18 |
And finally, I would like to know what's the value of | ▶ 00:00 |
C1, the figure down here, after we ran value iteration | ▶ 00:02 |
over and over again all the way to a convergence. | ▶ 00:06 |
Again, gamma equals 1. State transition function is deterministic. | ▶ 00:09 |
And the answer is easily obtained if you just | ▶ 00:00 |
subtract -3 for each step. | ▶ 00:02 |
We get 88 and 85 over here. | ▶ 00:05 |
We could also reach the same value going around here. | ▶ 00:08 |
So, 85 would have been the right answer, | ▶ 00:11 |
and this will be the value function after convergence. | ▶ 00:13 |
It's beautiful to see that the value function is effective | ▶ 00:16 |
the distance to the positive absorbing state times 3 | ▶ 00:24 |
subtracted from 100. | ▶ 00:26 |
So, we have 97, 94, 91, 88, 85 and so on. | ▶ 00:28 |
This is a degenerate case. | ▶ 00:32 |
If we have a deterministic state transition function, | ▶ 00:34 |
it gets more tricky to calculate for the stochastic case. | ▶ 00:36 |
Let me ask the same question for the stochastic case. | ▶ 00:00 |
We have the same world as before, | ▶ 00:04 |
and actions have stochastic outcomes. | ▶ 00:06 |
The probability 0.8, we get the action we commanded, | ▶ 00:09 |
otherwise we get left or right. | ▶ 00:12 |
And assuming that the initial values are all 0, | ▶ 00:15 |
calculate for me for a single backup the value of A3. | ▶ 00:17 |
This should look familiar from the previous material. | ▶ 00:00 |
It's 77, and the reason is in A3, | ▶ 00:03 |
we have an 80% chance | ▶ 00:06 |
for the action going east to reach 100. | ▶ 00:09 |
But the remaining 20%, we either stay in place or go to the field down here, | ▶ 00:14 |
both of which have an initial value of 0. | ▶ 00:18 |
That gives us 0, but we have to subtract the cost of 3, | ▶ 00:20 |
and that gives us 80 - 3 = 77. | ▶ 00:23 |
It's also easy to verify that any of the other actions have lower values. | ▶ 00:27 |
For example, the value of going west will be | ▶ 00:30 |
0 in all possible outcomes given the current value function | ▶ 00:34 |
minus 3, so the value of going west would right now | ▶ 00:38 |
be estimated as -3, and 77 is larger than -3. | ▶ 00:41 |
Therefore, we'll pick 77 as the action that maximizes | ▶ 00:45 |
the updated equation over here. | ▶ 00:49 |
And here's a somewhat non-trivial quiz. | ▶ 00:00 |
For the state B3, calculate the value function | ▶ 00:02 |
assuming that we have a value function as shown over here | ▶ 00:06 |
and all the open states have a value of assumed 0, | ▶ 00:10 |
because we're still in the beginning of our value update. | ▶ 00:13 |
What would be our very first value function for B3 | ▶ 00:16 |
that we compute based on the values shown over here? | ▶ 00:19 |
And the answer is 48.6. | ▶ 00:00 |
And obviously, it's not quite as trivial as the calculation before | ▶ 00:05 |
because there's 2 competing actions. | ▶ 00:08 |
We can try to go north, which gives us the 77 | ▶ 00:11 |
but risks the chance of falling into the -100. | ▶ 00:14 |
Or we can go west, as before, which gives us a much smaller chance | ▶ 00:17 |
to reach 77, but avoids the -100. | ▶ 00:20 |
Let's do both and see which one is better. | ▶ 00:23 |
If we go north, we have a 0.8 chance of reaching 77. | ▶ 00:25 |
There's now a 10% chance of paying -100 | ▶ 00:31 |
and a 10% chance of staying in the same location, | ▶ 00:36 |
which at this point is still a value of 0. | ▶ 00:39 |
We subtract our costs of 3, and we get 61.6 | ▶ 00:42 |
- 10 - 3 = 48.6. | ▶ 00:46 |
Let's check the west action value. | ▶ 00:51 |
We reach the 77 with probability 0.1 | ▶ 00:54 |
with 0.8 chance we stay in the same cell, | ▶ 00:58 |
which has the value of 0, | ▶ 01:01 |
and with 0.1 chance, we end up down here, | ▶ 01:03 |
which also has a value of 0. | ▶ 01:06 |
We subtract our costs of -3, | ▶ 01:08 |
and that gives us 7.7 - 3 = 4.7. | ▶ 01:11 |
At this point, going west is vastly inferior | ▶ 01:16 |
to going north, and the reason is we already propagated | ▶ 01:20 |
a great value of 77 for this cell over here, | ▶ 01:22 |
whereas this one is still set to 0. | ▶ 01:25 |
So, we will set it to 48.6. | ▶ 01:28 |
So, now that we have a value backup function | ▶ 00:00 |
that we discussed in depth, the question now becomes | ▶ 00:03 |
what's the optimal policy? | ▶ 00:05 |
And it turns out this value backup function defines | ▶ 00:07 |
the optimal policy as completely opposite | ▶ 00:10 |
of which action to pick, | ▶ 00:12 |
which is just the action that maximizes this expression over here. | ▶ 00:14 |
For any state S, any value function V, | ▶ 00:18 |
we can define a policy, | ▶ 00:22 |
and that's the one that picks the action under argmax | ▶ 00:24 |
that maximizes the expression over here. | ▶ 00:28 |
For the maximization, we can safely draw up gamma and R(s). | ▶ 00:31 |
Baked in the value iteration function was already | ▶ 00:35 |
an action choice that picks the best action. | ▶ 00:38 |
We just made it explicit. | ▶ 00:41 |
This is the way of backing up values, | ▶ 00:43 |
and once values have been backed up, | ▶ 00:45 |
this is the way to find the optimal thing to do. | ▶ 00:48 |
I'd like to show you some value function after convergence | ▶ 00:00 |
and the corresponding policies. | ▶ 00:04 |
If we assume gamma = 1 and our cost for the non-absorbing state | ▶ 00:07 |
equals -3, as before, we get the following approximate value function | ▶ 00:11 |
after convergence, and the corresponding policy looks as follows. | ▶ 00:15 |
Up here we go right until we hit the absorbing state. | ▶ 00:21 |
Over here we prefer to go north. | ▶ 00:25 |
Here we go left, and here we go north again. | ▶ 00:27 |
I left the policy open for the absorbing states | ▶ 00:31 |
because there's no action to be chosen here. | ▶ 00:33 |
This is a situation where | ▶ 00:36 |
the risk of falling into the -100 is balanced by | ▶ 00:39 |
the time spent going around. | ▶ 00:42 |
We have an action over here in this visible state here | ▶ 00:45 |
that risks the 10% chance of falling into the -100. | ▶ 00:48 |
But that's preferable under the cost model of -3 | ▶ 00:52 |
to the action of going south. | ▶ 00:55 |
Now, this all changes if we assume a cost of 0 | ▶ 00:58 |
for all the states over here, in which case, | ▶ 01:02 |
the value function after convergence looks interesting. | ▶ 01:05 |
And with some thought, you realize it's exactly the right one. | ▶ 01:09 |
Each value is exactly 100, | ▶ 01:13 |
and the reason is with a cost of 0, | ▶ 01:16 |
it doesn't matter how long we move around. | ▶ 01:18 |
Eventually we can guarantee in this case we reach the 100, | ▶ 01:21 |
therefore each value after backups will become 100. | ▶ 01:24 |
The corresponding policy is the one we discussed before. | ▶ 01:28 |
And the crucial thing here is that for this state, | ▶ 01:32 |
we go south, if you're willing to wait the time. | ▶ 01:35 |
For this state over here, we go west, | ▶ 01:38 |
willing to wait the time so as to avoid | ▶ 01:40 |
falling into the -100. | ▶ 01:42 |
And all the other states resolve | ▶ 01:44 |
exactly as you would expect them to resolve | ▶ 01:46 |
as shown over here. | ▶ 01:49 |
If we set the costs to -200, | ▶ 01:52 |
so each step itself is even more expensive | ▶ 01:55 |
then falling into this ditch over here, | ▶ 01:58 |
we get a value function that's strongly negative everywhere | ▶ 02:02 |
with this being the most negative state. | ▶ 02:05 |
But more interesting is the policy. | ▶ 02:08 |
This is a situation where our agent tries to end the game | ▶ 02:11 |
as fast as possible so as not to endure the penalty of -200. | ▶ 02:14 |
And even over here where it jumps itself into the -100's | ▶ 02:18 |
it's still better than going north and taking the excess of 200 as a penalty | ▶ 02:21 |
and then leave the +100. | ▶ 02:25 |
Similarly, over here we go straight north, | ▶ 02:27 |
and over here we go as fast as possible | ▶ 02:30 |
to the state over here. | ▶ 02:32 |
Now, this is an extreme case. | ▶ 02:35 |
I don't know why it would make sense to set a penalty for life | ▶ 02:37 |
that is so negative that even negative death is worse than living, | ▶ 02:39 |
but certainly that's the result of running value iteration in this extreme case. | ▶ 02:45 |
So, we've learned quite a bit so far. | ▶ 00:00 |
We've learned about Markov Decision Processes. | ▶ 00:02 |
We have fully observable with a set of states | ▶ 00:06 |
and corresponding actions where they have stochastic action effects | ▶ 00:10 |
characterized by a conditional probability entity of P of S prime | ▶ 00:14 |
given that we apply action A in state S. | ▶ 00:19 |
We seek to maximize a reward function | ▶ 00:22 |
that we define over states. | ▶ 00:25 |
You can equally define over states in action pairs. | ▶ 00:27 |
The objective was to maximize the expected | ▶ 00:30 |
future accumulative and discounted rewards, | ▶ 00:33 |
as shown by this formula over here. | ▶ 00:36 |
The key to solving them was called value iteration | ▶ 00:38 |
where we assigned a value to each state. | ▶ 00:42 |
There's alternative techniques that have assigned values | ▶ 00:45 |
to state action pairs, often called Q(s, a), | ▶ 00:47 |
but we didn't really consider this so far. | ▶ 00:50 |
We defined a recursive update rule | ▶ 00:53 |
to update V(s) that was very logical | ▶ 00:55 |
after we understood that we have an action choice, | ▶ 00:58 |
but nature chooses for us the outcome of the action | ▶ 01:00 |
in a stochastic transition probability over here. | ▶ 01:03 |
And then we observe the value iteration converged | ▶ 01:07 |
and we're able to define a policy if we're assuming | ▶ 01:10 |
the argmax under the value iteration expression, | ▶ 01:12 |
which I don't spell out over here. | ▶ 01:16 |
This is a beautiful framework. | ▶ 01:18 |
It's really different from planning than before | ▶ 01:20 |
because of the stochasticity of the action effects. | ▶ 01:22 |
Rather than making a single sequence of states and actions, | ▶ 01:26 |
as would be the case in deterministic planning, | ▶ 01:29 |
now we make an entire field a so-called policy | ▶ 01:31 |
that assigns an action to every possible state. | ▶ 01:35 |
And we compute this using a technique called value iteration | ▶ 01:39 |
that spreads value in reverse order through the field of states. | ▶ 01:42 |
So far, we talked about the fully observable case, | ▶ 00:00 |
and I'd like to get back to the more general case | ▶ 00:04 |
of partial observability. | ▶ 00:06 |
Now, to warn you, I don't think it's worthwhile in this class | ▶ 00:08 |
to go into full depth about the type of techniques | ▶ 00:11 |
that are being used for planning and uncertainty | ▶ 00:13 |
if the world is partially observable. | ▶ 00:17 |
But I'd like to give you a good flavor | ▶ 00:19 |
about what it really means to plan in information spaces | ▶ 00:22 |
that we reflect the types of methods that are being brought to bear | ▶ 00:26 |
in planning and uncertainty. | ▶ 00:30 |
Like my Stanford class, I don't go into details here either | ▶ 00:32 |
because the details are much more subject to more specialized classes, | ▶ 00:36 |
but I hope you can enjoy the type of flavor of materials | ▶ 00:39 |
that you're going to get to see in the next couple of minutes. | ▶ 00:42 |
So we now learned about fully observable environments, | ▶ 00:00 |
and planning in stochastic environments with MDPs | ▶ 00:04 |
I'd like to say a few words about partially observable environments, | ▶ 00:08 |
or POMDPs--which I won't go into in depth; the material is relatively complex. | ▶ 00:12 |
But I'd like to give you a feeling for why this is important, and what type of problems | ▶ 00:17 |
you can solve with this, that you could never possibly solve with MDPs. | ▶ 00:21 |
So, for example, POMDPs address problems of optimal exploration versus exploitation, | ▶ 00:25 |
where some of the actions might be information-gathering actions; | ▶ 00:32 |
whereas others might be goal-driven actions. | ▶ 00:35 |
That's not really possible in the MDPs because the state space is fully observable | ▶ 00:39 |
and therefore, there is no notion of information gathering. | ▶ 00:43 |
I'd like to illustrate the problem, using a very simple environment | ▶ 00:00 |
that looks, as follows: | ▶ 00:05 |
Suppose you live in world like this; | ▶ 00:07 |
and your agent starts over here, | ▶ 00:09 |
and there are 2 possible outcomes. | ▶ 00:11 |
You can exit the maze over here-- | ▶ 00:13 |
where you get a plus 100-- | ▶ 00:16 |
or you can exit the maze over here, | ▶ 00:18 |
where you receive a minus 100. | ▶ 00:20 |
Now, in a fully observable case, | ▶ 00:22 |
and in a deterministic case, | ▶ 00:25 |
the optimal plan might look something like this; | ▶ 00:28 |
and whether or not is goes straight over here or not, depends on the details. | ▶ 00:32 |
For example, whether the agent has momentum or not. | ▶ 00:35 |
But you'll find a single sequence of actions and states that might cut the corners, | ▶ 00:38 |
as close as possible, to reach the plus 100 as fast as possible. | ▶ 00:44 |
That's conventional planning. | ▶ 00:47 |
Let's contrast this with the case we just learned about, | ▶ 00:50 |
which is the fully observable case or the stochastic case. | ▶ 00:53 |
We just learned that the best thing to compute is a policy | ▶ 00:57 |
that assigns to every possible state, an optimal action; | ▶ 01:01 |
and simplified speaking, this might look as follows: | ▶ 01:04 |
Where each of these arrows corresponds | ▶ 01:07 |
to a sample control policy. | ▶ 01:09 |
And those are defined in part of the state space that are even far away. | ▶ 01:12 |
So this wouuld be an example of a control policy | ▶ 01:16 |
where all the arrows gradually point you over here. | ▶ 01:18 |
We just learned about this, using MDPs and value iteration. | ▶ 01:22 |
The case I really want to get at is the case of partial observability-- | ▶ 01:25 |
which we will eventually solve, using a technique called POMDP. | ▶ 01:29 |
And in this case, I'm going to keep the location of the agent in the maze observable. | ▶ 01:32 |
The part I'm going to make unobservable is where, exactly, I receive plus 100 | ▶ 01:37 |
and where I receive minus 100. | ▶ 01:43 |
Instead, I'm going to put a sign over here | ▶ 01:45 |
that tells the agent where to expect plus 100, | ▶ 01:48 |
and where to expect minus 100. | ▶ 01:51 |
So the optimum policy would be to first move to the sign, | ▶ 01:53 |
read the sign; | ▶ 01:57 |
and then return and go to the corresponding exit, | ▶ 01:59 |
for which the agent now knows where to receive plus 100. | ▶ 02:03 |
So, for example, if this exit over here gives us plus 100, | ▶ 02:07 |
the sign will say Left. | ▶ 02:10 |
If this exit over here gives us plus 100, the sign will say Right. | ▶ 02:12 |
What makes this environment interesting is | ▶ 02:15 |
that if the agent knew which exit would have plus 100, | ▶ 02:17 |
it will go north, from a starting position. | ▶ 02:21 |
It goes south exclusively to gather information. | ▶ 02:23 |
So the question becomes: Can we devise a method for planning | ▶ 02:26 |
that understands that, even though we'd wish to receive the plus 100 as the best exit, | ▶ 02:30 |
there's a detour necessary to gather information. | ▶ 02:36 |
So here's a solution that doesn't work: | ▶ 02:40 |
Obviously, the agent might be in 2 different worlds--and it doesn't know. | ▶ 02:42 |
It might be in the world where there's plus 100 on the Left side | ▶ 02:46 |
or it might be in the world with plus 100 on the Right side, | ▶ 02:49 |
with minus 100 in the corresponding other exit. | ▶ 02:51 |
What doesn't work is you can't solve the problem for both of these cases | ▶ 02:53 |
and then put these solutions together-- | ▶ 02:59 |
for example, by averaging. | ▶ 03:00 |
The reason why this doesn't work is | ▶ 03:02 |
this agent, after averaging, would go north. | ▶ 03:04 |
It would never have the idea that it is worthwhile to go south, | ▶ 03:08 |
read the sign, and then return to the optimal exit. | ▶ 03:11 |
When it arrives, finally, at the intersection over here, | ▶ 03:15 |
it doesn't really know what to do. | ▶ 03:18 |
So here is the situation that does work-- | ▶ 03:20 |
and it's related to information space or belief space. | ▶ 03:22 |
In the information space or belief space representation you do planning, | ▶ 03:25 |
not in the set of physical world states, | ▶ 03:29 |
but in what you might know about those states. | ▶ 03:31 |
And if you're really honest, you find out that there's a multitude of belief states. | ▶ 03:34 |
Here's the initial one, where you just don't know where to receive 100. | ▶ 03:39 |
Now, if you move around and either reach one of these exits or the sign, | ▶ 03:44 |
you will suddenly know where to receive 100. | ▶ 03:48 |
And that makes your belief state change-- | ▶ 03:51 |
and that makes your belief state change. | ▶ 03:55 |
So, for example, if you find out that 100 is Left, | ▶ 03:58 |
then your belief state will look like this-- | ▶ 04:01 |
where the ambiguity is now resolved. | ▶ 04:03 |
Now, how would you jump from this state space or this state space? | ▶ 04:05 |
The answer is: when you read the sign, | ▶ 04:09 |
there's a 50 percent chance that the location over here | ▶ 04:12 |
will result in a transition to the location over here-- | ▶ 04:16 |
50 percent because there's a 50 percent chance that the plus 100 is on the Left. | ▶ 04:19 |
There's also a 50 percent chance that the plus 100 is on the Right, | ▶ 04:23 |
so the transition over here is stochastic; | ▶ 04:28 |
and with 50 percent chance, it will result in a transition over here. | ▶ 04:31 |
If we now do the MDP trick in this new belief space, | ▶ 04:35 |
and you pour water in here, it kind of flows through here | ▶ 04:39 |
and creates all these gradients--as we had before. | ▶ 04:44 |
We do the same over here and all these gradients are being created | ▶ 04:48 |
point to this exit on the Left side. | ▶ 04:51 |
Then, eventually, this water will flow through here and create gradients like this; | ▶ 04:53 |
and then flow back through here, where it creates gradients like this. | ▶ 04:58 |
So the value function is plus 100 over here, plus 100 over here | ▶ 05:02 |
that gradually decrease down here, down here; | ▶ 05:06 |
and then gradually further decrease over here-- | ▶ 05:08 |
and even further decrease over there, so we've got arrows like these. | ▶ 05:11 |
And that shows you that in this new belief space, you can find a solution. | ▶ 05:15 |
In fact, you can use value iteration--MDP's value iteration-- | ▶ 05:20 |
in this new space to find a solution to this really complicated | ▶ 05:24 |
partially observable planning process. | ▶ 05:28 |
And the solution--just to reiterate-- | ▶ 05:30 |
we'll suggest: Go south first, | ▶ 05:33 |
read the sign, | ▶ 05:35 |
expose yourself to the random position to the Left or Right world | ▶ 05:37 |
in which you are now able to reach the plus 100 with absolute confidence. | ▶ 05:41 |
So now we have, learned pretty much, all there is to know about Planning Under Uncertainty. | ▶ 00:00 |
We talked about Markov Decision Processes. | ▶ 00:05 |
We explained the concept of information spaces; | ▶ 00:07 |
and what's better, you can actually apply it. | ▶ 00:09 |
You can apply it to a huge number of problems | ▶ 00:12 |
where the outcome of states are uncertain. | ▶ 00:14 |
There is a huge legislation about robot motion planning. | ▶ 00:17 |
Here are some examples of robots moving through our environments | ▶ 00:20 |
that use MDP-style planning techniques; | ▶ 00:23 |
and these methods have become vastly popular in artificial intelligence-- | ▶ 00:26 |
so I'm really glad you now understand the basics | ▶ 00:29 |
of those and you can apply them yourself. | ▶ 00:31 |
Hi--welcome back. | ▶ 00:00 |
You just learned how Markov Decision Processes | ▶ 00:02 |
can be used to determine an optimal sequence | ▶ 00:05 |
of actions for an agent in a stochastic environment. | ▶ 00:07 |
And that is, an agent that knows the correct model of the environment | ▶ 00:11 |
can navigate, finding its ways to the positive | ▶ 00:15 |
rewards and avoiding the negative penalties. | ▶ 00:18 |
But it can only do that if he knows where the rewards and penalties are. | ▶ 00:21 |
In this Unit, we'll see how a technique | ▶ 00:25 |
called reinforcement learning | ▶ 00:28 |
can guide the agent to an optimal policy, | ▶ 00:30 |
even though he doesn't know anything about the rewards when he starts out. | ▶ 00:33 |
No subtitles... | ▶ 00:00 |
For example, in the 4 by 3 GridWorld, | ▶ 00:00 |
what if we don't know where the plus 1 and minus 1 rewards are when we start out? | ▶ 00:03 |
A reinforcement learning agent can learn to explore the territory, | ▶ 00:08 |
find where the rewards are, | ▶ 00:13 |
and then learn an optimal policy. | ▶ 00:15 |
Whereas, an MDP solver can only do that | ▶ 00:17 |
once it knows exactly where the rewards are. | ▶ 00:19 |
Now, this idea of wandering around and then finding a plus 1 or a minus 1 | ▶ 00:22 |
is analogous to many forms of games, such as backgammon-- | ▶ 00:27 |
and here's an example: backgammon is a stochastic game; | ▶ 00:32 |
and at the end, you either win or lose. | ▶ 00:35 |
And in the 1990s, Gary Tesauro at IBM | ▶ 00:38 |
wrote a program to play backgammon. | ▶ 00:40 |
His first attempt tried to learn the utility of a Game state, U of S, | ▶ 00:43 |
using examples that were labelled by human expert backgammon players. | ▶ 00:49 |
But this was tedious work for the experts, | ▶ 00:53 |
so only a small number of states were labelled. | ▶ 00:55 |
The program tried to generalize from that, | ▶ 00:58 |
using supervised learning, | ▶ 01:00 |
and was not able to perform very well. | ▶ 01:02 |
So Tesauro's second attempt used no human expertise and no supervision. | ▶ 01:04 |
Instead, he had 1 copy of his program play against another; | ▶ 01:11 |
and at the end of the game, the winner got a positive reward, | ▶ 01:14 |
and the loser, a negative. | ▶ 01:18 |
So he used reinforcement learning; | ▶ 01:20 |
he backed up that knowledge throughout the Game states, | ▶ 01:22 |
and he was able to arrive at a function | ▶ 01:25 |
that had no input from human expert players, | ▶ 01:27 |
but, still, was able to perform | ▶ 01:30 |
at the level of the very best players in the world. | ▶ 01:32 |
He was able to do this, after learning from examples of about 200,000 games. | ▶ 01:35 |
Now, that may seem like a lot-- | ▶ 01:41 |
but it really only covers about 1 trillionth | ▶ 01:43 |
of the total state space of backgammon. | ▶ 01:46 |
Now, here's another example: | ▶ 01:49 |
This is a remote controlled helicopter | ▶ 01:51 |
that Professor Andrew Ng at Stanford trained, | ▶ 01:54 |
using reinforcement learning; | ▶ 01:56 |
and the helicopter--oh--oh, sorry-- | ▶ 01:58 |
I made a mistake--I put this picture upside down | ▶ 02:00 |
because--really, Ng trained the helicopter | ▶ 02:04 |
to be able to fly fancy maneuvers--like flying upside down. | ▶ 02:08 |
And he did that by looking at only a few hours | ▶ 02:11 |
of training data from expert helicopter pilots | ▶ 02:15 |
who would take over the remote controls, | ▶ 02:18 |
pilot the helicopter--and those would all be recorded-- | ▶ 02:20 |
and then, you would get rewards from when it did something good, | ▶ 02:23 |
or when it did something bad; | ▶ 02:27 |
and Ng was able to use reinforcement learrning | ▶ 02:29 |
to build an automated helicopter pilot, | ▶ 02:32 |
just from those training examples. | ▶ 02:34 |
And that automated pilot, too, can perform tricks | ▶ 02:36 |
that only a handful of humans are capable of performing. | ▶ 02:39 |
But enough of this still picture--let's watch a video of Ng's helicopters in action. | ▶ 02:43 |
[Stanford University Autonomous Helicopter] | ▶ 02:49 |
[sound of helicopter flying] [Chaos] | ▶ 02:52 |
[Stanford University Autonomous Helicopter] | ▶ 03:05 |
Let's stop and review the 3 main forms of learning. | ▶ 00:00 |
We have supervised learning, | ▶ 00:03 |
in which the training set | ▶ 00:05 |
is a bunch of input/output pairs-- | ▶ 00:07 |
X1,Y1; X2, Y2; et cetera-- | ▶ 00:10 |
in which we try to produce a function: | ▶ 00:15 |
y equals f of x-- | ▶ 00:18 |
and so the learning is producing this function, f. | ▶ 00:21 |
Then we have unsupervised learning, | ▶ 00:24 |
in which we're given just a set of data points-- | ▶ 00:27 |
X1, X2, and so on-- | ▶ 00:29 |
and each of these points, maybe, has many | ▶ 00:33 |
dimensions, many features. | ▶ 00:35 |
And what we're trying to learn is some patterns in that-- | ▶ 00:37 |
some clusters of these data-- | ▶ 00:39 |
or you could just say what we're trying to learn | ▶ 00:42 |
is a probability distribution | ▶ 00:45 |
or what's the probability that this | ▶ 00:47 |
random variable will have particular values; | ▶ 00:49 |
and learn something interesting from that. | ▶ 00:52 |
In this Unit, we're introducing the third type of learning-- | ▶ 00:55 |
reinforcement learning-- | ▶ 00:58 |
in which we have a sequence of action and state transitions. | ▶ 01:01 |
So: state and action, state and action--and so on. | ▶ 01:05 |
And at some point, we have some rewards associated with these. | ▶ 01:11 |
So there's a reward, and maybe not a reward for this state; | ▶ 01:16 |
and then another reward for this state-- | ▶ 01:21 |
and the rewards are just scalar numbers, positive or negative numbers. | ▶ 01:23 |
What we're trying to learn here is: | ▶ 01:27 |
at optimal policy, what's the right thing to do in any of the states? | ▶ 01:29 |
Let's show some examples of machine learning problems | ▶ 00:00 |
and I want you to tell me, for each one, | ▶ 00:03 |
whether it's best addressed with supervised learning, | ▶ 00:05 |
unsupervised learning, | ▶ 00:08 |
or reinforcement learning. | ▶ 00:10 |
And the first example is speech recognition-- | ▶ 00:12 |
where I have examples of voice recordings, | ▶ 00:16 |
and then the transcript's intermittent text for each of those recordings; | ▶ 00:19 |
and from them, I try to learn a model of language. | ▶ 00:23 |
Is that supervised, unsupervised or reinforcement? | ▶ 00:26 |
Next example is analyzing the spectral emissions of stars | ▶ 00:32 |
and trying to find clusters of stars in dissimilar types | ▶ 00:37 |
that may be of interest to astronomers. | ▶ 00:41 |
Would that be supervised, unsupervised or reinforcement? | ▶ 00:44 |
The data here would just consist of: | ▶ 00:49 |
for each star, a list of all the different emission frequencies of light coming to earth. | ▶ 00:51 |
Next example is lever pressing. | ▶ 00:58 |
So--I have a rat who is trained to press a lever | ▶ 01:02 |
to get a release of food | ▶ 01:06 |
when certain conditions are met. | ▶ 01:08 |
Is that supervised, unsupervised or reinforcement learning? | ▶ 01:10 |
And finally, the problem of an elevator controller. | ▶ 01:16 |
Say I have a bank of elevators in a building | ▶ 01:20 |
and they have to have some program--some policy-- | ▶ 01:22 |
to decide which elevator goes up | ▶ 01:25 |
and which elevator goes down | ▶ 01:27 |
in response to the percepts, | ▶ 01:29 |
which would be the button presses at various floors in the building. | ▶ 01:31 |
And so, I have a sequence of button presses, | ▶ 01:35 |
and I have the wait time that I am trying to minimize-- | ▶ 01:39 |
so after each button press, the elevator moves; | ▶ 01:44 |
the person waiting is waiting for a certain amount of time, | ▶ 01:48 |
and then gets picked up, | ▶ 01:53 |
and the algorithm is, given that amount of wait time. | ▶ 01:55 |
Would that be supervised, unsupervised or reinforcement? | ▶ 01:59 |
The answers are that speech recognition | ▶ 00:00 |
can be handled quite well by supervised learning. | ▶ 00:02 |
That is, we have input/output pairs; | ▶ 00:05 |
the input is the speech signal, | ▶ 00:07 |
and the output is the words that they correspond to. | ▶ 00:09 |
Analyzing the spectral emissions of stars | ▶ 00:12 |
is an example of unsupervised clustering | ▶ 00:15 |
where we're taking the input data-- | ▶ 00:19 |
we have data for each star, but we don't have any label associated with it. | ▶ 00:21 |
Rather, we're trying to make up labels | ▶ 00:25 |
by clustering them together, | ▶ 00:27 |
giving them to scientists, | ▶ 00:29 |
and then letting the scientists see: | ▶ 00:31 |
Do these clusters make any sense? | ▶ 00:33 |
Lever pressing is a classic example of reinforcement learning. | ▶ 00:35 |
In fact, the term "reinforcement learning" | ▶ 00:38 |
was used for a long time in animal psychology, | ▶ 00:40 |
before it was used in computer science. | ▶ 00:43 |
And elevator controllers is another area | ▶ 00:46 |
that has been investigated, | ▶ 00:48 |
using reinforcement learning | ▶ 00:50 |
and, in fact, very good algorithms-- | ▶ 00:52 |
better than the previous state of the art-- | ▶ 00:54 |
have been made, using reinforcement learning techniques. | ▶ 00:56 |
So the input, again, is a set of state/action transitions; | ▶ 00:59 |
and then the reinforcement is-- | ▶ 01:04 |
in this case, it's always a negative number | ▶ 01:07 |
because there's always a wait time. | ▶ 01:09 |
And so that's the penalty--we're trying to minimize that penalty, | ▶ 01:11 |
but all we get is the amount of wait time that we're trying to minimize. | ▶ 01:14 |
Now, before we get into the math of reinforcement learning, | ▶ 00:00 |
let's review MDPs-- | ▶ 00:03 |
which are, of course, Markov Decision Processes. | ▶ 00:05 |
An MDP consists of a set of states-- | ▶ 00:09 |
S is an element of the state, S; | ▶ 00:13 |
a set of actions-- | ▶ 00:16 |
A is an element of the actions that are available in each of the states, S. | ▶ 00:18 |
And we're going to distinguish a Start state, | ▶ 00:26 |
which we'll call S-zero, | ▶ 00:28 |
and then we need a transition function that says: | ▶ 00:30 |
How does the world evolve as we take actions in the world? | ▶ 00:34 |
And we can denote that by | ▶ 00:37 |
the probability that we get a Result state, S prime-- | ▶ 00:39 |
given that we start in state, S, | ▶ 00:45 |
and apply action, A. | ▶ 00:49 |
That's a probability distribution | ▶ 00:50 |
because the world is stochastic. | ▶ 00:52 |
The same result doesn't happen every time, | ▶ 00:54 |
when we do the same action, | ▶ 00:56 |
so we have this probability distribution. | ▶ 00:58 |
In some notations, you'll see: | ▶ 01:00 |
T of S, A, S--for the transition function. | ▶ 01:03 |
And then, in addition to the transition, | ▶ 01:08 |
we need a reward function-- | ▶ 01:10 |
which we'll denote R. | ▶ 01:12 |
Sometimes that's over the whole triplet-- | ▶ 01:14 |
the reward that you get from starting in one state, | ▶ 01:16 |
taking an action, and arriving at another state; | ▶ 01:19 |
sometimes we only need to talk about the result state. | ▶ 01:22 |
So in the 4 by 3 Grid World, for example, | ▶ 01:26 |
we don't care how you got to this state-- | ▶ 01:29 |
it's just, when you get to one of the states | ▶ 01:31 |
in the upper right, you get a plus 1 or minus 1 reward. | ▶ 01:33 |
And similarly, in a game like backgammon-- | ▶ 01:35 |
when you win or lose, | ▶ 01:38 |
you get a positive or negative reward. | ▶ 01:40 |
It doesn't matter what move you took to win or lose. | ▶ 01:42 |
And so that's all there is to MDPs. | ▶ 01:44 |
Now to solve an MDP, | ▶ 00:00 |
we're trying to find a policy--pi of S-- | ▶ 00:02 |
that's going to be our answer. | ▶ 00:06 |
The pi that we want--the optimal policy-- | ▶ 00:08 |
is the one that's going to maximize | ▶ 00:10 |
the discounted, total Reward. | ▶ 00:13 |
So what we mean is: | ▶ 00:15 |
we want to take the sum over all Times | ▶ 00:17 |
into the future of the Reward | ▶ 00:20 |
that you get from starting out | ▶ 00:23 |
in the state that you're in, in time T-- | ▶ 00:25 |
and then applying the policy to that state, | ▶ 00:28 |
and arriving at a new state, at time T plus 1. | ▶ 00:32 |
And so we want to maximize that sum-- | ▶ 00:35 |
but the sum might be infinite | ▶ 00:37 |
and so, what we do is | ▶ 00:39 |
we take this value, Gamma, | ▶ 00:41 |
and raise it to the T power, saying | ▶ 00:43 |
we're going to count future Rewards less than | ▶ 00:46 |
current Rewards--and that way, | ▶ 00:49 |
we'll make sure that the sum total is bounded. | ▶ 00:52 |
So we want the policy that maximizes that result. | ▶ 00:55 |
If we figure out the utility of the state | ▶ 00:58 |
by solving the Markov Decision Process, | ▶ 01:00 |
then we have: the utility of any state, S, | ▶ 01:03 |
is equal to the maximum over all | ▶ 01:07 |
possible actions that we could take in S | ▶ 01:09 |
of the expected value of taking that action. | ▶ 01:12 |
And what's the expected value? | ▶ 01:15 |
Well, it's just the sum over all resulting states | ▶ 01:17 |
of the transition model-- | ▶ 01:21 |
the probability that we get to that state, | ▶ 01:23 |
given from the start state, we take an action | ▶ 01:25 |
specified by the optimal policy | ▶ 01:28 |
times the utility of that resulting state. | ▶ 01:31 |
So--look at all possible actions; | ▶ 01:34 |
choose the best one-- | ▶ 01:37 |
according to the expected, in terms of probability utility. | ▶ 01:39 |
Now here's where reinforcement learning comes into play: | ▶ 00:00 |
What if you don't know R--the Reward function? | ▶ 00:03 |
What if you don't even know P--the transition model of the world? | ▶ 00:06 |
Then you can't solve the Markov Decision Process | ▶ 00:09 |
because you don't have what you need to solve it. | ▶ 00:12 |
However, with reinforcement learning, | ▶ 00:14 |
you can learn R and P by interacting with the world | ▶ 00:16 |
or you can learn substitutes that will tell you | ▶ 00:19 |
as much as you know, so that you never actually have to compute with R and P. | ▶ 00:22 |
What you learn, exactly, depends on what you already know and what you want to do. | ▶ 00:26 |
So we have several choices. | ▶ 00:30 |
One choice is we can build a utility-based agent. | ▶ 00:32 |
So we're going to list agent types, based on what we know, | ▶ 00:36 |
what we want to learn, | ▶ 00:41 |
and what we then use once we've learned. | ▶ 00:43 |
So for a utility-based agent, | ▶ 00:45 |
if we already know T, the transition model, | ▶ 00:47 |
but we don't know R, the Reward model, | ▶ 00:51 |
then we can learn R--and use that, | ▶ 00:54 |
along with P, to learn our utility function; | ▶ 00:57 |
and then go ahead and use the utility function | ▶ 01:01 |
just as we did in normal Markov Decision Processes. | ▶ 01:04 |
So that's one agent design. | ▶ 01:07 |
Another design that we'll see in this Unit | ▶ 01:09 |
is called a Q-learning agent. | ▶ 01:11 |
In this one, we don't have to know P or R; | ▶ 01:14 |
and we learn a value function, which is usually denoted by Q. | ▶ 01:17 |
And that's a type of utility | ▶ 01:22 |
but, rather than being a utility over states, | ▶ 01:26 |
it's a utility of state action pairs--and that tells us: | ▶ 01:28 |
For any given state and any given action, | ▶ 01:32 |
what's the utility of that result-- | ▶ 01:36 |
without knowing the utilities and rewards, individually? | ▶ 01:38 |
And then we can just use that Q directly. | ▶ 01:42 |
So we don't actually have to ever learn the transition model, P, | ▶ 01:45 |
with a Q-learning agent. | ▶ 01:49 |
And finally, we can have a reflex agent | ▶ 01:51 |
where, again, we don't need to know P and R to begin with; | ▶ 01:54 |
and we learn directly, the policy, pi of S; | ▶ 01:57 |
and then we just go ahead and apply pi. | ▶ 02:02 |
So it's called a reflex agent because it's pure stimulus response: | ▶ 02:05 |
I'm in a certain state, I take a certain action. | ▶ 02:09 |
I don't have to think about modeling the world, in terms of: | ▶ 02:11 |
What are the transitions--where am I going to go next? | ▶ 02:15 |
I just go ahead and take that action. | ▶ 02:17 |
Now, the next choice we have in agent design | ▶ 00:00 |
revolves around how adventurous he wants to be. | ▶ 00:03 |
One possibility is what's called the passive reinforcement learning agent-- | ▶ 00:06 |
and that can be any of these agent designs, | ▶ 00:11 |
but what passive means is that the agent | ▶ 00:14 |
has a fixed policy and executes that policy. | ▶ 00:16 |
But it learns about the reward function, R, | ▶ 00:19 |
and maybe the transition function, P, | ▶ 00:22 |
if it didn't already know that. | ▶ 00:25 |
It learns that while executing the fixed policy. | ▶ 00:27 |
So let me give you an example. | ▶ 00:30 |
Imagine that you're on a ship in uncharted waters | ▶ 00:32 |
and the captain has a policy for piloting the ship. | ▶ 00:35 |
You can't change the captain's policy. | ▶ 00:38 |
He or she is going to execute that, no matter what. | ▶ 00:41 |
But it's your job to learn all you can about the uncharted waters. | ▶ 00:44 |
In other words, learn the reward function, | ▶ 00:47 |
given the actions and the state transitions | ▶ 00:50 |
that the ship is going through. | ▶ 00:53 |
You learn, and remember what you've learned, | ▶ 00:55 |
but that doesn't change the captain's policy-- | ▶ 00:57 |
and that's passive learning. | ▶ 00:59 |
Now, the alternative is called | ▶ 01:01 |
active reinforcement learning-- | ▶ 01:04 |
and that's where we change the policy as we go. | ▶ 01:06 |
So let's say, eventually, you've done such a great job | ▶ 01:09 |
of learning about the uncharted water | ▶ 01:12 |
that the captain says to you, | ▶ 01:14 |
"Okay--I'm going to hand over control | ▶ 01:16 |
and as you learn, I'm going to allow you | ▶ 01:19 |
to change the policy for this ship. | ▶ 01:21 |
You can make decisions of where we're going to go next." | ▶ 01:23 |
And that's good, because you can start to | ▶ 01:26 |
cash in early on your learning | ▶ 01:28 |
and it's also good because it gives you | ▶ 01:30 |
a possibility to explore. | ▶ 01:32 |
Rather than just say: What's the best action I can do right now?-- | ▶ 01:35 |
you can say: What's the action that might allow me to learn something-- | ▶ 01:38 |
to allow me to do better in the future? | ▶ 01:42 |
Let's start by looking at passive reinforcement learning. | ▶ 00:00 |
I'm going to describe an algorithm called | ▶ 00:03 |
Temporal Difference Learning--or TD. | ▶ 00:05 |
And what that means--sounds like a fancy name, | ▶ 00:07 |
but all it really means is we're going to move | ▶ 00:09 |
from one state to the next; | ▶ 00:11 |
and we're going to look at the difference between the 2 states, | ▶ 00:13 |
and learn that--and then kind of back up | ▶ 00:16 |
the values, from one state to the next. | ▶ 00:19 |
So if we're going to follow a fixed policy, pi, | ▶ 00:22 |
and let's say our policy tells us to go this way, and then go this way. | ▶ 00:27 |
We'll eventually learn that we get a plus 1 reward there | ▶ 00:31 |
and we'll start feeding back that plus 1, saying: | ▶ 00:35 |
if it was good to get a plus 1 here, | ▶ 00:38 |
it must be somewhat good to be in this state, | ▶ 00:40 |
somewhat good to be in this state--and so on, back to the start state. | ▶ 00:42 |
So, in order to run this algorithm, | ▶ 00:46 |
we're going to try to build up a table of utilities for each state | ▶ 00:48 |
and along the way, we're going to keep track of | ▶ 00:53 |
the number of times that we visited each state. | ▶ 00:56 |
Now, the table of utilities, we're going to start blank-- | ▶ 00:59 |
we're not going to start them at zero or anything else | ▶ 01:01 |
where they're just going to be undefined. | ▶ 01:03 |
And the table of numbers, we're going to start at zero, | ▶ 01:05 |
saying we visited each state a total of zero times. | ▶ 01:07 |
What we're going to do is run the policy, | ▶ 01:11 |
have a trial that goes through the state; | ▶ 01:14 |
when it gets to a terminal state, | ▶ 01:16 |
we start it over again at the start and run it again; | ▶ 01:18 |
and we keep track of how many times we visited each state, | ▶ 01:21 |
we update the utilities, and we get a better | ▶ 01:24 |
and better estimate for the utility. | ▶ 01:26 |
And this is what the inner loop of the algorithm looks like-- | ▶ 01:28 |
and let's see if we can trace it out. | ▶ 01:30 |
So we'll start at a start state, | ▶ 01:32 |
we'll apply the policy--and let's say the policy tells us to move in this direction. | ▶ 01:34 |
Then we get a reward here, | ▶ 01:39 |
which is zero; | ▶ 01:42 |
and then we look at it with the algorithm, | ▶ 01:44 |
and the algorithm tells us if the state | ▶ 01:46 |
is new--yes, it is; we've never been there before-- | ▶ 01:48 |
then set the utility of that state to the new reward, which is zero. | ▶ 01:51 |
Okay--so now we have a zero here; | ▶ 01:56 |
and then let's say, the next step, we move up here. | ▶ 01:58 |
So, again, we have a zero; | ▶ 02:02 |
and let's say our policy looks like a good one, | ▶ 02:04 |
so we get: here, we have a zero. | ▶ 02:07 |
We get: here, we have a zero. | ▶ 02:10 |
We get: here--now, this state, | ▶ 02:12 |
we get a reward of 1, so that state gets a utility of 1. | ▶ 02:16 |
And all along the way, we have to think about | ▶ 02:20 |
how we're backing up these values, as well. | ▶ 02:23 |
So when we get here, we have to look at this formula to say: | ▶ 02:26 |
How are we going to update the utility of the prior state? | ▶ 02:31 |
And the difference between this state and this state is zero. | ▶ 02:35 |
so this difference, here, is going to be zero-- | ▶ 02:38 |
the reward is zero, and so there's going to be no update to this state. | ▶ 02:43 |
But now, finally--for the first time--we're going to have an actual update. | ▶ 02:46 |
So we're going to update this state to be plus 1, | ▶ 02:50 |
and now we're going to think about changing this state. | ▶ 02:54 |
And what was its old utility?--well, it was zero. | ▶ 02:57 |
And then there's a factor called Alpha, | ▶ 03:00 |
which is the learning rate | ▶ 03:03 |
that tells us how much we want to move this utility | ▶ 03:05 |
towards something that's maybe a better estimate. | ▶ 03:08 |
And the learning rate should be such that, | ▶ 03:11 |
if we are brand new, | ▶ 03:14 |
we want to move a big step; | ▶ 03:16 |
and if we've seen this state a lot of times, | ▶ 03:18 |
we're pretty confident of our number | ▶ 03:20 |
and we want to make a small step. | ▶ 03:22 |
So let's say that the Alpha function is 1 over N plus 1. | ▶ 03:24 |
Well, we'd better not make it 1 over N plus 1, when N is zero. | ▶ 03:29 |
So 1 over N plus 1 would be ½; | ▶ 03:31 |
and then the reward in this state was zero; | ▶ 03:35 |
plus, we had a Gamma-- | ▶ 03:39 |
and let's just say that Gamma is 1, | ▶ 03:41 |
so there's no discounting; and then | ▶ 03:44 |
we look at the difference between the utility | ▶ 03:46 |
of the resulting state--which is 1-- | ▶ 03:49 |
minus the utility of this state, which was zero. | ▶ 03:52 |
So we get ½, 1 minus zero--which is ½. | ▶ 03:57 |
So we update this; | ▶ 04:01 |
and we change this zero to ½. | ▶ 04:03 |
Now let's say we start all over again | ▶ 04:06 |
and let's say our policy is right on track; | ▶ 04:10 |
and nothing unusual, stochastically, has happened. | ▶ 04:12 |
So we follow the same path, | ▶ 04:16 |
we don't update--because they're all zeros all along this path. | ▶ 04:19 |
We go here, here, here; | ▶ 04:23 |
and now it's time for an update. | ▶ 04:26 |
So now, we've transitioned from a zero to ½-- | ▶ 04:28 |
so how are we going to update this state? | ▶ 04:33 |
Well, the old state was zero | ▶ 04:35 |
and now we have a 1 over N plus 1-- | ▶ 04:37 |
so let's say 1/3. | ▶ 04:41 |
So we're getting a little bit more confident--because we've been there | ▶ 04:44 |
twice, rather than just once. | ▶ 04:46 |
The reward in this state was zero, | ▶ 04:48 |
and then we have to look at the difference between these 2 states. | ▶ 04:51 |
That's where we get the name, Temporal Difference; | ▶ 04:54 |
and so, we have ½ minus zero-- | ▶ 04:57 |
and so that's 1/3 times ½-- | ▶ 05:01 |
so that's 1/6. | ▶ 05:03 |
Now we update this state. | ▶ 05:05 |
It was zero; now it becomes 1/6. | ▶ 05:07 |
And you can see how the results | ▶ 05:11 |
of the positive 1 starts to propagate | ▶ 05:13 |
backwards--but it propagates slowly. | ▶ 05:16 |
We have to have 1 trial at a time | ▶ 05:18 |
to get that to propagate backwards. | ▶ 05:20 |
Now, how about the update from this state to this state? | ▶ 05:22 |
Now, we were ½ here--so our old utility was ½; | ▶ 05:25 |
plus Alpha--the learning rate--is 1/3. | ▶ 05:31 |
The reward in the old state was zero; | ▶ 05:35 |
plus the difference between these two, | ▶ 05:39 |
which is 1 minus ½. | ▶ 05:42 |
So that's ½ plus 1/6 is 2/3. | ▶ 05:45 |
And now the second time through, | ▶ 05:49 |
we've updated the utility of this state from 1/2 to 2/3. | ▶ 05:51 |
And we keep on going--and you can see the results of the positive, propagating backwards. | ▶ 05:57 |
And if we did more examples through here, | ▶ 06:02 |
you would see the results of the negative propagating backwards. | ▶ 06:04 |
And eventually, it converges to the correct utilities for this policy. | ▶ 06:08 |
Now here are some results from running the passive TD algorithm on the 4 by 3 maze. | ▶ 00:00 |
On the right, we see a graph of the average | ▶ 00:05 |
error in the utility function--average across all the states. | ▶ 00:08 |
So it starts off--for the first 5 or so trials, | ▶ 00:11 |
the error rate is very high--it's off the charts. | ▶ 00:15 |
But then it starts to settle down, through 10, 20, 40; | ▶ 00:18 |
and up to about 60 or so, it's still improving; | ▶ 00:23 |
and then it gets to a final steady state | ▶ 00:26 |
after about 60 trials of about .05 in the average error in utility. | ▶ 00:29 |
So that's not too bad, but not really converging all the way down to no rate of error. | ▶ 00:36 |
And on the left, you see the utility estimates | ▶ 00:40 |
for various different states; | ▶ 00:43 |
and, as we see--as we get out to 500 trials, | ▶ 00:45 |
they're starting to converge a little bit, | ▶ 00:49 |
close to their true values. | ▶ 00:51 |
But we see in the first 100 or so trials-- | ▶ 00:54 |
they were all over the map, and so it wasn't doing very well. | ▶ 00:56 |
It took awhile for it to converge to something close to the true values. | ▶ 00:59 |
Now I want to do a little quiz, | ▶ 00:00 |
and ask you: True or False, | ▶ 00:02 |
which of the following are possible | ▶ 00:04 |
weaknesses in this TD learning | ▶ 00:07 |
with a passive approach to reinforcement learning? | ▶ 00:09 |
One: Is it possible that we would have | ▶ 00:12 |
a long convergence time-- | ▶ 00:15 |
that it might take a long time to converge to the correct utility values? | ▶ 00:17 |
Secondly, are we limited by the policy that we choose? | ▶ 00:21 |
So remember: in passive reinforcement learning, | ▶ 00:26 |
we choose a fixed policy | ▶ 00:28 |
and execute that policy; | ▶ 00:30 |
and any deviance from the policy | ▶ 00:32 |
results from the stochasticity. | ▶ 00:36 |
We may visit different squares | ▶ 00:38 |
because the environment is stochastic, | ▶ 00:40 |
but not because we made different choices. | ▶ 00:42 |
So there's that elementation. | ▶ 00:44 |
Third, can there be a problem with missing states? | ▶ 00:47 |
That is, could there be some states that have | ▶ 00:50 |
a zero count--that we never visited, | ▶ 00:53 |
and never got a utility estimate? | ▶ 00:56 |
And fourth, could there be a problem with a poor estimate for certain states? | ▶ 00:58 |
So could it be that, even though a state didn't have a count of zero, | ▶ 01:03 |
it had a low count, and we weren't able to get a good utility estimate for that state? | ▶ 01:08 |
An answer is that every one of these | ▶ 00:00 |
is a potential problem for passive reinforcement learning. | ▶ 00:03 |
So every problem won't show up in every possible domain. | ▶ 00:06 |
It'll depend on what the environment looks like. | ▶ 00:09 |
But it is a possibility that you could get bitten by any of these problems. | ▶ 00:12 |
And they all stem from the same cause, | ▶ 00:16 |
from the fact that passive learning | ▶ 00:19 |
stubbornly sticks to the same policy throughout. | ▶ 00:21 |
We have a policy, pi of S, | ▶ 00:25 |
and we always execute that policy. | ▶ 00:27 |
So if the policy here was to go up and then go right, | ▶ 00:29 |
then we would always stick to that; | ▶ 00:34 |
and the only time we would explore any other state is when those actions failed. | ▶ 00:36 |
If we tried to go up from this state-- | ▶ 00:41 |
because that's what the policy said; | ▶ 00:43 |
but, stochastically, we slipped over to this state-- | ▶ 00:45 |
then we wouldn't do something else, according to the policy | ▶ 00:48 |
and so we'd get a little bit of exploration, | ▶ 00:51 |
but we'd only vary from the chosen path | ▶ 00:53 |
because of that variation | ▶ 00:56 |
and we wouldn't intentionally explore enough of the space. | ▶ 00:58 |
So let's move on to Active Reinforcement Learning | ▶ 00:00 |
and, in particular, let's examine a simple | ▶ 00:03 |
approach called a Greedy Reinforcement Learner. | ▶ 00:06 |
And the way that works is it uses the same | ▶ 00:10 |
passive TD learning algorithm that we talked about, | ▶ 00:13 |
but, after each time we update the utilities | ▶ 00:16 |
or maybe after a couple of updates--you can decide how often you want to do it-- | ▶ 00:20 |
after the change to the utilities, | ▶ 00:23 |
we recompute the new optimal policy, pi. | ▶ 00:25 |
So we throw away our old pi, pi1, | ▶ 00:28 |
and replace it with a new pi, pi2-- | ▶ 00:32 |
which is a result of solving the MDP described by our new estimates of the utiliities. | ▶ 00:35 |
Now we have a new policy, | ▶ 00:41 |
and we continue learning with that new policy. | ▶ 00:43 |
And so, if the initial policy was flawed, | ▶ 00:45 |
the Greedy algorithm would tend to move away from the initial policy, | ▶ 00:49 |
towards a better policy--and we can show how well that works. | ▶ 00:52 |
Here's the result of running the Greedy agent over 500 trials. | ▶ 00:00 |
And I've graphed 2 things here: | ▶ 00:04 |
One is the error; and you see, over the top-- | ▶ 00:06 |
over the first 40 or so trials-- | ▶ 00:09 |
the error was very high--way up here. | ▶ 00:12 |
But then, suddenly, it jumped down | ▶ 00:14 |
to a lower level, and stayed along that level all the way through to 500. | ▶ 00:16 |
I've also graphed, with a dotted line, the policy loss. | ▶ 00:22 |
What does that mean?--so that's the difference | ▶ 00:25 |
between the policy that the agent has learned and the optimal policy. | ▶ 00:28 |
So if it had learned the optimal policy, the policy loss would be zero, down here. | ▶ 00:32 |
It doesn't quite get to zero. | ▶ 00:37 |
It was high, up here, | ▶ 00:39 |
and then at around step 40, it learned something important. | ▶ 00:41 |
What did it learn?--well, here's the final policy that it came up with. | ▶ 00:45 |
Maybe it started originally going in this direction in hitting the minus 1; | ▶ 00:49 |
and then it flipped and learned a new policy that went in a better direction. | ▶ 00:53 |
But it still hasn't learned the optimal policy. | ▶ 00:57 |
And we can see--for example, this looks like a mistake here. | ▶ 00:59 |
In state 1-2, it's policy is moving down | ▶ 01:03 |
and then following this path, which it learned, towards the goal. | ▶ 01:08 |
But really, a better route would be to take the northern route, and go through this path. | ▶ 01:12 |
But it hasn't learned that. | ▶ 01:17 |
Because it was Greedy, it found something | ▶ 01:19 |
that seemed to be doing good for it, and then it never deviated from that. | ▶ 01:21 |
So the question, then, is: How do we get this learner out of its rut? | ▶ 00:00 |
It improved its policy for awhile, | ▶ 00:04 |
but then it got stuck in this policy | ▶ 00:07 |
where we go here, go up and then go right. | ▶ 00:09 |
Most of the time, that's a perfectly good policy. | ▶ 00:13 |
But if a stochastic error makes us slip into the minus 1, then it hurts us. | ▶ 00:16 |
We'd like to be able to say we're going to stop doing that | ▶ 00:21 |
and somehow find this route. | ▶ 00:25 |
But in order to find that new route, | ▶ 00:28 |
we'd have to spend some time executing a policy | ▶ 00:30 |
which was not the best policy known to us. | ▶ 00:32 |
In other words, we'd have to stop exploiting | ▶ 00:35 |
the best policy we'd found so far--which is this one-- | ▶ 00:38 |
and start exploring, to see if maybe there's a better policy. | ▶ 00:42 |
And exploring could lead us astray | ▶ 00:46 |
and cause us to waste a lot of time. | ▶ 00:48 |
So we have to figure out: what's the right trade-off? | ▶ 00:51 |
When is it worth exploring to try to find something better for the long term-- | ▶ 00:53 |
even though we know that exploring is going to hurt us in the short term? | ▶ 00:57 |
Now, one possibility is, certainly, random exploration. | ▶ 01:02 |
That is, we can follow our best policy | ▶ 01:06 |
some percentage of the time, | ▶ 01:09 |
and then randomly, at some point, | ▶ 01:11 |
we can decide to take an action which is not the optimal action. | ▶ 01:14 |
So we're here, the optimal action would be to go east; | ▶ 01:17 |
and we say, "Well, this time we're gong to choose something else-- | ▶ 01:20 |
let's try going north. | ▶ 01:23 |
And then we explore from there | ▶ 01:25 |
and see if we've learned something. | ▶ 01:27 |
So that policy does, in fact, work-- | ▶ 01:29 |
randomly making moves with some probability--but it tends to be slow to converge. | ▶ 01:31 |
In order to get something better, we have to really understand | ▶ 01:37 |
what's going on with our exploration, versus exploitation. | ▶ 01:39 |
So let's really think about what we're doing when we're executing | ▶ 00:00 |
the active TD learning algorithm. | ▶ 00:03 |
First, we're keeping track of the optimal policy we've found so far; | ▶ 00:05 |
and that gets updated as we go, | ▶ 00:09 |
and replaced with new policies. | ▶ 00:11 |
Secondly, we're keeping track of the utilities of states-- | ▶ 00:13 |
and those, too, get updated as we go along. | ▶ 00:17 |
And third, we're keeping track of the number | ▶ 00:20 |
of times that we visited each state. | ▶ 00:23 |
And that gets incremented on each trial. | ▶ 00:26 |
Now, what could happen? What could go wrong? | ▶ 00:28 |
There are really 2 reasons | ▶ 00:30 |
why our utility estimates could be off. | ▶ 00:32 |
First, we haven't sampled enough. | ▶ 00:36 |
The end values are too low for that state | ▶ 00:39 |
and the utilities that we got were just some | ▶ 00:42 |
random fluctuations and weren't | ▶ 00:44 |
a very good, true estimate. | ▶ 00:46 |
And secondly, we could get a bad utility | ▶ 00:48 |
because our policy was off. | ▶ 00:51 |
The policy was telling us to do something that wasn't really the best thing, | ▶ 00:53 |
and so the utility wasn't as high as it could be. | ▶ 00:57 |
So let's do a little quiz. | ▶ 01:00 |
I want you to tell me, for the 2 sources of possible error-- | ▶ 01:02 |
too little sampling and wrong policy-- | ▶ 01:06 |
I want you to tell me, is it True or False--each of these statements: | ▶ 01:09 |
One: Could the error--either the sampling error or the policy error-- | ▶ 01:13 |
could that make the utility estimates too low? | ▶ 01:19 |
And secondly, could it make utility too high? | ▶ 01:23 |
And third, could it be improved with higher N values--that is, more trials? | ▶ 01:29 |
And here are the answers: For the error introduced by a lack of enough sampling, | ▶ 00:00 |
all these problems are true. | ▶ 00:06 |
If you don't have enough samples, | ▶ 00:08 |
it might make the utility too high; it might make the utility too low-- | ▶ 00:10 |
and it could certainly be improved by taking more trials. | ▶ 00:13 |
But with the differences due to having not quite the right policy, | ▶ 00:16 |
The answers aren't the same. | ▶ 00:20 |
So yes, if you don't have the right policy, | ▶ 00:22 |
that could make the utilities too low--if you're doing something silly, | ▶ 00:24 |
like starting in this state and the policy says, | ▶ 00:28 |
"Drive straight into the minus 1" | ▶ 00:31 |
that could make the utility of this state lower than it really should be. | ▶ 00:34 |
But it can't make the utility too high. | ▶ 00:37 |
So we really have a bound on the utility here. | ▶ 00:40 |
The bound is: what does the optimal policy do? | ▶ 00:43 |
And no matter what policy we have, | ▶ 00:47 |
it's not going to be better than the optimal policy; | ▶ 00:49 |
and so we can only be making things worse | ▶ 00:51 |
with our policy, not making them better. | ▶ 00:54 |
And finally, having more N won't necessarily improve things. | ▶ 00:56 |
It will decrease the variance, but it won't decrease or improve the mean. | ▶ 01:00 |
Now what that suggests is the design for an exploration agent | ▶ 00:00 |
that will be more proactive about exploring the world when it's uncertain, | ▶ 00:04 |
and will fall back to exploiting the optimal policy--or whatever policy it has as close to optimal-- | ▶ 00:09 |
when it becomes more certain about the world. | ▶ 00:15 |
And what we can do is go through this | ▶ 00:17 |
normal cycle of TD learning-- | ▶ 00:19 |
like we always did. | ▶ 00:21 |
But when we're looking for the estimate | ▶ 00:23 |
of the utility of the state, | ▶ 00:25 |
what we can do is say: | ▶ 00:27 |
The utility of the state estimate will be | ▶ 00:29 |
some large value, plus R-- | ▶ 00:33 |
say, plus 1--in the case of this example-- | ▶ 00:36 |
the largest reward we can expect to get. | ▶ 00:40 |
In every case, when the number of visits to the state | ▶ 00:43 |
is less than the sum threshold, E, the exploration threshold. | ▶ 00:48 |
And when we've visited a state E times, | ▶ 00:52 |
then we revert to the learned probabilities | ▶ 00:55 |
or the learned utilities, rather. | ▶ 00:58 |
So when we start out, we're going to explore from new states; | ▶ 01:01 |
and once we have a good estimate of what the true utility of the state actually is, | ▶ 01:05 |
then we stop exploring and we go with those utilities. | ▶ 01:09 |
And here we have the result of some simulations of the exploratory agent. | ▶ 00:00 |
We see it's doing much better than the passive agent or than the Greedy agent. | ▶ 00:04 |
So I'm graphing here; and we only had to go through 100 trials. | ▶ 00:09 |
We didn't have to go through 500--so it's converging much faster. | ▶ 00:13 |
And it's converging to much better results. | ▶ 00:16 |
So the policy loss and the dotted lines | ▶ 00:19 |
started off high; but after only 20 trials, | ▶ 00:22 |
it's come down to perfect. | ▶ 00:25 |
So it learned the exact, correct policy after 20 trials. | ▶ 00:27 |
The error in the utilities--so you can have the perfect policy, | ▶ 00:31 |
while not quite having the right utilities for each state-- | ▶ 00:35 |
and the errors in utility comes down, | ▶ 00:39 |
and that, too, comes down to a level that's lower than the previous agent's-- | ▶ 00:42 |
but still, not quite perfect. | ▶ 00:47 |
And we see here that it, in fact, learns the correct policy. | ▶ 00:49 |
Now, let's say we've done all this learning, | ▶ 00:00 |
we've applied our agent, and we've come up with a utility model; | ▶ 00:03 |
and we have the estimates for the utility for every state. | ▶ 00:06 |
Now what do we do when we want to act in the world? | ▶ 00:10 |
Well, we now have out policy for the state, | ▶ 00:13 |
which is determined by the expected value | ▶ 00:17 |
and we compute the expected value | ▶ 00:20 |
of each state by looking at the utility, | ▶ 00:22 |
which we just learned. | ▶ 00:25 |
But then, we have to multiply by the transition possibilities. | ▶ 00:27 |
What's the probability of each resulting state that we have to look up the utility of? | ▶ 00:31 |
And so, we need to know that-- | ▶ 00:37 |
and in some cases, we're given the transition model, and so we know all these probabilities. | ▶ 00:39 |
But in other cases, we don't have it; | ▶ 00:44 |
and so if we haven't learned it, we can'tapply | ▶ 00:46 |
our policy, even though we know the utilities. | ▶ 00:48 |
I want to talk, briefly, about this alternative method | ▶ 00:51 |
called Q Learning, that I mentioned before. | ▶ 00:54 |
Where in Q Learning, we don't learn U direclty, | ▶ 00:57 |
and we don't need the transition model. | ▶ 01:01 |
Instead, what we learned is a direct mapping, | ▶ 01:03 |
Q, from states and actions | ▶ 01:08 |
to utilities and so then, once we've learned Q, | ▶ 01:11 |
we can determine the optimal policy of he state, | ▶ 01:15 |
just by taking the maximum overall possible actions of this Q of S, A values. | ▶ 01:18 |
Now, how do we do Q Learning? | ▶ 00:00 |
Well, we start off with this table of Q values-- | ▶ 00:02 |
and notice that there's more entries in this | ▶ 00:05 |
table than there were in the utility table. | ▶ 00:08 |
So for each state, I've divided it up | ▶ 00:10 |
into different actions--so here's the action of going north, south, east or west | ▶ 00:14 |
from this particular state. | ▶ 00:20 |
They all start out with utility-- | ▶ 00:22 |
or rather Q utility, at zero. | ▶ 00:24 |
But as we go, we start to update, | ▶ 00:26 |
and we have an update formula that's very | ▶ 00:29 |
similar to the formula for TD learning. | ▶ 00:31 |
It has the same learning rate, Alpha, | ▶ 00:34 |
and the same discount factor, Gamma; | ▶ 00:37 |
and we just start applying that. | ▶ 00:39 |
So we start tracking through the state space, | ▶ 00:41 |
and when we get a transition--say we go | ▶ 00:45 |
east from here, | ▶ 00:48 |
and then east and then north and then north; | ▶ 00:50 |
and then east-- | ▶ 00:55 |
and then we would back up this value; | ▶ 00:57 |
and depending on what the values of Alpha and Gamma were, | ▶ 00:59 |
we might update this to .6 or something; | ▶ 01:03 |
and then the next time through, | ▶ 01:07 |
we might update that to .7, and update this one to .4, and so on. | ▶ 01:09 |
In each case, we'd be updating | ▶ 01:16 |
only the action we took, | ▶ 01:18 |
associated with that state, not the whole state. | ▶ 01:21 |
We'd keep repeating that process until we had values filled in for all the action state pairs. | ▶ 01:24 |
Now, in some sense, you've learned all you need to know about reinforcement learning. | ▶ 00:00 |
Yes, it's a huge field, and there's a lot of other details that we haven't covered | ▶ 00:05 |
but you've seen all the basics. | ▶ 00:09 |
The theory is there and it works. | ▶ 00:11 |
But in another sense, we haven't gone very far | ▶ 00:13 |
because what we've done works for these small 4 by 3 Grid Worlds, | ▶ 00:16 |
But it won't work very well for larger problems: | ▶ 00:21 |
dealing with flying helicopters or playing backgammon-- | ▶ 00:24 |
because there's just too many states | ▶ 00:27 |
and we can't visit every one of the states | ▶ 00:29 |
and build up the correct utility values, or Q values, | ▶ 00:33 |
for all the billions or trillions or quadrillions of states we would need to represent. | ▶ 00:36 |
So let's go back to a simpler type of example. | ▶ 00:41 |
Here's a state in a Pacman game | ▶ 00:44 |
and we can see that this is a bad state, | ▶ 00:47 |
where Pacman is surrounded by 2 bad guys, | ▶ 00:49 |
and there's no place for him to escape. | ▶ 00:53 |
And so reinforcement learning could quickly learn that this is bad. | ▶ 00:56 |
But the problem is that that state has | ▶ 01:00 |
no relation whatsoever to this state. | ▶ 01:02 |
Where conceptually, it's the same problem-- | ▶ 01:06 |
that the Pacman is stuck in a corner | ▶ 01:10 |
and there are bad guys own either sides of him. | ▶ 01:12 |
But in terms of a concrete state, | ▶ 01:15 |
the 2 are completely different. | ▶ 01:18 |
So what we want to be able to do is | ▶ 01:20 |
find some generalization, so that these 2 states look the same, | ▶ 01:22 |
and what I learn for this state-- | ▶ 01:26 |
that learning can transfer over into this state. | ▶ 01:29 |
And so, just as we did in supervised machine learning, where we wanted to take | ▶ 01:32 |
similar points in the state and be able to reason about them, together, | ▶ 01:37 |
we want to be able to do the same thing for reinforcement training. | ▶ 01:42 |
And we can use the same type of approach. | ▶ 01:45 |
So we can represent a state, | ▶ 00:00 |
not by an exhaustive listing of everything that's true in the state-- | ▶ 00:02 |
every single dot, and so on. | ▶ 00:05 |
But rather, by a collection of important features. | ▶ 00:07 |
So we can say that a state is this collection | ▶ 00:11 |
of Feature 1, Feature 2, and so on. | ▶ 00:15 |
And what are the features? | ▶ 00:18 |
Well, they don't have to be the exact position | ▶ 00:20 |
of every piece in the board. | ▶ 00:22 |
They could be things like the distance to the nearest Ghost | ▶ 00:25 |
or maybe the square of the distance--or the inverse square; | ▶ 00:28 |
or the distance to a dot or food-- | ▶ 00:31 |
or the number of Ghosts remaining. | ▶ 00:34 |
And then we can represent the utility of a state, | ▶ 00:36 |
or let's go with a Q value, of a state action pair | ▶ 00:39 |
and represent that as the sum over some set of waits times the value of each feature. | ▶ 00:43 |
And then our task, then, is to learn good values of these waits-- | ▶ 00:51 |
how important is each feature, whether they're positive or negative, and so on. | ▶ 00:55 |
This formulation will be good to the extent that similar states have the same value. | ▶ 01:00 |
So if these 2 states have the same value, that would be good | ▶ 01:05 |
because we could learn that, in both cases, Pacman is trapped. | ▶ 01:08 |
It would be bad, to the extent that dissimilar states have the same value-- | ▶ 01:12 |
say, if we're ignoring something important. | ▶ 01:18 |
So, for example, if one of the features was: | ▶ 01:20 |
Is Pacman in a tunnel? | ▶ 01:25 |
It would probably be important to know: is that tunnel a dead end or not? | ▶ 01:27 |
And if we represented all tunnels the same, we'd probably be making a mistake. | ▶ 01:31 |
Now, the great thing is that we can make a small modification to our Q learning algorithm | ▶ 01:35 |
where, when we were updating, the Q of S, A got updated | ▶ 01:42 |
in terms of a small change to the existing Q of S, A values. | ▶ 01:48 |
We can do the same thing with the wait's sub-i values. | ▶ 01:53 |
We can update them as we make each change to the Q values. | ▶ 01:59 |
And they're both driven by the amount of error. | ▶ 02:02 |
If the Q values are off by a lot, we have to make a big change; | ▶ 02:05 |
if they're not, we make a small change-- | ▶ 02:09 |
the same thing with the Wi values. | ▶ 02:11 |
And that looks just like what we did when we | ▶ 02:13 |
used supervised machine learning to update our waits. | ▶ 02:17 |
So we can apply that same process, even though it's not supervised. | ▶ 02:20 |
It's as if we're bringing our own supervision to reinforcement learning. | ▶ 02:24 |
In summary, then, we've learned how to do a lot with MDPs-- | ▶ 00:00 |
especially using reinforcement learning. | ▶ 00:03 |
If we don't know what the MDP is, | ▶ 00:06 |
we know how to estimate it and then solve it. | ▶ 00:08 |
We can estimate the utility for some fixed policy, pi; | ▶ 00:11 |
or we could estimate the Q values for the | ▶ 00:15 |
optimal policy while executing an exploration policy. | ▶ 00:18 |
And we saw something about how we can make the right trade-offs | ▶ 00:22 |
between exploration and exploitation. | ▶ 00:25 |
So reinforcement learning remains one of the most exciting areas of AI. | ▶ 00:28 |
Some of the biggest surprises have come out of reinforcement learning-- | ▶ 00:31 |
things like Tesauro's backgammon player | ▶ 00:35 |
or Andrew Ng's helicopter; | ▶ 00:37 |
and we think that there's a lot more that we can learn. | ▶ 00:39 |
It's an exciting field, and one where there's plenty of room for new innovation. | ▶ 00:43 |
The answer is we're transitioning from this state to this state. | ▶ 00:00 |
We get a reward of zero in the old state. | ▶ 00:04 |
Then we get the Q value of 100 minus the Q value of zero, | ▶ 00:09 |
and the discount rate is 90. | ▶ 00:14 |
That's a difference of 90. | ▶ 00:17 |
Then the alpha, the learning rate, is 1/2. That gives us 45. | ▶ 00:20 |
We apply that 45 to the state action pair. | ▶ 00:27 |
We were in this state, and we executed the action north, | ▶ 00:30 |
so the 45 goes here. | ▶ 00:34 |
Notice it doesn't go over here. | ▶ 00:36 |
We did end up going to the east, but we didn't execute the action of going to the east. | ▶ 00:38 |
All the other actions remain unchanged. | ▶ 00:43 |
This problem involves the Q-learning agent who is currently situated at this square | ▶ 00:00 |
called (3,3), and executes the NORTH action trying to go up, | ▶ 00:05 |
but because the environment is stochastic, it actually ends up arriving at this terminal state | ▶ 00:11 |
with value 100. | ▶ 00:17 |
And what I want you to answer is how should the Q-values be updated for this state, | ▶ 00:19 |
and I want you to enter the Q-values over here because we don't want you to | ▶ 00:26 |
mess up the original, and we'll use the formula below which I should point out is | ▶ 00:31 |
from the Sarsa version of Q-learning, | ▶ 00:37 |
and in this formula, the parameter alpha--the learning rate--will take on the value of 1/2, and | ▶ 00:41 |
gamma--the discount rate--will be 0.9, | ▶ 00:47 |
and all the rewards for moving from one state to the next are 0 | ▶ 00:52 |
with the exception of moving into the terminal state, | ▶ 00:57 |
and this Q of S prime, A prime--that means what goes on in the next state, | ▶ 01:01 |
so here we were in this S, and we took the action of going NORTH, | ▶ 01:06 |
and we transferred into this state, and in that state, no matter what action you take | ▶ 01:12 |
the Q value is always 100, so this value here will always be 100. | ▶ 01:18 |
To work out the answer, let's look at the individual features for each of the states. | ▶ 00:00 |
For this state up here, the values for F1, F2, and F3 would be 2, 1, and 1. | ▶ 00:05 |
That is, the distance from the agent to the goal is 2, | ▶ 00:13 |
the distance to the closest bad guy is 1, and the distance of the bad guy to the goal is 1. | ▶ 00:17 |
Now this state here also has values 2, 1, and 1. | ▶ 00:24 |
That would be indistinguishable under either functions F or G. | ▶ 00:30 |
This state here has values 2, 1, and 3. | ▶ 00:36 |
The 2 and the 1 are the same, so that would be indistinguishable under F, | ▶ 00:41 |
but would be different under G. | ▶ 00:46 |
And this state has values 2, 3, and 1, and the 2 and 3 are different than 2 and 1, | ▶ 00:49 |
so that would be different under either F or G. | ▶ 00:55 |
Now the question which is a more useful function-- | ▶ 00:59 |
the answer is G is more useful, because G can actually distinguish between these 2 states. | ▶ 01:02 |
In this state the agent is surround by bad guys, so that's a bad situation. | ▶ 01:09 |
In this state the agent has a clear path to the goal, so that's a good situation. | ▶ 01:14 |
You'd want a function that says that those two are different | ▶ 01:19 |
rather than one that says they're the same. | ▶ 01:22 |
G says they are different whereas F says they're the same. | ▶ 01:24 |
This question involves function generalization in reinforcement learning, | ▶ 00:00 |
and we're operating in a 1-dimensional environment of squares, | ▶ 00:05 |
and we're going to consider a state generalization function, | ▶ 00:09 |
that is a function that takes a state such as this and condenses it into some features | ▶ 00:12 |
to represent that state. | ▶ 00:18 |
The first function we're going to consider F has these features-- | ▶ 00:20 |
f1 is the distance from the Agent represented by A to the goal represented by G, | ▶ 00:23 |
and f2--the distance from the Agent to the closest Bad guy | ▶ 00:29 |
which is represented by a B. | ▶ 00:33 |
So that's the function F, and we also want to consider the function G | ▶ 00:36 |
which has the same 2 features--f1 and f2--and adds a third feature | ▶ 00:39 |
which is the distance of the closest Bad guy to the goal. | ▶ 00:44 |
That is distance from the goal to the Bad guy--the minimum of that over | ▶ 00:49 |
all possible Bad guys, | ▶ 00:54 |
and now I want you to say which of the states below--these 3 states-- | ▶ 00:55 |
have the same value as the state above--this state--under the functions F and G. | ▶ 01:00 |
And click off the ones that have the same, and then I want you to answer for me-- | ▶ 01:06 |
In this world, agents and Bad guys can move one Square at a time, | ▶ 01:11 |
and the agent tries to get to the goal without encountering Bad guys, | ▶ 01:16 |
and for the agent to do that, which is a more useful generalization function | ▶ 01:19 |
to use over these states--F or G? | ▶ 01:25 |
The answer is according to the policy the agent would prefer to follow this straight line, | ▶ 00:00 |
because it is the most direct, and it is the longer goal. | ▶ 00:06 |
Now, at any point he might slip off to one of these squares. | ▶ 00:09 |
Those would all potentially be explored, | ▶ 00:14 |
but if he did he would go back down onto the road. | ▶ 00:17 |
Likewise, he might fall off onto any of of these squares, | ▶ 00:20 |
but if he did, he would also go back towards the road. | ▶ 00:25 |
That's certainly true under this situation, when he's off road, | ▶ 00:29 |
but it also turns out to be true here and here, | ▶ 00:33 |
because the closest way to get to the goal would be to go in the north direction. | ▶ 00:37 |
Therefore, these three rows could all potentially be explored, | ▶ 00:43 |
but the bottom two rows would never be explored under any conditions | ▶ 00:48 |
no matter what happens stochastically as long as the agent is following this fixed policy. | ▶ 00:53 |
In this problem, a passive TD-reinforcement learning agent starts at S and moves to G | ▶ 00:00 |
under a fixed policy which says first, make moves that get closest to G. | ▶ 00:07 |
So if we started here, we'd want to go in this direction because it's closest to G, | ▶ 00:13 |
and (b) stay on these gray squares which represent roads, | ▶ 00:18 |
and (c) if you do happen to go off the road, then move it back onto the road immediately. | ▶ 00:22 |
The actions are stochastic, and they may go in the intended direction, | ▶ 00:28 |
or they may go 90 degrees off, so if we were here, we'd plan to start under this policy | ▶ 00:32 |
going in this direction. We might end up there, but we might end up here or here. | ▶ 00:38 |
And if we did end up here, then we'd immediately head back towards the road, | ▶ 00:44 |
so we'd aim back down in this direction. | ▶ 00:49 |
And what I want you to do is click on all the squares that would never be explored | ▶ 00:52 |
by this reinforcement learning agent following this passive fixed policy. | ▶ 00:57 |
Today I have the great, great pleasure | ▶ 00:01 |
to teach you about hidden Markov models and filter algorithms. | ▶ 00:04 |
The reason why I'm so excited is in pretty much all of my scientific career, | ▶ 00:09 |
hidden Markov models and filters have played a major role. | ▶ 00:15 |
There's no robot that I program today that wouldn't extensively use hidden Markov models | ▶ 00:19 |
and things such as particle filters. | ▶ 00:24 |
In fact, when I applied for a job at Stanford University as a professor many years ago, | ▶ 00:27 |
my job talk that I used to market myself to Stanford | ▶ 00:33 |
was extensively about a version of hidden Markov models and particle filters | ▶ 00:37 |
applied to robotic mapping. | ▶ 00:42 |
Today I will teach you those algorithms so you can use them in many challenging problems. | ▶ 00:45 |
I can't quite promise you that once you have mastered the material | ▶ 00:52 |
you will get a job at Stanford, but you can really, really apply them | ▶ 00:55 |
to a vast array of problems in places such as finance, medicine, robotics, | ▶ 01:00 |
weather prediction, time series analysis, and many, many other domains. | ▶ 01:06 |
This is going to be a really fun class. | ▶ 01:11 |
[Thrun] Hidden Markov models, or abbreviated HMMs, | ▶ 00:01 |
are used to analyze or to predict time series. | ▶ 00:07 |
Applications include robotics, medical, finance, | ▶ 00:15 |
speech and language technologies, and many, many, many other domains. | ▶ 00:20 |
In fact, HMMs and filters are at the core of a huge amount of deployed practical systems | ▶ 00:25 |
from elevators to airplanes. | ▶ 00:35 |
Every time there is a time series that involves noise or sensors or uncertainty, | ▶ 00:38 |
this is the method of choice. | ▶ 00:43 |
So today I'll teach you all about HMMs and filters | ▶ 00:45 |
so you can apply some of the basic algorithms in a wide array of practical problems. | ▶ 00:48 |
[Thrun] The essence of HMMs are really simply characterized | ▶ 00:00 |
by the following Bayes network. | ▶ 00:04 |
There's a sequence of states that evolve over time, | ▶ 00:07 |
and each state depends only on the previous state in this Bayes network. | ▶ 00:11 |
Each state also emits what's called a measurement. | ▶ 00:17 |
It is this Bayes network that is the core of hidden Markov models | ▶ 00:22 |
and various probabilistic filters such as Kalman filters, particle filters, and many others. | ▶ 00:27 |
These are words that might sound cryptic and they might not mean anything to you, | ▶ 00:35 |
but you might come across them as you study different disciplines of computer science | ▶ 00:40 |
and control theory. | ▶ 00:44 |
The real key here is the graphical model. | ▶ 00:46 |
If you look at the evolution of states, | ▶ 00:49 |
what you'll find is that these states evolve as what's called a Markov chain. | ▶ 00:52 |
In a Markov chain, each state only depends on its predecessor. | ▶ 00:57 |
So for example, state S3 is conditioned on S2 but not on S1. | ▶ 01:02 |
It's only immediate through S2 that S3 might be influenced by S1. | ▶ 01:07 |
That's called a Markov chain, and we're going to study Markov chains quite a bit | ▶ 01:11 |
in this class to understand them well. | ▶ 01:15 |
But what makes it a hidden Markov model or hidden Markov chain, if you wish, | ▶ 01:17 |
is the fact that there is measurement variables. | ▶ 01:22 |
So rather than being able to observe the state itself, what you get to see are measurements. | ▶ 01:25 |
Let me put this to perspective, showing you several of the robots I've built | ▶ 01:31 |
that possess hidden state. | ▶ 01:36 |
And where I only get to observe certain measurements, | ▶ 01:38 |
let me infer something about the hidden state. | ▶ 01:42 |
[Thrun] What's shown here is the tour guide robot that I showed you earlier, | ▶ 00:00 |
but now I'll talk about the what's called localization problem-- | ▶ 00:03 |
the problem of finding out where in the world this robot is. | ▶ 00:07 |
This problem is important because to find its way around the museum | ▶ 00:13 |
and to arrive at exhibits of interest, it must know where it is. | ▶ 00:17 |
The problem with this problem is that it doesn't have a sensor that tells us where it is. | ▶ 00:23 |
Instead, it's given what's called range finders. | ▶ 00:30 |
These are sensors that measure distances to surrounding objects. | ▶ 00:34 |
It's also given the map of the environment, | ▶ 00:38 |
and it can compare these range finders measurements with the map of the environment | ▶ 00:40 |
and infer from that where it might be. | ▶ 00:46 |
The process of inferring the hidden state of the robot's location from the measurements, | ▶ 00:49 |
the range sensor measurements, that's the problem of filtering. | ▶ 00:56 |
And the underlying model is exactly the same I showed you before. | ▶ 01:00 |
It's a hidden Markov model where the state is the sequence of locations | ▶ 01:05 |
that the robot assumes in the museum | ▶ 01:09 |
and the measurements is the sequence of range measurements it perceives | ▶ 01:12 |
while it navigates the museum. | ▶ 01:16 |
A second example is the underground robotic mapping robot | ▶ 01:19 |
which has pretty much the same problem--finding out where it is-- | ▶ 01:24 |
but now it is not given a map; it builds the map from scratch. | ▶ 01:28 |
What this animation here shows you is a so-called particle filter applied to robotic mapping. | ▶ 01:32 |
Intuitively--what you see is very simple-- | ▶ 01:41 |
as the robot transcends into a mine, it builds a map. | ▶ 01:44 |
But the many black lines are hypotheses on where the robot might have been | ▶ 01:48 |
when building this map. | ▶ 01:54 |
It can't tell because of the noise in its motors and in its sensors. | ▶ 01:56 |
As the robot reconnects and closes the loop in this map, | ▶ 02:00 |
one of these black what we call particles in the trade-- | ▶ 02:05 |
one of these black hypotheses are being selected as the best one, | ▶ 02:09 |
and by virtue of having maintained many of those, the robot is able to build a coherent map. | ▶ 02:13 |
In fact, this animation was a key animation in my job talk | ▶ 02:19 |
when I applied to become a professor at Stanford University. | ▶ 02:23 |
Here is one final example I'd like to discuss with you which is called speech recognition. | ▶ 02:27 |
If you have a microphone that records speech | ▶ 02:32 |
and you want to make your computer recognize the speech, | ▶ 02:35 |
you will likely come across hidden Markov models. | ▶ 02:38 |
This is a typical speech signal over here. | ▶ 02:41 |
It's an oscillation for the words "speech lab" which I borrowed from Simon Arnfield. | ▶ 02:44 |
And if you blow up a small region over here, you'll find that there is an oscillation, | ▶ 02:51 |
and this oscillation in time is the speech signal. | ▶ 02:58 |
What speech recognizing systems do is they transform this signal over here | ▶ 03:03 |
back into letters like "speech lab." | ▶ 03:09 |
And you can see it's not an easy task. | ▶ 03:12 |
There is some signal here. | ▶ 03:14 |
The E, for example, is a certain shape. | ▶ 03:16 |
But different speakers speak differently, and there might be background noise, | ▶ 03:18 |
so decoding this back into speech is challenging. | ▶ 03:22 |
There's been enormous progress in the field | ▶ 03:25 |
mostly due to hidden Markov models that have been researched for more than 20 years. | ▶ 03:28 |
And today's best speech recognizers all use variants of hidden Markov models. | ▶ 03:33 |
So once again, I can't teach you everything in this class, but I'll teach you the very basics | ▶ 03:38 |
that you can apply to things such as speech signals. | ▶ 03:43 |
[Thrun] So let's begin by taking the hidden out of the Markov model | ▶ 00:00 |
and study Markov chains. | ▶ 00:05 |
We're going to use an example for which I will quiz you. | ▶ 00:07 |
Suppose there are 2 types of weather--rainy, which we call R, | ▶ 00:11 |
and sunny, which we call S-- | ▶ 00:15 |
and suppose we have the following state transition diagram. | ▶ 00:17 |
If it's rainy, it stays rainy with a 0.6 chance while with 0.4 it becomes sunny. | ▶ 00:21 |
Sunny remains sunny with 0.8 chance but moves to rainy with 0.2 chance. | ▶ 00:28 |
This is obviously a temporal sequence so the weather at time 1 will be called R1 or S1, | ▶ 00:33 |
at time 2, R2 or S2. | ▶ 00:41 |
Suppose in the beginning we happen to know it is rainy, | ▶ 00:44 |
which means R times 0 when we begin. | ▶ 00:48 |
We have the probability of rain equals 1 and the probably of sun, S times 0 equals 0. | ▶ 00:52 |
I'd like to know from you what's the probability of rain on day 1, the same for day 2, | ▶ 00:59 |
and the same for day 3. | ▶ 01:08 |
[Thrun] And the answer will be 0.6, 0.44, and 0.376. | ▶ 00:00 |
It's really an exercise applying probability theory. | ▶ 00:08 |
In the very beginning we know to be in state R, | ▶ 00:13 |
and the probability of remaining there is 0.6, which is directly the value on the arc over here. | ▶ 00:17 |
On the second state we know that the probability of R is 0.6 | ▶ 00:24 |
and therefore, the probability of sun is 0.4, | ▶ 00:29 |
and we compute the probability of rain on day 2 using total probability. | ▶ 00:34 |
The probability of rain on day 2 given rain on day 1 | ▶ 00:40 |
times the probability of rain on day 1 plus the probability of rain on day 2 | ▶ 00:45 |
given it was sunny on day 1 times the probability of sun on day 1. | ▶ 00:49 |
And if you plug in all these values, | ▶ 00:54 |
we get 0.6 times 0.6 plus rain following sun which is this arc over here, 0.2, | ▶ 00:56 |
times 0.4 as the prior, and this results in 0.44. | ▶ 01:05 |
We can now do the same with the probability of rain on day 3, | ▶ 01:12 |
which is the same 0.6 over here, but now our prior is different--it's 0.44-- | ▶ 01:17 |
plus the same 0.2 over here with the prior of 0.56, which is 1 minus 0.44. | ▶ 01:26 |
And when you work this all out, it is 0.376 as indicated over here. | ▶ 01:33 |
So what we really learned here is that this is a temporal Bayes network | ▶ 01:38 |
of which we can apply conventional probabilities such as the total probability | ▶ 01:42 |
which was also known as variable elimination in the Bayes network lecture. | ▶ 01:48 |
All these fancy words aside, it's really easy to evaluate those. | ▶ 01:52 |
So if you want to do this and you ask yourself given the probability of the certain time step | ▶ 01:56 |
like 1, what's it related to time step 2, | ▶ 02:01 |
you ask yourself what's the durations that I encounter in time step 1. | ▶ 02:04 |
There are usually 2 in this case. | ▶ 02:09 |
What are the transition probabilities that lead me to the desired state in time step 2 | ▶ 02:11 |
like the 0.6 if you started in R and 0.2 if you started in S, | ▶ 02:16 |
and you add all these cases up and you just get the right number. | ▶ 02:22 |
It's really an easy piece of mathematics if you think about it. | ▶ 02:25 |
[Thrun] Let's practice this again with another 2-state Markov chain. | ▶ 00:00 |
States are A and B. | ▶ 00:04 |
A has a 50% chance of transitioning to B, | ▶ 00:07 |
and B always transitions into A. | ▶ 00:11 |
There is no loop from B to itself. | ▶ 00:13 |
Let's assume again at a time, 0, we know with certainty to be in state A. | ▶ 00:16 |
I would like to know the probability of A at time 1, at time 2, and at time 3. | ▶ 00:22 |
[Thrun] And again the solution follows directly from the state diagram over here. | ▶ 00:00 |
In the beginning we do know we're in state A | ▶ 00:03 |
and the chance of remaining in A is 0.5. | ▶ 00:07 |
This is the 0.5 over here. We can just read this off. | ▶ 00:09 |
For the next state we find ourselves to be with 0.5 chance to be in A | ▶ 00:13 |
and 0.5 chance to be in B. | ▶ 00:19 |
If we're in B, we transition with certainty to A. | ▶ 00:21 |
That's because of the 0.5. | ▶ 00:24 |
But if we're in A, we stay in A with a 0.5 chance. So you put this together. | ▶ 00:26 |
0.5 probability being in A times 0.5 probability of remaining in A | ▶ 00:31 |
plus 0.5 probability to be in B times 1 probability to transition to A. | ▶ 00:36 |
That gives us 0.75. | ▶ 00:41 |
Following the same logic but now we're in A with 0.75 times a 0.5 probability | ▶ 00:44 |
of staying in A plus 0.25 in B, which is 1 minus 0.75, | ▶ 00:52 |
and the transition's uncertainty back to A as 1, we get 0.625. | ▶ 00:58 |
So now you should be able to take a Markov chain and compute by hand | ▶ 01:06 |
or write a piece of software the probabilities of future states. | ▶ 01:11 |
You will be able to predict something. That's really exciting. | ▶ 01:16 |
[Thrun] So one of the questions you might ask for a Markov chain like this is | ▶ 00:00 |
what happens if time becomes really large? | ▶ 00:04 |
What happens for the probability of A1000? | ▶ 00:07 |
Or let's go extreme. | ▶ 00:11 |
What about in the limit, A infinity, often written as the limits of time | ▶ 00:13 |
going to infinity of any P of At. | ▶ 00:19 |
That's like the fancy math notation, but what it really means is we just wait a long, long time. | ▶ 00:22 |
What is going to happen to the Markov chain over here? What is that probability? | ▶ 00:28 |
This probability is called a stationary distribution, | ▶ 00:32 |
and a Markov chain settles to a stationary distribution | ▶ 00:36 |
or sometimes a limit cycle if the transition is alternativistic(?), which we don't care about. | ▶ 00:39 |
And the key to calculating this is to realize that the probability for any t | ▶ 00:44 |
must be the same as the probability 1 times (?) | ▶ 00:51 |
This can be resolved as follows. | ▶ 00:55 |
We know that P of At is P of At given At minus 1 times P of At minus 1 | ▶ 00:57 |
plus P of At given Bt minus 1 | ▶ 01:05 |
times probability of Bt minus 1. | ▶ 01:12 |
This is just the theorem of total probability or forward propagation rule | ▶ 01:17 |
applied to this case over here, so nothing really new. | ▶ 01:21 |
But if you call this guy over here X, then we now have X | ▶ 01:26 |
equals probability of At given At minus 1 is 0.5 | ▶ 01:32 |
times--and this is the same X as this one over here | ▶ 01:39 |
because you're looking for the stationary distribution, so it's X again. | ▶ 01:41 |
This probability over here, A following B, is 1 in this special case, | ▶ 01:45 |
and the probability of Bt minus 1 is 1 minus At minus 1. | ▶ 01:51 |
And if you plug this in, that's the same as 1 minus X. | ▶ 01:58 |
And we can now solve this for X. | ▶ 02:02 |
Let me just do this. | ▶ 02:05 |
X equals, if you put these 2 Xs together we get minus 0.5 plus 1 | ▶ 02:07 |
or, differently, 1.5X equals 1. | ▶ 02:15 |
That means X equals 1 over 1.5, which is 2/3. | ▶ 02:18 |
So the answer here is the stationary distribution will have A occurring with 2/3 chance | ▶ 02:24 |
and B with 1/3 chance. | ▶ 02:31 |
It's still a Markov chain--it flips from A to B-- | ▶ 02:33 |
but these are the frequencies at which A occurs | ▶ 02:35 |
and this is the frequency at which B occurs. | ▶ 02:38 |
[Thrun] To see if you understood this, let me look at the rain-sun Markov chain again, | ▶ 00:00 |
and let me ask you for the stationary distribution or the limit distribution | ▶ 00:07 |
for rain to be the case after infinitely many steps. | ▶ 00:10 |
[Thrun] And the answer is 1/3, | ▶ 00:00 |
as you can easily see if you call X the probability of rain in time T | ▶ 00:03 |
and also the probability of rain, T minus 1. | ▶ 00:09 |
These 2 must be equivalent because we're looking for the stationary distribution. | ▶ 00:12 |
Then we get, by virtue of our expansion of the state at time T, | ▶ 00:16 |
the probability of transitioning from rain to rain is 0.6, | ▶ 00:22 |
the probability of having it rain is X again, | ▶ 00:27 |
the probability of transitioning from sun to rain is 0.2, | ▶ 00:30 |
and the probability of having sun before is 1 minus X, | ▶ 00:36 |
so we get X equals 0.4X plus 0.2. | ▶ 00:40 |
Or, differently, we have 0.6X equals 0.2, | ▶ 00:46 |
and when we work this out, X is 1/3, | ▶ 00:51 |
which is the probability of rain in the asymptote if you wait forever. | ▶ 00:55 |
One of the interesting things to observe here | ▶ 01:01 |
is that the stationary distribution does not depend on the initial distribution. | ▶ 01:04 |
In fact, I didn't even tell you what the initial state was. | ▶ 01:08 |
Markov chains that have that property, which are pretty much all Markov chains, | ▶ 01:12 |
are called ergodic. | ▶ 01:16 |
You can safely forget that word again, but people in the field use this word | ▶ 01:19 |
to express Markov chains that mix. | ▶ 01:23 |
And mix means that the knowledge of the initial distribution fades over time | ▶ 01:26 |
until it disappears in the end. | ▶ 01:32 |
The speed at which it gets lost is called the mixing speed. | ▶ 01:36 |
[Thrun] You can also learn the transition probabilities of a Markov chain like this | ▶ 00:00 |
from actual data. | ▶ 00:07 |
Suppose you look out of the window and see sequences of rainy days | ▶ 00:09 |
followed by sunny days followed by rainy days | ▶ 00:12 |
and you wonder what numbers to put here, here, here, and here. | ▶ 00:15 |
Let me assume you see a sequence rain, sun, sun, sun, rain, sun, and rain. | ▶ 00:24 |
These are, in total, 7 different days, | ▶ 00:34 |
and we wish to estimate all those probabilities over here, | ▶ 00:36 |
including the initial distribution for the first day using maximum likelihood. | ▶ 00:38 |
You might remember all this work with Laplace smoothing, | ▶ 00:46 |
but for now we keep it simple, just maximum likelihood. | ▶ 00:49 |
We find for day 0 we had rain, and maximum likelihood would just say | ▶ 00:52 |
the probability for day 0 is 1. | ▶ 00:57 |
That's the most likely estimate. | ▶ 01:00 |
Then for the transition probability we find we transition from rain | ▶ 01:02 |
to something else twice here. | ▶ 01:07 |
We sometimes transition to sun and sometimes stay in rain. | ▶ 01:11 |
In both of the transitions we go from rain to sun. There is no instance of rain to rain. | ▶ 01:14 |
So maximum likelihood gives us over here a 1 and this over here 0. | ▶ 01:19 |
And finally, we can also ask the question what happens from a sunny state. | ▶ 01:23 |
We transition to a new sunny state or a rainy state, | ▶ 01:27 |
and those distributions are easily calculated. | ▶ 01:31 |
We have 4 transitioning out of a sunny state to something else-- | ▶ 01:33 |
this one, this one, this one, and this one. | ▶ 01:37 |
Twice it goes to sunny over here and over here, | ▶ 01:39 |
twice it goes to rainy over here and over here, | ▶ 01:42 |
so therefore the probability for either transition is 0.5. | ▶ 01:45 |
So we have 0.5 over here, 0.5 over here, 1 over here, and 0 over here | ▶ 01:48 |
for the transition probabilities. | ▶ 01:54 |
[Thrun] So in this quiz please do the same for me. | ▶ 00:00 |
Here is our sequence. | ▶ 00:03 |
There's a couple of sunny days--5 in total--a rainy day, 3 sunny days, 2 rainy days. | ▶ 00:05 |
Calculate using maximum likelihood the prior probability of rain | ▶ 00:10 |
and then the 4 transition probabilities as before. | ▶ 00:14 |
Please fill in those numbers over here. | ▶ 00:18 |
[Thrun] The initial probability for rain is 0 | ▶ 00:00 |
because we are just encountering 1 initial day and it's sunny. | ▶ 00:03 |
The maximum likelihood estimate is therefore 0. | ▶ 00:06 |
We transition 8 times out of a sunny state--1, 2, 3, 4, 5, 6, 7, 8-- | ▶ 00:09 |
twice into a rainy state, and therefore 6 times we remain in a sunny state, | ▶ 00:15 |
so the probability of sun to sun is ¾, | ▶ 00:20 |
whereas sun to rain is ¼. | ▶ 00:23 |
From a rainy state we have 2 outbound transitioning, | ▶ 00:26 |
1 to a sunny state and 1 to a rainy state. | ▶ 00:29 |
The last R over here has no outbound transition, | ▶ 00:32 |
so it doesn't really count in our statistic. | ▶ 00:34 |
The maximum likelihood therefore is 0.5 or ½ for each of those. | ▶ 00:37 |
[Thrun] One of the oddities of the maximum likelihood estimator is overfitting. | ▶ 00:00 |
So for example, we observed that we always have a single first day, | ▶ 00:04 |
and this becomes our prior probability. | ▶ 00:08 |
So in this case the prior probability for rain on day 0 would be 1, | ▶ 00:11 |
which kind of doesn't make sense, really. | ▶ 00:16 |
It should be more like the stationary distribution or something like that. | ▶ 00:18 |
Well, you might remember the work on Laplacian smoothing. | ▶ 00:21 |
This is a great moment where I can test whether you really think | ▶ 00:27 |
like an artificial intelligence person. | ▶ 00:30 |
I'm going to make you apply Laplacian smoothing in this new context | ▶ 00:33 |
of estimating the parameters of this Markov chain | ▶ 00:36 |
using the smoother of K = 1. | ▶ 00:41 |
You might remember you add something to the numerator, like 1, | ▶ 00:45 |
and something to the denominator to make sure things normalize, | ▶ 00:48 |
and then you get different probabilities | ▶ 00:51 |
than you would get with the maximum likelihood estimator. | ▶ 00:53 |
So I'm going to ask you a quiz here, even though I haven't completely shown you | ▶ 00:56 |
the application of Laplacian smoothing in this context. | ▶ 00:59 |
But if you understood Laplacian smoothing, you might want to give it a try. | ▶ 01:02 |
What's the probability of rain on day 0, and what are its conditional probabilities? | ▶ 01:05 |
Sun goes to sun, sun goes to rain, rain goes to sun, and rain stays in rain. | ▶ 01:13 |
The way probabilities work, as you surely know, these 2 things over here | ▶ 01:23 |
have to add up to 1, and these 2 things over here have to add up to 1. | ▶ 01:27 |
[Thrun] So in Laplacian smoothing we look at the relative counts. | ▶ 00:00 |
We know there is 1 instance of rain at time 0. | ▶ 00:04 |
Normally it would be 1. | ▶ 00:07 |
But we add 1 to the numerator and 2 to the denominator, and we get 2/3. | ▶ 00:10 |
Let's look at these numbers again. | ▶ 00:19 |
The count that we have is 1 out of 1 is rain and 1 out of 1 would give us 1 | ▶ 00:21 |
under the maximum likelihood estimator. | ▶ 00:26 |
But because we're smoothing, we're adding a pseudocount, | ▶ 00:28 |
which is 1 rainy day and 1 sunny day, | ▶ 00:31 |
and we have to compensate for the 2 additional counts with a 2 over here | ▶ 00:34 |
and therefore we get 2/3. | ▶ 00:38 |
So our probability under the Laplacian smoother is 2/3 for the rainy day to be the first day, | ▶ 00:40 |
which is really different from 1. | ▶ 00:46 |
Applying the same logic over here, we transition 3 times out of a sunny state-- | ▶ 00:48 |
1, 2, 3--and each time it's a sunny state. | ▶ 00:53 |
So maximum likelihood would say 3 times out of 3 it's sunny into sunny. | ▶ 00:58 |
We add a pseudo observation of 1, and then there's 2 possible outcomes; | ▶ 01:02 |
hence, we have to count 2 over here. | ▶ 01:07 |
So it's 4/5. | ▶ 01:10 |
And the missing 1/5 shows up over here. | ▶ 01:13 |
We can do the same math as before. | ▶ 01:15 |
Zero with 3 transitions from a sunny day resulted in a rainy day. | ▶ 01:18 |
In fact, they were all sunny. | ▶ 01:22 |
But we add 1 pseudo observation over here and 2 of the normalizer, 1/5. | ▶ 01:24 |
These 2 things surely add up to 1. | ▶ 01:29 |
The last one is analogous. | ▶ 01:32 |
We have 1 transition of a rainy state and it led to a sunny state, so 1/1, | ▶ 01:34 |
but we add 1 over here and 2 on the denominator so you get 2/3. | ▶ 01:38 |
And if you do the math over here, you get 1/3. | ▶ 01:42 |
I really want you to remember Laplacian smoothing. | ▶ 01:45 |
It's applicable to many estimation problems, | ▶ 01:47 |
and it will be important going forward in this class. | ▶ 01:51 |
Here we applied it to the estimation of a Markov chain. | ▶ 01:55 |
Please take a moment and study the logic so you'll be able to apply those things again. | ▶ 01:58 |
[Thrun] So now let's return to hidden Markov models. | ▶ 00:00 |
Those are really the subject of this class. | ▶ 00:04 |
Let's again use the rainy and sunny example just to keep it simple. | ▶ 00:08 |
These are the transition probabilities as before. | ▶ 00:12 |
Let's assume for now that the initial probability of rain is 0.5; | ▶ 00:15 |
hence, the probability of sun at time 0 is 0.5. | ▶ 00:20 |
The key modification to go to hidden Markov model is that this state is actually hidden. | ▶ 00:23 |
I cannot see whether it's raining or it's sunny. | ▶ 00:28 |
Instead I get to observe something else. | ▶ 00:32 |
Suppose I can be happy or grumpy | ▶ 00:35 |
and happiness or grumpiness is being caused by the weather. | ▶ 00:38 |
So rain might make me happy or grumpy, | ▶ 00:43 |
and sunshine makes me happy or grumpy | ▶ 00:46 |
but with vastly different probabilities. | ▶ 00:49 |
If it's sunny, I'm just mostly happy, 0.9. | ▶ 00:51 |
There's a 0.1 chance I might still be grumpy for some other reason. | ▶ 00:55 |
If it's rainy, I'm only happy with 0.4 probability and with 0.6 I'm grumpy. | ▶ 00:59 |
In fact, living in California I can attest that these are actually not wrong probabilities. | ▶ 01:05 |
I love the sun over here. | ▶ 01:11 |
Suppose I observe that I'm happy on day 1. | ▶ 01:14 |
A question that we can ask now is what is the so-called posterior probability | ▶ 01:20 |
for it raining on day 1 and what's the posterior probability for it being sunny on day 1? | ▶ 01:27 |
What's the probability of rain on day 1 given that I observed that I was happy on day 1? | ▶ 01:35 |
This is being answered using Bayes rule, | ▶ 01:43 |
so this is the probability of being happy given that it rains | ▶ 01:46 |
times the probability that it rains over the probability of being happy. | ▶ 01:50 |
We know the probability of rain at day 1 based on our Markov state transition model. | ▶ 01:56 |
In fact, let's just calculate it. | ▶ 02:03 |
The probability of rain on day 1 is the probability it was rainy on day 0 | ▶ 02:05 |
and it led to a self transition from rain to rain from day 0 to day 1 | ▶ 02:10 |
plus the probability it was sunny on day 0 times the probability that sun led to rain over here. | ▶ 02:14 |
If you can plug in all these numbers to obtain 0.4, | ▶ 02:20 |
you can just easily verify this. | ▶ 02:26 |
So we know this guy over here is 0.4. | ▶ 02:29 |
This guy over here is 0.4 again, but now it's this 0.4 over here. | ▶ 02:32 |
The probability of being happy on a rainy day is 0.4. | ▶ 02:39 |
This guy over here resolves to 0.4 times 0.4 | ▶ 02:44 |
plus the same situation with sunny in time 1 | ▶ 02:51 |
where the prior is 0.6 and the happiness factor is 0.9. | ▶ 02:55 |
And that gives us the entire expression is 0.229. | ▶ 03:01 |
Let's interpret the 0.229 in the context of the question we asked. | ▶ 03:06 |
We know that at time 0 it was raining with half a chance. | ▶ 03:11 |
If you look at the state transition diagram, it's more likely to be sunny afterwards | ▶ 03:16 |
because it's more likely to flip from rain to sun than sun to rain. | ▶ 03:20 |
In fact, we worked out that the probability of rain at a time step later was only 0.4, | ▶ 03:23 |
so it was 0.6 sunny. | ▶ 03:29 |
But now that I saw myself being happy, my probability of rain was further lowered | ▶ 03:31 |
from 0.4 to 0.229. | ▶ 03:36 |
And the reason why the probability went down is if you look at happiness, | ▶ 03:39 |
happiness is much more likely to occur on a sunny day than it is to occur on a rainy day. | ▶ 03:45 |
And when you work this in using Bayes rule and total probability, | ▶ 03:50 |
you would find just the fact that it was at happiness at time 1 | ▶ 03:53 |
makes your belief of it being rainy go down from 0.4 to 0.229. | ▶ 03:57 |
This is a wonderful example of applying Bayes rule | ▶ 04:05 |
in this really relatively complicated hidden Markov model. | ▶ 04:08 |
[Thrun] So let me use exactly the same hidden Markov model where we have rain and sun | ▶ 00:00 |
and happiness and grumpiness with 0.4 and 0.6 | ▶ 00:06 |
and 0.9 and 0.1 probabilities. | ▶ 00:11 |
The only change I will apply is I will tell you that for probability 1 it's raining on day 0; | ▶ 00:14 |
hence, the probability of sunny at day 0 is 0. | ▶ 00:21 |
I now observe another happy face on day 1, | ▶ 00:25 |
and I'd like to know the probability of it raining on day 1 given this observation. | ▶ 00:30 |
This is the same as before with the only difference | ▶ 00:37 |
that we have a different initial probability, | ▶ 00:40 |
but all the other probabilities should just be the same. | ▶ 00:43 |
[Thrun] Once again let's calculate the probability of rain on day 1. | ▶ 00:00 |
This one is easy because we know it is raining on day 0, | ▶ 00:07 |
so it's 0.6, the 0.6 over here. | ▶ 00:10 |
This expression over here is expanded by a Bayes rule as applied over here. | ▶ 00:14 |
Probability of happiness during rain is 0.4, | ▶ 00:20 |
the probability of rain was said to be just 0.6, | ▶ 00:24 |
and we divide by 0.4 times 0.6 plus 0.9 times 0.4, which is 1 minus 0.6. | ▶ 00:28 |
And that resolves simply to 0.4 if you work it all out. | ▶ 00:37 |
So the interesting thing here is if you were just to run the Markov chain, | ▶ 00:42 |
on day 1 we have a 0.6 chance of rain, | ▶ 00:47 |
but the fact that I observed myself to be happy reduces the chance of rain to 0.4. | ▶ 00:51 |
[Thrun] So if you got those questions right, I'm in awe with you--wow-- | ▶ 00:00 |
because you understand the very basics of using a hidden Markov model for 2 things now. | ▶ 00:05 |
One is prediction, and one is called state estimation. | ▶ 00:10 |
In state estimation that's a really fancy word for just computing the probability | ▶ 00:16 |
of the internal or hidden state given measurements. | ▶ 00:21 |
In prediction we predict the next state, and you might also predict the next measurement. | ▶ 00:26 |
[Thrun] I want to show you a little animation of hidden Markov models | ▶ 00:00 |
used for robot localization. | ▶ 00:05 |
This is obviously a little toy robot over here that lives in the grid world, | ▶ 00:07 |
and the grid world is composed of discrete cells where the robot may be located. | ▶ 00:12 |
This robot happens to know where north is at all times. | ▶ 00:16 |
It's given 4 sensors, a wall sensor to the left, to the right, to the top | ▶ 00:20 |
and the bottom over here, and it can sense whether in the adjacent cell there's a wall or not. | ▶ 00:24 |
Initially this robot has no clue where it is. It faces what we call a global localization problem. | ▶ 00:30 |
It now uses its sensors and its actuators to localize itself. | ▶ 00:37 |
So in the very first episode the robot senses a wall north and south of it | ▶ 00:43 |
but none west or east. | ▶ 00:49 |
And look what this does to the probabilities. | ▶ 00:52 |
The posterior probability is now increased | ▶ 00:56 |
in places that are consistent with this measurement, | ▶ 00:58 |
like all of those places have a wall in north and east, like these guys over here, | ▶ 01:01 |
and free space in the left and the right, | ▶ 01:06 |
yet they have been decreased in places that are inconsistent, like this guy over here. | ▶ 01:09 |
These states over here are interesting. They are shaded gray and lighter gray. | ▶ 01:14 |
What this means is they still have a significant probability | ▶ 01:18 |
but yet not as much as over here, | ▶ 01:21 |
the reason being that this measurement over here would be characteristic | ▶ 01:24 |
for the state over here if there had been exactly 1 measurement error-- | ▶ 01:29 |
if the bottom sensor had erred and erroneously detected a wall. | ▶ 01:33 |
Errors are less likely than no errors, and as a result, the cell over here | ▶ 01:39 |
which is completely consistent ends up to be more likely than the cell over here, | ▶ 01:43 |
yet you can see the HMM does a nice job in understanding the posterior probability. | ▶ 01:47 |
Let's assume the robot moves right and senses again | ▶ 01:53 |
and gets the exact same measurement. | ▶ 01:57 |
Of course it has no clue that it is exactly over here. | ▶ 01:59 |
It can see the probability as being decayed. | ▶ 02:02 |
Interestingly enough, this guy over here has a lower probability, | ▶ 02:04 |
and the reason is by itself it is very consistent with the most recent measurement, | ▶ 02:08 |
but it's less consistent with the idea of having moved right and measured before | ▶ 02:12 |
a wall to the north and the south. | ▶ 02:17 |
And similarly, these places over here become less consistent. | ▶ 02:19 |
The only ones that are completely consistent are these 3 states over here | ▶ 02:23 |
and the 3 states over here. | ▶ 02:26 |
The robot keeps moving to the right, | ▶ 02:28 |
and now we get to the point where the sequence of measurement | ▶ 02:30 |
really makes 2 states equally likely--the ones over here. | ▶ 02:34 |
They are equally likely with symmetry. | ▶ 02:36 |
Those are still pretty likely, and those are gradually and likely over here to the left. | ▶ 02:38 |
As the robot now moves, it moves into a distinguishing state. | ▶ 02:44 |
It sees a wall in the north but free space in the 3 other directions, | ▶ 02:48 |
and that renders the state over here relatively unlikely, | ▶ 02:52 |
and now it has localized itself. | ▶ 02:55 |
[Thrun] We discussed specific incidents of hidden Markov model inference or filtering | ▶ 00:00 |
in our quizzes. | ▶ 00:05 |
Let me now give you the basic math. | ▶ 00:07 |
We all know hidden Markov model is a chain like this | ▶ 00:09 |
of hidden states that are Markovian | ▶ 00:13 |
and measurements that only depend on the corresponding state. | ▶ 00:17 |
We know that this Bayes network entailed certain independencies. | ▶ 00:22 |
For example, given X2 the past, the future, and the present measurement | ▶ 00:25 |
are all conditionally independent given X2. | ▶ 00:34 |
The nice thing about this structure is it makes it possible to efficiently do inference. | ▶ 00:37 |
I'll give you the equations we used before here in a more explicit form. | ▶ 00:42 |
Let's look at the measurement side, and suppose we wish to know the probability | ▶ 00:49 |
of an internal state variable given a specific measurement, | ▶ 00:55 |
and that by Bayes rule becomes P of Z1 given X1 times P of X1 over P of Z1. | ▶ 00:59 |
When you start doing this, you'll find that the normalizer | ▶ 01:06 |
doesn't depend on the target variable X; | ▶ 01:10 |
therefore, we often write a proportionality sign and get an equation like this. | ▶ 01:13 |
This product over here is the basic measurement update of hidden Markov models. | ▶ 01:19 |
And the thing to remember when you apply it, you have to normalize. | ▶ 01:24 |
We already practiced all of this, so you know all of this. | ▶ 01:27 |
The other equation is the prediction equation, | ▶ 01:30 |
so let's go from X1 to X2. | ▶ 01:33 |
This is called prediction even though sometimes it has nothing to do with prediction. | ▶ 01:36 |
It's the traditional term, but it comes from the fact that we might want to predict | ▶ 01:40 |
the distribution of X2 given that we know the distribution of X1. | ▶ 01:43 |
Here we apply total probability. | ▶ 01:49 |
The probability of X2 is obtained by checking all states we might have come from in X1 | ▶ 01:51 |
and calculating the probability of going from X1 to X2. | ▶ 01:59 |
We also practiced this before. | ▶ 02:03 |
Any probability of X2 being in a certain state must have come from another state, X1, | ▶ 02:06 |
and then transitioned into X2, so we sum over all of those | ▶ 02:12 |
and we get the posterior probability of X2. | ▶ 02:15 |
These 2 equations together form the math of a hidden Markov model | ▶ 02:18 |
where the next state distribution and the measurement distribution | ▶ 02:24 |
and the initial state distribution are all given as the parameters of a hidden Markov model. | ▶ 02:29 |
[Thrun] Here is the application of HMM to a real robot localization example. | ▶ 00:00 |
This robot is in a world that's 1-dimensional and it is lost. | ▶ 00:05 |
It has initial uncertainty about where it is, | ▶ 00:09 |
and it is actually located next to a door but it doesn't know. | ▶ 00:12 |
It's also given a map of the world, | ▶ 00:16 |
and the distribution of all possible states, here noted as s, is given by this histogram. | ▶ 00:18 |
We bin the world into small bins, and for each bin we assign a single numerical probability | ▶ 00:24 |
of the robot being there. | ▶ 00:31 |
The fact they have all the same height means that the robot is maximally uncertain | ▶ 00:33 |
as to where it is. | ▶ 00:37 |
Let's assume this robot is going to sense | ▶ 00:39 |
and it senses to be next to a door. | ▶ 00:41 |
The red graph over here is the probability of seeing a door | ▶ 00:43 |
for different locations in the environment. | ▶ 00:47 |
There are 3 different doors, and seeing a door is more likely here | ▶ 00:49 |
than it is in between. | ▶ 00:53 |
It might still see a door here, but it's just less likely. | ▶ 00:55 |
We now apply Bayes rule. | ▶ 00:58 |
We multiply the prior with this measurement probability to obtain the posterior. | ▶ 01:00 |
That was our measurement update. It's that simple. | ▶ 01:07 |
So you can see how all these uniform values over here become nonuniform values over here | ▶ 01:10 |
multiplied by this curve over here. | ▶ 01:17 |
The story progresses by the robot taking an action to the right, | ▶ 01:20 |
and this is now the next state prediction part, the what we call convolution part | ▶ 01:25 |
or state transition part, where these little bumps over here get shifted along with the robot | ▶ 01:30 |
and they are flattened out a little bit just because robot motion has used uncertainty. | ▶ 01:36 |
Again, it's a really simple operation. | ▶ 01:40 |
You shift those to the right and you smooth them out a little bit | ▶ 01:43 |
to account for the control noise in the robot's actuators. | ▶ 01:46 |
And now we get to the point that the robot senses again, | ▶ 01:51 |
and this robot senses a door again. | ▶ 01:53 |
And see what happens. It multiplies. | ▶ 01:56 |
It's now a nonuniform prior over here with the same measurement probability as before, | ▶ 01:59 |
but now we get a distribution that's peaked over here | ▶ 02:06 |
and has smaller bumps at various other places, | ▶ 02:09 |
the reason being the only place where my prior has a higher probability | ▶ 02:12 |
and my measurement probability is also high probability is the second door, | ▶ 02:17 |
and as a result of our distribution over here, it assumes a much larger value. | ▶ 02:21 |
If you look at that picture, that is really easy to implement, | ▶ 02:24 |
and that's what we did all along when we talked about rain and sun and so on. | ▶ 02:28 |
It's really a very simple algorithm. | ▶ 02:32 |
Measurements are multiplications, and motion become essentially convolutions | ▶ 02:35 |
which are shifts with added noise. | ▶ 02:41 |
[Thrun] This is a great segue to one of the most successful algorithms | ▶ 00:00 |
in artificial intelligence and robotics called particle filters. | ▶ 00:05 |
Again, the topic here is robot localization, | ▶ 00:09 |
and here we're dealing with a real robot with actual sensor data. | ▶ 00:13 |
The robot is lost in this building. | ▶ 00:17 |
You can see different rooms, and you can see corridors, | ▶ 00:20 |
and the robot is equipped with range sensors. | ▶ 00:24 |
These are sound sensors that measure the range to nearby obstacles. | ▶ 00:26 |
Its task is to figure out where it is. | ▶ 00:31 |
The robot will move along the black line over here, but it doesn't know this. | ▶ 00:35 |
It has no clue where it is. | ▶ 00:39 |
It has to figure out where it is. | ▶ 00:41 |
The key thing in particle filters is the representation of the belief. | ▶ 00:43 |
Whereas before we had discrete worlds like our sun and rain example | ▶ 00:48 |
or we had a histogram approach where we cut the space into small bins, | ▶ 00:54 |
particle filters have a very different representation. | ▶ 00:59 |
They represent the space by a collection of points or particles. | ▶ 01:02 |
Each of these small dots over here is a hypothesis where the robot might be. | ▶ 01:07 |
It's a concrete value of its X location and its Y location and its heading direction | ▶ 01:12 |
in this environment. | ▶ 01:19 |
So it's a vector of 3 values. | ▶ 01:21 |
The sum or set of all those vectors together form the belief space. | ▶ 01:23 |
So particle filters approximate a posterior | ▶ 01:30 |
by many, many, many guesses, | ▶ 01:33 |
and the density of those guesses represents the posterior probability | ▶ 01:36 |
of being at a certain location. | ▶ 01:41 |
To illustrate this, let me run the video. | ▶ 01:44 |
You can see in a very short amount of time the range sensors, | ▶ 01:47 |
even though they're very noisy, force the particles to collect in the corridor. | ▶ 01:51 |
There's 2 symmetrical point dots--this one over here and this one over here-- | ▶ 01:57 |
that come from the fact that the corridor itself is symmetric. | ▶ 02:01 |
But as the robot moves into the office, the symmetry is broken. | ▶ 02:04 |
This office looks very different from this office over here, | ▶ 02:08 |
and those particles die out. | ▶ 02:11 |
What's happening here? | ▶ 02:14 |
Intuitively speaking, each particle is a representation of a possible state, | ▶ 02:16 |
and the more consistent the particle with the measurement, | ▶ 02:21 |
the more the sonar measurement fits into the place where the particle says the robot is, | ▶ 02:24 |
the more likely it is to survive. | ▶ 02:29 |
This is the essence of particle filters. | ▶ 02:31 |
Particle filters use many particles to represent a belief, | ▶ 02:34 |
and they will let those particles survive in proportion to the measurement probability. | ▶ 02:38 |
And the measurement probability here is nothing else but the consistency | ▶ 02:44 |
of the sonar range measurements with the map of the environment | ▶ 02:49 |
given the particle place. | ▶ 02:53 |
Let me play this again. | ▶ 02:55 |
Here's the maze. The robot is lost in space. | ▶ 02:57 |
Again, you can see how within very few steps the particles | ▶ 03:00 |
consistent with the range measurements all accumulate in the corridor. | ▶ 03:05 |
As the robot hits the end of the corridor, only 2 particle clouds survive | ▶ 03:10 |
due to the symmetry of the corridor, and the particles finally die out. | ▶ 03:14 |
This algorithm is beautiful, | ▶ 03:19 |
and you can implement it in less than 10 lines of program code. | ▶ 03:21 |
So given all the difficulty of talking of probabilities and Bayes network | ▶ 03:27 |
and hidden Markov models, you will now find a way | ▶ 03:32 |
to implement one of the most amazing algorithms for filtering and state estimation | ▶ 03:36 |
in less than 10 lines of C code. | ▶ 03:41 |
Isn't that amazing? | ▶ 03:45 |
[Thrun] Here is our 1-dimensional localization example again, | ▶ 00:00 |
this time with particle filters. | ▶ 00:03 |
You can see the particles initially spread out uniformly. | ▶ 00:06 |
This 1-dimensional space of forward locations you're going to use as an example | ▶ 00:09 |
to explain every single step of particle filters. | ▶ 00:13 |
In the very first step, the robot senses a door. | ▶ 00:16 |
Here is its initial particles before sensing the door. | ▶ 00:20 |
It now copies these particles over verbatim but gives them what's called a weight. | ▶ 00:24 |
We call this weight the importance weight, | ▶ 00:30 |
and the importance weight is nothing else but the measurement probability. | ▶ 00:33 |
It's more likely to see a door over here than over here. | ▶ 00:37 |
The red curve over here is the measurement probability, | ▶ 00:41 |
and the particles over here are the same as up here, | ▶ 00:44 |
but they now attached an importance weight where the height of the particle | ▶ 00:48 |
illustrates the weight. | ▶ 00:52 |
So you can see the place over here, the place over here, and the place over here | ▶ 00:54 |
carry the most weight because they're the most likely ones. | ▶ 00:57 |
This robot moves and it moves by using its previous particles | ▶ 01:00 |
to create a new random particle set that represents the posterior probability | ▶ 01:07 |
of being at a new location. | ▶ 01:12 |
The key thing here is called resampling. | ▶ 01:15 |
The algorithm works as follows. | ▶ 01:19 |
Pick a particle from the set over here and pick it in proportion to the importance weight. | ▶ 01:21 |
Once you've picked one--and sure enough, you pick those more frequently | ▶ 01:28 |
than those over here--add the motion to it plus a little bit of noise | ▶ 01:32 |
to create a new particle. | ▶ 01:37 |
Repeat this procedure for each particle. | ▶ 01:39 |
Pick them with replacement. | ▶ 01:42 |
You're allowed to pick a particle twice or 3 or 4 times. | ▶ 01:44 |
Sure enough, you pick these more frequently. | ▶ 01:47 |
These are being forward moved to over here, these to over here. | ▶ 01:49 |
You see a higher density of particles over here and over here, | ▶ 01:53 |
than you see, for example, over here. | ▶ 01:56 |
That's your forward prediction step in particle filters. | ▶ 01:58 |
It's really easy to implement. | ▶ 02:01 |
The next step is another measurement step, | ▶ 02:04 |
and here I'm illustrating to you that indeed this nonuniform set of particles | ▶ 02:06 |
leads to a reasonable posterior in this space. | ▶ 02:10 |
We now have a particle set as nonuniform. | ▶ 02:13 |
We have increased density over here, over here, and over here. | ▶ 02:16 |
You can see how multiplying these particles with the importance weight, | ▶ 02:20 |
which is copying them over verbatim but attaching a vertical importance weight | ▶ 02:25 |
in proportion to the measurement probability, | ▶ 02:31 |
yields a lot of particles over here with big weights, | ▶ 02:33 |
some over here with big weights, lots of particles over here with low weights. | ▶ 02:37 |
They got copied over, but the measurement probability here is low and so on and so on. | ▶ 02:41 |
And if you look at this set of particles, you already understand | ▶ 02:46 |
why the majority of importance and weights resides in the correct location | ▶ 02:50 |
given that we had a measurement of a door and motion to the right | ▶ 02:56 |
and another measurement of the door. | ▶ 02:59 |
The nice thing here is that particle filters work in continuous spaces, | ▶ 03:01 |
and, what's often underappreciated, they use your computational resources | ▶ 03:06 |
in proportion to how likely something is. | ▶ 03:13 |
You can see that almost all the computation now resides over here, | ▶ 03:16 |
almost all the memory resides over here, | ▶ 03:19 |
and that's the place that's likely. | ▶ 03:21 |
Stuff over here requires less memory, less computation, and guess what? | ▶ 03:23 |
It's much less likely. | ▶ 03:27 |
So particle filters make use of your computational resources in an intelligent way. | ▶ 03:29 |
They're really nice to implement on something with low compute power. | ▶ 03:34 |
Let me move on to explain the next motion. | ▶ 03:38 |
Here you see our robot moving to the right again, | ▶ 03:42 |
and now the same what we call resampling takes place. | ▶ 03:45 |
We pick, with replacement, particles from over here. | ▶ 03:49 |
Sure enough, these are the ones we pick the most often. | ▶ 03:52 |
And then we add the motion command plus some random noise. | ▶ 03:55 |
If you look at this particle set over here, almost all the particles sit over here. | ▶ 03:59 |
It doesn't really show it very well on this computer screen, | ▶ 04:03 |
but the density of particles over here is significantly higher than anywhere else. | ▶ 04:06 |
There's occurrences over here and over here that correspond with these guys over here | ▶ 04:10 |
and these guys over here and over here, correspond to this guy over here, | ▶ 04:14 |
but the vast majority of probability mass sits over here. | ▶ 04:18 |
So let's dive into how complicated this algorithm really is. | ▶ 04:22 |
[Thrun] So here is our algorithm particle filter. | ▶ 00:00 |
It sets as an input a set of particles with associated important weights, | ▶ 00:04 |
a control, and a measurement vector, | ▶ 00:09 |
and it constructs a new particle set as prime | ▶ 00:12 |
and in doing so it also has an auxiliary variable, eta. | ▶ 00:16 |
Here is the algorithm. | ▶ 00:20 |
Initially we go through all new particles of which there are n | ▶ 00:22 |
and we sample in index j according to the distribution | ▶ 00:27 |
defined by the importance weights associated with the particle set over here. | ▶ 00:32 |
Put differently, we have a set of particles over here | ▶ 00:38 |
and we have associated importance factors which we will construct a little bit later on, | ▶ 00:41 |
and now we pick one of these particles with replacement | ▶ 00:45 |
where the probability of picking this particle is exactly the importance weight, w. | ▶ 00:48 |
For this particle we now sample a possible successor state | ▶ 00:55 |
according to the state transition probability using our controls | ▶ 01:01 |
and that specific particle as an input. We call it sj over here. | ▶ 01:06 |
We also compute an importance weight, which is the measurement probability | ▶ 01:11 |
for that specific particle over here. | ▶ 01:16 |
This gives us a new particle, and this gives us a new non-normalized importance weight. | ▶ 01:20 |
For now we just add them into our new particle set as prime and we reiterate. | ▶ 01:25 |
The only thing missing now is at the very end we have to normalize all the weights. | ▶ 01:32 |
For this we keep our running counter, eta, | ▶ 01:37 |
and we have a For loop in which we take all the weights in the set over here | ▶ 01:40 |
and just normalize them accordingly. | ▶ 01:45 |
This is the entire algorithm. | ▶ 01:48 |
We feed in over here particles with associated important weights | ▶ 01:51 |
and a control and a measurement, | ▶ 01:55 |
and then we construct the new set of particles by picking particles from our previous set | ▶ 01:58 |
at random with replacement but in accordance to the importance weights, | ▶ 02:05 |
so important particles are picked more frequently. | ▶ 02:10 |
We guess for this particle this will be a state. | ▶ 02:13 |
We guess what a new state might be by just sampling it, | ▶ 02:16 |
and we attach it an importance weight which we later normalize | ▶ 02:20 |
that is proportional to the measurement probability for this thing over here. | ▶ 02:23 |
So you're going to upweigh the particles that look consistent with the measurements | ▶ 02:28 |
and downweigh the ones that are non-consistent. | ▶ 02:31 |
We add all of these things back to our particle sets and reiterate. | ▶ 02:34 |
I promised you it would be an easy algorithm. | ▶ 02:38 |
You can look at this, and you could actually implement this really easily. | ▶ 02:40 |
Just remember how much difficulty we introduced | ▶ 02:45 |
with talking about Bayes networks and hidden Markov models and all that stuff. | ▶ 02:48 |
This is all there is to implement particle filters. | ▶ 02:54 |
[Thrun] Particle filters are really easy to implement. | ▶ 00:00 |
They have some deficiencies. | ▶ 00:03 |
They don't really scale to high-dimensional spaces. | ▶ 00:05 |
That's been recognized because the number of particles you need | ▶ 00:08 |
to fill a high-dimensional space tends to grow exponentially | ▶ 00:11 |
with the dimensionality of the space. | ▶ 00:14 |
So for 100 dimensions it's hard to make work. | ▶ 00:17 |
But there are extensions. | ▶ 00:20 |
They go under really fancy names like Rao-Blackwellized particle filters | ▶ 00:22 |
that can actually do this, but I won't talk about them in any detail here. | ▶ 00:27 |
They also have problems with degenerate conditions. | ▶ 00:31 |
For example, they don't work well if you only have 1 particle or 2 particles. | ▶ 00:36 |
They tend not to work well if you have no noise in your measurement model | ▶ 00:40 |
or no noise in your controls. | ▶ 00:45 |
You need this kind of to remix things a little bit. | ▶ 00:47 |
If there is very little noise, you have to deviate from the basic paradigm. | ▶ 00:50 |
But the good news is they work really well in many, many applications. | ▶ 00:55 |
For example, our self-driving cars use particle filters for localization and for mapping | ▶ 01:00 |
and for a number of other things. | ▶ 01:05 |
And the reason why they work so well is they're really easy to implement, | ▶ 01:07 |
they're computationally efficient in the sense that they really put the computational resources | ▶ 01:11 |
where they are needed the most, and they can deal with highly non-monotonic | ▶ 01:16 |
and very complex posterior distribution that have many peaks. | ▶ 01:22 |
And that's important. Many other filters can't. | ▶ 01:25 |
So particle filters are often the method of choice when it comes to building quickly | ▶ 01:28 |
an estimation method for problems where the posterior is complex. | ▶ 01:33 |
[Thrun] Wow! You learned a lot about hidden Markov models and particle filters. | ▶ 00:00 |
Particle filters is the most used algorithm in robotics today | ▶ 00:06 |
when it comes to interpreting sensor data, | ▶ 00:10 |
but these algorithms are applicable in a wide array of applications | ▶ 00:12 |
such as finance, medicine, behavioral studies, time series analysis, speech, | ▶ 00:15 |
language technologies, anything involving time and sensors or uncertainty. | ▶ 00:22 |
And now you know how to use them. | ▶ 00:28 |
You can apply them to all these problems if you listened carefully. | ▶ 00:30 |
It's been a pleasure teaching you with this class. | ▶ 00:35 |
As I told you in the beginning, this is a topic very close to my heart, | ▶ 00:37 |
and I hope it's going to empower you to do better stuff | ▶ 00:41 |
in any domain involving time series and uncertainty. | ▶ 00:45 |
Here is a sequence of MDP questions. | ▶ 00:00 |
We're given a maze environment with 8 fields | ▶ 00:04 |
where we receive +100 over here and -100 in the corner over here. | ▶ 00:08 |
Our agent can go north, south, west, or east, but actions may fail at random. | ▶ 00:16 |
With probability "P," and P is a number between 0 and 1, the action succeeds, | ▶ 00:23 |
and with 1 - P we go into reverse. | ▶ 00:30 |
For example, if we take the action go east into this state over here, | ▶ 00:33 |
with P probability we find ourselves over there. | ▶ 00:38 |
With 1 - P we find ourselves right over here in the exact opposite direction. | ▶ 00:41 |
Here is the east action again. With P we go to the right, and with 1 - P we go to the left. | ▶ 00:46 |
Of course, if you bounce into a wall we stay where we are. | ▶ 00:53 |
For my first question, I'll assume P equals 1. | ▶ 00:56 |
There is no uncertainty in action outcome, and there is no failure. | ▶ 01:00 |
The state transition function is deterministic. | ▶ 01:03 |
I want you to fill in for each state the final value after running value iteration to completion, | ▶ 01:06 |
and please assume the cost is -4 and we use gamma equals 1 as the discount factor. | ▶ 01:12 |
Please fill in those missing six values over here. | ▶ 01:19 |
And the answer is obtained by looking at the nearest value -4, | ▶ 00:00 |
which gives us 96 over here, 92, 88, and 84. | ▶ 00:05 |
Let us now assume that P = 0.8, which means actions fail with probability 0.2. | ▶ 00:00 |
Again, the cost is -4, and gamma equals 1. | ▶ 00:08 |
I want you to run exactly one value calculation for the state in red up here. | ▶ 00:13 |
Assuming that the value function is initialized with 0 everywhere, | ▶ 00:19 |
what will be the value after a single value iteration for the state up here. | ▶ 00:23 |
This is the state a4. | ▶ 00:29 |
The answer is 76. | ▶ 00:00 |
The value of over here is maximized for the south action, | ▶ 00:03 |
which we reach with 0.8 chance with 0.2 chance we'll stay | ▶ 00:07 |
in the same state for which inital value was 0, | ▶ 00:11 |
and we subtract the action cost of 4, which is 80 + 0 - 4 = 76. | ▶ 00:15 |
Now using the same premise as before, P equals 0.8, cost equals -4 and gamma equals 1, | ▶ 00:00 |
I'd like you to run value iteration to completion. | ▶ 00:07 |
For the one state over here, a4, I'd like to know what is it's final value. | ▶ 00:11 |
Now you might be tempted to write a piece of computer software, | ▶ 00:17 |
but for this specific state, it's actually possible to do it with a relatively simple peice of math. | ▶ 00:20 |
It's not trivial, but give it a try. | ▶ 00:26 |
What is the value of a4 after convergence? | ▶ 00:28 |
The answer is 95. | ▶ 00:00 |
To see, we observe the dominant equation suggests that | ▶ 00:03 |
each new iteration doesn't change the value. | ▶ 00:06 |
We kind of know that the optimal policy is to go south over here, | ▶ 00:09 |
so just really the value iteration for going south. | ▶ 00:13 |
Let's call it the value of x. | ▶ 00:16 |
We know that x is updated by 0.8 times 100 plus 0.2 of saying in the same state, | ▶ 00:18 |
whose value is x minus the cost. | ▶ 00:26 |
This invariance must hold true after convergence. We can now resolve it for x. | ▶ 00:29 |
We get 0.8x equals 76, so 76 divided by 0.8 is 95. | ▶ 00:34 |
Finally, I'd like to ask you what is the optimal policy for the parameters we just studied. | ▶ 00:00 |
I'm listing here all states--a1, a2, a3, a4, a5, b2, and b3-- | ▶ 00:07 |
and I'd like you to tell me whether you would like to go north, south, west, or east | ▶ 00:14 |
in any of those six states over here. | ▶ 00:21 |
For each of those there is exactly one correct answer. | ▶ 00:23 |
It is easy to see that these three states over here, a1 to a3, you want to go east. | ▶ 00:00 |
In the state a4 up here, you wish to go south. | ▶ 00:07 |
In b2 you with to go north so as to not risk the 0.2 probability of falling into -100. | ▶ 00:10 |
In b3 it's perfectly fine to go east. | ▶ 00:18 |
In the worst case you find yourself in b2, | ▶ 00:21 |
in which case you can safely escape the -100 by going north, | ▶ 00:23 |
turn right again, and go down over here. | ▶ 00:26 |
This is the correct set of answers over here. | ▶ 00:29 |
In this question I would like you to check all the boxes that are true. | ▶ 00:00 |
First, there exists at least one environment in which every agent is rational, | ▶ 00:04 |
by which I mean optimal. | ▶ 00:10 |
For every agent, there exists (at least) one environment in which the agent is rational. | ▶ 00:13 |
To solve the sliding-tile 15-puzzle, an optimal agent that searches will usually require | ▶ 00:19 |
less memory than an optimal table-lookup reflex agent. | ▶ 00:26 |
By "usually" I mean there are always extreme cases that he can construct | ▶ 00:30 |
where this isn't the case. I ask about the common case. | ▶ 00:34 |
Here is the sliding-tile 15-puzzle, | ▶ 00:38 |
and you can see this is a puzzle where you move these pieces around | ▶ 00:40 |
until all these numbers are in order. | ▶ 00:43 |
It's a somewhat combinatorial search problem. | ▶ 00:45 |
Finally, to solve the sliding-tile 15-puzzle, an agent that searches will always do better-- | ▶ 00:48 |
that means it will always find shorter paths-- | ▶ 00:54 |
than a table-lookup reflex agent. | ▶ 00:56 |
This question is about A* search for the heuristic function h, | ▶ 00:00 |
which is indicated in the graph over here. | ▶ 00:05 |
An action costs 10 per step. | ▶ 00:07 |
Enter into each node of this graph the order when the node is expanded. | ▶ 00:10 |
That is the same as removed from the queue in A*. | ▶ 00:16 |
Start with a "1" in the start state over here at the top | ▶ 00:19 |
and enter "0" if the node will never be expanded. | ▶ 00:22 |
This is a graph where we have a whole number a nodes, | ▶ 00:26 |
and the heuristic function is indicated over here. | ▶ 00:29 |
I'm also asking you is the heuristic h admissible? | ▶ 00:31 |
Here's an easy question. | ▶ 00:00 |
For coin X, we know that the probability of heads is 0.3. | ▶ 00:02 |
What is the probability of tails? | ▶ 00:05 |
In this probability question, we study a potentially loaded or unfair coin, which we flip twice. | ▶ 00:00 |
Say the probability for it coming up heads both times is 0.04. | ▶ 00:07 |
These are independent experiments with the same coin. | ▶ 00:13 |
I wonder what is the probability it comes up tails twice if we flip the same coin twice. | ▶ 00:16 |
We now have two coins--one fair coin for which the probability of heads is 0.5, | ▶ 00:00 |
and a loaded coin for which the probability of heads is 1. | ▶ 00:07 |
This might be a coin where heads is on both sides. | ▶ 00:11 |
We now pick a coin at random with 0.5 chance, | ▶ 00:15 |
and we don't quite know which coin we've picked, | ▶ 00:19 |
but we do flip this coin, and we see "heads. | ▶ 00:22 |
What is the probability that this is the loaded coin? | ▶ 00:25 |
We now flip this coin again (the same coin), and we see "heads" again for a second time. | ▶ 00:28 |
What is now the probability that this is the loaded coin? | ▶ 00:34 |
Here's a Bayes network question. | ▶ 00:00 |
Consider the following Bayes network with variables A all the way to I. | ▶ 00:02 |
A and B connects in to E. C and D connects into F. | ▶ 00:05 |
E connects into G and H, and F also connects into H but also into I. | ▶ 00:08 |
I'm asking the question whether A is independent of B, | ▶ 00:14 |
A is conditionally independent of B given E. | ▶ 00:18 |
A is conditionally independent of B given G. | ▶ 00:21 |
A is conditionally independent of B given F, | ▶ 00:24 |
and A is conditionally independent of C given G. | ▶ 00:27 |
Check out this Bayes network over here, | ▶ 00:00 |
which is defined by the following conditional probability table: | ▶ 00:03 |
P of A equals 0.5. | ▶ 00:06 |
A connects into B and C. | ▶ 00:08 |
P of B given A is equal to 0.2. | ▶ 00:10 |
P of B given not A is also 0.2. | ▶ 00:12 |
P of C given A is equal to 0.8. | ▶ 00:14 |
P of C given no A is 0.4. | ▶ 00:17 |
You'll find an interesting oddity in this table if you look very carefully. | ▶ 00:19 |
I'd like to ask you what is the probability of B given C, | ▶ 00:23 |
and what is the probability of C given B? | ▶ 00:26 |
In this question, we apply naive Bayes with Laplacian smoothing-- | ▶ 00:00 |
the same as we have learned in class. | ▶ 00:03 |
We have now 2 classes of movies. One is called "old" and one is called "new." | ▶ 00:05 |
There are titles in here. There's three old movies--Top Gun, Shy People, Top Hat. | ▶ 00:11 |
Two new movies--Top Gear, Gun Shy. | ▶ 00:15 |
Use Laplacian smoothing with k=1 to compute the probability of a movie being old-- | ▶ 00:18 |
this is a prior probability, which is just based on class counts-- | ▶ 00:24 |
the probability of the word "top" as a title word in the class of old movies, | ▶ 00:28 |
and the probability that a new movie that we look at-- | ▶ 00:34 |
by new I mean a movie we've never seen before-- | ▶ 00:37 |
that is called "top," the probability this movie that corresponds | ▶ 00:40 |
to the old movie class with the new movie class. | ▶ 00:45 |
I recommend you use a single dictionary for smoothing, | ▶ 00:48 |
so look at all the words and see how large the dictionary is. | ▶ 00:51 |
Top occurs here in two different ways. | ▶ 00:55 |
One is a word over here, but one also is a movie title over here. | ▶ 00:57 |
Don't pay too much attention to it, just don't get confused by it. | ▶ 01:01 |
Again, use Laplacian smoothing with k=1. | ▶ 01:05 |
In this question, I'm giving a set of data points--some positive, some negative--and a query point. | ▶ 00:00 |
Given the following labeled data set, I'd like to find the minimum of "k" | ▶ 00:06 |
for which the query point over here becomes negative. | ▶ 00:10 |
Enter "0" if this is impossible. | ▶ 00:14 |
Ties are broken at random, and I'd suggest trying to avoid them, | ▶ 00:16 |
because you might not be able to guarantee that the class is negative. | ▶ 00:21 |
In linear regression we are given a data set where the x's go 1, 3, 4, 5, and 9, | ▶ 00:00 |
and the y's 2, 5.2, 6.8, 8.4, and 14.8. | ▶ 00:07 |
I'd like to find the formula y = w1x = w0 by minimizing the residual quadratic error | ▶ 00:13 |
as we learned in class in linear regression. | ▶ 00:19 |
What will be w1, and what will be w0? | ▶ 00:21 |
K-Means Clustering. | ▶ 00:00 |
We're given a data set indicated by the solid dots over here. | ▶ 00:02 |
There's a total of 9 dots if you count carefully. | ▶ 00:06 |
We have 2 initial cluster centers: C1 and C2 as indicated by those stars. | ▶ 00:09 |
I'd like to run K-Means to completion | ▶ 00:14 |
and wonder what the final location of C1 will be after running K-Means. | ▶ 00:16 |
Ignore the A, B, C, or D over here. | ▶ 00:22 |
Here is our logic question. | ▶ 00:00 |
I would like you to mark each sentence as "valid," which means it is always true-- | ▶ 00:02 |
you can't make it untrue--"satisfiable," which means it is sometimes true | ▶ 00:07 |
but could also be false depending on the variable values, or "unsatisfiable," | ▶ 00:12 |
which means you cannot possibly make it true. | ▶ 00:17 |
The first statement is not A. | ▶ 00:20 |
The second is A or not A. | ▶ 00:22 |
The third one is (A and not A) implies (B implies C). | ▶ 00:25 |
The fourth one is (A implies B) and (B implies C) and (C implies A). | ▶ 00:33 |
The next one is (A implies B) and not (not A or B). | ▶ 00:41 |
The final one is ((A implies B) and (B implies C)) equivalent to (A implies C). | ▶ 00:49 |
Remember, you might use truth tables to find out. | ▶ 00:58 |
The planning question might be a bit hard to read, so let me read the text for you. | ▶ 00:00 |
In the state space below, shown over here, we can travel between locations | ▶ 00:06 |
S, A, B, and G along the roads as shown. | ▶ 00:10 |
For example, SA means we go from S to A. | ▶ 00:15 |
But the world is partially observable and stochastic. | ▶ 00:19 |
There may be a stoplight somewhere between B and G | ▶ 00:23 |
that can prevent passing from B to G. | ▶ 00:26 |
The action might fail, and there might be a flood that sits between A and G, | ▶ 00:29 |
and the flood also makes the action going from A to G fail. | ▶ 00:34 |
If the flood occurs, it will always remain flooded. | ▶ 00:38 |
If the stoplight is red, it will flip green at some point, but we can't predict when. | ▶ 00:42 |
The flood is only visible at A and the stoplight only visible at B. | ▶ 00:49 |
I want you to check all these plans over here and see what the outcome is. | ▶ 00:53 |
There are 3 potential outcomes. | ▶ 00:58 |
One is it always reaches the goal state and does so in a bounded number of steps. | ▶ 01:00 |
By bounded I mean in advance you can tell me a maximum number of steps-- | ▶ 01:06 |
not after the fact. | ▶ 01:09 |
If you cannot do this after the fact, it's really not bounded. | ▶ 01:12 |
The second possibility is always reaches the goal state, | ▶ 01:16 |
but the number of steps cannot be bounded in advance. | ▶ 01:19 |
The third one is it might actually fail to reach the goal state. | ▶ 01:24 |
Look at the following plans: SA followed by AG, SB, step 2 if we can't move go back to 2, | ▶ 01:29 |
then finally proceed to BG, and so on and so on. | ▶ 01:38 |
See which of these plans fall into which category over here. | ▶ 01:42 |
Here is an MDP question. | ▶ 00:00 |
We have a deterministic environment, which means the state transitions are deterministic. | ▶ 00:02 |
There is no probablistic or stochastic outcome of actions. | ▶ 00:06 |
The cost of motion is -5. The terminal state is worth 100 as indicated. | ▶ 00:11 |
We have four actions: north, south, west, and east. | ▶ 00:15 |
The shaded state can't be entered. | ▶ 00:19 |
Please fill in the final values after value iteration converges. | ▶ 00:21 |
In this final question, I'd like you to learn the position parameters of Markov chains. | ▶ 00:00 |
There's a Markov chain over here with two states. | ▶ 00:07 |
There is an initial state distribution for the time step 0. | ▶ 00:09 |
Then there is conditional state distribution from time T to time T + 1. | ▶ 00:13 |
You might go from A and stay in A. | ▶ 00:17 |
You might go from A to B. | ▶ 00:19 |
You might go from B and stay in B and go from B to A. | ▶ 00:22 |
What we observe is the sequence A, A, A, A, B. | ▶ 00:25 |
This is our sample for the initial state and all these transitinos over here | ▶ 00:30 |
are samples for the state transitions in this Markov chain. | ▶ 00:36 |
I want you to compute all the parameters, which is the initial distribution, | ▶ 00:40 |
and the transition distribution out of state A and out of state B. | ▶ 00:43 |
However I'd like to do this with Laplacian smoothing with k=1. | ▶ 00:48 |
It is not maximum likelihood. | ▶ 00:53 |
It is Laplacian smoothing, which can be applied just exactly | ▶ 00:55 |
the same way we saw it in class in various contexts. | ▶ 00:58 |
I'd like you to learn these parameters of the Markov chain from the observed sequence. | ▶ 01:02 |
Again, the only sample for the initial state is | ▶ 01:05 |
the very first measurement observation in this sequence over here. | ▶ 01:09 |
This unit is about games. Why games? | ▶ 00:00 |
Well, for one, games are fun. | ▶ 00:03 |
They've captured the imagination of people for thousands of years. | ▶ 00:06 |
They form a well-defined subset of the real world | ▶ 00:09 |
in that they have rules, which we understand and write down, and they are self-contained. | ▶ 00:13 |
They're not as messy as driving a car or flying an autonomous plane | ▶ 00:16 |
and having to worry about everything in the world. | ▶ 00:23 |
In that sense, they form a small-scale model of a specific problem. | ▶ 00:25 |
Namely, the problem of dealing with adversaries. | ▶ 00:30 |
Along the way we've seen a lot of different technologies in this class | ▶ 00:00 |
and a lot of different techniques, that are focused at different parts of the agent | ▶ 00:04 |
and environment mix and different difficulties there. | ▶ 00:08 |
Here we have a quiz, and what I want you to tell me is for each of these technologies | ▶ 00:12 |
what do they most address? | ▶ 00:18 |
Some of them address more than one, but give the best answer for each line. | ▶ 00:20 |
Do they address the problem of a stochastic environment-- | ▶ 00:25 |
that is one where the results of actions can vary? | ▶ 00:28 |
Do they address the problem of a partially observable environment-- | ▶ 00:33 |
one where we can't see everything? | ▶ 00:37 |
Do they address the problem of an unknown environment-- | ▶ 00:40 |
one where we don't even know what the various actions are and what they do? | ▶ 00:42 |
Do they address computational limitations-- | ▶ 00:47 |
that is problems of dealing with a very large problem rather than a small one | ▶ 00:49 |
and making approximations to deal with that? | ▶ 00:55 |
Or do they deal with handling adversaries who are working against our goals? | ▶ 00:57 |
And I want you to answer that for MDPs, Markov decision processes; | ▶ 01:04 |
POMDPs, partially observable Markov decision processes and belief space; | ▶ 01:08 |
for reinforcement learning, and for A* algorithm, heuristic function, and Monte Carlo techniques. | ▶ 01:14 |
The answer is the MDPs are designed to do stochastic control. | ▶ 00:00 |
POMDPs are designed to deal with partial observability. | ▶ 00:05 |
Reinforcement learning deals with an unknown environment, | ▶ 00:09 |
and the heuristic function and A* search and Monte Carlo techniques | ▶ 00:13 |
are used to deal with computational limitations. | ▶ 00:19 |
Monte Carlo techniques gives us an approximation. | ▶ 00:22 |
The heuristic function, if we use the right one, still gives us the right answer, | ▶ 00:25 |
but deals with the computational complexity. | ▶ 00:30 |
We don't as yet have any technology that's specifically designed to deal with adversaries. | ▶ 00:32 |
What is a game? | ▶ 00:00 |
The philosopher Wittgenstein said that there is no single set of necessary | ▶ 00:02 |
and sufficient conditions that define all games. | ▶ 00:06 |
Rather games have a set of features, and some games share some of them, | ▶ 00:10 |
and other games share others of them. | ▶ 00:14 |
It's a complex overlapping set rather than a simple criteria. | ▶ 00:17 |
Here I've listed six different games and in some cases sets of games | ▶ 00:22 |
like Chess and Go are similar, Robotic Soccer, Poker, | ▶ 00:27 |
hide-and-go-seek played in the real world, | ▶ 00:31 |
Cards Solitaire, and Minesweeper, the computer solitaire game. | ▶ 00:34 |
I want to ask you, for each one, which of these properties they exhibit. | ▶ 00:40 |
Are they stochastic? Are they partially observable? | ▶ 00:44 |
Do they have an unknown environment? Are they adversarial? | ▶ 00:48 |
For each game tell me all that apply. | ▶ 00:53 |
Let me add that your answers may not be the same as mine, | ▶ 00:56 |
because these very terms are not that precise. | ▶ 01:00 |
Sometimes you can analyze a problem in two different ways | ▶ 01:03 |
and flip from one of these attributes to another, depending on how you analyze it. | ▶ 01:07 |
Now, I've chosen to say that only robotic soccer and hide-and-go-seek are stochastic. | ▶ 00:00 |
By that I mean if you have an action like go forward 1 meter, | ▶ 00:06 |
the result of that action stochastic. You may not go forward exactly 1 meter. | ▶ 00:10 |
You could also analyze games like poker and cards and say that they're stochastic | ▶ 00:15 |
in that the next car is random, and so the action of flipping over the next card is stochastic. | ▶ 00:21 |
You don't know how that action is going to result. | ▶ 00:28 |
I've chosen to model that as partial observability. | ▶ 00:32 |
What I've said is it's not that you pick randomly from the next card, | ▶ 00:36 |
it's that the cards are already arranged in some order. | ▶ 00:41 |
It's just that you don't know what that order is. | ▶ 00:45 |
There's partial observability that gives you the next card. | ▶ 00:47 |
Partial observability also shows up in the real world sports | ▶ 00:50 |
or of robot soccer and hide-and-go-seek. | ▶ 00:54 |
Obviously, that's kind of the point of hide-and-go-seek that it's partially observable. | ▶ 00:58 |
Now, in terms of unknown, I've said that only hide-and-go-seek satisfies that. | ▶ 01:03 |
In everything else, the world is well-defined. | ▶ 01:07 |
Even in the real world in an environment like robot soccer, | ▶ 01:10 |
you only have the known field to deal with. | ▶ 01:14 |
Whereas in hide-and-go-seek, someone could be hiding anywhere | ▶ 01:17 |
in a room or location that you don't know about yet. | ▶ 01:20 |
Notice that many games are adversarial, but some games are not. | ▶ 01:25 |
Solitaire games are not adversarial. | ▶ 01:29 |
You could mark that down as saying, well, I'm playing against the game itself, | ▶ 01:31 |
but we don't count that as adversarial, because the games itself is not trying to defeat you. | ▶ 01:37 |
The game itself is passive. | ▶ 01:42 |
Whereas in these games and what adversarial has come to mean is that | ▶ 01:44 |
the opponent is taking into account what you are thinking | ▶ 01:49 |
when the opponent does their own thinking and tries to defeat you that way. | ▶ 01:52 |
Here's a game that we've seen before. | ▶ 00:00 |
We call this a single-player deterministic game. | ▶ 00:03 |
We know how to solve this. | ▶ 00:07 |
We use the techniques of search through a state space--the problems solving techniques. | ▶ 00:09 |
We draw a search tree through the state space, | ▶ 00:13 |
and I'm going to draw the nodes like this with triangles rather than with circles. | ▶ 00:18 |
In any position--in this position here--there are three moves I can make. | ▶ 00:24 |
I can slide this tile, this tile, or this tile. | ▶ 00:28 |
So I have 3 moves, and that gives me 3 more states. | ▶ 00:32 |
I keep on expanding out the states going farther and farther down until I reach one | ▶ 00:36 |
that's a goal state, and then I have a path through there that gets me to a solution. | ▶ 00:42 |
What does it take to describe a game? | ▶ 00:49 |
Well, we have a set of states S, including a distinguished start state S0. | ▶ 00:51 |
We have a set of players P that can be our one player, as in this game, or two or more. | ▶ 00:58 |
We have a function that gives us the allowable actions in a state, | ▶ 01:03 |
and sometimes we put in a second argument, | ▶ 01:10 |
which is the player, in that state-making action, | ▶ 01:13 |
and sometimes it's explicit in the state itself whose turn it is to move. | ▶ 01:17 |
We have a transition function that tells us the result of, | ▶ 01:21 |
in some state, applying an action giving us a new state. | ▶ 01:25 |
And we have a terminal test to say is it the end of the game. | ▶ 01:29 |
That's going to be true or false. | ▶ 01:34 |
Finally, we have terminal utilities saying that for a given state and a given player | ▶ 01:36 |
there is some number which is the value of the game to that player. | ▶ 01:42 |
In simple games that number is a win or a loss, a one or a zero. | ▶ 01:46 |
Sometimes it's denoted as a +1 and a -1. | ▶ 01:52 |
In other games there can be more complicated utilities | ▶ 01:56 |
of you win twice as much or four times as much or whatever. | ▶ 02:00 |
Now let's consider games like chess and checkers, | ▶ 00:00 |
which we define as deterministic, two-player, zero-sum games. | ▶ 00:04 |
The deterministic part is clear. | ▶ 00:09 |
The rules of chess say you make a move, take a piece, and that's it. There's no stochasticity. | ▶ 00:11 |
It's two players, one against another, | ▶ 00:18 |
and zero sum means that the sum of the utilities to the two players is zero. | ▶ 00:20 |
If one player gets a +1 for winning the game, the other player gets a -1 for losing. | ▶ 00:25 |
How do we deal with these types of games? | ▶ 00:30 |
Well, we use a similar type of approach. | ▶ 00:33 |
We have a state-space search. We have a starting state. | ▶ 00:36 |
There are some moves available to player one. | ▶ 00:39 |
Then in the next state there are moves available to player two. | ▶ 00:43 |
We're going to draw them like this, and we're going to give names to our players. | ▶ 00:47 |
The first player we're going to call Max, because it's a nice name, | ▶ 00:51 |
and because player one is trying to maximize the utility to player one. | ▶ 00:55 |
The next player, who operates at this level, we draw with a downward-pointing triangle. | ▶ 01:02 |
We call that player Min, because Min is trying to minimize the utility to Max, | ▶ 01:08 |
which is the same thing as trying to maximize the utility to himself or herself. | ▶ 01:14 |
Then we have a game tree that continues like that, alternating between Max and Min moves. | ▶ 01:19 |
Now, the search tree keeps going and let's say we get to a point where one player, | ▶ 01:26 |
and let's say it's Max, has a choice, and there are two states, | ▶ 01:31 |
and these, rather than being states where it's Min's turn, are states that are terminal. | ▶ 01:36 |
We'll draw them with a square box. | ▶ 01:42 |
Let's say one of them results in +1, a win for Max, | ▶ 01:45 |
and one of them results in -1, a loss for Max. | ▶ 01:50 |
Now if Max is rational, of course, Max is going to make this choice to the +1. | ▶ 01:54 |
What we're going to do now is show we can determine the value of any state in the tree, | ▶ 02:00 |
including the start state up here in terms of the values of the terminal nodes. | ▶ 02:07 |
The tree keeps on going. We assume it's a finite game. | ▶ 02:12 |
After a finite number of moves, every path leads to a terminal state. | ▶ 02:15 |
Then we look at each state and say whose turn is it to make the decision. | ▶ 02:21 |
In this state Max is making the decision, and Max, being rational, | ▶ 02:28 |
will choose the maximum value, saying, "I'd rather have a +1 than a -1, | ▶ 02:32 |
so I'll get a +1 here." | ▶ 02:37 |
We start going back up the tree, and maybe we get up to a point here | ▶ 02:39 |
where Min has a choice, and we've used this type of process to go up the tree, | ▶ 02:46 |
and Min has a choice between a +1 and a -1. | ▶ 02:50 |
Min is going to choose the minimum and will have a -1 here. | ▶ 02:55 |
If we go through all the possibilities, let's say these all result in -1, | ▶ 02:59 |
but this move results in a +1. Then Max will take that move. | ▶ 03:05 |
He'll say, "Out of my four possibilities, I know this is the best one. I'll take that move." | ▶ 03:10 |
Now we've done two things. | ▶ 03:16 |
One, we've assigned a value to every state in the search tree, | ▶ 03:18 |
and secondly, we backed that all the way up the top. | ▶ 03:22 |
Now we've worked out a path through that state to say, | ▶ 03:25 |
if all players are rational, here's the choices they would make. | ▶ 03:29 |
The important point here is that we've taken the utility function, | ▶ 03:32 |
which is defined only on terminal states. | ▶ 03:37 |
Here's a state here. The utility of that state was +1. | ▶ 03:40 |
Here's a state here. The utility of that state was -1. | ▶ 03:43 |
We've used those utility values in the definition of available actions | ▶ 03:48 |
to back those utilities up and tell us the utility of every state, including the start state. | ▶ 03:52 |
Now let's define a function value of S, | ▶ 00:00 |
which tells us how to compute the value for a given state, | ▶ 00:03 |
and therefore will allow us to make the best possible move. | ▶ 00:07 |
If S is a terminal state, then the value is just the utility of the state | ▶ 00:10 |
given by the definition of the game. | ▶ 00:15 |
If S is a maximizing state, then we'll return something called max value of S, | ▶ 00:18 |
and if S is a minimizing state, then we'll return min value of S. | ▶ 00:25 |
Now we can define max value to just iterate over all the successors | ▶ 00:32 |
and figure out the values of each of those. | ▶ 00:37 |
We'll initialize a value m equals minus infinity, | ▶ 00:40 |
and then we'll say for all pairs of actions and successors states in successors of S, | ▶ 00:46 |
we'll say the value is--and let's call this S-prime so we don't get confused-- | ▶ 00:54 |
the value of S-prime and M for keeping track of the maximum so far and the new value. | ▶ 00:59 |
Then when we're all done we return the M with the maximum value. | ▶ 01:07 |
This will compute the maximum at a maximum node over all the states | ▶ 01:12 |
that we have from all the possible moves. | ▶ 01:16 |
The definition for min value is roughly equivalent but just reversed, | ▶ 01:19 |
taking the minimum instead. | ▶ 01:24 |
With these three recursive routines--value, max value, and min value-- | ▶ 01:26 |
we can determine the value of any node in the tree. | ▶ 01:30 |
Now to do that efficiently, you'd want a little bit of bookkeeping | ▶ 01:33 |
so you aren't recomputing the same thing over and over again, | ▶ 01:37 |
but conceptually, this will answer any two-player, deterministic, finite game. | ▶ 01:40 |
Now we know we have an algorithm that can solve any game tree, | ▶ 00:00 |
that can propagate the terminal values back up to the top | ▶ 00:03 |
and tell us the value for any position. | ▶ 00:07 |
It's theoretically complete, but now we need to know | ▶ 00:10 |
the complexity of the algorithm to figure out if it's practical. | ▶ 00:13 |
Let's look at an analysis of how long it's going to take. | ▶ 00:16 |
Let's say that the average branching factor-- | ▶ 00:20 |
the number of possible moves or actions coming out of a position--is b. | ▶ 00:22 |
Here b would be 4. | ▶ 00:28 |
And let's say that the depth of the tree is m, so b wide and m deep. | ▶ 00:30 |
Now what I want you to tell me is what would be the computational complexity | ▶ 00:37 |
of searching through all the paths and backing the values up to the top. | ▶ 00:41 |
Would it be of the order of b times m or the order of be to the mth power | ▶ 00:46 |
or the order of m to the b power? Chose one of these. | ▶ 00:52 |
The answer is we have b choice at the top level, | ▶ 00:00 |
and for each of those b we have another b at the next level. | ▶ 00:03 |
That would be b squared, b cubed and so on, | ▶ 00:07 |
and all the way to b to the mth power. | ▶ 00:10 |
Now the next thing I want you to tell me is the space complexity. | ▶ 00:00 |
That was the time complexity. | ▶ 00:04 |
The space complexity is how much storage do we need to be able to search this tree. | ▶ 00:06 |
Remember that the value and max value and min value routines that we have defined | ▶ 00:12 |
are doing a depth-first search. | ▶ 00:18 |
Which of these would correctly represent the amount of storage that we would need-- | ▶ 00:20 |
the space complexity? | ▶ 00:24 |
The answer is that we only need b times m space in order to do the search. | ▶ 00:00 |
Even though the entire tree is order b to the mth power of nodes, | ▶ 00:05 |
on any individual path through the tree we only need to look one path at a time | ▶ 00:10 |
in order to do the depth-first search. | ▶ 00:14 |
We generate these b nodes, store them away, look at the first one, generate b more, | ▶ 00:16 |
store those away, and so we're saving only b nodes at each level for m times level | ▶ 00:22 |
for total of b times m total storage space required. | ▶ 00:27 |
The next question is let's look at the game of chess | ▶ 00:00 |
for which the branching factor is somewhere around 30. It varies from move to move. | ▶ 00:03 |
The length of a game is somewhere around 40. | ▶ 00:08 |
Certainly some games are much longer, but that's an average length of a game. | ▶ 00:11 |
Now let's imagine that you have a computer system, | ▶ 00:15 |
and you want to search through this whole tree for chess, | ▶ 00:19 |
and let's assume that you can evaluate a billion nodes a second on one computer. | ▶ 00:22 |
Let's also say that for the moment somebody lent you every computer in the world. | ▶ 00:27 |
If you have all the computers and they can each do a billion evaluations a second, | ▶ 00:31 |
how long would it take you to search through this whole tree? | ▶ 00:36 |
Would it be on the order of seconds, minutes, days, years, | ▶ 00:39 |
or lifetimes of the universe. Tell me which of these. | ▶ 00:44 |
The answer is that it would take many lifetimes of the universe. | ▶ 00:00 |
Even though you have a lot of computing power at your disposal, | ▶ 00:04 |
30 to the 40th power is just such a huge number | ▶ 00:07 |
that there is no chance of searching through the entire tree for chess. | ▶ 00:10 |
Now our question is how do we deal with the complexity of having a tree | ▶ 00:00 |
with branching factor b and depth m. | ▶ 00:05 |
Here are some possibilities, and I want you to tell me which of these are good approaches. | ▶ 00:08 |
We have the problem of dealing with b to the m. | ▶ 00:13 |
Could we reduce b somehow, that is, reduce the branching factor, | ▶ 00:17 |
reduce m, the depth of the tree, or convert the tree into a graph in some way? | ▶ 00:21 |
Tell me which, if any or all, of these would be good approaches | ▶ 00:27 |
to dealing with the complexity. | ▶ 00:30 |
The answer is that all three are useful approaches, and we'll look at each of them. | ▶ 00:00 |
Let's review just for a second. | ▶ 00:00 |
This is called the minimax routine for evaluating a game tree. | ▶ 00:02 |
Given a particular state we look and see is it a terminal state? | ▶ 00:06 |
Is it a maximizing state? It is a minimum state? | ▶ 00:10 |
In each case we look up the utility from the game. | ▶ 00:13 |
We do the max value routine, or we do the min value routine, which is similar. | ▶ 00:16 |
That gives us the value of each state. | ▶ 00:21 |
Then the action that the agent would take would be just to take the action | ▶ 00:24 |
that results in the maximum state--the state with the best value. | ▶ 00:29 |
Now let's try to apply the minimax routine to this game tree. | ▶ 00:32 |
This is a small game in which Max has three options for his moves, | ▶ 00:36 |
and then Min has three options for its moves, and then the game is over. | ▶ 00:41 |
Here are the terminal values for these states in terms of Max's score. | ▶ 00:45 |
What I want you to do is use minimax to fill in the values of these intermediate states. | ▶ 00:51 |
What are the values of these three states for Min to move, | ▶ 00:56 |
and what is the value of this state for Max to move? | ▶ 01:00 |
The answer is that these are minimizing nodes. | ▶ 00:00 |
The minimum of 3, 12, and 8, is 3. | ▶ 00:04 |
Here the minimum is 2. | ▶ 00:07 |
Here it's 1. | ▶ 00:10 |
Then this is a maximizing move. | ▶ 00:12 |
The max is 3. | ▶ 00:14 |
That means that if both players played rationally, then Max would take this move. | ▶ 00:16 |
Then Min would take this move, and the value of the game would be 3. | ▶ 00:22 |
Now I want to get at the idea of reducing b, the branching factor. | ▶ 00:00 |
How is it that we can cut down on the number of nodes that we expand | ▶ 00:06 |
in the horizontal direction while still getting the right answer for the evaluation of the tree? | ▶ 00:10 |
Let's go back and consider that during our evaluation, if we get to this point, | ▶ 00:16 |
we've expanded these three nodes, we figured out that the value of this one is 3, | ▶ 00:21 |
we looked at this one so far and found its value was 2, | ▶ 00:25 |
and now, without looking at these, what can we say about the value of this node? | ▶ 00:29 |
Well, it's a minimizing node, so the least it could be is 2. | ▶ 00:35 |
If these are less than 2, it'll be less than that, and if these are more, it'll end up being 2. | ▶ 00:39 |
We can say that the value of this node is less than or equal to 2. | ▶ 00:44 |
Now if we look at it from Max's point of view, | ▶ 00:50 |
Max will have this choice here of choosing either this, this, or this, | ▶ 00:53 |
and if this one is 3 and this one is less than or equal to 2, | ▶ 00:58 |
then we know Max will always choose this one. | ▶ 01:02 |
What that tells us is that it doesn't matter what the value is of this node and this node. | ▶ 01:05 |
No matter what those values are this is still going to be less than or equal to 2, | ▶ 01:11 |
and is not going to matter to the total evaluation, | ▶ 01:16 |
because we're going to go this way anyway. | ▶ 01:18 |
We can prune the tree, chop off these nodes here, and never have to evaluate. | ▶ 01:20 |
Now, with this particular case, that doesn't save us very much, | ▶ 01:26 |
because these are terminal nodes, but these could have been large branches-- | ▶ 01:29 |
big parts of the tree, and we still wouldn't have to look at them. | ▶ 01:33 |
We've made a potentially large pruning without effecting the value. | ▶ 01:36 |
We still get the exact correct value for the value of the tree. | ▶ 01:41 |
Now I want you to tell me over here which, if any or all, | ▶ 00:00 |
of the three nodes can be pruned away by this procedure. | ▶ 00:04 |
The answer is when we see the 14 we're not sure what this value is. | ▶ 00:00 |
It has to be less than or equal to 14, | ▶ 00:04 |
which means it might be the right path or it might not. | ▶ 00:08 |
Once we see the one then we know that the value is less than or equal to one, | ▶ 00:11 |
and we know that we have a better alternative here, so we can stop at that point. | ▶ 00:17 |
Then we can prune off the 8. | ▶ 00:21 |
Out of the three, only this node, the right-most, would be the one pruned away. | ▶ 00:23 |
Now I'm going to look at the issue of reducing m, the depth of the tree. | ▶ 00:00 |
Here, I've drawn a game tree and left out some bits, | ▶ 00:05 |
but the idea is that is that it keeps on going and going. | ▶ 00:08 |
There'll be too many nodes for us to evaluate at all. What can we do? | ▶ 00:11 |
The simplest approach is to just by fiat cut off the search at a certain depth. | ▶ 00:15 |
We'll say we're only going to search to level three, | ▶ 00:20 |
and when we get down to level three, | ▶ 00:23 |
we're going to pretend that these are all terminal nodes. | ▶ 00:25 |
We'll draw them as the square boxes for terminals rather than the max nodes | ▶ 00:28 |
and cut off the search at that point. | ▶ 00:35 |
Now, of course, they aren't terminal, so according to the rules of the game, | ▶ 00:38 |
we haven't either won or lost at this particular point. | ▶ 00:41 |
We can't say for sure what the value is for each of these nodes, | ▶ 00:45 |
but we can estimate it using something called an evaluation function, | ▶ 00:49 |
which is given a state S and returns an estimate of the final value for that state. | ▶ 00:54 |
What do we want out of our evaluation function and how do we get it? | ▶ 01:00 |
We want the evaluation function to be stronger for positions that are stronger | ▶ 01:03 |
and weaker for positions that are weaker. | ▶ 01:07 |
We can get it one way from experience-- | ▶ 01:10 |
from playing the games before and seeing similar situations | ▶ 01:13 |
and figuring out what their values are. | ▶ 01:16 |
We can try to break that down into components by using experience with the game. | ▶ 01:19 |
For example, in the game of chess it is traditional to say that a pawn is worth 1 point, | ▶ 01:24 |
a knight 3 points, a bishop 3 points, a rook 5, and a queen 9. | ▶ 01:30 |
You could add up all those points. | ▶ 01:34 |
So we could have an evaluation function of S | ▶ 01:36 |
which is equal to this weighted sum of the various weights times the various pieces-- | ▶ 01:40 |
positive weights for your pieces and negative weights for the opponent's pieces. | ▶ 01:48 |
We've seen this idea before when we did machine learning | ▶ 01:52 |
where we have a set of features, which could be the pieces, | ▶ 01:55 |
and they could be other features of the game as well. | ▶ 01:59 |
For example, in chess it's good to control the center, | ▶ 02:02 |
it's good not to have a double pawn, and so on. | ▶ 02:05 |
We could make up as many features as we can think of to represent each individual state | ▶ 02:08 |
and then use machine learning from examples to figure out what the weight should be. | ▶ 02:13 |
Then we have an evaluation function. | ▶ 02:18 |
We apply the evaluation function to each state at the cutoff point | ▶ 02:21 |
rather than doing a long search. | ▶ 02:25 |
Then we have an estimate, and we back those values up just as if they were terminal values. | ▶ 02:28 |
Now let's see how we can compute the value of a state using these | ▶ 00:00 |
two innovations to work on b and m. | ▶ 00:03 |
I've modified our routine for value in two ways-- | ▶ 00:07 |
one, I've introduced a new line that says if we decide to cut off the search | ▶ 00:10 |
at a particular depth then apply the evaluation function to the state and return that. | ▶ 00:16 |
Then I've also added some bookkeeping variables. | ▶ 00:22 |
One for the current depth, which will get increased as we go along, | ▶ 00:25 |
and then two values called alpha and beta, which are the traditional names, | ▶ 00:29 |
where alpha is the best value found so far for Max along the path | ▶ 00:34 |
that we are currently exploring, and beta is the best value found so far for Min. | ▶ 00:40 |
Then since we have these extra parameters when we start out, | ▶ 00:46 |
we would make the call value of our initial state S0 and we're currently at depth zero | ▶ 00:49 |
in the search tree, and we haven't found the best for Max yet so that would be minus infinity, | ▶ 00:57 |
and the best for Min similarly we haven't found anything there so that would be plus infinity. | ▶ 01:03 |
We call that and then each node we would chose one of these four cases. | ▶ 01:10 |
Here's the new definition of maxValue taking the depth | ▶ 01:16 |
and the alpha and beta parameters into account. | ▶ 01:19 |
It's similar to what we had before. | ▶ 01:22 |
We go through all the successors. | ▶ 01:24 |
We take the maximum, and in this case we're incrementing the depth | ▶ 01:26 |
as we call recursively for the value of each node. | ▶ 01:31 |
We get the cutoff here if we exceed beta, | ▶ 01:35 |
and otherwise we retain alpha as the maximum value to Max so far. | ▶ 01:38 |
Then we return the final value. | ▶ 01:44 |
Now we said we have three ways to reduce this exponential b to the m-- | ▶ 00:00 |
reducing the branching factor b, reducing the depth of the tree m, | ▶ 00:04 |
and converting the tree to a graph | ▶ 00:08 |
Let's see how each of those fare. | ▶ 00:11 |
First, for reducing b we came up with this alpha-beta pruning technique. | ▶ 00:13 |
In fact, that is very effective. | ▶ 00:19 |
That takes us from a regime where we're in order b to the m to one where, | ▶ 00:21 |
if we do a good jog, we can get to order b to the m/2. | ▶ 00:29 |
Now what do I mean by doing a good job? | ▶ 00:34 |
Well, we get different amounts of pruning depending on the order | ▶ 00:36 |
in which we expand each branch from a node. | ▶ 00:39 |
If we expand the good nodes first, then we get a lot of pruning, | ▶ 00:42 |
because we do a good job of getting to the cutoff points. | ▶ 00:45 |
If we expand the poor nodes first, then we don't do any pruning, | ▶ 00:49 |
because we don't get to that cutoff point until later. | ▶ 00:53 |
But if we can do well, then we get to the square root of the number of nodes. | ▶ 00:56 |
In other words, we get to search twice as deep into the search tree. | ▶ 01:01 |
That's all 100% perfect in terms of not changing the result. | ▶ 01:05 |
We'd still get the exact evaluation. | ▶ 01:12 |
We just stop doing work that we didn't have to do. | ▶ 01:15 |
Now for the tree to the graph, we haven't talked that yet. | ▶ 01:18 |
In fact, it depends on the particular game, but in many games it can be very useful. | ▶ 01:21 |
In games like chess, we have opening books. | ▶ 01:25 |
That is, we look at the past openings | ▶ 01:29 |
and we just memorize those positions and what are the good moves. | ▶ 01:32 |
It doesn't matter how we get to those positions. | ▶ 01:36 |
We can get to them in multiple paths through a tree, | ▶ 01:38 |
and we can just consider it a single graph. | ▶ 01:41 |
We also have closing books, where we can memorize all the positions | ▶ 01:43 |
with five or fewer pieces and know exactly what to do. | ▶ 01:48 |
In the midgame when there are too many positions to memorize all of them, | ▶ 01:52 |
we can still search through a graph if we want to or if we want we can just do part of that. | ▶ 01:57 |
One thing that has proven effective in games like chess is called the killer-move heuristic. | ▶ 02:04 |
What that says is if there's one really good move in part of a search tree, | ▶ 02:09 |
then try the other move in the sister branches for that tree. | ▶ 02:14 |
In other words, if I try making one move and I find that the opponent takes my queen, | ▶ 02:19 |
then when I try making another move from that same position, | ▶ 02:25 |
I should also check if the opponent has that response of taking my queen. | ▶ 02:28 |
Converting from a tree to graph, also doesn't lose information. | ▶ 02:32 |
It can just help us make the search go faster. | ▶ 02:35 |
The third possibility was reducing m, the depth of the tree, | ▶ 02:38 |
by just cutting off search and going to an evaluation function. | ▶ 02:42 |
That is imperfect in that it is an estimate of the true value of the tree | ▶ 02:46 |
but won't give you the exact value. | ▶ 02:51 |
We can get into trouble. Let me show you an example of that. | ▶ 02:53 |
Here's a search tree for a version of Pacman in which there's only four squares. | ▶ 00:00 |
There's a little Pacman guy who can move around, | ▶ 00:05 |
and there are food dots that the Pacman can eat. | ▶ 00:08 |
Maybe someplace else in the maze there are opponents, | ▶ 00:14 |
but we're not going to worry about them right here. | ▶ 00:17 |
We're just going to consider the Pacman's actions. | ▶ 00:20 |
He has two actions--to go left or right | ▶ 00:23 |
If he goes left, he goes over here and eats that food particle | ▶ 00:25 |
and then moves back right--that's his only move from that position. | ▶ 00:28 |
Or if he moves right then he has two other moves. | ▶ 00:32 |
Now let's assume that we cut off the search at this depth, | ▶ 00:36 |
and we want to have an evaluation function, | ▶ 00:40 |
and the goal is for Pacman to eat all the food. | ▶ 00:43 |
The evaluation function will be the number of food particles that he's eaten so far. | ▶ 00:47 |
What I want you to do is tell me in these boxes | ▶ 00:52 |
what the evaluation should be for each of these three states. | ▶ 00:57 |
The answer is here he's eaten 1, here he's eaten 0, and here he's eaten 1. | ▶ 00:00 |
That's fine. The problem arises when we start backing up these numbers. | ▶ 00:06 |
If these are max nodes, we've skipped the opponent's moves, which are the min nodes. | ▶ 00:11 |
We're only looking at the maxes. | ▶ 00:17 |
The max of 1 is 1, so this would also get an evaluation of 1. | ▶ 00:20 |
The max of 0 and 1 is 1, so this would also get an evaluation of 1. | ▶ 00:25 |
This final node would be the max of 1 and 1, so that's also 1. | ▶ 00:31 |
But now when we go to apply the policy, if we're in this position, | ▶ 00:35 |
using these evaluation functions, both of these moves are equally good. | ▶ 00:39 |
The Pacman might choose this one, | ▶ 00:44 |
choosing at random or choosing by some predefined ordering. | ▶ 00:48 |
Then he'd end up in this state. So far he hasn't eaten anything. | ▶ 00:52 |
But this state is just as good because he knows in two moves he can eat one particle | ▶ 00:55 |
going this way just as well as in two moves he can eat one particle going this way. | ▶ 00:59 |
Now he's in this state, but notice that this state is symmetric to this one. | ▶ 01:04 |
On his next turn, if we did another depth-two search, | ▶ 01:08 |
he might just as well go back one position. | ▶ 01:12 |
He would be stuck going back and forth between these two states, | ▶ 01:14 |
because either one of those, if you look ahead only two, is equally good. | ▶ 01:19 |
You have to look ahead one, two, three, four moves to know | ▶ 01:24 |
that one of them is better than the other. | ▶ 01:28 |
This is known as the horizon effect. | ▶ 01:31 |
The idea is that when we cut off search we're specifying a horizon | ▶ 01:34 |
beyond which the agent can't see. | ▶ 01:39 |
If a good thing or a bad thing happens beyond the horizon, we don't see that. | ▶ 01:41 |
All we see is whatever is reflected in the evaluation function. | ▶ 01:45 |
If the evaluation function is imperfect, we don't see beyond the horizon, | ▶ 01:50 |
and we can make mistakes. | ▶ 01:54 |
There is one more thing to deal with when we have to talk about games--and that's chance. | ▶ 00:00 |
We want to move from purely deterministic games to stochastic games | ▶ 00:06 |
like backgammon or other games that introduce dice or other parts of random action. | ▶ 00:10 |
That means that the actions that an agent takes | ▶ 00:16 |
are not specifically specified to have a single result. | ▶ 00:19 |
Let's see how we can deal with stochastic games | ▶ 00:23 |
by looking at our value function and modifying it to allow for this. | ▶ 00:26 |
Here we have our valuation function, and we're dealing with four types of nodes-- | ▶ 00:30 |
one, nodes that we decide to cut off on our own, because we reached a certain depth; | ▶ 00:34 |
second, nodes that are terminal according to the rules of the game; | ▶ 00:38 |
and third, max to move and min to move. | ▶ 00:42 |
Now I'm going to add one more type, which is a chance node. | ▶ 00:45 |
We say if the state is a chance node, then we want to return the expected value of S | ▶ 00:48 |
and carry along these bookkeeping variables. | ▶ 00:57 |
What we're saying here is if it's at the point of the game where it's time to roll the dice, | ▶ 01:00 |
then we're going to role the dice, and we're going to take the expected value | ▶ 01:04 |
of all the possible results rather than the max or the min. | ▶ 01:08 |
Here we have a schematic diagram for a stochastic game--a game with dice. | ▶ 01:11 |
We start out. The chance node or the dice-rolling node is first. | ▶ 01:15 |
The dice is rolled--one of six possibilities. | ▶ 01:19 |
Then the next player gets his move. | ▶ 01:22 |
In this case, we've let Min move first, and Min has various moves possible to make. | ▶ 01:24 |
For each one, there is then another role of the dice, and then Max gets to make his move. | ▶ 01:30 |
Here's the game tree for another stochastic game. | ▶ 00:00 |
This game involves flipping a coin. | ▶ 00:03 |
The chance nodes have two results: heads or tails. | ▶ 00:05 |
Then the player Max has two possible moves, A and B, | ▶ 00:09 |
and the player Min has two possible moves, C and D. | ▶ 00:14 |
This game is too small to have any alpha-beta pruning involved, | ▶ 00:18 |
but I've listed all the terminal values for the terminal states of the game. | ▶ 00:22 |
What I want you to do is fill in the non-terminal values | ▶ 00:27 |
for the chances nodes, the max nodes, and the min nodes | ▶ 00:31 |
according to the rules of minimum value and maximum value and expected value. | ▶ 00:35 |
I should say that the probability of the coin flip is 50% heads and 50% tails. | ▶ 00:41 |
To evaluate the game tree, we work from the bottom up. | ▶ 00:00 |
Let's start over here. This is a min node. | ▶ 00:03 |
Min chooses the minimum, which will be 1. | ▶ 00:06 |
In this position, Min would choose 2, the minimum of 2 and 4. | ▶ 00:09 |
Over here Min would choose 0, the minimum of 0 and 10. | ▶ 00:14 |
Now we have some chance nodes, so we have to choose the expected value. | ▶ 00:19 |
Chance, the flip of the coin, doesn't get the choice of one direction or the other. | ▶ 00:23 |
Rather both of them are possibilities. | ▶ 00:27 |
So we just average the results, since the probability of heads and tails are equal. | ▶ 00:30 |
So 7 and 1 is 8, divided by 2 is 4, and 8 and 2 is 10 over 2 is 5 is the expected value there. | ▶ 00:35 |
The expected value of 0 and 6 is 3, and the expected value of 0 and 4 is 2. | ▶ 00:44 |
Now we have a maximizing node. The max of 5 and 4 would be 5. | ▶ 00:52 |
The max of 3 and 2 would be 3, and finally, we have another chance node. | ▶ 00:57 |
The average of 5 and 3 would be 4, and that's the value of the final state. | ▶ 01:04 |
Now one more question for this same game tree. | ▶ 00:00 |
I want you to click on all the terminal states that are possible outcomes | ▶ 00:03 |
for this game if both players play rationally. | ▶ 00:07 |
No subtitles... | ▶ 00:00 |
One more quick game tree to evaluate. | ▶ 00:00 |
Here we have terminal values. | ▶ 00:03 |
We have chance nodes where the two options are equiprobable. | ▶ 00:04 |
We have a max node. The two actions A and B. | ▶ 00:09 |
I want you to fill in the values for all the nodes | ▶ 00:11 |
and click on which action, A or B, is the rational action for Max. | ▶ 00:15 |
The average of these two is 2. That's a chance node. | ▶ 00:00 |
The average of these is 2.5. | ▶ 00:03 |
Max will choose the better of 2 and 2.5, which is 2.5. | ▶ 00:07 |
Therefore, B will be the rational action for Max. | ▶ 00:12 |
Now we know if this game if these were terminal nodes, | ▶ 00:00 |
then that would be the right action for the game, and there was nothing to argue about. | ▶ 00:03 |
But what if instead of having these be terminal nodes, these were cutoff nodes, | ▶ 00:08 |
and these were evaluation values for those nodes? | ▶ 00:14 |
Furthermore, if it's an evaluation function, then it's an arbitrary function. | ▶ 00:19 |
Suppose if instead of coming up with these values, we used a different evaluation function, | ▶ 00:24 |
which squared these values, and so we came up with evaluations of 0, 16, 4, and 9. | ▶ 00:30 |
With that function, I want you to repeat the problem of filling in the values for each | ▶ 00:39 |
of these nodes and tell me what the rational action is for Max. | ▶ 00:46 |
The answer is this is a chance node so we take the average of 0 and 16. | ▶ 00:00 |
That's no longer 2. It becomes 8. | ▶ 00:04 |
We take the average of 9 and 4. That's no longer 2.5. It becomes 6.5. | ▶ 00:07 |
Notice what's happened now is Max now chooses 8 over 6.5. | ▶ 00:14 |
and now the rational action has shifted from B to A. What's gone on here? | ▶ 00:22 |
We notice that just by making a change to the evaluation function, | ▶ 00:28 |
we changed the rational action. | ▶ 00:32 |
Let's summarize what we've done so far. | ▶ 00:00 |
We've built up this valuation function that tells us the value of any state, | ▶ 00:02 |
and therefore we can choose the best action in a state. | ▶ 00:07 |
We started off just having terminal states and max value states. | ▶ 00:10 |
That's good for one-player, deterministic games, | ▶ 00:14 |
and we realized that that's just the same thing as searches we've seen before | ▶ 00:19 |
where we had A* search or depth-first search. | ▶ 00:23 |
Then we added in an opponent player for two-player or multiplayer games, | ▶ 00:26 |
which is trying to minimize rather than maximize. We saw how to do that. | ▶ 00:31 |
Then we optimized by saying at some point we may not be able to search the whole tree, | ▶ 00:35 |
so we're going to have a cutoff depth and an evaluation function. | ▶ 00:42 |
We recognized that that means that we're no longer perfect in terms of | ▶ 00:46 |
valuating the tree. We now have an estimate. | ▶ 00:50 |
We also tried to be more computationally effective | ▶ 00:53 |
by throwing in the alpha and beta parameters, | ▶ 00:56 |
which keep track of the best value so far for Max and Min | ▶ 01:00 |
and prune off branches of the tree that are outside of that range | ▶ 01:04 |
that are provably not part of the answer for the best value. | ▶ 01:08 |
We kept track of those through these bookkeeping parameters. | ▶ 01:12 |
Then finally we introduced stochastic games, | ▶ 01:15 |
in which there is an element of chance or luck or rolling of the dice. | ▶ 01:19 |
We realized that in order to valuate those nodes, | ▶ 01:23 |
we have to take the expected value rather than the minimum or the maximum value. | ▶ 01:26 |
Now we have a way to deal with all the popular types of games. | ▶ 01:30 |
The details now go into when we figure out to cut off and what's the right evaluation function. | ▶ 01:34 |
Those are a complex area. | ▶ 01:41 |
A lot of research in AI is being done in that, | ▶ 01:44 |
but it's being done for specific games rather than for the theory in general. | ▶ 01:47 |
Hey, welcome back. | ▶ 00:00 |
Hope you enjoyed the last unit. You guys have been doing great. | ▶ 00:02 |
You've been doing amazing work, getting a lot done, | ▶ 00:04 |
doing a really good job of answering the questions. | ▶ 00:07 |
I've been looking at this book here. | ▶ 00:10 |
This is a book from my father's collection. | ▶ 00:12 |
It's called "Introduction to the Theory of Games" by McKinsey. | ▶ 00:14 |
It was published in 1952, | ▶ 00:18 |
4 years before the start of artificial intelligence. | ▶ 00:20 |
And so game theory and AI have kind of grown up together. | ▶ 00:23 |
They've taken different paths, | ▶ 00:27 |
and now they've begun to merge back together. | ▶ 00:29 |
We've talked about games already in a previous unit. | ▶ 00:31 |
We talked about mostly turn-taking games | ▶ 00:35 |
where 1 player moves and then another moves, | ▶ 00:37 |
and the trick is how to work against an adversary | ▶ 00:40 |
who's trying to maximize his own utility | ▶ 00:43 |
and thus minimize your utility. | ▶ 00:46 |
Game theory handles those types of games, but it also really focuses | ▶ 00:49 |
on games where the 2 moves are simultaneous, | ▶ 00:52 |
or another way to think about them is 1 player moves | ▶ 00:56 |
and then the other moves, but the second player doesn't know | ▶ 00:59 |
what choice the first player made, so it's partially observable. | ▶ 01:02 |
And it's this back and forth of trying to figure out what should I move | ▶ 01:05 |
given what I think he's going to move and what does he think about | ▶ 01:09 |
what I'm going to move that gives game theory its special status. | ▶ 01:12 |
We're going to talk about how that works for AI, | ▶ 01:17 |
and 2 problems are studied. | ▶ 01:20 |
The first is agent design. | ▶ 01:22 |
That is, given a game, find the optimal policy. | ▶ 01:24 |
And the second is mechanism design. | ▶ 01:27 |
That is, given utility functions, | ▶ 01:30 |
how can we design a mechanism so that | ▶ 01:32 |
when the agents act rationally the global utility will be maximized in some way? | ▶ 01:35 |
Let's take a look. | ▶ 01:39 |
We're going to talk about game theory, | ▶ 00:00 |
which is the study of finding an optimal policy | ▶ 00:03 |
when that policy can depend on the opponent's policy and vice versa. | ▶ 00:06 |
And let's look at 1 of the most famous games of all, | ▶ 00:10 |
a game called the "Prisoner's Dilemma." | ▶ 00:14 |
And the story is that there are 2 criminals, Alice and Bob, | ▶ 00:17 |
who have a working relationship, and they're both caught | ▶ 00:21 |
at the scene of a crime, but the police don't quite have enough evidence | ▶ 00:24 |
to put them away. | ▶ 00:28 |
They offer each independently a deal saying | ▶ 00:30 |
"If you testify against your cohort, | ▶ 00:33 |
we'll give you a better deal and give you a reduced sentence time." | ▶ 00:37 |
And Alice and Bob both understand what's going on. | ▶ 00:42 |
They're both perfectly rational, | ▶ 00:45 |
and to understand what the situation is, | ▶ 00:47 |
we draw up a matrix in which we have possible outcomes | ▶ 00:50 |
and possible strategies for each side. | ▶ 00:55 |
For Alice, she has 2 strategies. | ▶ 00:57 |
1 is to testify against Bob, | ▶ 01:01 |
and the other is to refuse to testify. | ▶ 01:04 |
And Bob has the same choices, | ▶ 01:07 |
to testify against Alice or to refuse. | ▶ 01:09 |
In general, different agents may have different actions available to them. | ▶ 01:13 |
And now we show the payoff to each agent. | ▶ 01:17 |
Sometimes those payoffs are opposite, | ▶ 01:20 |
as in a game like chess where if 1 player gets a +1, | ▶ 01:23 |
the other gets a -1. | ▶ 01:27 |
In this game, the payoffs are not opposite, | ▶ 01:29 |
so it's a non-zero-sum game. | ▶ 01:32 |
And if they both refuse to testify against each other, | ▶ 01:34 |
then neither can be convicted of the major crime, | ▶ 01:38 |
but the police will get them for a lesser crime. | ▶ 01:42 |
And let's say they each serve 1 year in jail, | ▶ 01:45 |
so that's a -1 for each of them. | ▶ 01:50 |
If Alice testifies and Bob refuses, | ▶ 01:52 |
then the police are grateful to Alice, | ▶ 01:56 |
and she gets off with nothing, and Bob gets | ▶ 01:59 |
the book thrown at him and gets a -10 score. | ▶ 02:03 |
Likewise, if the roles are reversed | ▶ 02:06 |
and if both testify against each other, then they're both guilty, | ▶ 02:09 |
and they split the penalty. | ▶ 02:13 |
Now, the question that both Alice and Bob have to face | ▶ 02:15 |
is what is the strategy going to be? | ▶ 02:19 |
And the first concept we want to talk about | ▶ 02:21 |
is the concept of a dominant strategy. | ▶ 02:24 |
A dominant strategy is one for which a player | ▶ 02:27 |
does better than any other strategy | ▶ 02:31 |
no matter what the other player does. | ▶ 02:34 |
And now the question is, does either Alice or Bob | ▶ 02:36 |
have a dominant strategy? | ▶ 02:41 |
If Alice has a dominant strategy, | ▶ 02:44 |
I want you to check that off, either testify or refuse, | ▶ 02:46 |
and similarly, if Bob has a dominant strategy, | ▶ 02:51 |
check that off. | ▶ 02:54 |
The answer is for Alice, | ▶ 00:00 |
testify is a dominant strategy. | ▶ 00:02 |
Let's see. We have to compare it against all possible strategies for Bob. | ▶ 00:04 |
If Bob does testify, | ▶ 00:08 |
then Alice gets -5 here and -10 here, | ▶ 00:11 |
so testify is better. | ▶ 00:15 |
And if Bob does refuse, | ▶ 00:17 |
then Alice gets 0 here and -1 here, so testify is better. | ▶ 00:19 |
Testify is better for Alice no matter what, | ▶ 00:24 |
and by similar reasoning, | ▶ 00:27 |
testify is better for Bob no matter what, | ▶ 00:29 |
so testify is a dominant strategy for both players. | ▶ 00:31 |
The next concept I want to talk about | ▶ 00:00 |
is the concept of a pareto optimal outcome. | ▶ 00:03 |
So, this is talking about outcomes rather than strategies. | ▶ 00:08 |
The strategies are in the margins. | ▶ 00:11 |
The outcomes are in the matrix, and the pareto optimal outcome | ▶ 00:13 |
is one where there's no other outcome | ▶ 00:17 |
that all players would prefer. | ▶ 00:20 |
And this is named after the economist Pareto. | ▶ 00:22 |
What I want you to answer is | ▶ 00:25 |
is there a pareto optimal outcome in this game? | ▶ 00:27 |
Is there an outcome such that | ▶ 00:30 |
there's no other outcome that all players would prefer? | ▶ 00:32 |
And the answer is that this outcome, | ▶ 00:00 |
A = -1, B = -1, | ▶ 00:02 |
is Pareto optimal because there's no other outcome | ▶ 00:04 |
that all the players would prefer. | ▶ 00:07 |
Sure, B would prefer being up here, | ▶ 00:09 |
and A would prefer being over here, | ▶ 00:11 |
but none of them that both players can agree on. | ▶ 00:14 |
Now, the third concept is the concept of equilibrium. | ▶ 00:00 |
An equilibrium is an outcome such that no player | ▶ 00:05 |
can benefit from switching to a different strategy, | ▶ 00:07 |
assuming that the other players stay the same. | ▶ 00:10 |
And there was a famous result from John Nash, economist, | ▶ 00:13 |
who was shown in the movie and book "A Beautiful Mind" | ▶ 00:17 |
proving that every game has at least 1 equilibrium point. | ▶ 00:21 |
The question here is which, if any, of these outcomes | ▶ 00:25 |
are equilibriums in this game? | ▶ 00:31 |
And the answer is only this outcome, | ▶ 00:00 |
with A = -5, B = -5, is an equilibrium point | ▶ 00:02 |
because if A switches, it gets -10. | ▶ 00:06 |
If B switches, it gets -10. | ▶ 00:09 |
Neither player wants to switch away from keeping with that strategy. | ▶ 00:11 |
Over here, the Pareto optimal solution is not an equilibrium point | ▶ 00:16 |
because if B switches, it will do better, | ▶ 00:20 |
and A will do worse. | ▶ 00:26 |
This is where the game turns out to be a dilemma | ▶ 00:28 |
because there's an equilibrium point that it seems like | ▶ 00:31 |
if both players are rational, they're bound to end up | ▶ 00:36 |
in this outcome, | ▶ 00:39 |
whereas the Pareto optimal solution is over here in the other corner. | ▶ 00:41 |
And yet, being rational, neither Alice nor Bob can see a way | ▶ 00:46 |
to get to this preferred outcome. | ▶ 00:50 |
Let's try another example. | ▶ 00:00 |
This one is called the Game Console Game, | ▶ 00:02 |
and the story is that there is a | ▶ 00:05 |
game console manufacturer called Acme, | ▶ 00:08 |
and it has to decide whether its next console | ▶ 00:12 |
is going to play Blu-ray discs or DVD discs. | ▶ 00:15 |
And then there's a game manufacturer called Best, | ▶ 00:19 |
and they similarly have to decide whether to put out their next game | ▶ 00:23 |
on Blu-ray discs or DVD discs. | ▶ 00:26 |
And the payoffs are if they're both on Blu-ray, | ▶ 00:29 |
A gets a +9, and B is also a +9. | ▶ 00:33 |
If they both choose to go with DVD, it's not quite as lucrative. | ▶ 00:40 |
A gets a +5. B gets a +5. | ▶ 00:44 |
And if they disagree, then they'll be in trouble, and they'll take losses. | ▶ 00:48 |
A gets a -4, and B gets a -1, | ▶ 00:52 |
while here A = -3 and B = -1. | ▶ 00:57 |
The first question is is there a dominant strategy? | ▶ 01:03 |
And is there one for A? | ▶ 01:06 |
Click here if yes. | ▶ 01:09 |
And is there one for B? | ▶ 01:11 |
Click here if yes, and if there's none at all, click here. | ▶ 01:13 |
There may be both A and B. It's your choice. | ▶ 01:17 |
And then the next question is is there an equilibrium? | ▶ 01:20 |
Click on any of these 4 outcomes | ▶ 01:24 |
to indicate whether there's an equilibrium. | ▶ 01:29 |
The answers are that there's no dominant strategy | ▶ 00:00 |
because for each player, what's best depends on what the other player does. | ▶ 00:03 |
They do best if they match, | ▶ 00:06 |
and so you can't figure out what your own best strategy is | ▶ 00:08 |
unless you know what the other player is going to play. | ▶ 00:11 |
In terms of equilibrium, there's 2 equilibrium points, | ▶ 00:13 |
the +9/+9 and the +5/+5. | ▶ 00:16 |
Both of them are equilibriums because neither of the players | ▶ 00:20 |
can benefit from switching to the other strategy. | ▶ 00:24 |
And now the next question is | ▶ 00:00 |
is there 1 or more Pareto optimal outcomes? | ▶ 00:02 |
Click on any of the outcomes that you think are Pareto optimal. | ▶ 00:05 |
And the answer is that there's just 1. | ▶ 00:00 |
The +9/+9 is Pareto optimal. | ▶ 00:02 |
Both players would rather be there than anyplace else. | ▶ 00:04 |
And so it seems that if both players are rational | ▶ 00:07 |
they'll both know that there are 2 equilibrium points, | ▶ 00:10 |
but only 1 of them is Pareto optimal. | ▶ 00:13 |
And even though there isn't a dominant strategy, | ▶ 00:16 |
they can both arrive at that happy conclusion. | ▶ 00:18 |
So, we've seen that it's easy to figure out the solution to a game | ▶ 00:00 |
if there's a dominant strategy or if there's a Pareto-optimal equilibrium. | ▶ 00:03 |
Now let's look at a harder game for which such solutions don't exist. | ▶ 00:07 |
This game is called Two Finger Morra, | ▶ 00:12 |
and it's a betting game, and we're going to show a simplified version of it. | ▶ 00:15 |
Again, we have a simple 4-state outcome matrix, | ▶ 00:19 |
and there are 2 players called even and odd. | ▶ 00:23 |
And they both simultaneously show either | ▶ 00:26 |
1 or 2 fingers. | ▶ 00:29 |
And then if the result of the total number of fingers is even, | ▶ 00:31 |
then the even player wins that many dollars from the odd player. | ▶ 00:37 |
And if the total number of fingers is odd, | ▶ 00:44 |
then the odd player wins that number of dollars from the even player. | ▶ 00:47 |
So, if 1 and 1 is 2, so that's even, | ▶ 00:52 |
so even gets +2, and we won't bother writing odd getting -2 | ▶ 00:56 |
because it's a zero-sum game, and it will always be the opposite. | ▶ 01:04 |
Similarly, 2 and 2 is 4, | ▶ 01:08 |
so even gets +4 and odd gets -4. | ▶ 01:11 |
2 and 1 is 3, so even loses 3 dollars | ▶ 01:15 |
and pays it to odd and similarly up here. | ▶ 01:19 |
Now, there's no dominant strategy, and it seems kind of tricky | ▶ 01:23 |
to figure out what the right strategy is. | ▶ 01:26 |
We're going to need more complicated techniques, | ▶ 01:29 |
and it turns out that there is no single move | ▶ 01:31 |
that's the best strategy for either player. | ▶ 01:34 |
But there is what's called a mixed strategy, | ▶ 01:36 |
so a single strategy of always playing one or the other | ▶ 01:39 |
is called a pure strategy, and a mixed strategy | ▶ 01:44 |
is when you have a probability distribution over the possible moves. | ▶ 01:47 |
Now, since it seems complicated to solve this game in this form, | ▶ 00:00 |
one way we can address it is to change from this matrix form | ▶ 00:04 |
into the familiar tree form. | ▶ 00:08 |
We'll move this over here, | ▶ 00:11 |
and we'll draw it as a game tree. | ▶ 00:13 |
Max will be the even player, and min will be the odd player, | ▶ 00:15 |
and for the moment, let's look at the game | ▶ 00:20 |
of what would happen if max had to go first | ▶ 00:25 |
rather than having them move simultaneously. | ▶ 00:28 |
So, max would make a move either 1 or 2. | ▶ 00:31 |
And then min--so max is even and min is O-- | ▶ 00:36 |
would also make the move, 1 or 2, 1 or 2. | ▶ 00:41 |
And then the outcome in terms of E would be 2 here | ▶ 00:46 |
-3 here, -3 here and 4 here. | ▶ 00:50 |
And now what does min do? Well, try to minimize. | ▶ 00:54 |
So, we choose 2 here, so this node would be -3. | ▶ 00:57 |
We'd choose 1 here, so this node would be -3, | ▶ 01:01 |
and then E tries to maximize. | ▶ 01:06 |
It doesn't matter what he chooses, | ▶ 01:08 |
and we get a -3 up here. | ▶ 01:11 |
So, that's giving E the disadvantage of having to reveal | ▶ 01:14 |
his or her strategy first. | ▶ 01:17 |
What if we did it the other way around? | ▶ 01:21 |
Let's take a look at that. | ▶ 01:23 |
What if O had to go first and reveal a strategy of 1 or 2 | ▶ 01:25 |
and then E as the maximizing player goes second | ▶ 01:29 |
and does a 1 or 2? | ▶ 01:35 |
And then we have these 4 terminal states here, | ▶ 01:37 |
and I want you to fill in the values of the 4 terminal states | ▶ 01:41 |
taken from the table and the intermediate states | ▶ 01:45 |
or the higher up states in the tree as well. | ▶ 01:49 |
And the answer is 1 + 1 is 2, | ▶ 00:00 |
and so that's even, so it'd be a positive payoff to E. | ▶ 00:02 |
1 + 2 is 3, that's odd, so it'd be a -3. | ▶ 00:07 |
Similarly, 2 +1 is 3, which is odd. | ▶ 00:11 |
So, -3, 2 + 2 is 4. | ▶ 00:14 |
That's a positive payoff. | ▶ 00:17 |
Now E is maximizing, | ▶ 00:19 |
so E would prefer 2 here | ▶ 00:21 |
and would prefer 4 here. | ▶ 00:24 |
And now O is minimizing, | ▶ 00:26 |
so O would prefer 2 here. | ▶ 00:29 |
And notice what we've done here is that | ▶ 00:32 |
we're trying to figure out what the utility of the game is | ▶ 00:35 |
to E, and the true game, | ▶ 00:38 |
both players move simultaneously. | ▶ 00:41 |
Over here, we've handicapped E. | ▶ 00:44 |
And over here, we handicapped O. | ▶ 00:47 |
The true value of the game must be somewhere in between there, | ▶ 00:50 |
so we can say that the utility to E must be | ▶ 00:53 |
less than or equal to 2, which is the value here, | ▶ 00:57 |
and greater than or equal to -3, which is the value here. | ▶ 01:01 |
We've narrowed it down to some degree, but we still haven't nailed down | ▶ 01:06 |
exactly what the utility of the game is. | ▶ 01:08 |
Now, 1 reason there's such a wide discrepancy in the outcomes | ▶ 00:00 |
of these 2 versions of the game is that | ▶ 00:03 |
we handicapped E and O so severely | ▶ 00:06 |
that here E had to reveal his entire strategy, | ▶ 00:09 |
whether he's going to play 1 or 2 all the time, | ▶ 00:13 |
and the same thing for O over here. | ▶ 00:16 |
What if we could think of a way where we didn't handicap them quite as much, | ▶ 00:18 |
where they weren't giving away quite as much information? | ▶ 00:21 |
Let's look at a way to do that. | ▶ 00:24 |
Let's look at the situation where E goes first | ▶ 00:26 |
and has to reveal the strategy, | ▶ 00:30 |
but instead of having to reveal my strategy is | ▶ 00:32 |
to play 1 or to play 2, | ▶ 00:35 |
what if E says "Well, my strategy is | ▶ 00:37 |
with probability P, I'm going to play 1." | ▶ 00:40 |
"And with probability 1 - P, I'm going to play 2." | ▶ 00:44 |
And that's called a mixed strategy. | ▶ 00:48 |
So, E would announce that strategy for some number P. | ▶ 00:50 |
And there could be an infinite number of possibilities, | ▶ 00:55 |
so we should be drawing an infinite number of branches | ▶ 00:58 |
out of this decision point for all the possibilities | ▶ 01:02 |
for values of P that E would come up with. | ▶ 01:05 |
But instead, I'm just going to sort of parameterize that | ▶ 01:07 |
and just draw 1 line coming out. | ▶ 01:09 |
And now O as the minimizing player | ▶ 01:12 |
has to make a choice between 1 and 2, and what are the outcomes? | ▶ 01:15 |
Well, if P was 1, then 1 + 1 is 2, | ▶ 01:19 |
so with probability P, we get an outcome of 2. | ▶ 01:23 |
That's 2P, but if we choose 2, | ▶ 01:28 |
the probability 1 - P, then 2 + 1 is 3, | ▶ 01:33 |
so with probability 1 - P, we get a -3. | ▶ 01:36 |
So, 2P - 3 times 1 - P | ▶ 01:40 |
would be the outcome for this day. | ▶ 01:44 |
And then the outcome over here would be | ▶ 01:48 |
-3P + 4 times 1 - P. | ▶ 01:51 |
That's the parameterized outcome given the parameterized strategy. | ▶ 01:56 |
And we could do the same thing on the other side. | ▶ 02:00 |
What if O had to go first? | ▶ 02:03 |
With probability Q, O plays 1, | ▶ 02:05 |
and with probability 1 - Q plays 2. | ▶ 02:09 |
Then even is the maximizer | ▶ 02:12 |
and we get 2Q - 3(1 - Q) | ▶ 02:15 |
and -3Q + 4(1 - Q). | ▶ 02:21 |
Now, what value should E choose for P? | ▶ 00:00 |
Remember, you've got an infinite number of choices | ▶ 00:03 |
for any value for P. | ▶ 00:05 |
Well, if E chose a value of P | ▶ 00:07 |
such that this value here | ▶ 00:10 |
was larger than this value here, | ▶ 00:14 |
then O would know to always play 1, | ▶ 00:17 |
and similarly, if this value was larger, | ▶ 00:20 |
then O would know to always play 2. | ▶ 00:22 |
So, it seems that what E wants to do | ▶ 00:25 |
is choose the value of P such that these 2 are equal. | ▶ 00:28 |
So, how much is that? Well, let's see. | ▶ 00:33 |
This is 2P - 3 - P. | ▶ 00:35 |
That's 5P - 3, | ▶ 00:37 |
and we want to set that equal to -3 + 4 - P. | ▶ 00:40 |
That's -7P + 4. | ▶ 00:46 |
And let's gather the terms together. | ▶ 00:50 |
So, that would be 12P = 7 | ▶ 00:53 |
or P = 7/12. | ▶ 00:56 |
So, if E chooses the value of P = 7/12, | ▶ 01:00 |
so 7/12 of the time play 1, | ▶ 01:05 |
5/12 of the time play 2, | ▶ 01:07 |
then O doesn't know what to do. | ▶ 01:09 |
No matter whether he chooses 1 or 2, he gets the same result. | ▶ 01:12 |
And you can do the same calculation over here, | ▶ 01:15 |
and it turns out that Q also equals 7/12. | ▶ 01:18 |
Now, let's take this strategy of P = 7/12, 1, | ▶ 01:23 |
and feed it back into the matrix for the game, | ▶ 01:28 |
and if E plays this strategy of 7/12, 1, 5/12, 2, | ▶ 01:30 |
then no matter what the strategy O plays, | ▶ 01:36 |
the value of the game to E, the utility to E is -1/12. | ▶ 01:39 |
And then we can do the same computation over here. | ▶ 01:46 |
If Q has the strategy 7/12, 1, and 5/12, 2, | ▶ 01:50 |
then we plug that back into here, and no matter what strategy E chooses, | ▶ 01:55 |
the value there is also -1/12. | ▶ 02:01 |
And so now we've shown that the utility to E | ▶ 02:05 |
is greater than or equal to -1/12 | ▶ 02:09 |
and less than or equal to -1/12. | ▶ 02:12 |
In other words, the utility to E | ▶ 02:14 |
is exactly -1/12, so we've solved the game. | ▶ 02:16 |
Now, the introduction of mixed strategy | ▶ 00:00 |
brings us some curious philosophical problems | ▶ 00:04 |
related to the idea of randomness, secrecy, and rationality. | ▶ 00:07 |
We said that sometimes the rational strategy | ▶ 00:14 |
can be a mixed strategy. | ▶ 00:17 |
That is, ones with probability in it. | ▶ 00:19 |
Probability P, I do action A, | ▶ 00:21 |
and with probability 1 - P I do action B. | ▶ 00:25 |
And that suggests that we need some secrecy | ▶ 00:30 |
so that our opponent doesn't know which of these random choices we're making. | ▶ 00:35 |
The curious thing is that that's only true | ▶ 00:40 |
at the extent of an individual play, | ▶ 00:43 |
not to the extent of the strategy itself. | ▶ 00:45 |
So, if this is the optimal strategy, | ▶ 00:48 |
a mixed strategy, it's okay for us to reveal | ▶ 00:51 |
that strategy to our opponent because our opponent | ▶ 00:55 |
can also compute that that's our rational strategy, | ▶ 00:58 |
and so we won't do any worse by revealing to the opponent | ▶ 01:01 |
exactly what our strategy is. | ▶ 01:05 |
However, the actual implementation of that strategy, | ▶ 01:07 |
that is, this is the grand strategy, that in this situation, | ▶ 01:11 |
whenever we're faced with playing this game, this is what we'll do, | ▶ 01:15 |
that part can be revealed, but the actual choice | ▶ 01:19 |
that this time we're going to choose A or we're going to choose B, | ▶ 01:22 |
of course, that has be to kept secret. | ▶ 01:25 |
If we reveal that, if our opponent can somehow discover | ▶ 01:27 |
which choice we're going to make based on this random choice, | ▶ 01:31 |
then our opponent can get an advantage over us. | ▶ 01:36 |
Now, with respect to rationality, | ▶ 01:39 |
we said that a rational agent is one that does the right thing, | ▶ 01:42 |
and that's still true. | ▶ 01:44 |
However, it turns out that there are games | ▶ 01:46 |
in which you can do better if your opponent believes | ▶ 01:48 |
you are not rational, and that has been said about various politicians | ▶ 01:51 |
throughout history, and I won't pick on one or another. | ▶ 01:54 |
But sometimes it has been said that they are intentionally | ▶ 01:59 |
cultivating an image of being crazy | ▶ 02:02 |
so that they can gain an advantage | ▶ 02:05 |
when faced with certain games with opponents. | ▶ 02:07 |
For example, suppose 1 action available to a leader is to go to war, | ▶ 02:10 |
but both sides realize that the strategy of going to war | ▶ 02:15 |
is dominated by other strategies and thus would be irrational. | ▶ 02:18 |
So, a leader who is perceived to be rational and makes a threat | ▶ 02:22 |
of "Give me this concession, or I'll go to war against you," | ▶ 02:26 |
that's not a credible threat. | ▶ 02:29 |
The leader's threat would be dismissed, and it would have no effect. | ▶ 02:31 |
However, if the leader can convince the opponent | ▶ 02:35 |
that he is irrational or crazy, then the threat suddenly becomes credible. | ▶ 02:38 |
And so note that being irrational doesn't help, | ▶ 02:43 |
but appearing irrational can gain you an advantage. | ▶ 02:46 |
Now, let's give you a chance to solve a game. | ▶ 00:00 |
This will be another 2x2 game, | ▶ 00:02 |
and let's just go ahead and call the players max and min, | ▶ 00:04 |
and they each have 2 moves, | ▶ 00:09 |
and we'll make this be a zero-sum game. | ▶ 00:12 |
We'll just show the value to max, | ▶ 00:15 |
and the value to min will be the negation of that. | ▶ 00:19 |
And what I want you to do to solve the game | ▶ 00:22 |
is tell me what the strategy should be | ▶ 00:25 |
in terms of fill in these blanks. | ▶ 00:28 |
What are the percentages that min should play 1 or 2? | ▶ 00:30 |
And in these blanks, the percentages for max to play 1 or 2. | ▶ 00:35 |
And then tell me the final value for the game. | ▶ 00:40 |
The utility or expected value to max equals what? | ▶ 00:43 |
We see in this game each player has a dominant strategy. | ▶ 00:00 |
For max, if he plays strategy 1, | ▶ 00:03 |
then that's better than playing strategy 2. | ▶ 00:06 |
If min plays 1, then strategy 1 is also better than strategy 2. | ▶ 00:09 |
If min plays 2, so strategy 1 is the dominant strategy, | ▶ 00:14 |
and that should have probability 1. | ▶ 00:19 |
Strategy 2 should have probability 0. | ▶ 00:21 |
And it's the same thing for min, | ▶ 00:23 |
that 2 minimizes better than 1 in both cases. | ▶ 00:25 |
So, strategy 2 should have probability 1, | ▶ 00:29 |
and strategy 1 should have probability 0, | ▶ 00:33 |
and that means we're always going to end up with this outcome, | ▶ 00:36 |
and the value of the game is 3. | ▶ 00:39 |
Now, that last one was easy. Let's do one more. | ▶ 00:00 |
Here we have a game, and the payoff to max is 3, | ▶ 00:03 |
6, 5 and 4. | ▶ 00:07 |
And I want you to tell me what the strategy is, | ▶ 00:11 |
whether it's pure or mixed. | ▶ 00:13 |
What are the probabilities that max should play 1 and 2, | ▶ 00:15 |
and what are the probabilities that min should play 1 and 2? | ▶ 00:18 |
And what is the value of the game to max? | ▶ 00:22 |
So, in this case, there's no dominant strategy, | ▶ 00:00 |
so we'll have to go to a mixed strategy. | ▶ 00:02 |
And we'll start by looking at max and saying | ▶ 00:04 |
he has a mixed strategy with a probability P | ▶ 00:07 |
of playing 1, so then if min chooses 1, | ▶ 00:10 |
then we'll have the outcome 3P + 5(1 - P). | ▶ 00:15 |
And we want to set that equal to the outcome | ▶ 00:22 |
if min plays 2, which is 6P + 4(1 - P). | ▶ 00:25 |
And we solve that, and that works out to P = 1/4. | ▶ 00:32 |
So, P, that was a probability of max playing 1. | ▶ 00:37 |
That should be 1/4, which leaves 3/4 over here. | ▶ 00:41 |
And now we go at it from min's direction, | ▶ 00:46 |
and if min has a probability of Q | ▶ 00:49 |
of playing 1, then we want to set 3Q + 6 (1 - Q) | ▶ 00:53 |
equals 5Q + 4 (1 - Q). | ▶ 01:00 |
And you solve that, and you get Q = 1/2. | ▶ 01:05 |
So, 1/2 and 1/2. | ▶ 01:09 |
And then the utility of the game is the expected value, | ▶ 01:12 |
so we look at all the outcomes and the probability of each outcome. | ▶ 01:15 |
So, 3 times 1/8, because it's 1/4 times 1/2 | ▶ 01:19 |
would be the probability, so 3 times 1/8 | ▶ 01:24 |
+ 6 times 1/8 + 5 times 3/8 | ▶ 01:27 |
+ 4 times 3/8, and that works out to 4.5. | ▶ 01:31 |
Here's a geometric interpretation that may help you understand a little better what's going on. | ▶ 00:00 |
Here we've gone back to the two-finger Morra game. | ▶ 00:05 |
Now, remember we looked at the two possibilities of E going first | ▶ 00:08 |
and revealing a strategy of playing one with probability P and two with probability of 1 - P. | ▶ 00:13 |
Now O has a choice of what to do, and O wants to minimize. | ▶ 00:20 |
If O chooses one, he'll be somewhere along this line | ▶ 00:25 |
corresponding to this strategy for different values of P. | ▶ 00:30 |
This graph here is showing the utility to E for different values of P | ▶ 00:34 |
that P chose in P's strategy. | ▶ 00:41 |
Now since O can achieve the minimum | ▶ 00:44 |
since O gets to chose the strategy of doing one or two, | ▶ 00:46 |
O can be anywhere on this frontier. | ▶ 00:50 |
It makes sense to E to push that up at high as possible since H is the maximizer. | ▶ 00:54 |
E will choose this point here, which turns out to be at P = 7/12. | ▶ 01:00 |
The same argument going on this side. | ▶ 01:06 |
Here O has gone first and chosen the strategy, q:one and (1 - q):two. | ▶ 01:08 |
Now E has a choice of what to do. E can either choose two and be along this line, | ▶ 01:14 |
depending on what the value of q is, | ▶ 01:20 |
or can chose one, which will be along this line. | ▶ 01:22 |
It would make sense for O, who is trying to minimize, | ▶ 01:25 |
to get this frontier down as low as possible. | ▶ 01:28 |
O would choose the value of q that puts us right here at this point. | ▶ 01:32 |
It turns out that that also is q = 7/12. | ▶ 01:36 |
We see that each side is trying to maximize or minimize, | ▶ 01:42 |
and we end up at a distinguished point that's the intersection of their two strategies. | ▶ 01:46 |
Now so far we've dealt only with games that take a single turn-- | ▶ 00:00 |
that is there are two players, they both simultaneously reveal their move, | ▶ 00:04 |
and the game is over. | ▶ 00:08 |
But game theory can also deal with more complex games | ▶ 00:10 |
that have multiple rounds of turn taking. | ▶ 00:12 |
Here I'm describing a simple game of poker, | ▶ 00:16 |
the simplest type of poker you've probably ever seen. | ▶ 00:19 |
The deck only has four cards. | ▶ 00:22 |
One card is dealt to each player. | ▶ 00:24 |
There are two rounds. | ▶ 00:26 |
In the first, player one has a choice to either raise--to bet a dollar--or to check. | ▶ 00:28 |
Then in the second round, the second player has the chance to call-- | ▶ 00:34 |
to say I want to see what's up--or to fold. | ▶ 00:40 |
Now this format begins to look very much like the game tree | ▶ 00:44 |
that we talked about in the previous unit. | ▶ 00:48 |
It starts out and there's a chance node. | ▶ 00:51 |
This corresponds to dealing the cards with the 1/6 that the first player gets an Ace | ▶ 00:54 |
and the second player gets an Ace. | ▶ 01:00 |
One-third that the first player gets and Ace and the second Player gets a Kind, and so on. | ▶ 01:02 |
There there are maximizing nodes and minimizing nodes. | ▶ 01:08 |
What this format, which is known as the sequential game format, | ▶ 01:12 |
is especially good at is keeping track of the belief states of the possibilities | ▶ 01:15 |
of what each agent knows and doesn't know. | ▶ 01:22 |
The tree as a whole describes everything that's going on, | ▶ 01:27 |
but each agent doesn't know at which point in the tree they are. | ▶ 01:30 |
So if you're agent number one, you know that you have an Ace, | ▶ 01:35 |
so you know you're in one of these two states denoted by the dotted lines. | ▶ 01:38 |
You're either in the state where you have an Ace and the other player has an Ace, | ▶ 01:43 |
or in the state where you have an Ace and the other player has a King, | ▶ 01:47 |
But you don't know which one you're at. | ▶ 01:51 |
Similarly, over here there is confusion for the second player as to what state they're in. | ▶ 01:53 |
Now, we can solve this game using this game tree approach, | ▶ 01:57 |
and it's not quite the same as the max and the min approach, | ▶ 02:00 |
because where you are in the states, what you know about the partial information, | ▶ 02:05 |
affects your strategy in a way that we haven't dealt with before. | ▶ 02:11 |
One possibility for how you can evaluate again like this | ▶ 02:15 |
is just to convert it to the other form. | ▶ 02:20 |
The form we've seen before is called the normal form or matrix form. | ▶ 02:22 |
This is the sequential game in extensive form. | ▶ 02:25 |
If we convert the extensive form, we get something like this. | ▶ 02:29 |
Here for each player, we've denoted by a two-letter strategy | ▶ 02:33 |
what you should do when you have an Ace and what you should do when you have a King. | ▶ 02:36 |
So we end up with an exponentially large search space, | ▶ 02:42 |
but here the game was so simple, that it ends up being rather small, | ▶ 02:46 |
and the game is rather trivial, and you can solve it. | ▶ 02:50 |
It turns out that there are two equilibrium corresponding to the strategy for player two, | ▶ 02:53 |
which is he should check when he has an Ace, and he should fold when he has a King, | ▶ 03:00 |
and strategy for player one is it doesn't matter if he raises or checks when has an Ace, | ▶ 03:05 |
but he should check when he has a King. | ▶ 03:12 |
That would give the game a value of zero. | ▶ 03:14 |
Now this works fine for the simple version of poker. | ▶ 03:18 |
For real poker, this table would have about 10^18 states, | ▶ 03:21 |
and it would be impossible to deal with. | ▶ 03:26 |
So we need some strategies for getting back down to a reasonable number of states. | ▶ 03:28 |
One of the best strategies is to try abstraction. | ▶ 00:00 |
Instead of dealing with every single possible state of the game, | ▶ 00:04 |
we can take similar states and deal with them as if they were the same. | ▶ 00:07 |
For example, in poker one abstraction that works pretty well is to eliminate the suits. | ▶ 00:10 |
If no player is trying to get a flush, then we can treat all four Aces as if they were identical | ▶ 00:15 |
rather than treating the four of them as being different | ▶ 00:21 |
and similarly with all the other face values. | ▶ 00:25 |
Another thing we can do is lump similar cards together. | ▶ 00:27 |
Rather than saying that 2, 3, 4, and 5 are all different values, | ▶ 00:31 |
if I know that I'm holding a pair of 10s then I can think of the other players' cards | ▶ 00:36 |
as being equal to 10, lower than 10, or higher than 10. | ▶ 00:42 |
Otherwise, lump the same. | ▶ 00:47 |
Similarly, I can lump bets together. | ▶ 00:49 |
Rather than thinking of every dollar amount of a bet from $1 to the upper limit, | ▶ 00:52 |
I can lump the bets into small, medium, and large. | ▶ 00:58 |
Then finally another way to do abstraction is rather than considering every possible deal | ▶ 01:02 |
of all the cards, I can just consider a small subset of the deals | ▶ 01:07 |
to do Monte Carlo sampling over the possible deals, | ▶ 01:13 |
rather than considering them all. | ▶ 01:17 |
This approach extensive games can handle quite a lot | ▶ 01:20 |
in terms of dealing with uncertainty, dealing with partial observability, | ▶ 01:24 |
dealing with multiple agents, stochastic, sequential, dynamic. | ▶ 01:30 |
But there's a few things they can't handle very well. | ▶ 01:34 |
They aren't very good at unknown actions. | ▶ 01:36 |
We need to know what all the actions are for either player before we can define the game. | ▶ 01:38 |
Game theory doesn't deal very well with continuous actions, | ▶ 01:44 |
because we have this matrix-like form. | ▶ 01:46 |
It doesn't deal very well with irrational opponents. | ▶ 01:49 |
We can know that we're going to do the best we possibly can against a rational opponent, | ▶ 01:51 |
but it doesn't tell us how to exploit our opponent's weakness | ▶ 01:56 |
if he turns out to be irrational. | ▶ 02:00 |
Then finally, it doesn't deal with unknown utilities. | ▶ 02:02 |
If we don't know what it is we're trying to optimize, | ▶ 02:05 |
game theory isn't going to tell us how to do it. | ▶ 02:07 |
This exercise describes a game played between | ▶ 00:00 |
the federal reserve board and politicians. | ▶ 00:03 |
Now the politicians have a choice whether they want to contract fiscal policy, | ▶ 00:07 |
expand it, or do nothing, and the Fed has the same three choices. | ▶ 00:12 |
Each party has preference for what outcome they would like to see. | ▶ 00:18 |
Here we've ranked them for each party from 1 being the worst outcome | ▶ 00:22 |
to 9 being the best outcome. | ▶ 00:26 |
What I want you is find the equilibrium point for this game. | ▶ 00:29 |
There will be one equilibrium point. I want you to find it. | ▶ 00:33 |
The equilibrium point defines a pure strategy for each player. | ▶ 00:37 |
Tell me the pure strategy for the Fed. | ▶ 00:42 |
It is contract, do nothing, or expand? | ▶ 00:45 |
Click on the right box here. | ▶ 00:48 |
Similarly, for the politicians, click on the right box for their strategy, | ▶ 00:50 |
which leads to the equilibrium point. | ▶ 00:54 |
Then tell me the outcome for the game for each player for that equilibrium point. | ▶ 00:56 |
Then tell me if the equilibrium is Pareto optimal. | ▶ 01:02 |
Now, we could determine the equilibrium point by examining all 9 of the outcomes | ▶ 00:00 |
and checking each one to see if both parties do no better by switching. | ▶ 00:05 |
But instead, I'm going to show an alternative method to analyze games, | ▶ 00:11 |
which is to look for dominated strategies. | ▶ 00:15 |
There are no dominant strategies here, but there are dominated strategies. | ▶ 00:18 |
For example, for the politician, the strategy of contracting is domininated | ▶ 00:22 |
by the strategy of doing nothing. | ▶ 00:28 |
To the politician, 2 is greater than 1, 5 is greater than 4, and 9 is greater than 6. | ▶ 00:30 |
We can say that this strategy is dominated, | ▶ 00:37 |
and we can take it out of consideration. | ▶ 00:40 |
Now, how does that help? | ▶ 00:42 |
Well, now in the other direction, we do have a dominant strategy that we didn't have before. | ▶ 00:44 |
Now for the Fed, the option of contracting gives them 8, which is better than 5 or 4, | ▶ 00:49 |
or 3 which is better than 2 and 1. | ▶ 00:55 |
This a dominant strategy for the Fed, and we can mark that off. | ▶ 00:58 |
Now for the politicians, they know they're going to be in this column, | ▶ 01:03 |
and they have a choice of getting a 2 or a 3, | ▶ 01:07 |
The 3 would be the strategy for the politicians. | ▶ 01:11 |
That leads us to this Nash equilibrium point, | ▶ 01:14 |
and the values of that outcome are 3 for each party. | ▶ 01:18 |
Is that Pareto optimal? | ▶ 01:22 |
Actually, it's more like Pareto pessimal in that this is worst total. | ▶ 01:24 |
Out of all these outcomes the total is only 6 | ▶ 01:30 |
as opposed to every other one is better. | ▶ 01:36 |
To answer the question specifically is it Pareto optimal, | ▶ 01:38 |
the answer is no, because any of these four would be better for both parties. | ▶ 01:42 |
That may tell you something about our political system. | ▶ 01:47 |
Next time you get an outcome that you don't like, | ▶ 01:50 |
don't assume that the players are irrational. | ▶ 01:52 |
Just assume that that's the way the game was set up. | ▶ 01:55 |
Now let's switch to the other part of game theory, | ▶ 00:00 |
which remember we called mechanism design. | ▶ 00:02 |
It could really be called game design. | ▶ 00:06 |
The idea is that someone is going to be running a game | ▶ 00:08 |
that players are going to be participating in. | ▶ 00:12 |
We want to design the rules of the game such that we get a high outcome | ▶ 00:14 |
or a high expected utility for the people that run the game, | ▶ 00:20 |
for the players who play the game, and for the public at large. | ▶ 00:25 |
Here's an example of a game. | ▶ 00:29 |
This is the advertising game. | ▶ 00:31 |
Here I've shown it on an Internet search engine, where you do a search, | ▶ 00:34 |
and then ads show up, sometimes at the top, sometimes at the right, | ▶ 00:38 |
sometimes at the bottom of the page, depending on the mechanism. | ▶ 00:42 |
This is also done at sites like eBay that sell items, | ▶ 00:46 |
and there's lots of places where auctions are run. | ▶ 00:50 |
The idea of mechanism design is to come up with the rules of the auction | ▶ 00:54 |
that will make it attractive to bidders and/or people who want to respond to the ads, | ▶ 00:58 |
and make a good result for all. | ▶ 01:06 |
Now, one property that you would like an auction to have is to | ▶ 01:09 |
attract more bidders to make it a more competitive market, | ▶ 01:13 |
and you could attract more if it's less work for them. | ▶ 01:17 |
It's easier for the bidders if they have a dominant strategy. | ▶ 01:21 |
You saw how hard it was to work out the value of a game | ▶ 01:25 |
when you didn't have a dominant strategy, | ▶ 01:28 |
and how easy it is to work it out if you did. | ▶ 01:30 |
If you want to save everybody a lot of trouble, design the game | ▶ 01:33 |
so that dominant strategies exist. | ▶ 01:36 |
These strategies have various names in auctions. | ▶ 01:38 |
Sometimes you call it an auction strategy proof | ▶ 01:41 |
if you only need to know your own strategy. | ▶ 01:45 |
You don't have to think about what all the other people are going to be bidding. | ▶ 01:47 |
They also call that truth revealing or incentive compatible. | ▶ 01:51 |
Let's examine a type of auction called the second price option. | ▶ 00:00 |
This is popular in various internet search and auction sites. | ▶ 00:05 |
The way it works is that we have a line of possible prices-- | ▶ 00:10 |
higher prices at the top--and bids come in. | ▶ 00:15 |
Different players can bid whatever they want, | ▶ 00:19 |
and whoever bids the highest is the winner, | ▶ 00:23 |
but the price that they pay is the price of the second highest bidder. | ▶ 00:26 |
Now let's say you're participating in this auction, | ▶ 00:30 |
and something is for sale, and you place a value on that. | ▶ 00:33 |
We'll call that value "V", and say V is here. | ▶ 00:36 |
Your bid we'll call "b", and the highest other bid we'll call "c." | ▶ 00:42 |
Now the payoff to you if your bid is higher than all the others | ▶ 00:48 |
then the payoff is you get the value of the auction, | ▶ 00:54 |
because you won the item, and you get V, | ▶ 00:57 |
but you have to pay the second highest price, which is c. | ▶ 01:00 |
You get b minus c. Otherwise, you lose the auction. | ▶ 01:03 |
You don't get anything, and you don't pay anything. | ▶ 01:08 |
The value to you of the auction is zero. | ▶ 01:10 |
What I want you to do is fill in this chart to look at different strategies for different possible bids. | ▶ 01:12 |
We'll say that the value to you of the item for sale is V equals 10. | ▶ 01:15 |
You have the option of bidding, say, 12, 10, or 8, | ▶ 01:26 |
and we'll consider the cases where the highest other bid is 7, 9, 11, or 13. | ▶ 01:32 |
What I want you to do is fill in this chart with the value to you | ▶ 01:41 |
of this game according to your strategy and the strategies of the other players. | ▶ 01:46 |
Tell me if one of these strategies is a dominant strategy. | ▶ 01:51 |
Then tell me is that dominant strategy, if there is one, a truth revealing strategy? | ▶ 01:55 |
I should have one note about dominance. | ▶ 02:01 |
When we talked about it before, we glossed over the possibility of ties. | ▶ 02:04 |
If some policy is better everywhere than any other policy, | ▶ 02:09 |
then we say that that policy strictly dominates the others. | ▶ 02:14 |
On the other hand, if there are some ties and some places where its better | ▶ 02:17 |
but none where it's worse, then we say it weakly dominates. | ▶ 02:22 |
Either way, it's a case of dominance. | ▶ 02:26 |
Now I'll do the first entry to get you started. | ▶ 02:28 |
If you bid 12 and the highest other bid is 7, | ▶ 02:30 |
then you have the high bid, so you win. | ▶ 02:34 |
It's a second-price auction, so you pay 7. | ▶ 02:36 |
The value of the goods is 10, so the total value of the outcome is 10 minus the cost of 7 for 3. | ▶ 02:39 |
I want you to fill in the rest. | ▶ 02:49 |
Here are the answers. | ▶ 00:00 |
We can see that the strategy of bidding 10, the true value, is weakly dominant. | ▶ 00:02 |
It's the same here, but it's better in these two cases. | ▶ 00:06 |
Let's look at these cases a little bit more carefully and figure out what's going on. | ▶ 00:10 |
If you bid 12--so if b was up here--and if c snuck in between the bid and the valuation, | ▶ 00:14 |
then you'd be paying too much. You'd be paying more than the goods are worth. | ▶ 00:22 |
You'd end up with a negative utility. | ▶ 00:27 |
So you don't want to bid more than what it's really worth to you. | ▶ 00:29 |
On the other hand, if you bid down here, and if c snuck in in between | ▶ 00:33 |
what your bid is and what the valuation is, | ▶ 00:38 |
then you've lost the auction, and you get a zero, | ▶ 00:41 |
but you should have won--or it would have been worth your while to win-- | ▶ 00:44 |
because the price still would have been a bargain to you. | ▶ 00:48 |
That says that the rational strategy, the dominant strategy in second-price auction, | ▶ 00:50 |
is to bid your true value and that makes it a truth-revealing auction mechanism. | ▶ 00:55 |
This unit we'll return to the topic of planning, | ▶ 00:00 |
and we'll talk about 4 things that we left out last time we talked about planning. | ▶ 00:05 |
First is time. | ▶ 00:09 |
That is, rather than just saying whether an action occurs before an action or after it, | ▶ 00:11 |
We'll talk about actions that persists over a length of time. | ▶ 00:15 |
Second is resources necessary to do a task. | ▶ 00:21 |
Third is active perception-- | ▶ 00:25 |
That is, taking the action of perceiving something. | ▶ 00:27 |
Fourth is hierarchical plans--that is, plans that consist of steps which have substeps. | ▶ 00:31 |
We'll start with time. | ▶ 00:36 |
We'll look at the problem of scheduling a series of tasks, each of which has duration. | ▶ 00:00 |
We'll show a task network which has a start and a finish date, | ▶ 00:05 |
and then has a sequence of tasks, which have to be completed | ▶ 00:09 |
and arrows to indicate precedence of which ones have to go before other ones. | ▶ 00:13 |
This task has to occur before this one, | ▶ 00:18 |
but there's nothing said about the relationship between this task and this task. | ▶ 00:21 |
We'll list for each task their duration. | ▶ 00:25 |
This one takes 30 minutes, 30, 10, 60, 15, and 10. | ▶ 00:28 |
Scheduling then is a process of figuring out a schedule under which | ▶ 00:34 |
we specified the times at which each of these tasks starts | ▶ 00:40 |
such that we can finish as soon as possible. | ▶ 00:44 |
Now schedule is defined in terms of specifying for every task in the network | ▶ 00:00 |
the earliest start that, which we'll call "ES," and the latest possible start time, | ▶ 00:05 |
which we call "LS" for which it's possible to complete the task network | ▶ 00:11 |
in the shortest possible total amount of time. | ▶ 00:16 |
We can define these with a set of recursive formulas | ▶ 00:19 |
which can be solved by dynamic programming. | ▶ 00:21 |
The earliest start time of the start state is defined as being zero. | ▶ 00:24 |
The earliest start time of any state B is defined as being the maximum over all As | ▶ 00:30 |
which have an arrow leading into B-- | ▶ 00:36 |
that is all As that are defined to be predecessors of B-- | ▶ 00:38 |
of the earliest start time of A plus the duration of A. | ▶ 00:43 |
For example, the earliest start time of this state here would be the maximum | ▶ 00:49 |
over all the ones that are coming in, which is only this one. | ▶ 00:54 |
The maximum of its start time, which will be here, plus its duration, which would be 30. | ▶ 00:57 |
Then the latest start time is defined by saying the latest start time of the finish, | ▶ 01:03 |
is the same as the earliest start time of the finish, | ▶ 01:09 |
because the finish by itself has no duration-- | ▶ 01:12 |
it's just there to give us a point to end at. | ▶ 01:15 |
The latest start time in general of any node A | ▶ 01:18 |
is the minimum over all B which come after A, | ▶ 01:21 |
of the latest start time of B minus the duration of A. | ▶ 01:26 |
These formulas together define a unique schedule, which is the fastest possible schedule. | ▶ 01:30 |
What I want you to do is fill in for me in the upper left hand the earliest start time, | ▶ 01:37 |
in the upper right the latest start time for each of these nodes. | ▶ 01:43 |
Here I've zoomed in a bit just to give you a little bit more room to fill in the blanks. | ▶ 01:47 |
You can see the earliest and latest start times filled in for all the states. | ▶ 00:00 |
Here's an alternative method for visualizing this. | ▶ 00:05 |
Here I've given names to the various actions-- | ▶ 00:08 |
the three on the top and the three on the bottom. | ▶ 00:12 |
They have a duration along a time line, and we can see the time line there. | ▶ 00:14 |
Notice that these three actions have no slack between them. | ▶ 00:20 |
One has to start after the other. | ▶ 00:23 |
We say that these are on the critical path in that if any of these slip in the schedule, | ▶ 00:25 |
then the whole schedule will slip. | ▶ 00:30 |
Whereas these three actions have some slack. | ▶ 00:33 |
The could occur anywhere within this gray window, | ▶ 00:36 |
and if this action slipped to the right, then the others would slip to the right | ▶ 00:39 |
without affecting the final schedule. | ▶ 00:43 |
I should say that over the years the field of scheduling has moved | ▶ 00:45 |
in and out of the artificial intelligence. | ▶ 00:48 |
Some people have worked on it, but most of the work on scheduling | ▶ 00:51 |
has been done in the field of operations research- | ▶ 00:54 |
a closely related field to artificial intelligence. | ▶ 00:56 |
The next question I want to address is the one of resources. | ▶ 00:00 |
Resources are things like this pile of nuts and bolts that are used somewhere in a plan. | ▶ 00:04 |
Of course, resources could be handled just in the language of classical planning. | ▶ 00:10 |
Here we have a description of a problem domain in classical planning language. | ▶ 00:14 |
The goal is to get an assembly inspected, | ▶ 00:19 |
and in order to do that, we have the action of inspecting, | ▶ 00:22 |
which looks at an assembly which has five nuts and bolts | ▶ 00:25 |
which each have to be fastened to each other. | ▶ 00:29 |
If that precondition is satisfied, then the effect is that the assembly is inspected. | ▶ 00:32 |
We have an action of fastening a nut and bolt to the assembly, | ▶ 00:38 |
which requires a nut and a bolt, and the result is that they're fastened | ▶ 00:42 |
and that the nut and bolt are no longer available for use. | ▶ 00:47 |
Initially we have four nuts and five bolts. | ▶ 00:52 |
Now the question is with this description of this problem can we achieve the goal? | ▶ 00:56 |
Assuming that we have a depth-first tree search planner, | ▶ 01:02 |
how many paths would that planner have to consider? | ▶ 01:07 |
Would it be 1, 4, 5, 4 +5, 4 * 5, 4! + 5!, or 4! * 5!? | ▶ 01:10 |
The answer is we can't achieve the goal. | ▶ 00:00 |
We're just missing a nut so we can't do it. | ▶ 00:03 |
But we're going to have to consider 4 factorial times 5 factorial paths | ▶ 00:06 |
before we discover that. | ▶ 00:10 |
The reason is because we start out, and we say in order to achieve inspected, | ▶ 00:12 |
we need the precondition of being fastened. | ▶ 00:16 |
In order to achieve fastened, we need some nut and some bolt. | ▶ 00:18 |
We can try N1 and B1, but then we would also, when we end up back tracking, | ▶ 00:24 |
have to try N2 against B1, N3 against B1, and so on, for all these and all these, | ▶ 00:30 |
and we'd have to do that at every step in the back track. | ▶ 00:38 |
So we end up trying all combinations of nuts and all combinations of bolts. | ▶ 00:41 |
That seems silly, and so the idea of resources is to make the nuts and bolts | ▶ 00:46 |
rather than make each one be distinct so we can handle them more efficiently. | ▶ 00:52 |
Here I've shown how to extend the language of classical planning to handle resources. | ▶ 00:00 |
We've added a new type of statement saying that there are resources | ▶ 00:05 |
and how many of each there are. | ▶ 00:09 |
We can say there's five nuts and four bolts, | ▶ 00:11 |
and we're also going to explicitly model inspectors, and we have one of them. | ▶ 00:13 |
Then the actions have two new types of clauses. | ▶ 00:17 |
The fasten action has a consume clause, saying it consumes resources | ▶ 00:21 |
and once it uses them, they're gone forever. | ▶ 00:26 |
Fastening is going to consume one nut and one bolt. | ▶ 00:29 |
The inspect action has a used clause, and that says it's going to use one of the resources, | ▶ 00:32 |
the inspector, while the action is going on. | ▶ 00:37 |
But once the action is completed then the inspector has returned to the pool | ▶ 00:40 |
and is available for use elsewhere. | ▶ 00:44 |
Keeping track of resources this way gets rid of that computational or exponential explosion | ▶ 00:47 |
of looking at different combinations by just treating all of the same resource identically. | ▶ 00:55 |
The final topic in this unit is called hierarchical planning. | ▶ 00:00 |
The idea here is we want to close the abstraction gap. What do I mean by that? | ▶ 00:06 |
Well, let's think about what you have to do to plan your own lifetime. | ▶ 00:12 |
You live about maybe a couple of billion seconds, | ▶ 00:16 |
and during that time you have a choice of actions to make, | ▶ 00:20 |
and you have maybe around 1,000 muscles, | ▶ 00:25 |
which you can operate maybe around 10 per second. | ▶ 00:30 |
You end up at a lifetime as somewhere around 10^13 actions | ▶ 00:35 |
give or take an order of magnitude or two. | ▶ 00:41 |
But there's a big gap between 10^13 and the 10^4 or so actions | ▶ 00:44 |
that current planning algorithms or programs can deal with. | ▶ 00:50 |
Part of the problem with such a big gap is that it's just difficult to deal at the level of an individual muscle movement. | ▶ 00:54 |
We'd rather deal with more abstract plans. | ▶ 01:01 |
We're going to introduce the notion of a hierarchical task network, | ▶ 01:04 |
and rather than having a plan be a sequence of individual steps, | ▶ 01:08 |
we can talk about higher order steps where maybe there's a smaller number, | ▶ 01:13 |
and the individual steps can correspond to multiple steps. | ▶ 01:18 |
This idea is called refinement planning. | ▶ 01:22 |
Here's how refinement planning works. | ▶ 00:00 |
In addition to regular actions, we have abstract actions | ▶ 00:02 |
like going from my home to the San Francisco airport. | ▶ 00:06 |
Then we have possible ways to take to refine these abstract options into concrete actions. | ▶ 00:10 |
Here one refinement is I can drive from home to long-term parking | ▶ 00:16 |
and then take the shuttle to the airport. | ▶ 00:21 |
Another refinement is I can just take a taxi. | ▶ 00:23 |
Here's another example of an abstract action, | ▶ 00:26 |
which is if I'm at one point on a grid, ab, and I want to get to point xy, | ▶ 00:29 |
and if I know the grid is all connected, | ▶ 00:35 |
and I have this abstract action of just navigating from ab to xy. | ▶ 00:37 |
One refinement says if I'm already there I do nothing. | ▶ 00:41 |
Another refinement says I can start the journey by going left. | ▶ 00:46 |
Another refinement says I can start the journey by going right and so on. | ▶ 00:50 |
The idea is I can figure out a complex plan that involves | ▶ 00:54 |
navigating around picking up and object, doing something else, | ▶ 00:59 |
and do that planning just at the level of abstract actions like navigate | ▶ 01:02 |
rather than having to figure out a path from ab to xy. | ▶ 01:08 |
How do we know when we have a solution? | ▶ 01:12 |
A hierarchical task network achieves the goal if for every part, every abstract action, | ▶ 01:14 |
at least one of the refinements achieves the goal. | ▶ 01:20 |
We only need to at least one of them, because we're the planner. | ▶ 01:23 |
We get to make the choice. | ▶ 01:26 |
It's like an and/or search where we can make the best possible choices, | ▶ 01:28 |
and if any of the choices work, then the goal can be achieved. | ▶ 01:33 |
Now, in addition to doing an and/or search, | ▶ 00:00 |
sometimes we can solve an abstract hierarchical task network planning problem | ▶ 00:03 |
without going all the way down to the concrete steps. | ▶ 00:08 |
Let's talk about how to do that. | ▶ 00:12 |
Here we have a description of a state space. | ▶ 00:14 |
The start state is here, and the goal state is outlined in gray here. | ▶ 00:16 |
We have one abstract action, and we're shown a set of possible states | ▶ 00:22 |
that can be reached by that abstract action, | ▶ 00:30 |
if we refine the abstract action, using one concrete action or another. | ▶ 00:33 |
This is like when we were dealing with belief states | ▶ 00:37 |
where we would move, because we had a stochastic action, | ▶ 00:41 |
from one state to several possible other states. | ▶ 00:44 |
Here we have several possible states that we'll end up with, | ▶ 00:48 |
not because the actions are stochastic, | ▶ 00:51 |
but because we haven't decided yet which refinement we're going to use. | ▶ 00:54 |
This would be a single step that would bring us to this belief state, | ▶ 00:58 |
and then when we add a second step, we get to this belief state. | ▶ 01:01 |
Now we can check to see if we can achieve the goal with this two-step plan | ▶ 01:07 |
just by checking if there is an intersection between the reachable state and the goal state. | ▶ 01:11 |
In this case, there is. | ▶ 01:17 |
We know that we've achieved the goal, | ▶ 01:19 |
and now if we want to find a refinement that actually works, | ▶ 01:21 |
the way to do it is to search backwards rather than forward. | ▶ 01:25 |
If we search forward we'd have a large tree of possibilities, | ▶ 01:28 |
but if we search backwards, we know the intersections here. | ▶ 01:33 |
What could have brought us to here? Only this refinement. | ▶ 01:36 |
And what could have brought is to this state? Only this refinement. | ▶ 01:40 |
That's the plan that is a refinement of this abstract plan that achieves the goal. | ▶ 01:43 |
Now sometimes it may be very difficult to specify | ▶ 00:00 |
exactly what states can be reachable by an abstract action, | ▶ 00:03 |
because the refinements are complicated. | ▶ 00:08 |
We can go with the notion of an approximate set of reachable states. | ▶ 00:11 |
That's what I've shown schematically here. | ▶ 00:16 |
For this abstract action, I've shown a lower bound and an upper bound | ▶ 00:18 |
on the states that are reachable. | ▶ 00:23 |
What do I mean by that? | ▶ 00:25 |
Consider the abstract action of going to the airport in San Francisco. | ▶ 00:27 |
Now, some things I know are going to be true about the resulting state. | ▶ 00:31 |
I know it's going to take, say, half an hour to get there no matter what way I go. | ▶ 00:35 |
That's always going to be true. | ▶ 00:41 |
Other things depend on which choice I make. | ▶ 00:43 |
I may consume some money if I take a taxi. | ▶ 00:46 |
I may consume some gas if I take a car, | ▶ 00:50 |
but I may not be able to specify exactly which of those combinations hold true. | ▶ 00:53 |
So we approximate the set of reachable states by this lower bound | ▶ 00:59 |
of things that we know we can get to | ▶ 01:03 |
and this upper bound of things that we might be able to get to, | ▶ 01:05 |
but we're not quite sure if all combinations of them will check out depending on the refinement. | ▶ 01:08 |
Here, similarly, there's another set of lower and upper bounds and here as well. | ▶ 01:15 |
These are the goals. | ▶ 01:21 |
What I want you to tell me is for each of these three actions, | ▶ 01:23 |
is it guaranteed, yes, that I can reach the goal state if I choose the right refinement, | ▶ 01:28 |
or is it never possible--no, that I'll never be able to reach the goal state-- | ▶ 01:35 |
or is it uncertain yet because the description of upper and lower bound | ▶ 01:39 |
doesn't tell us enough about whether we can reach the goal state. | ▶ 01:46 |
Answer that for this abstract action here, | ▶ 01:49 |
and for this abstract action here, | ▶ 01:54 |
and for this abstract action here. | ▶ 01:57 |
In the case of this abstract action, | ▶ 00:00 |
we know all the possible outcomes are somewhere within here | ▶ 00:03 |
and none of those intersect with a goal, | ▶ 00:07 |
so there's nothing we can do to make this one work. | ▶ 00:09 |
For this abstract action, we see that there is an intersection, | ▶ 00:12 |
and there's an intersection even in the under estimate of the state. | ▶ 00:16 |
We know that we can reach someplace in here, | ▶ 00:20 |
and since we have the choice of where we want to go, | ▶ 00:23 |
we know we can reach there, | ▶ 00:26 |
so we know that we can always refine this abstract action to achieve the goal. | ▶ 00:28 |
Over here there's an intersection, | ▶ 00:33 |
but it's only in the over estimate--the outside part of the search space. | ▶ 00:36 |
So we're not quite sure. | ▶ 00:41 |
We have to look more carefully at the refinements to see if there is | ▶ 00:43 |
a combination of refinements that allow us to reach this state, | ▶ 00:46 |
or if the combination of refinements leave us somewhere over here, | ▶ 00:49 |
which is not inside that state. | ▶ 00:54 |
So that would be questionable or unknown yet. | ▶ 00:56 |
Here's one more topic. | ▶ 00:00 |
We're going to talk about how to extend classicial planning to allow active perception | ▶ 00:02 |
to deal with partial observability. | ▶ 00:07 |
Here's a problem description. | ▶ 00:09 |
There's a table and a chair, and there are two cans of paint. | ▶ 00:11 |
The table is within our field of view, | ▶ 00:15 |
and our goal is to have the chair and the table have the same color. | ▶ 00:18 |
Here's the actions. | ▶ 00:23 |
We can remove the lid from a can, making it open. | ▶ 00:25 |
We can paint one thing with that can if the can is open. | ▶ 00:28 |
We also have an active perception action, | ▶ 00:33 |
which is we can look at something. | ▶ 00:37 |
If it's in view, we can look at it, | ▶ 00:39 |
and then we're looking at that one thing, and we're no longer looking | ▶ 00:41 |
at whatever we were looking at before. | ▶ 00:43 |
Now, here's the big extension that in addition to actions, | ▶ 00:45 |
we now have percept schemas. | ▶ 00:50 |
Action schemas and perception schemas, | ▶ 00:52 |
and we can perceive the color of something if it's an object. | ▶ 00:55 |
Here the objects are declared to be the table and the chair, | ▶ 00:59 |
and if it's within our field of view. | ▶ 01:03 |
Notice that here we're introducing a new variable. | ▶ 01:05 |
We never did that before in planning. | ▶ 01:08 |
Before all the actions in planning, all the variables, | ▶ 01:11 |
were predefined by matching against the precondition. | ▶ 01:14 |
Here we're introducing a new variable. | ▶ 01:18 |
We're saying if these preconditions are true, | ▶ 01:20 |
then you can perceive something, and you'll learn something new. | ▶ 01:24 |
You'll learn the value of this variable. | ▶ 01:27 |
Here's a question. How can we achieve this goal? | ▶ 01:29 |
The first thing I want to ask is, without even thinking about the percepts, | ▶ 01:32 |
is there a conformant plan that is a that doesn't do sensing | ▶ 01:37 |
that will allow us to achieve this goal? | ▶ 01:42 |
Is there that type of conformant plan? | ▶ 01:45 |
Tell me yes or no. | ▶ 01:48 |
The answer is, yes, there is. | ▶ 00:00 |
That is we can remove the lid from the can, | ▶ 00:02 |
and we can do that to either can, can one or can two, | ▶ 00:05 |
and then without knowing what color that can is and without knowing | ▶ 00:09 |
what color any of the furniture is, | ▶ 00:12 |
we can first paint the table, and then paint the chair. | ▶ 00:15 |
Then we know they'll both have the same color, | ▶ 00:18 |
and we'll have achieved the goal. | ▶ 00:20 |
Now one of the problems with this plan is | ▶ 00:00 |
say the chair and the table were already the same color. | ▶ 00:02 |
We would've wasted our time painting them when we didn't have to do it. | ▶ 00:05 |
Now the next question is, yes or no, is there a better sensory plan? | ▶ 00:09 |
That is a plan that uses perception and comes up with, | ▶ 00:15 |
in at least some cases, a smaller number of painting actions. | ▶ 00:20 |
We're going to allow these plans to have conditionals in them | ▶ 00:25 |
as well as having perception. | ▶ 00:28 |
The answer is, yes, there is. | ▶ 00:00 |
There's a variety of possibilities. | ▶ 00:02 |
I'll show you a somewhat complex one. | ▶ 00:04 |
This one says we can look at the table and look at the chair. | ▶ 00:07 |
Then if the color of the table and the color of the chair is the same, c, | ▶ 00:12 |
then do nothing. We're done. We don't have to do any painting. | ▶ 00:17 |
Otherwise, we can remove the lids from can one and look at it, | ▶ 00:21 |
remove the lid from can two and look at it. | ▶ 00:24 |
If the color of the table is the same as the color of the can, | ▶ 00:28 |
then we can paint the chair with that can. | ▶ 00:33 |
This is for any possible can, either can one or can two. | ▶ 00:36 |
Otherwise, if the color of the chair and the color of the can match, | ▶ 00:41 |
then we can paint the table, and otherwise we have to paint both of them. | ▶ 00:48 |
In this question I'd like you to fit a Markov model. | ▶ 00:00 |
We have 3 sequences of observation--A, B, C, A, B, C. | ▶ 00:05 |
This is the first state--state0--and these are the other states. | ▶ 00:10 |
Same for A, A, B, B, C, C and A, A, A, C, C, C. | ▶ 00:14 |
These are 3 different observation sequences. | ▶ 00:19 |
The first element here is state0. | ▶ 00:22 |
All the other ones are examples of transitions from state to xt-1 to xt. | ▶ 00:26 |
Using maximum likelihood, I'd like to know the initial properties for state0 | ▶ 00:32 |
and all the transition probabilities. Those are indicated by the arrow. | ▶ 00:38 |
This is the probability that A at time t - 1 goes to A at time t. | ▶ 00:43 |
Here they are. | ▶ 00:48 |
For any of the symbols in the sequence, we see there are three possible outcomes-- | ▶ 00:50 |
A-B-C, A-B-C, and A-B-C. | ▶ 00:55 |
Each has a probability. Obviously, these 3 over here have to add up to 1. | ▶ 00:58 |
These 3 over here have to add up to 1, and these 3 over here have to add up to 1. | ▶ 01:01 |
Initially there is only A, | ▶ 00:00 |
and the maximum likelihood we get the probability of A being in the first place is 1, | ▶ 00:03 |
and all the other ones are 0. | ▶ 00:08 |
There are 7 transitions out of A as indicated by the lines under the As over here. | ▶ 00:10 |
In 3 cases, A is followed by A--over here, over here, and over here. | ▶ 00:16 |
This gives us 3/7 for the maximum likelihood estimator. | ▶ 00:22 |
A flows into B in another 3 cases--over here, over here, and over here--again, 3/7. | ▶ 00:25 |
There is 1 instance where A moves into C--1/7. | ▶ 00:31 |
There are 4 transitions out of B-- | ▶ 00:36 |
3 go to C, and 1 goes to B over here, | ▶ 00:39 |
which gives us 0, 1/4, and 3/4. | ▶ 00:43 |
Then there are 4 transitions out of C--one, two, three, four-- | ▶ 00:46 |
3 of which result in C, but one goes into A over here. | ▶ 00:51 |
These are the results--1/4, 0, and 3/4. | ▶ 00:55 |
Here we're given a marchov chain between A and B, | ▶ 00:00 |
with the transition of A to itself is 0.9 with 0.1 chances transitions to B. | ▶ 00:03 |
B stays in B with 0.5 chance and transitions back into A.with 0.5 chance. | ▶ 00:09 |
I'd like to know the stationary distribution over here. | ▶ 00:15 |
So what's the probabability of A in the stationary distribution? | ▶ 00:18 |
Of course correspondingly, what's the probability of B in the stationary distribution? | ▶ 00:23 |
The answer is 5/6 for A and 1/6 for B. | ▶ 00:00 |
To see, let's call this property over here, X, and we can now solve for the equation | ▶ 00:07 |
that in the stationary case, the property of A is 0.9 x the property of A before + 0.5, | ▶ 00:13 |
x the property of B which is 1- X. | ▶ 00:22 |
If you solve this, this gives us 0.4X + 0.5 | ▶ 00:24 |
or put differently, | ▶ 00:29 |
0.6X = 0.5. That is the same as saying X = 5/6, which is what I wrote over here. | ▶ 00:30 |
I am now asking a hidden markov model question. | ▶ 00:00 |
In the given, the following hidden markov models with 2 internal states, | ▶ 00:04 |
with a property of transitioning to either side is 0.5, | ▶ 00:08 |
and the probability of staying is therefore, 0.5. | ▶ 00:11 |
This Hidden Markov Model has 2 possible measurements or observations, X and Y. | ▶ 00:15 |
The probability of observing X and Y depends on what state | ▶ 00:22 |
the hidden markov model is in. | ▶ 00:25 |
For A, it's 0.1 for X and 0.9 for Y. | ▶ 00:27 |
For B, it's 0.8 for X and 0.2 for Y. | ▶ 00:32 |
Let's assume that the initial probability of distribution x 0 is 1/2 for either of the 2 states. | ▶ 00:37 |
I would like to know what's the [ ] probability of being A x 0 given that we observed | ▶ 00:46 |
X x 0 and then Y, what's the [ ] probability of state A x 1 given the observation of X x 0, | ▶ 00:51 |
No subtitles... | ▶ 00:00 |
This is a particle filter question. | ▶ 00:00 |
Suppose we had a word with 4 states. | ▶ 00:03 |
In these 2 states over here, we tend to observe A property 80%. | ▶ 00:05 |
The remaining 20%, we'll observe B. | ▶ 00:11 |
In these 2 states, we tend to observe B with 80% probability | ▶ 00:13 |
and with 20% probability, we observe A. | ▶ 00:18 |
Suppose we have 3 particles--1 over here, 1 over here, and 1 over here, | ▶ 00:21 |
and we observe A. | ▶ 00:26 |
Let's call this particle over here particle A. Lower caps a. | ▶ 00:27 |
Lower caps b. | ▶ 00:31 |
Lower caps c. | ▶ 00:32 |
What is the probability that the sample a, given that we just observed A, | ▶ 00:34 |
which means it will be more likely to be in 1 of these 2 states over here. | ▶ 00:39 |
What's the probability of sampling b? What's the probability of sampling c? | ▶ 00:43 |
The particle a will get an importance weight of 0.8, nonnormalized. | ▶ 00:00 |
You have to normalize in the 2nd. | ▶ 00:06 |
Particle b will get an importance weight of 0.2. Same for particle c. | ▶ 00:08 |
If you add those together, we get 1.2. | ▶ 00:15 |
Then we have to divide everything by 1.2--which is 2/3 for the probability of sampling a, | ▶ 00:18 |
and 1/6 each for b or c. | ▶ 00:25 |
Here's another particle filter question. | ▶ 00:00 |
Now we're looking at the state position, | ▶ 00:03 |
beginning with the same state position as before and the same 3 particles-- | ▶ 00:05 |
1 over here, 1 over here, 1 over here, | ▶ 00:09 |
and to give the states names, you're going to call this one a1, a2, b1, and b2. | ▶ 00:12 |
Let's assume we take a single random particle with uniform distribution, | ▶ 00:18 |
and we emulate a next state. | ▶ 00:23 |
The state position works as follows: A particle will move with property 1 to an adjacent state | ▶ 00:26 |
but adjacent is either north, south, east, or west, but not diagonal, | ▶ 00:33 |
because every particle has 2 adjacent states, it'll break ties at random | ▶ 00:37 |
so you're going to pick 1 of the 2 with 50% probability. | ▶ 00:41 |
So with this 1 particle that you've drawn random, | ▶ 00:44 |
and it's in the random single position, | ▶ 00:46 |
what's the probability that finds itself in a1, a2, b1, and b2? | ▶ 00:47 |
If you look at the chances of the particles that can end up in a1, | ▶ 00:00 |
we find that only this particle can go up here, so let's call this a 1. | ▶ 00:04 |
To a2, there's 2 particles that can go to this. Call this a 2. | ▶ 00:08 |
There's 2 particles that can end up in b1. This one and this guy over here. Again, a 2, | ▶ 00:11 |
and one that can make it into b2. | ▶ 00:16 |
Now this is a total of 6. If you normalize, you get extra probabilities. | ▶ 00:18 |
a1 is worth 1/6. | ▶ 00:23 |
a2, 1/3, which is 2/6ths. | ▶ 00:25 |
b1, again 1/3, | ▶ 00:28 |
and b2 is 1/6. | ▶ 00:30 |
So here's a multiple choice question for particle filters. | ▶ 00:00 |
Say we implement a particle filter such as a mobile localization, | ▶ 00:05 |
and we use exactly 1 particle. | ▶ 00:08 |
Which one of the following statements are true? | ▶ 00:10 |
Check any or all of the following statements. | ▶ 00:14 |
Measurements will be ignored? Check this box if you believe that's the case. | ▶ 00:17 |
The result is generally poor. | ▶ 00:21 |
It cannot represent mulit-model distributions. | ▶ 00:24 |
The initial state, if known, is ignored. | ▶ 00:27 |
The state transitions are ignored. | ▶ 00:30 |
If none of the above applies, check the final box down here. | ▶ 00:33 |
And the answer is measurements are indeed ignored which has to do with the following: | ▶ 00:00 |
We do weigh particles by the measurement probability but we normalize them to 1, | ▶ 00:06 |
and if there's only 1 particle, it will always normalize itself back to 1, | ▶ 00:11 |
so the measurement probability has no effect. | ▶ 00:15 |
The results are generally poor. | ▶ 00:18 |
That is, a single particle is just insufficient to anything interesting, | ▶ 00:20 |
so that is absolutely the correct answer over here. | ▶ 00:25 |
Clearly, a single particle cannot represent multi-modal distributions | ▶ 00:27 |
because multi-modes look something like this over here | ▶ 00:31 |
because it's just 1 particle, so this is actually correct. | ▶ 00:34 |
The initial state, if known, is not necessarily ignored. | ▶ 00:37 |
You might actually place the particle at the initial state | ▶ 00:41 |
and it might consider it in the filtering result. | ▶ 00:44 |
The state transitions are also not ignored because we will still propagate this particle | ▶ 00:48 |
forward according to the state transistion, and because 3 of them are true, | ▶ 00:53 |
the final one isn't. | ▶ 00:57 |
Another particle filter question. | ▶ 00:00 |
Check the following statements if they're true. | ▶ 00:03 |
They are usually easy to implement. | ▶ 00:05 |
They scale quadratically with the dimensionality of the state space. | ▶ 00:08 |
They can only be applied to discrete state spaces. | ▶ 00:14 |
Finally, if none of those applies, check the final check mark over here. | ▶ 00:17 |
And only the first answer is correct. | ▶ 00:00 |
They are usually easy to implement compared to any other filter. | ▶ 00:03 |
They do not scale quadratically, in fact, normally they scale exponetially | ▶ 00:07 |
with the dimensionality of the state space because you need | ▶ 00:11 |
exponentially many particles to fill up the state space. | ▶ 00:13 |
The filter that scales quadratically is called a common filter, | ▶ 00:16 |
but we didn't really talk about it in this class. | ▶ 00:21 |
They can only be applied to discrete state spaces is clearly wrong. | ▶ 00:24 |
We saw how to apply the robot localization | ▶ 00:27 |
which is a continuous of revalued state space, | ▶ 00:30 |
and because this first one is true, the last one, none of the above, is clearly false. | ▶ 00:33 |
For this exercise, I want you to solve this game. | ▶ 00:00 |
So it's a 2 x 2 zero-sum game. | ▶ 00:03 |
We're showing the utilities to the player Max, and so tell me what Max's strategy is | ▶ 00:06 |
by putting in probabilities in these 2 boxes for his 2 plays--1 and 2. | ▶ 00:11 |
Tell me what Min's strategy is by putting in probabilities in these boxes, | ▶ 00:17 |
and then tell me the value of the game, the expected utility to Max. | ▶ 00:22 |
The answer is both players have a rational mixed strategy. | ▶ 00:00 |
For Max, he plays 1--8/17ths of the time and 2--9/17ths of the time. | ▶ 00:04 |
Min would play 1--10/17ths and 2--7/17ths, | ▶ 00:10 |
and then the utility of the game to Max turns out to be 5/17ths. | ▶ 00:15 |
In this scheduling problem, we have a network of actions | ▶ 00:00 |
with the precedence relations between them and the duration of each action | ▶ 00:03 |
shown in each box. | ▶ 00:08 |
What I want you to do is, for each action fill in the earliest start time in the upper left | ▶ 00:09 |
and the latest start time in the upper right. | ▶ 00:15 |
Here we see the start times for each of the actions. | ▶ 00:00 |
Note that the critical path which the earliest and latest start time are the same | ▶ 00:03 |
goes straight down the center. | ▶ 00:08 |
Here's a game tree for a stochastic 2-player game. | ▶ 00:00 |
There are max nodes, min nodes, and chance nodes. | ▶ 00:03 |
What I want you to do is back up all the values, so fill in a value for the value of | ▶ 00:07 |
each of these nodes, and then check off all the nodes that could be pruned away | ▶ 00:13 |
by a procedure that's similar to alpha beta, but updated to handle chance nodes. | ▶ 00:19 |
So what I mean by that is, a node can be pruned away if evaluating the nodes | ▶ 00:26 |
is not necessary to figure out what the best moves are for max and min. | ▶ 00:32 |
For the chance nodes, all the possibilities are equally probably. | ▶ 00:37 |
So here, there's a 1/3 chance of each of these. | ▶ 00:41 |
Here there's a 1/2 chance of each of these. | ▶ 00:43 |
And in this game, the result of every game is either +1, -1, or 0, | ▶ 00:47 |
and all the players know that those are the only possible outcomes for the game. | ▶ 00:55 |
Therefore, the players can take that into account when trying to figure out | ▶ 01:00 |
which nodes to prune away. | ▶ 01:04 |
For backed up values for the min nodes, the backed up value is always the minimum. | ▶ 00:00 |
For the chance nodes, it's the expectation, or the average, | ▶ 00:05 |
and for the max node, it's the maximum. | ▶ 00:10 |
Here the nodes that can be pruned. | ▶ 00:13 |
This and this can be pruned because in each of these cases min has achieved | ▶ 00:14 |
the best possible play that min can get, and therefore, | ▶ 00:20 |
doesn't need to consider any other possibilites. | ▶ 00:24 |
Once you know you can win the game or do the best you can, | ▶ 00:26 |
you don't need to find another way to do just as well. | ▶ 00:29 |
This node here can be pruned and thus, all the ones below it because | ▶ 00:32 |
at this point, when we're trying to evaluate this node, max knows he can get | ▶ 00:37 |
at least 1/3, and here once we know that this node is worth -1 | ▶ 00:42 |
then we know that regardless of the value here, | ▶ 00:48 |
that has to be somewhere between -1 and +1. | ▶ 00:52 |
So therefore, the expectation has to be between -1 and 0, | ▶ 00:56 |
and if 0 is the best that this can be, max knows he already has 1/3 over here. | ▶ 01:01 |
Here's a 2-player game. Each player has 3 possible moves. | ▶ 00:00 |
What I want you to tell me is, first, does A have a dominant strategy? Yes or no. | ▶ 00:04 |
Second, does B have a dominant strategy? Yes or no. | ▶ 00:09 |
Third, click on all the boxes that are equilibrium points. | ▶ 00:12 |
A does not have a dominant strategy, but B does. | ▶ 00:00 |
If B plays the middle e, then in this case, B will get 5, which is more than 3 or 2. | ▶ 00:03 |
In this case, B will get 7, which is more than 2 or 4, | ▶ 00:11 |
and in this case, B will get 8, which is more than 7 or 5. | ▶ 00:15 |
So that makes this play the dominant strategy for B. | ▶ 00:19 |
Now if B is going to do that, then what should A do? | ▶ 00:22 |
Well, A should try to get the best possible value that A can, | ▶ 00:26 |
and that would be here, and that makes this square the lone equilibrium point. | ▶ 00:29 |
Hi again. It's great to see you again. | ▶ 00:00 |
We talked a lot about basic methods of AI, | ▶ 00:03 |
and from today on we'd like to go into applications. | ▶ 00:07 |
Specifically, today we'll talk about computer vision. | ▶ 00:11 |
Computer vision is a very bright field | ▶ 00:15 |
that concerns itself with making sense out of camera images or video. | ▶ 00:18 |
Many devices today are equipped with cameras, such as cell phones or cars, | ▶ 00:24 |
and making sense out of image data has become a really important subfield | ▶ 00:30 |
of artificial intelligence. | ▶ 00:34 |
Today I'll teach you some of the very basics. | ▶ 00:37 |
It's not as deep as my graduate level class on computer vision, | ▶ 00:39 |
and I hope you get a chance to take that in the future, | ▶ 00:42 |
but I hope to enable you to apply some of the very basic methods | ▶ 00:46 |
to, for example, use images and classify them using artificial intelligence technology | ▶ 00:50 |
through feature extraction and other techniques | ▶ 00:56 |
and also to start doing some of the more 3D-oriented tasks | ▶ 00:59 |
such as 3D constructions. | ▶ 01:03 |
So let's start with the very, very basics | ▶ 01:06 |
and ask ourselves what is a camera. | ▶ 01:08 |
Cameras come in all sizes and shapes. | ▶ 01:13 |
This is my beautiful Nikon D3 camera [shutter clicks], | ▶ 01:16 |
but I don't use it much because it's very heavy, even though it takes beautiful pictures. | ▶ 01:20 |
This is the camera I use the most. It's a cell phone camera. | ▶ 01:26 |
It's an 8 megapixel camera over here with a flash, | ▶ 01:30 |
and I can start it, and you get to see whatever is underneath, | ▶ 01:33 |
like this pen over here. | ▶ 01:39 |
I can also activate the front camera, and you get to see the way I've been recording | ▶ 01:44 |
all those wonderful online lectures over all those weeks | ▶ 01:50 |
with this little camera over here. | ▶ 01:54 |
In all of those cameras there is a lens and there's a chip, | ▶ 01:58 |
and the light is captured from the environment and focused through the lens on the chip, | ▶ 02:03 |
which raises the question, how does a lens and a chip really work? | ▶ 02:08 |
[Thrun] The science of how images are created using cameras is called image formation, | ▶ 00:01 |
where formation just means the way an image is being captured. | ▶ 00:06 |
Perhaps the easiest model of a camera is called a pinhole camera. | ▶ 00:11 |
In a pinhole camera, the light from within the world | ▶ 00:15 |
goes through a various small hole--ideally it's a really, really small hole-- | ▶ 00:19 |
to project into a camera chip that sits somewhere in the background. | ▶ 00:25 |
So for example, if you had an object that was a person over here, | ▶ 00:29 |
then this person would be projected as follows. | ▶ 00:33 |
The feet would be projected to over here and the head to over here, | ▶ 00:36 |
which gives us this inverted person on the projection plane or the camera chip. | ▶ 00:42 |
There is some very basic math that governs the geometry of a pinhole camera. | ▶ 00:48 |
If we call X the physical height of the object and small x the height of the projection, | ▶ 00:52 |
which I'll call -x because it points in the opposite direction as the original object, | ▶ 01:01 |
then we can also talk about other values | ▶ 01:06 |
such as the distance of the object to the camera plane | ▶ 01:09 |
and f, which is the focal distance of the camera, | ▶ 01:14 |
which is the distance between the pinhole and the projection plane over here. | ▶ 01:19 |
There's a simple piece of math that relates all of those 4 variables over here, | ▶ 01:25 |
and it's easily obtained by what's called equal triangles. | ▶ 01:31 |
In particular, it turns out if I map this triangle over here to right over here-- | ▶ 01:34 |
so these are the same triangles, just flipped, where x is over here and f is over here-- | ▶ 01:41 |
we get that the ratio of upper caps X to Z is the same as lower caps x to f. | ▶ 01:50 |
So I write this as follows. | ▶ 01:58 |
This is a result of equal triangles. | ▶ 02:00 |
So as you take a triangle of a certain shape, | ▶ 02:02 |
when you scale it up to larger triangles, those proportions are retained, | ▶ 02:05 |
so therefore, upper caps X divided by Z is the same as lower caps x divided by f. | ▶ 02:10 |
If we now transform this, I find that the projection of lower caps x, | ▶ 02:16 |
which I might care about, is upper caps X, the physical size of the object itself, | ▶ 02:20 |
times the quotient of the focal length over the distance. | ▶ 02:27 |
That's an interesting equation. | ▶ 02:33 |
The further an object is away, the smaller it appears. | ▶ 02:36 |
The larger the focal length of the camera, the larger the object in its projection. | ▶ 02:40 |
And of course the size of the object itself directly influences | ▶ 02:46 |
how big its image of the object really is. | ▶ 02:50 |
So let's see if you can practice that equation using a quiz. | ▶ 02:53 |
[Thrun] So here is our equation again. | ▶ 00:00 |
Let's say James is 2 meters tall. | ▶ 00:02 |
He is 10 meters away from the camera, and the focal length is 10mm. | ▶ 00:06 |
How large will be James's projection using a pinhole camera | ▶ 00:12 |
on the camera chip with a focal length of 10mm? | ▶ 00:16 |
Please specify your answers in millimeters as a unit. | ▶ 00:21 |
[Thrun] And the answer is 2mm. | ▶ 00:00 |
James, even though he is 2 meters tall, will look like 2mm tall in the camera. | ▶ 00:03 |
The picture will be there's a 2 meter tall person over here who is 10 meters away, | ▶ 00:09 |
there's a pinhole, and the focal plane is only 10mm away from the pinhole. | ▶ 00:14 |
So this projection will be really, really small. | ▶ 00:20 |
Let's do this in math. | ▶ 00:22 |
The upper caps X is 2 meters, the 10 meters over here is the distance, Z, | ▶ 00:24 |
and the focal length, 10mm, is the thing over here, | ▶ 00:30 |
so 2 meters for X divided by 10 meters for Z times 10mm | ▶ 00:34 |
becomes 0.2 times 10mm, which is 2mm. | ▶ 00:42 |
[Thrun] I have another quiz. | ▶ 00:00 |
We are looking at a building that is 10 meters tall, | ▶ 00:03 |
and our camera is 100 meters away. | ▶ 00:08 |
We also know that the projection of the building on the internal chip is 4mm in size. | ▶ 00:11 |
I want to know what is the focal length in millimeters. | ▶ 00:17 |
[Thrun] And the answer is 40mm. | ▶ 00:00 |
With f being the unknown, we can transform this equation as follows, | ▶ 00:03 |
and now we can plug in the things we know. | ▶ 00:08 |
The 10 meters tall is the upper caps X, | ▶ 00:11 |
the distance of 100 meters is the Z, | ▶ 00:14 |
and the 4mm projection goes over here. | ▶ 00:17 |
And if you work this all out, it's 40mm. | ▶ 00:20 |
[Thrun] So in this final quiz we're going to see we can use a camera as a range sensor | ▶ 00:00 |
or as a distance measuring device, | ▶ 00:05 |
provided that we know the size of the object we are looking at. | ▶ 00:08 |
Suppose you're looking at a car, and we happen to know that this car is 160cm tall. | ▶ 00:13 |
Imagine we take a picture of this car using a pinhole camera | ▶ 00:21 |
with a focal length of 40mm, | ▶ 00:25 |
and in our projection the car is 2mm tall. | ▶ 00:29 |
My question is what is the range? How far is this car away? | ▶ 00:35 |
Please answer in centimeters. | ▶ 00:41 |
[Thrun] And the answer is 3200cm or 32 meters. | ▶ 00:00 |
And to see, we transform this equation over here so that the range, Z, is on the left side. | ▶ 00:07 |
With this new equation we can just plug in the known quantities. | ▶ 00:15 |
F is 40mm, x is 2mm, and upper caps X is 160cm. | ▶ 00:19 |
We work this out as 160cm by 20, which gives us 3200cm. | ▶ 00:26 |
[Thrun] So we just learned something really important, | ▶ 00:00 |
which is the central law of perspective projection, | ▶ 00:03 |
which basically says that in a pinhole camera, or in fact any camera, | ▶ 00:07 |
the projective size of any object scales with distance. | ▶ 00:13 |
So you have an object that's yea tall over here | ▶ 00:17 |
that looks just about the same as an object yea tall over here. | ▶ 00:21 |
In math x is proportional to the size of the object | ▶ 00:27 |
but inverse proportional to the distance to the object, | ▶ 00:31 |
and the only constant that then governs that relationship is the focal length, f. | ▶ 00:35 |
So if we take an object and move it further away, | ▶ 00:40 |
it'll appear smaller, and we all know this. | ▶ 00:44 |
Just look at this object over here, how large it is. | ▶ 00:48 |
And as I move it away from the camera, it becomes smaller. | ▶ 00:51 |
Large...and small. | ▶ 00:55 |
And that's a function of distance. | ▶ 00:59 |
The law that governs the size change of this object in appearance | ▶ 01:01 |
relative to the camera image is the perspective law we just saw. | ▶ 01:06 |
Actually Cal images have 2 dimensions--not just 1. | ▶ 00:00 |
Here's an X and a Y, | ▶ 00:05 |
and the perspective laws apply to both dimensions. | ▶ 00:08 |
The projection of the X-coordinate into the camera plane | ▶ 00:11 |
is governed by the perspective law over here. | ▶ 00:17 |
And the same is true for Y. | ▶ 00:20 |
In both cases, the appearance of an object, of size X and Y, | ▶ 00:23 |
is scaled inversely with the distance of the object to the camera plane. | ▶ 00:28 |
One of the interesting consequences of perspective projection is | ▶ 00:34 |
that parallel lines in the world seem to result in vanishing points over here | ▶ 00:38 |
so that these lines are parallel in the physical world. | ▶ 00:45 |
Because things shrink in inverse proportion to the distance, you'd find it far away. | ▶ 00:49 |
The perceived distance is much smaller, | ▶ 00:57 |
resulting all the way in a vanishing point that sits somewhere over here. | ▶ 00:59 |
So sometimes there's more than 1 vanishing point. | ▶ 01:04 |
In this specific instance, there's a vanishing point over here, | ▶ 01:06 |
and a vanishing point over here--and those correspond to parallel lines, | ▶ 01:10 |
like the curb over here or the house facades | ▶ 01:13 |
that all shrink, with distance, in their visual appearance. | ▶ 01:16 |
So here's my quiz on vanishing points. | ▶ 00:00 |
The question is: How many vanishing points may exist in a single image? | ▶ 00:03 |
Exactly 1 answer is correct here. | ▶ 00:10 |
We already encountered it in an example with 2. | ▶ 00:12 |
So perhaps there's 3, 4, 6-- | ▶ 00:14 |
or even infinitely many. | ▶ 00:17 |
So what's the maximum number of vanishing points you may possibly encounter in an image? | ▶ 00:20 |
And the answer is infinitely many. | ▶ 00:00 |
If you thought it's 3, then yes--cubes might have 3 vanishing points, | ▶ 00:03 |
like this one over here, which is a cube under perspective projection. | ▶ 00:10 |
It actually has 3 vanishing points: | ▶ 00:14 |
No. 1, No. 2, and No. 3 over here. | ▶ 00:16 |
But that's because the cube has 3 faces. | ▶ 00:21 |
You've had different faces whose Z-distance to the camera varied. | ▶ 00:25 |
And those enclosing lines that might be parallel in the physical space | ▶ 00:29 |
would result in their own vanishing point. | ▶ 00:33 |
So you can theoretically make an object with infinitely many vanishing points. | ▶ 00:36 |
Let me comment on the idea of a lens. | ▶ 00:00 |
A fundamental limitation of a pinhole camera | ▶ 00:03 |
is that only very few rays of light hit the plane of the imager. | ▶ 00:06 |
So suppose we have an object over here | ▶ 00:13 |
and the object emits light in all directions. | ▶ 00:16 |
Then most beams get absorbed by the area outside the pinhole | ▶ 00:19 |
and a very small number of beam makes it through the pinhole. | ▶ 00:23 |
Now this is misfortunate because the total amount of light that hits the camera chip | ▶ 00:26 |
is small and its support is only applicable for very, very bright scenes. | ▶ 00:31 |
And further, as you make this gap smaller and smaller to increase your focus | ▶ 00:36 |
on the image plane, you will eventually run into what's called "Light Defraction," | ▶ 00:43 |
which puts a limit on how small you can make this pinhole over here. | ▶ 00:49 |
Now if you use a lens, then all rays will make it to the same point in the image plane. | ▶ 00:54 |
So an example--a ray over here gets projected like this, | ▶ 01:00 |
and a ray over here might make it like this. | ▶ 01:04 |
So any ray in a good lens will eventually meet at the same point over here. | ▶ 01:07 |
The lens collects all the light that hits it, and projects it back to 1 point. | ▶ 01:14 |
Now this specific situation is characterized by only a small plane over here, | ▶ 01:22 |
for which everything is in complete focus. | ▶ 01:27 |
If you move your object back to over here, | ▶ 01:30 |
then what you find is the resulting projections don't match up. | ▶ 01:33 |
Therefore, when you have a camera with a large lens or a large aperture, | ▶ 01:38 |
you have to focus the camera to make sure that the distance between the image plane, | ▶ 01:42 |
the lens itself, and the opposite object are in tune. | ▶ 01:47 |
There is an equation that governs all of this, | ▶ 01:51 |
and it looks about as follows: | ▶ 01:54 |
1 over the focal length, f, for the lens. | ▶ 01:56 |
This would be the sum over the extrinsic distance. | ▶ 01:59 |
Plus 1 over the intrinsic distance, lower cap z. | ▶ 02:04 |
I won't revise this equation, | ▶ 02:07 |
but this is the fundamental equation that governs when things are in focus, for a lens. | ▶ 02:10 |
So this is great--we now learned a lot about cameras and images. | ▶ 00:00 |
We learned the Law of Perspective Projection | ▶ 00:04 |
and we also know when things are in focus. | ▶ 00:07 |
Now the first law is really important. | ▶ 00:10 |
and the second law doesn't really matter that much-- | ▶ 00:12 |
but I put it in so you understand what the implications of using a lens really is. | ▶ 00:15 |
Let's talk of when we're in Computer Vision | ▶ 00:21 |
and what type things we can do with images. | ▶ 00:24 |
One of the primary purposes is to extract information from these images, | ▶ 00:26 |
such as classify objects. | ▶ 00:30 |
So here is one of my son's favorite objects. | ▶ 00:32 |
And he might be interested in understanding that this is a car. | ▶ 00:35 |
A second purpose is 3D reconstruction. | ▶ 00:40 |
So you might care to take many images of this object, from different perspectives | ▶ 00:44 |
or with multiple cameras, like a Stereo Camera Rig | ▶ 00:47 |
and ask yourself what is the 3D model that we can reconstruct from these 2D projections. | ▶ 00:51 |
Or you might care about motion analysis. | ▶ 00:58 |
This is a common problem in Computer Vision | ▶ 01:01 |
where things might move and are seen in the video of many images | ▶ 01:05 |
and you might, for example, care how do things move, over time. | ▶ 01:09 |
So I'd like to quiz you a number of times on something that's really essential to | ▶ 00:00 |
the problem of Object Recognition. | ▶ 00:04 |
In Object Recognition, you're given an image of an object | ▶ 00:07 |
and you care to understand what the nature of the object really is-- | ▶ 00:10 |
like, for example, you might look at the image of a plane and say it's a plane. | ▶ 00:15 |
You might look at the image of a person and say it's a person. | ▶ 00:19 |
A key concept in Object Recognition is called invariance, | ▶ 00:22 |
which means there is natural variations of the image | ▶ 00:24 |
that don't affect the nature of the object itself | ▶ 00:27 |
and you wish to be invariant in your software | ▶ 00:30 |
to those natural variations. | ▶ 00:34 |
So what I will do is I'm going to run a couple of | ▶ 00:37 |
invariances by you and I'd like you to help me | ▶ 00:39 |
which invariance I'm referring to. | ▶ 00:42 |
And here are the possible invariances: | ▶ 00:45 |
Scale, Illumination, Rotation, Deformation, Occlusion, and View Point. | ▶ 00:47 |
I understand--none of these words you've seen before, | ▶ 00:54 |
so I am really appealing to your intuition here | ▶ 00:57 |
and your sense of the English language. | ▶ 00:59 |
So here is the object, | ▶ 01:02 |
and we wish to recognize this object. | ▶ 01:04 |
And, as I'm illustrating, the object might vary in some important dimension. | ▶ 01:07 |
And I wonder what type of invariance you or I, with it, must possess | ▶ 01:14 |
to be able to recognize this object. | ▶ 01:19 |
So here is the list. | ▶ 01:21 |
And the answer here is Rotation. | ▶ 00:00 |
I rotated the object. | ▶ 00:02 |
So you wish to make sure that any recognition item | ▶ 00:04 |
is invariant to a rotation. | ▶ 00:07 |
Sometimes the object gets closer to the camera | ▶ 00:00 |
and increases in size; | ▶ 00:04 |
and gets further away from the camera and, therefore, becomes smaller. | ▶ 00:06 |
Just watch how the object becomes larger, | ▶ 00:09 |
and smaller. | ▶ 00:12 |
What do you think? What type of invariance is this? | ▶ 00:15 |
And the answer is this is Scale Invariance. | ▶ 00:00 |
Scale means how large the image is, relative to the camera. | ▶ 00:04 |
This is governed by the Perspective Projection Law that we discussed before. | ▶ 00:08 |
The further objects are away, the smaller they appear. | ▶ 00:12 |
We wish any classifier to be invariant to scale | ▶ 00:16 |
so it can recognize objects nearby or really far away. | ▶ 00:19 |
This object over here is interesting because we can change its shape. | ▶ 00:00 |
So you can take this thing over here | ▶ 00:04 |
and move it around, and what you actually see of the object | ▶ 00:06 |
depends on the angle of the rotors. | ▶ 00:11 |
So what do you think? What type of invariance is this? | ▶ 00:14 |
I would call this Deformation Invariance | ▶ 00:00 |
because the object is actually deformable | ▶ 00:03 |
as our many objects that surround us, | ▶ 00:06 |
like clothes and dollar bills and water glasses. | ▶ 00:08 |
This kind of deformation is really important in the recognition of objects. | ▶ 00:13 |
You wish to make sure that a helicopter can be recognized, | ▶ 00:17 |
no matter what angle it's rotor blades currently have. | ▶ 00:20 |
So here is one of my favorites. | ▶ 00:00 |
I'm holding in my hand a flashlight, | ▶ 00:02 |
and as I move the flashlight around, | ▶ 00:05 |
you can see that the appearance of the object changes a lot, | ▶ 00:07 |
based on where I hold my flashlight. | ▶ 00:11 |
So my question now is what type of invariance does this realize? | ▶ 00:14 |
And this is clearly an example of Illumination Invariance. | ▶ 00:00 |
Depending how the object's illuminated, | ▶ 00:04 |
it might appear very differently, even though its position to the camera might be identical. | ▶ 00:07 |
Sometimes objects are behind other objects. | ▶ 00:00 |
For example, I can partially cover up this object. | ▶ 00:03 |
You can probably still recognize it. | ▶ 00:06 |
Or I might move a pen in front of it, as shown over here. | ▶ 00:09 |
Now we've almost no choices left. | ▶ 00:12 |
Tell me what type of invariance this was. | ▶ 00:14 |
This is called Occlusion Invariance. | ▶ 00:00 |
Sometimes objects are partially occluded, | ▶ 00:03 |
yet you would wish to be able to recognize them even with a partial occlusion. | ▶ 00:06 |
And because there's only 1 invariance left, | ▶ 00:00 |
let me talk about View Point Invariance | ▶ 00:02 |
as the final invariance. | ▶ 00:04 |
So the appearance of this object depends on | ▶ 00:06 |
from what direction you look, what your view point is. | ▶ 00:09 |
And you can see it's very different, from different view points. | ▶ 00:12 |
So this looks fundamentally different from this, | ▶ 00:15 |
from this. | ▶ 00:19 |
That's called View Point or Vantage Point Invariance, | ▶ 00:21 |
and it's one of the hardest invariances | ▶ 00:24 |
because the appearance of the object | ▶ 00:26 |
might really alter a lot, from different vantage points. | ▶ 00:28 |
[Thrun] The reason why I went through these different invariances | ▶ 00:00 |
is because they are really crucial to computer vision. | ▶ 00:02 |
These and a number of other invariances really matter. | ▶ 00:06 |
When you want to recognize objects, you want to write software | ▶ 00:10 |
that is invariant to scale and illumination and so on | ▶ 00:15 |
and that it retains the important information in the image | ▶ 00:19 |
regardless of the present rotation and occlusion and deformation. | ▶ 00:22 |
If you succeed in eliminating the effects of those changes | ▶ 00:27 |
and build a truly invariant computer vision algorithm, | ▶ 00:33 |
I will be very impressed with you. | ▶ 00:36 |
You will have solved a major computer vision problem. | ▶ 00:38 |
[Thrun] So we learned a lot about invariances. | ▶ 00:00 |
Let's now take actual images and do something with these images. | ▶ 00:03 |
This is an image I took a while back in Amsterdam, | ▶ 00:07 |
and it's interesting because there's a lot of interesting features | ▶ 00:10 |
like these line features over here and possible corner features over there. | ▶ 00:13 |
In computer vision we don't use color images very much. | ▶ 00:18 |
We mostly use black and white images like this greyscale image over here, | ▶ 00:22 |
which misses information from the original image, | ▶ 00:27 |
but it turns out that greyscale is more robust to lighting variations than color is. | ▶ 00:30 |
That's a fairly common representation for images for computer vision. | ▶ 00:35 |
So a greyscale image is a matrix typically of several hundred rows | ▶ 00:40 |
and several hundred columns | ▶ 00:46 |
that has small values imprinted that correspond to the greyscale of a pixel. | ▶ 00:49 |
These values scale between 0 and 255, where 255 is white and 0 is black. | ▶ 00:55 |
You can see how this matrix is full of values that together compose the image. | ▶ 01:03 |
Here is a very small image of size 4 by 5, | ▶ 01:07 |
and based on the numbers I put in, it feels like there's a transition going on. | ▶ 01:12 |
At the top the image is relatively bright. | ▶ 01:18 |
These have values close to 255. | ▶ 01:21 |
And at the bottom it is relatively large. | ▶ 01:23 |
This is way too small an image to recognize anything. | ▶ 01:25 |
Picture a matrix much, much larger than this, | ▶ 01:28 |
yet an image that's still just a 2-dimensional matrix of singular brightness values. | ▶ 01:31 |
At least a greyscale image is a matrix like this. | ▶ 01:37 |
A color image would have 3 different values per pixel | ▶ 01:41 |
which correspond to red, blue, or green or some other encoding of the color itself. | ▶ 01:45 |
But for now we're going to be content with greyscale images. | ▶ 01:50 |
[Thrun] One of the most basic things we can do with computer vision | ▶ 00:00 |
is to extract features. | ▶ 00:03 |
For example, there is a very strong edge feature over here | ▶ 00:05 |
and a strong corner feature right over here and right over here. | ▶ 00:09 |
Let me tell you how to do this. | ▶ 00:14 |
How can you find in an image like this whether there is an edge, | ▶ 00:16 |
or in an image like this where there is an edge from a bright area on the left | ▶ 00:20 |
to a dark region on the right? | ▶ 00:26 |
Let us write a feature extractor that identifies transitions of this type, | ▶ 00:28 |
and let's start with horizontal transitions. | ▶ 00:33 |
The most obvious feature detector looks like this. | ▶ 00:36 |
We run this little 2-value matrix across the entire image over here, | ▶ 00:39 |
and we add whatever is on the left side and subtract whatever is on the right side. | ▶ 00:45 |
So if both sides are approximately in balance, like these points over here, | ▶ 00:51 |
adding an expression here is approximately 0. | ▶ 00:56 |
But if the left side is significantly larger than the right side, | ▶ 00:59 |
then adding and subtracting yields a very large value, like 212 - 7 over here. | ▶ 01:02 |
So this specific mask gives us edges that run from bright to dark. | ▶ 01:08 |
So here I'm taking the first value and subtract the second value from it. | ▶ 01:16 |
255 - 212 gives me 43. That's applying this mask over here. | ▶ 01:20 |
From 211 to 237 is -26 and so on. | ▶ 01:26 |
212 - 7 is 205. | ▶ 01:31 |
237 - 3 is 234 and so on. | ▶ 01:35 |
7 - 1 is 6. | ▶ 01:40 |
3 - 9 is -6 and so on. | ▶ 01:43 |
If you look at this result of applying the mask over here, | ▶ 01:46 |
you'll find that this column stands out. | ▶ 01:49 |
It is much, much larger in value than any of the adjacent columns, | ▶ 01:52 |
and that indicates that we have a high likelihood of a horizontal edge feature occurring | ▶ 01:57 |
at the ridge between this column and this column over here. | ▶ 02:03 |
So here we are applying that same trick to the original image, and this is the result. | ▶ 02:07 |
You can see that areas where the original image has a strong transition | ▶ 02:12 |
you get a strong response over here. | ▶ 02:17 |
This is actually showing the absolute value of the difference | ▶ 02:19 |
where we get rid of the minus sign, so you can see any transition from bright to dark | ▶ 02:22 |
or dark to bright horizontally shows up. | ▶ 02:26 |
Now, you can see these lines over here that are vertical show up very strongly. | ▶ 02:29 |
The lines over here don't, and the reason is the way we defined our kernel, | ▶ 02:34 |
it ran actually horizontal, so it finds horizontal edges and not vertical edges. | ▶ 02:39 |
Vertical edges require a different kernel, so let me get to this in a second. | ▶ 02:43 |
[Thrun] So in this quiz I've given you a very small image of 3 by 3, | ▶ 00:00 |
and I'd like you to apply a kernel that's about like the previous one, | ▶ 00:06 |
except I flipped the left and right side over here. | ▶ 00:09 |
And just apply this kernel to this image over here. | ▶ 00:12 |
We're going to receive a 3 by 2 image in this case. | ▶ 00:16 |
So please fill in all these 6 values over here. | ▶ 00:20 |
[Thrun] And here are the results. | ▶ 00:00 |
7 - 255 is -248. | ▶ 00:03 |
3 - 7 is -4. | ▶ 00:06 |
4 - 240 is -236. | ▶ 00:09 |
240 - 212 is 28. | ▶ 00:12 |
230 - 216 is 14. | ▶ 00:15 |
And 216 - 218 is -2. | ▶ 00:18 |
So this would be the image under that specific mask or kernel over here. | ▶ 00:22 |
Now, we already learned something really interesting, | ▶ 00:00 |
which is the special case of a linear filter. | ▶ 00:03 |
We took an image, and we applied a small kernel. | ▶ 00:05 |
The application of a kernel is often denoted | ▶ 00:10 |
with a special symbol over here. | ▶ 00:12 |
And we received a new image | ▶ 00:14 |
that was slightly smaller, and we don't really worry | ▶ 00:17 |
about the fact that it's smaller. | ▶ 00:19 |
There's ways to keep it the same size | ▶ 00:21 |
by assuming everything around the original image is zero. | ▶ 00:23 |
But we did receive a new image that was part of the kernel over here. | ▶ 00:26 |
And the general math of the new image, | ▶ 00:30 |
for any pixel accorded x and y, is obtained by summing | ▶ 00:33 |
over all layers in the kernel, u and v, | ▶ 00:37 |
of the original image shifted by u and v | ▶ 00:40 |
times the kernel itself. | ▶ 00:43 |
Now, this will take some time to digest, | ▶ 00:45 |
but what it really does is it does exactly what we did before. | ▶ 00:47 |
We take our kernel, which in this case might be a 2 x 1 kernel. | ▶ 00:51 |
We go over both of these fields | ▶ 00:55 |
or any number of fields that exists over here. | ▶ 00:57 |
We look at the corresponding image field and shift it a little bit. | ▶ 00:59 |
We did this before. We shifted it by 0, 1 pixels. | ▶ 01:03 |
We multiply these 2 things. | ▶ 01:07 |
There was a +1 here and a -1 here before. | ▶ 01:09 |
And we add all these things up | ▶ 01:11 |
to arrive at the resulting image. | ▶ 01:13 |
Think for a moment to realize that this function over here | ▶ 01:15 |
implements what we just did. | ▶ 01:19 |
It's a nice and elegant function. | ▶ 01:21 |
It's called a linear filter. | ▶ 01:24 |
And the reason is the math inside this sum is linear. | ▶ 01:26 |
It's a multiplication. | ▶ 01:30 |
And so is the sum, and the convolution operation itself | ▶ 01:32 |
is often called the linear operation. | ▶ 01:35 |
So, let me ask you another quiz. | ▶ 00:00 |
What type of filter do we need | ▶ 00:03 |
to find horizontal edges? | ▶ 00:05 |
And here are the choices. | ▶ 00:08 |
We have a filter like this, 1 and 1, | ▶ 00:10 |
horizontal filter 1, -1, | ▶ 00:13 |
and a vertical filter, 1, 1 and 1, -1. | ▶ 00:15 |
One of those is actually correct, | ▶ 00:18 |
so pick the one that is best suited to find horizontal edges. | ▶ 00:20 |
And the answer is this one over here. | ▶ 00:00 |
It takes a pixel and subtracts | ▶ 00:03 |
the vertically next pixel from it. | ▶ 00:06 |
And if there's a horizontal edge, | ▶ 00:09 |
if we have an image where the values over here are large, | ▶ 00:11 |
the values over here are small, | ▶ 00:16 |
then the specific filter over here, | ▶ 00:19 |
when applied to the transition between | ▶ 00:21 |
the large and small values will give you a large response. | ▶ 00:23 |
Here's another quiz. | ▶ 00:00 |
Given an image like this one over here | ▶ 00:02 |
with pixel areas 12, 18 and 6, | ▶ 00:06 |
2, 1, 7, 100, 140, 130, | ▶ 00:08 |
convolve this image with the vertical -1, +1 filter | ▶ 00:12 |
to arrive at the 6 missing values on the right side. | ▶ 00:17 |
And now the answers will be rather straightforward. | ▶ 00:00 |
100 - 2 is 98, 2 - 12 is -10, | ▶ 00:03 |
140 - 1 is 139, and so on and so on. | ▶ 00:06 |
This is the convolved image, | ▶ 00:10 |
and you can see a relatively large response | ▶ 00:12 |
corresponding to the position of larger areas over here | ▶ 00:15 |
to smaller areas over here, so this filter is well suited | ▶ 00:18 |
to find horizontal edges. | ▶ 00:22 |
And you can really see this in the results. | ▶ 00:00 |
So, this is the vertical mask applied to finding horizontal edges. | ▶ 00:02 |
These edges [s/l flare or throw] out really strongly. | ▶ 00:07 |
This one is nearly invisible. | ▶ 00:09 |
Compare this to the filter of vertical edges | ▶ 00:11 |
where these things now light up, | ▶ 00:15 |
but these have gone missing, and here's the original image again | ▶ 00:17 |
where you'll see all the different edges. | ▶ 00:20 |
Again, the horizontal filter finds the vertical edges, | ▶ 00:22 |
and the vertical filter finds horizontal edges. | ▶ 00:26 |
Now, what I've just shown you is called a gradient image. | ▶ 00:00 |
The gradient image in the horizontal direction | ▶ 00:04 |
is the image convolved with this kernel over here. | ▶ 00:07 |
And the gradient image in the vertical direction | ▶ 00:10 |
is the original image convolved with this kernel over here. | ▶ 00:12 |
This notation should now make sense | ▶ 00:16 |
since we practiced it a number of times. | ▶ 00:19 |
This is called, again, the convolution of the image. | ▶ 00:22 |
Now, if we wish to find edges in any direction, | ▶ 00:25 |
a really easy way to do this is to combine both of these gradient images | ▶ 00:29 |
into a single edge image, and here's how it goes. | ▶ 00:34 |
We take our gradient image in direction x, and we square it. | ▶ 00:38 |
The same with y, | ▶ 00:42 |
and we take the square root. | ▶ 00:44 |
And this response over here tells us | ▶ 00:46 |
in any of the 2 directions how strong the gradient response is. | ▶ 00:49 |
Here is just that gradient image. | ▶ 00:54 |
Compare this to the original image, | ▶ 00:57 |
and you can see that wherever there is a strong transition | ▶ 01:00 |
between a bright and dark color, | ▶ 01:03 |
the gradient [mack] into the image I just calculated has an edge. | ▶ 01:06 |
It has an edge vertically and an edge horizontally. | ▶ 01:10 |
And again, it's made of these 2 components, | ▶ 01:14 |
vertical edges and horizontal edges. | ▶ 01:17 |
By combining both of them, we get a gradient [s/l magnet?] image, | ▶ 01:20 |
and we have our very first feature detector, | ▶ 01:23 |
which is a feature detector of any edge in the image. | ▶ 01:26 |
Now, state-of-the-art edge detection is a little bit more advanced | ▶ 00:00 |
than this one over here. | ▶ 00:03 |
This is called a Canny edge detector. | ▶ 00:05 |
You see much more crisp edges over here. | ▶ 00:09 |
What this does, in addition to the gradient magnitude, | ▶ 00:11 |
it traces areas and finds local maxima. | ▶ 00:15 |
And it tries to connect them in a way that there's always just the single edge. | ▶ 00:19 |
When multiple edges meet, the Canny edge detector has a hole, | ▶ 00:23 |
like the area over here or the area over here. | ▶ 00:27 |
But when edges are single edges, | ▶ 00:30 |
the Canny edge detector traces them very, very nicely. | ▶ 00:33 |
This is named after John Canny, a professor at UC Berkeley. | ▶ 00:36 |
And he did one of the most impressive pieces of work | ▶ 00:39 |
on early edge detection. | ▶ 00:43 |
There are a few other common masks in the [I'm going to share?]. | ▶ 00:00 |
One is the Sobel mask, which is just like the edge detector I showed you, | ▶ 00:05 |
a little bit larger, and you can see it goes from left to right. | ▶ 00:09 |
There's about 8 of them. | ▶ 00:13 |
2 are shown over here, including some diagonal ones. | ▶ 00:15 |
Something called the Prewitt masks, which is like the Sobel | ▶ 00:18 |
but doesn't emphasize the center line. | ▶ 00:22 |
And the Kirsh mask, like the one over here. | ▶ 00:24 |
In fact, you can claim your own kernel. | ▶ 00:27 |
If you come up with a kernel that finds certain features, | ▶ 00:29 |
name it after yourself, and who knows? | ▶ 00:32 |
Maybe you'll get remembered like Mr. Sobel did or Mr. Prewitt. | ▶ 00:35 |
So, here's a mask that's a special case of a Prewitt mask, | ▶ 00:00 |
and I'd like to ask you a quiz. | ▶ 00:04 |
Will this find horizontal edges, vertical edges, | ▶ 00:07 |
corners, or none or all of the above? | ▶ 00:10 |
Please check exactly 1 of those 3 buttons. | ▶ 00:13 |
And the answer is horizontal edges | ▶ 00:00 |
because it shifts negative mass on the left side, | ▶ 00:03 |
positive mass on the right side. | ▶ 00:06 |
That gives us a horizontal edge. | ▶ 00:08 |
It doesn't find any of the other ones. | ▶ 00:10 |
Now, linear filters can also be applied | ▶ 00:00 |
to very different matrices. | ▶ 00:03 |
This is what's called a Gaussian kernel. | ▶ 00:05 |
You can see it over here. | ▶ 00:08 |
It's a matrix whose value is maximum | ▶ 00:10 |
at the center of the matrix and whose value falls | ▶ 00:14 |
exponentially to the side of this matrix. | ▶ 00:17 |
It's a Gaussian in 2D, as you can see over here. | ▶ 00:21 |
So, what happens if we convolve an image | ▶ 00:25 |
with a Gaussian kernel? | ▶ 00:28 |
Let me ask your intuition on the following quiz, | ▶ 00:30 |
and it's completely okay if you get this wrong. | ▶ 00:33 |
If you convolve an image with a Gaussian kernel, what do we get? | ▶ 00:36 |
An edge detector, a corner detector, | ▶ 00:39 |
a blurred image, or none of the above? | ▶ 00:43 |
And the answer is a blurred image. | ▶ 00:00 |
A Gaussian kernel gives us a blurred image. | ▶ 00:02 |
Let me demonstrate this to you. | ▶ 00:05 |
Here is the original image, | ▶ 00:07 |
and this is the result of convolving with a Gaussian. | ▶ 00:10 |
You can see that the features are much blurred, | ▶ 00:13 |
and the reason is each of these pixels | ▶ 00:16 |
is the sum of its weighted neighboring pixels. | ▶ 00:18 |
And the larger the neighborhood, the more the blurring effect. | ▶ 00:22 |
You can see clearly the difference in sharpness between these 2 images. | ▶ 00:26 |
So, why on earth would we ever want to blur an image? | ▶ 00:00 |
There are generally 2 reasons why you might want to do this. | ▶ 00:04 |
One is for down-sampling. | ▶ 00:06 |
If you have an image of super high resolution, | ▶ 00:08 |
maybe 5,000 x 5,000 pixels, | ▶ 00:11 |
and you'd like to go to a web image of much smaller resolution, | ▶ 00:14 |
it's better to blur by Gaussian before down-sampling | ▶ 00:17 |
then picking each nth pixel. | ▶ 00:21 |
And the reason is called aliasing. | ▶ 00:24 |
If you pick each nth pixel without blurring, | ▶ 00:27 |
you sometimes get very, very funny effects | ▶ 00:30 |
because each nth pixel might by chance | ▶ 00:33 |
correspond to something that's somewhat irregular. | ▶ 00:36 |
For example, if you have a checkerboard and you pick each nth pixel, | ▶ 00:39 |
you might only end up with black pixels. | ▶ 00:43 |
The second reason is called noise reduction. | ▶ 00:45 |
In noise reduction, you respond to pixel noise | ▶ 00:47 |
that might otherwise make it hard to compute things like image gradients. | ▶ 00:51 |
If you blur the image first, | ▶ 00:55 |
you get a smoother result that isn't quite as pronounced | ▶ 00:57 |
but has much less noise in the image. | ▶ 01:02 |
Here's the original gradient magnitude image to find edges, | ▶ 01:05 |
and here's the same applied to the blurred image. | ▶ 01:08 |
And you can see the original one is much more succinct, | ▶ 01:11 |
but also it's more subject to noise. | ▶ 01:16 |
Take the area over here, which has lots of image noise, | ▶ 01:19 |
and compare this to the area over here, which has [s/l many few edges.] | ▶ 01:22 |
The same is true over here and over here. | ▶ 01:26 |
I wouldn't really claim this is a much better result. | ▶ 01:28 |
In fact, it looks kind of funny and very coarse, | ▶ 01:31 |
but it does have less noise. | ▶ 01:35 |
Just to complete the issue on blurring, | ▶ 01:37 |
what we just did is we took an image, | ▶ 01:40 |
we blurred it with a Gaussian kernel, | ▶ 01:42 |
and then we applied a gradient kernel. | ▶ 01:44 |
If you dive into the math of convolution, | ▶ 01:47 |
you'll find that convolution is associative, | ▶ 01:49 |
so you could apply this one to the image and then this one over here, | ▶ 01:52 |
or you can combine these 2 guys over here | ▶ 01:56 |
into a Gaussian gradient kernel | ▶ 01:59 |
and apply this Gaussian gradient kernel to the image. | ▶ 02:03 |
So, f convolved with g is this big | ▶ 02:07 |
maybe 9 x 9 Gaussian matrix convolved by a single | ▶ 02:11 |
+1, -1 kernel g. | ▶ 02:15 |
And here's what this Gaussian gradient kernel looks like. | ▶ 02:18 |
It's really interesting. | ▶ 02:22 |
It is the same gradient kernel we had before | ▶ 02:24 |
but smooth now and spread out | ▶ 02:26 |
by Gaussian. | ▶ 02:29 |
And it really responds to an area over here similar to a Sobel operator | ▶ 02:32 |
that might have a strong negative value. | ▶ 02:37 |
And the area over here on the right side | ▶ 02:40 |
has a strong positive value, | ▶ 02:42 |
so you can think of Sobel and many other kernels | ▶ 02:44 |
as a combination of smoothing and taking a gradient. | ▶ 02:47 |
I find this really interesting because | ▶ 02:52 |
we can now devise a single, linear kernel that does both smoothing | ▶ 02:54 |
and find gradients at the same time. | ▶ 02:58 |
Sometimes you wish to find corners, | ▶ 00:00 |
as in this checkerboard over here. | ▶ 00:03 |
Corners have an advantage over edges. | ▶ 00:05 |
Edges aren't localizable. | ▶ 00:08 |
They could be anywhere on an edge. | ▶ 00:10 |
But a corner like this or a corner like this | ▶ 00:12 |
can be localized, which is useful in computer vision. | ▶ 00:15 |
What you see here is a Harris corner detector | ▶ 00:18 |
applied to a checkerboard pattern. | ▶ 00:22 |
And you can see all the points that define the checkerboard | ▶ 00:26 |
clearly found by a relatively simple algorithm | ▶ 00:29 |
which I'm just about to explain to you. | ▶ 00:33 |
The Harris corner detector is really a simple algorithm. | ▶ 00:36 |
Suppose you wished to find a corner just like this. | ▶ 00:41 |
Then in the small region over here where the corner resides, | ▶ 00:44 |
you will find a lot of horizontal gradients | ▶ 00:48 |
and a lot of vertical gradients. | ▶ 00:51 |
Now, what's our trick of finding gradients? | ▶ 00:53 |
Well, we know about horizontal gradients. | ▶ 00:55 |
We know about vertical gradients. | ▶ 00:58 |
If those summed up over a small window-- | ▶ 01:00 |
as shown right over here--are large, we have a corner. | ▶ 01:03 |
If only 1 of them is large and the other 1 is small, we likely have an edge. | ▶ 01:07 |
We already learned this before. | ▶ 01:11 |
It should be no surprise so far. | ▶ 01:13 |
Now, the Harris corner detector generalizes 2 images. | ▶ 01:15 |
We might have a corner like this | ▶ 01:20 |
that is rotated from the original corner. | ▶ 01:22 |
An image like this on a horizontal gradient | ▶ 01:25 |
isn't quite as pronounced as it is on the vertical gradient. | ▶ 01:28 |
But if you were to rotate our coordinate system | ▶ 01:31 |
back into the correct orientation, | ▶ 01:34 |
we could reduce it back to the case over here. | ▶ 01:36 |
The trick that's being applied is to de-rotate | ▶ 01:38 |
this image over here using eigenvalue decomposition. | ▶ 01:43 |
We use a matrix that slightly generalizes these 2 things over here | ▶ 01:47 |
where again we add our small windows. | ▶ 01:50 |
We plug in the statistic over here up here. | ▶ 01:53 |
The statistic over here down there. | ▶ 01:56 |
And here we have [s/l mixed strums] if we sum over the product | ▶ 01:58 |
of Ix and Iy in [ s/l after angle terms]. | ▶ 02:01 |
If we apply eigenvalue decomposition to this matrix over here, | ▶ 02:06 |
we get 2 eigenvalues. | ▶ 02:09 |
And if both eigenvalues are large, | ▶ 02:11 |
we again say we have a corner. | ▶ 02:13 |
So, applying this eigenvalue decomposition | ▶ 02:16 |
to every positive pixel in the original image | ▶ 02:19 |
and then taking the local maxima of that result | ▶ 02:21 |
where both eigenvalues are large gives us exactly | ▶ 02:24 |
the Harris corner detector in a very robust way | ▶ 02:27 |
to find corners in an image. | ▶ 02:30 |
This is exactly what's being done over here, | ▶ 02:33 |
and you can see it's very robust even to small rotations of the image, | ▶ 02:37 |
and of course, to a scale of the image. | ▶ 02:41 |
It's a beautiful way to find stable, | ▶ 02:43 |
localizable features in contrast-rich images. | ▶ 02:47 |
Now, modern feature detectors extend Harris corners | ▶ 00:00 |
into much more advanced features. | ▶ 00:05 |
They are usually localizable, like corners are. | ▶ 00:08 |
They also have unique signatures | ▶ 00:11 |
that summarize the identity of a feature | ▶ 00:13 |
that's typically invariant to lighting, orientation, | ▶ 00:16 |
translation and size variance, | ▶ 00:19 |
as you might find it in the image space. | ▶ 00:22 |
So, common methods that people use are called HOG, | ▶ 00:25 |
for histogram of oriented gradients, | ▶ 00:28 |
or SIFT, for scale invariant feature transform. | ▶ 00:31 |
All of these methods take corners | ▶ 00:35 |
and reduce the various variants like rotational variants | ▶ 00:38 |
by extracting statistics that are invariant to things like | ▶ 00:43 |
rotation and scale and certain perspective transformation. | ▶ 00:46 |
I took the liberty to apply SIFT features | ▶ 00:51 |
to the bridge image, | ▶ 00:54 |
and what you find here is a myriad of features | ▶ 00:56 |
that are all very localizable. | ▶ 00:58 |
There's features over here, | ▶ 01:00 |
very large ones like the square over here, | ▶ 01:02 |
which is, I guess, very visible, another square over here, | ▶ 01:05 |
and very small, tiny features like the square over here and the square over here | ▶ 01:08 |
that all have a unique signature and can easily be matched across images. | ▶ 01:12 |
This is called a SIFT feature extractor, | ▶ 01:17 |
and it's one of the state-of-the-art methods that are very commonly used. | ▶ 01:20 |
So, if you wish to extract features from an image, | ▶ 01:24 |
I recommend checking out HOG or SIFT. | ▶ 01:27 |
You can download software from the web. | ▶ 01:30 |
They are somewhat involved, and you can learn about them | ▶ 01:32 |
in advanced computer vision classes. | ▶ 01:35 |
So, now you know some of the very basics of computer vision. | ▶ 00:00 |
We talked about images, how images are being formed. | ▶ 00:03 |
We talked about perspective projection as a mathematical tool | ▶ 00:06 |
for understanding how cameras perceive images. | ▶ 00:10 |
And we talked a whole bunch about features. | ▶ 00:14 |
We talked about invariances, the type of things that affect | ▶ 00:16 |
the appearance of a feature in the camera image, | ▶ 00:19 |
and we went through methods for extracting edges, | ▶ 00:23 |
for extracting corners, | ▶ 00:26 |
and for extracting fairly sophisticated features like SIFT features. | ▶ 00:28 |
This is one of the basic processing methods in computer vision. | ▶ 00:32 |
Almost everyone who does computer vision preprocesses images | ▶ 00:36 |
by feature extraction, and now you know | ▶ 00:39 |
quite a bit about how to process images in computer vision. | ▶ 00:42 |
This class is all about 3D vision. | ▶ 00:00 |
There's a scene somewhere here, and there's a camera over here, | ▶ 00:03 |
and the scene is projected into the camera plane. | ▶ 00:08 |
Obviously, the camera image is only 2D. The scene is 3D. | ▶ 00:11 |
3D vision vision attempts to recover the full 3D information. | ▶ 00:16 |
The most important missing thing is called the range, | ▶ 00:20 |
sometimes called depth or distance to the camera plan. | ▶ 00:23 |
Cameras are deficient in that they can only recover 2D information- | ▶ 00:27 |
a perspective projection of the image. | ▶ 00:30 |
The question right now is going to be can we possibly recover the full 3D information | ▶ 00:33 |
about the scene outside the camera just as single or modular camera images. | ▶ 00:38 |
This is our first quiz. | ▶ 00:00 |
Given a single image and given parameters of the camera like the focal length | ▶ 00:02 |
and all the other parameters, can we recover the depth or range of a scene? | ▶ 00:07 |
And I'll give you a couple of possible answers. | ▶ 00:12 |
Yes, always; sometimes; or never. | ▶ 00:15 |
We haven't even talked about this. This requires some thinking on your end. | ▶ 00:19 |
But give it your best try. | ▶ 00:23 |
The correct answer is sometimes. There are actually cases where we can do this. | ▶ 00:00 |
It's not always possible because, as we learned before, | ▶ 00:04 |
the camera doesn't really record the distance. | ▶ 00:08 |
If we look into an arbitrary scene, we normally can't recover the depth from a single image. | ▶ 00:11 |
But in certain cases it's possible. Here's a dollar bill. | ▶ 00:17 |
A dollar bill is a fixed size, and you can know the size. All dollar bills have the same size. | ▶ 00:20 |
And by understanding the internal projection size of this bill, | ▶ 00:25 |
you can actually really recover how far it is away, | ▶ 00:32 |
if you know things like focal length and so on. | ▶ 00:35 |
The answer is yes in cases we know the size of the object | ▶ 00:38 |
and no in cases where you don't know the size. | ▶ 00:43 |
Therefore, sometimes was the correct answer here. | ▶ 00:46 |
One easy way to recover the depth with 3D vision is called Stereo. | ▶ 00:00 |
Humans use stereo all the time. | ▶ 00:07 |
We have two eyes--eye 1 and eye 2-- | ▶ 00:09 |
and these eyes have a so-called displacement, | ▶ 00:15 |
which just means that one eye is further left than the other eye. | ▶ 00:18 |
We're looking at the scene from slightly different angles. | ▶ 00:22 |
Humans can actually recover the depth of the scene | ▶ 00:25 |
in many situations where objects are nearby. | ▶ 00:29 |
Let's look at this in more detail. | ▶ 00:32 |
In stereo vision, we're given two cameras--usually both with identical focal length. | ▶ 00:34 |
Here are the pinholes, and here are the image planes. | ▶ 00:40 |
An object in the scene is being seen by both cameras. | ▶ 00:43 |
If I draw the optical axes over there, | ▶ 00:47 |
which are the axes orthogonal to the image planes that go through the pinholes, | ▶ 00:49 |
you will see that the projection of this point depends on the displacement, | ▶ 00:53 |
or the baseline, of the so-called stereo rig. | ▶ 00:58 |
Clearly, these two images see the point at a different angle, | ▶ 01:02 |
and it reflects itself by different coordinates | ▶ 01:05 |
where this point is being projected onto the image plane. | ▶ 01:10 |
The idea of stereo is to screen objects and use the displacement, | ▶ 01:14 |
often called "paralax," of those two different projections to estimate | ▶ 01:20 |
the depth or the range of the object. | ▶ 01:26 |
Let me just ask a simple quiz about stereo. | ▶ 01:29 |
Given two identical cameras | ▶ 00:00 |
for which we know things like focal length and all the other intrinsic parameters, | ▶ 00:02 |
and we also know the baseline, | ▶ 00:07 |
can we now recover the depth of a scene? | ▶ 00:09 |
Here are the answers: yes, always, no matter what the scene is. | ▶ 00:12 |
The second is sometimes, and the third one is never. | ▶ 00:18 |
As before, the answer is sometimes, | ▶ 00:00 |
although more often than before from a single camera image given we have two now. | ▶ 00:02 |
To give an intimation of why it's not always, | ▶ 00:07 |
let's look at two images where the object of interest is a vertical object | ▶ 00:10 |
and another pair of two images where the object of interest is a horizontal feature, | ▶ 00:17 |
like this one over here. | ▶ 00:22 |
Now, in the vertical case, there would be displacement. | ▶ 00:24 |
This would be slightly further to the left than this guy over here, | ▶ 00:27 |
and we can use the displacement to recover depth in a way I'll tell you in a second. | ▶ 00:30 |
But for the horizontal, it's really hard. | ▶ 00:35 |
If this feature crosses all of the camera image, there is something called "aperture effect." | ▶ 00:37 |
What this really means is we can't really tell which of the little dots on this line | ▶ 00:43 |
correspond to which little dots on this line over here. | ▶ 00:48 |
In cases where the image lacks structure-- | ▶ 00:51 |
or the worse one would be two images of fog. | ▶ 00:54 |
In fog, there is certainly a depth. | ▶ 00:58 |
Each water particle has a certain range, | ▶ 01:01 |
but we can't really recover how far away fog is, | ▶ 01:03 |
because, honestly, both images look alike. | ▶ 01:06 |
There are certain degenerate cases where stereo doesn't work. | ▶ 01:09 |
We are going to focus on this case over here right now | ▶ 01:12 |
where we do get information from the stereo sensor. | ▶ 01:14 |
Let's get back stereo rig. | ▶ 00:00 |
We have two pinholes with a known focal length f, | ▶ 00:02 |
and we wish to recover the depth z of a point p. | ▶ 00:06 |
We happen to know that the projection of p on the two image planes is somewhat different. | ▶ 00:12 |
Over here we call it x1 for the first imager. | ▶ 00:18 |
Over here we call it x2 for the second imager. | ▶ 00:21 |
The question is what is the formula that allows us to look at this rig over here | ▶ 00:25 |
with two images with a known baseline b to recover the depth z | ▶ 00:32 |
from the relative displacements x1 and x2. | ▶ 00:38 |
There happens to be a relatively simple answer. | ▶ 00:41 |
If you look at this big triangle over here, that triangle has the same proportions | ▶ 00:44 |
and the triangle put together by this little thing over here and this thing over here. | ▶ 00:49 |
You move these two triangles over here together into a single triangle. | ▶ 00:54 |
It looks like this. | ▶ 00:59 |
The proportions of this triangle over here are the same | ▶ 01:02 |
as the proportions of this triangle over here. | ▶ 01:06 |
Specifically, the length back here is x2 minus x1. | ▶ 01:08 |
This distance over here is f, the length over here in the baseline b, | ▶ 01:13 |
and this length over here is the unknown depth z. | ▶ 01:18 |
If we transform this and solve it for z, | ▶ 01:21 |
we get z equals f times b over x2 minus x1. | ▶ 01:24 |
If we look at the relative displacement of a point in these two different camera images, | ▶ 01:30 |
which is x2 minus x1, you'll find the the actual depth is inversely proportional, | ▶ 01:33 |
but in this case linearly with the focal length f and the baseline b. | ▶ 01:39 |
These are all things we know. The baseline and the focal length are constants. | ▶ 01:46 |
They're called intrinsics. | ▶ 01:50 |
These are measurements, and from this we can actually recover the real depth. | ▶ 01:52 |
Let's just try to practice this. | ▶ 01:55 |
So let me give you another quiz in which we have a stereo rig with baseline B | ▶ 00:00 |
with two measurements, x1 and x2, of the same point P in the scene. | ▶ 00:06 |
We know our focal length f, and I care about our depth z. | ▶ 00:13 |
Here is our formula again to make things a little bit easier. | ▶ 00:20 |
Here assume that my x2 equals 3 mm, my x1 is -1 mm, | ▶ 00:24 |
my focal length is 8 mm, and my baseline B is 20 cm. | ▶ 00:30 |
I'd like to know z in centimeters. | ▶ 00:35 |
The answer is 40. | ▶ 00:00 |
F is 8 mm over 3 minus -1 is 4 mm, | ▶ 00:02 |
which gives this guy over here a factor of 2 times B. | ▶ 00:09 |
Ten centimeters makes 40 cm for z. | ▶ 00:14 |
Using the same formula, we are going to write x2 minus x1 as delta x | ▶ 00:00 |
just to make it a little bit simpler. | ▶ 00:05 |
Let me see if we can recover other things. | ▶ 00:08 |
Let's assume the range is actually 10 m. We know about the physical world. | ▶ 00:10 |
We know our baseline 1 m. We know that our focal length is 30 mm. | ▶ 00:15 |
Can we possibly recover the delta x? | ▶ 00:20 |
I'd like you to give your answer in mm. | ▶ 00:24 |
The answer is absolute yes. It's going to be 3 mm. | ▶ 00:00 |
To see, we transform this equation over here to delta x to the left, | ▶ 00:05 |
B over z is 0.1 times 30 mm makes 3 mm over here. | ▶ 00:09 |
Let's now go to a difficult challenging case | ▶ 00:00 |
where I'd like to recover the focal length f from measurements of the type z, delta x, and B. | ▶ 00:03 |
Suppose I happen to know that an object is 100 m away, | ▶ 00:11 |
and my baseline is 0.5 m, which is 50 cm. | ▶ 00:15 |
And suppose my displacement x2 minus x1 is exactly 1 millimeters. | ▶ 00:20 |
What do you think f is expressed in mm? | ▶ 00:24 |
The answer is 200 mm for our focal length. | ▶ 00:00 |
We can transform this expression over here to bring f to the left side. | ▶ 00:05 |
And z over B equals 200, and delta x equals 1 mm. | ▶ 00:09 |
We get 200 mm as an answer for our question. | ▶ 00:14 |
I'd like to say a few words on the issue of correspondence, | ▶ 00:00 |
often also called data association. | ▶ 00:03 |
Supposing we have two camera images, as shown over here, | ▶ 00:06 |
and we seen an interesting point P in the left image. | ▶ 00:09 |
The question is where do you search in the right image? | ▶ 00:13 |
Everywhere? Along a line? Or can you already predict the point? | ▶ 00:16 |
So where do you search in the right image? | ▶ 00:22 |
Everywhere would be in 2D--the entire image. | ▶ 00:24 |
Along a line would be 1D, and a fixed point would be 0D. | ▶ 00:28 |
Please check the appropriate box. | ▶ 00:33 |
The answer is 1D. You can actually search along this line over here. | ▶ 00:00 |
You can't really know where along the line the point is, | ▶ 00:05 |
because where it is is a function of the depth of the scene, which you don't know, | ▶ 00:08 |
but it can't be the full image. | ▶ 00:15 |
To illustrate this, let me look a little bit from above. | ▶ 00:18 |
Here was have two image planes from the two cameras. | ▶ 00:21 |
There is a point over here that finds itself in the image plane over there. | ▶ 00:24 |
If we don't know the depth, we know that the point must lay on this ray over here, | ▶ 00:28 |
and each of the points on this ray get projected into this imager along a line. | ▶ 00:34 |
If the point is over here, it might be the projection over there, | ▶ 00:41 |
and as we go out to infinity, it might be the point over here. | ▶ 00:45 |
Now this camera array is a little bit more general than we talked about. | ▶ 00:50 |
The image plans aren't parallel anymore, | ▶ 00:53 |
but even if they're no parallel, each point in the left image corresponds to a potential line | ▶ 00:56 |
of corresponding points in the right image. | ▶ 01:01 |
It makes the search for correspondences much, much easier. | ▶ 01:04 |
Let's talk a little bit more about correspondences. | ▶ 01:07 |
The general correspondence problem is given | ▶ 00:00 |
if there are two identical-looking points in the scene that have different depths. | ▶ 00:03 |
For example with P1 might reflect into the image over here, | ▶ 00:08 |
and P2 will reflect into the image as indicated by these red lines. | ▶ 00:12 |
Now we understand the correspondence of P1 in both images | ▶ 00:16 |
that this point corresponds to this point, we are well off, | ▶ 00:20 |
and we can estimate the depth of P1. | ▶ 00:23 |
If we get it wrong, if we correspond this point over here in the image to this guy over here, | ▶ 00:25 |
then what we will see is this point right over here--P1 prime. | ▶ 00:31 |
If we correspond this guy over here with this guy over here, | ▶ 00:36 |
we get P2 prime. | ▶ 00:39 |
These aren't really points in the action image, but they'll be phantom points | ▶ 00:41 |
that occur because we got the correspondence wrong. | ▶ 00:46 |
It's really important when we look at two camera images | ▶ 00:48 |
to understand what is the actual correspondence. | ▶ 00:51 |
Here are actually two images from a stereo rig of a scene, | ▶ 00:55 |
and you can see there's a slight displacement. It's actually really hard to see. | ▶ 01:00 |
We're looking at this feature over here for now. | ▶ 01:04 |
I'd like to correspond it to something in the right image. | ▶ 01:07 |
We have already learned that the search will have to be along a line. | ▶ 01:10 |
Here is the green line, which is the corresponding line. | ▶ 01:15 |
It can't be that this point over here shows up somewhere in the sky over here, | ▶ 01:18 |
but even along the point, it's not completely obvious how to do correspondence-- | ▶ 01:22 |
how to match this image over here to this image over there. | ▶ 01:26 |
So my question is how can we possibly find | ▶ 01:29 |
where this feature corresponds to a feature over here? | ▶ 01:33 |
How can we determine correspondence? | ▶ 01:37 |
By matching small image patches using some of the linear techniques we talked about in | ▶ 01:40 |
the last class by just basically comparing how similar looking small image patches are | ▶ 01:45 |
or by matching features, and particularly edge features or corner features | ▶ 01:51 |
that we might extract from the original image. | ▶ 01:54 |
Or maybe neither of those two. Please check any or all of those that apply. | ▶ 01:57 |
The answer is both. You can use image patches and features, and I'll talk about both. | ▶ 00:00 |
They are somewhat similar, and they're not without problems, | ▶ 00:06 |
but both are being used to estimate correspondence. | ▶ 00:08 |
Here is my pair of images again, | ▶ 00:00 |
and my scan line, and I'm extracting from it a very small little window | ▶ 00:04 |
that is the local image of the specific feature over here | ▶ 00:10 |
which happens to have a strong vertical structure, | ▶ 00:13 |
which is nice of localization. | ▶ 00:16 |
Now I'm comparing this little patch with my little patches in the right image, | ▶ 00:18 |
and I'm drawing a sum of square difference error, | ▶ 00:23 |
which is minimized when these two patches look alike. | ▶ 00:27 |
I'll tell you in a second how this looks like mathematically, | ▶ 00:31 |
but intuitively we have to pick the place along the random measured search space | ▶ 00:34 |
that has the smallest sum of square difference error, | ▶ 00:39 |
which is the one where these two patches just look mostly alike. | ▶ 00:43 |
This is a space of the scan line in which I search, | ▶ 00:47 |
often called disparity, and for one location this is actually being minimized right over here. | ▶ 00:51 |
Here's the basic algorithm for SSD minimization. | ▶ 00:56 |
We take two patches--one from the left image, one from the right image. | ▶ 00:59 |
We normalize, so the average brightness is zero. | ▶ 01:03 |
We then take the normalized image and take the difference. | ▶ 01:06 |
Then we square the difference. That gives us a sum-of-square image. | ▶ 01:09 |
Then we can sum up all the pixels to get a single value. | ▶ 01:13 |
This is our SSD value, our sum-of-square difference value. | ▶ 01:17 |
All of these operations are easily implemented using the material you already know. | ▶ 01:21 |
The smaller the SSD value, the closer these two images correspond. | ▶ 01:26 |
This is a very common technique for comparing what's called image templates, | ▶ 01:31 |
where your left image is a template, | ▶ 01:36 |
and you're searching the left image for the optimal template. | ▶ 01:39 |
As you vary the location of the right image, you can find different SSDs. | ▶ 01:42 |
You tend to get graphs for the right image. | ▶ 01:47 |
With an image template, it gives you certain errors. | ▶ 01:50 |
Sometimes you get a very small disparate error. | ▶ 01:54 |
That's the place you'll pick for the best, mostly likely alignment. | ▶ 01:57 |
Here is the result of such an operation. | ▶ 00:00 |
We have yet again a left image over here. The right one is missing. | ▶ 00:03 |
Here you can see what's called a disparity map, | ▶ 00:06 |
which the map of the best match. | ▶ 00:09 |
In the right image, the further the disparity, the more we have to assume the patch shifted. | ▶ 00:12 |
We extracted every possible patch from this image, did the search on the right image, | ▶ 00:17 |
and we find in the foreground, the disparity is much larger than the background. | ▶ 00:22 |
Sometimes we get a black spot, like over here, | ▶ 00:26 |
where the information itself is not good enough to make any decision. | ▶ 00:29 |
Or in the pathway over here, there are no real features. | ▶ 00:33 |
Same for the sky over here. | ▶ 00:36 |
But in most cases, we can see a nicely shaded gray that decreases with distance | ▶ 00:38 |
where the disparity decreases. | ▶ 00:44 |
This is a very typical stereo vision result. | ▶ 00:46 |
Here is a disparity map from driving in desert with our DARPA Grand Challenge car, Stanley. | ▶ 00:50 |
We equipped it with two cameras, one on the left and one on the right. | ▶ 00:58 |
You can see the two camera images, and on the right the disparity map. | ▶ 01:02 |
It's not that informative, because there is very little structure in the road surface itself, | ▶ 01:07 |
but by and large you can see things further away end up being darker. | ▶ 01:15 |
The big dominant thing here is lack of texture, | ▶ 01:20 |
which leads to certain areas in the disparity map just being black, | ▶ 01:24 |
which means we don't know. | ▶ 01:28 |
But where it registers it does a pretty find job. | ▶ 01:30 |
I'd like to talk a little bit more about correspondence. | ▶ 00:00 |
Specifically, we've learned that searching for correspondence means | ▶ 00:04 |
we search along a single scan line, | ▶ 00:09 |
but I'd like to ask the question whether it's optimal to correspond individual patches | ▶ 00:11 |
which are independent of each other. | ▶ 00:17 |
Would it make sense to look at the context of an entire scan line? | ▶ 00:19 |
Let's look at the following situation. | ▶ 00:24 |
We have a background that's black. | ▶ 00:26 |
We have a foreground that's red, | ▶ 00:28 |
and we have sides of the object that are both blue. | ▶ 00:31 |
In a left image, we might see black, black, | ▶ 00:34 |
and then there is this blue element that is only visible from the left camera, | ▶ 00:37 |
a couple of reds--3 of them--and then we see more blacks. | ▶ 00:44 |
From the right imager we might see black, black. | ▶ 00:48 |
We won't see the blue over here, because it's occluded, | ▶ 00:52 |
but we'll see a couple of reds followed by the blue over here, | ▶ 00:55 |
which is only visible from the right camera, followed by more blacks. | ▶ 01:00 |
When we look at the entire situation, | ▶ 01:05 |
the question is whether we can correspond red pixels to each other | ▶ 01:08 |
irrespective of context or whether it makes sense to look at context. | ▶ 01:13 |
Specifically, take the mid red pixel over here--this guy over here-- | ▶ 01:17 |
and let me ask you does it correspond to the left red, the center red, or the right red? | ▶ 01:22 |
Please check the corresponding box. | ▶ 01:29 |
The answer is it corresponds to the center red, | ▶ 00:00 |
which is the guy over here. | ▶ 00:03 |
Finding this is not easy. | ▶ 00:06 |
This is the fifth pixel on the left camera image, | ▶ 00:08 |
and it's the fourth pixel on the right camera image. | ▶ 00:11 |
To make that correspondence, we have to understand that the best match matches | ▶ 00:14 |
these 2 black pixels over here, followed by an occlusion pixel | ▶ 00:19 |
that's only visible on the left but not on the right, | ▶ 00:24 |
followed by the 3 corresponding red pixels, | ▶ 00:27 |
followed by another occlusion pixel, | ▶ 00:30 |
followed by 2 black pixels that basically correspond. | ▶ 00:32 |
I now want to look into algorithms that can take entire scan lines of the left side | ▶ 00:36 |
and correspond them to entire scan lines on the right side. | ▶ 00:40 |
Let's look at the same problem again, | ▶ 00:00 |
and let me just draw the two scan lines-- | ▶ 00:03 |
the left scan line and the right scan line. | ▶ 00:05 |
As before, we get to see red pixels, black pixels, | ▶ 00:10 |
and the occlusive blue pixels as indicated over here. | ▶ 00:15 |
Now we'll try to match the entire scan line on the top to the entire scan line on the bottom. | ▶ 00:20 |
so we can figure out what the exact correspondence is. | ▶ 00:27 |
We do this by minimizing the cost function. | ▶ 00:31 |
The cost comes in two different flavors. | ▶ 00:34 |
There is the cost of bad matches. | ▶ 00:37 |
Let's assume if the color matches perfect, we pay zero, | ▶ 00:40 |
but if the color matches very poor, we pay 20. | ▶ 00:45 |
There is also the cost of occlusion. | ▶ 00:48 |
If in the process of matching these lines we have to assume a pixel is occluded, | ▶ 00:51 |
we're just going to pay 10. | ▶ 00:56 |
The question now is optimal alignment of the top to the bottom under this cost function? | ▶ 00:58 |
Let's just go through this. Let me look at two different possible alignments. | ▶ 01:04 |
Here is one. We align those black pixels, and we align the red pixels. | ▶ 01:08 |
If we did this, what is the cost of the total match. Please put the answer over here. | ▶ 01:12 |
The cost is 20. | ▶ 00:00 |
The reason being that in this match over here, we get a perfect color match. | ▶ 00:02 |
Black matches to black. Red matches to red. | ▶ 00:08 |
But we have to assume this this pixel over here and the pixel over here, | ▶ 00:11 |
are both the result of occlusion. Each costs us 10. | ▶ 00:15 |
So the result is we pay 20 as the total cost. | ▶ 00:18 |
Let me now the same question again for a different alignment. | ▶ 00:00 |
Suppose we were to marry the red pixel over here to the blue one over here, | ▶ 00:03 |
the red one to the red one over here, and so on. | ▶ 00:09 |
What would now by the cost of the alignment? | ▶ 00:13 |
We don't have these diagonals, but match pixel by pixel over here. | ▶ 00:16 |
The answer would be in this case 40, | ▶ 00:00 |
because we pay a 20 penalty of a bad match over here. | ▶ 00:03 |
These are good matches. We end up paying another 20 penalty for a bad match over there. | ▶ 00:09 |
In total we get a penalty of 40. | ▶ 00:14 |
What this teaches us is that in matching pixels to pixels, | ▶ 00:16 |
we match an entire corresponding line in stereo. | ▶ 00:21 |
We can trade off the bad match cost with the occlusion cost. | ▶ 00:25 |
Sometimes it is cheaper to assume occlusion, | ▶ 00:30 |
and sometimes it is cheaper to assume a bad match. | ▶ 00:33 |
The result of this optimization is that it gives us the best association | ▶ 00:37 |
of the scan line over here to the scan line over here. | ▶ 00:41 |
The tricky part is how to compute the best possible alignment. | ▶ 00:00 |
It's usually done by dynamic programming. | ▶ 00:06 |
The recognition here is that in principle there are | ▶ 00:09 |
exponentially many ways to align pixels in the left and right image, | ▶ 00:13 |
but in practice you can get away with an n-squared algorithm | ▶ 00:17 |
where n is the number of pixels in the scan line. | ▶ 00:22 |
Let's write this as n-squared. It's a much, much faster algorithm. | ▶ 00:25 |
Here's the idea. Let's write down both scan lines as shown over here. | ▶ 00:29 |
And let's write down a matrix of size and square. | ▶ 00:34 |
The neat thing here is that any path from the top left to the bottom right | ▶ 00:38 |
is a specific correspondence of pixels over here on the left scan line | ▶ 00:43 |
to pixels over here on the right scan line. | ▶ 00:50 |
For example, if I take the path that's diagonal, that line pixels by each other. | ▶ 00:52 |
But the best possible path would assume that the first two pixels correspond, | ▶ 00:58 |
and there's a left occlusion afterwards. | ▶ 01:03 |
Then all the red guys correspond. | ▶ 01:06 |
So this red guy over here corresponds to this red guy over here. | ▶ 01:08 |
There's an occlusion over here. | ▶ 01:12 |
Then we go diagonal again. | ▶ 01:14 |
So any path that picks actions that go diagonal, down, or right | ▶ 01:16 |
so that the top left is connected to the bottom right | ▶ 01:21 |
becomes a valid correspondence of the left scan line to the right scan line. | ▶ 01:26 |
How do we find the best one? | ▶ 01:33 |
Well, just like in MVPs we use the same methodology as an MVP. | ▶ 01:35 |
We define the value of any of these points in the grid to be the best, | ▶ 01:40 |
taking the value of getting there. | ▶ 01:44 |
The value of a point ij in the grid is the maximum of the match value | ▶ 01:47 |
if we chose diagonal, which is expressed over here to be the match of ij | ▶ 01:54 |
given that we chose the diagonal, which means add the value of i minus 1 and j minus 1, | ▶ 01:59 |
over the occlusion penalty plus any way we could have occluded for the left or the right. | ▶ 02:06 |
If we look at these three different things we maximize over here, | ▶ 02:12 |
then each value over here becomes the maximum of assuming we have no occlusion | ▶ 02:15 |
plus the corresponding match penalty or assuming we did have an occlusion, | ▶ 02:21 |
either from the top or the bottom, and then we just pay the occlusion penalty, | ▶ 02:26 |
and we assume the value over there. | ▶ 02:31 |
Now, that's not trivial. You have to think about this. | ▶ 02:34 |
Why does this give us the optimal path? | ▶ 02:37 |
But if you think about it and look at the optimal path, | ▶ 02:39 |
we pay no penalty over here because the match is perfect. | ▶ 02:42 |
We pay no penalty over here because the match is perfect again. | ▶ 02:45 |
So, again, the first clause in this formula. | ▶ 02:48 |
Over here we do pay a penalty. | ▶ 02:50 |
We pay a penalty of 10, which is the occlusion penalty, | ▶ 02:53 |
because we assume that between the blue pixel over here and the right scan image | ▶ 02:56 |
there's just no appropriate match. We're going to pay a penalty of 10 over here. | ▶ 03:01 |
Over here we pay no penalty, because the right corresponds perfectly to the red, | ▶ 03:06 |
and we assume it is a perfect match. | ▶ 03:09 |
The same over here and the same over here. | ▶ 03:12 |
Down here we pay a penalty of 10, because we assume an occlusion, | ▶ 03:14 |
and down here we just assume no penalty at all. | ▶ 03:18 |
Now with dynamic programming, it computes for every possible location. | ▶ 03:21 |
For example, this guy over here would have a best optimal path, | ▶ 03:25 |
which might assume we had a perfect match over here and two occlusions over there, | ▶ 03:29 |
but now the penalty is already 20 whereas the penalty over here is 10. | ▶ 03:34 |
So likely this point won't survive. | ▶ 03:37 |
By working out the value function in this really interesting grid over here, | ▶ 03:40 |
we find the value of the final point, which is 20, | ▶ 03:46 |
and we also find the best possible path | ▶ 03:49 |
by tracing the way in which the value propagated through this grid. | ▶ 03:53 |
This becomes the best possible correspondence of the left and the right image | ▶ 03:57 |
by aligning the entire left scan line and the entire right scan line | ▶ 04:02 |
simultaneously using dynamic programming. | ▶ 04:06 |
This is the state of the art in stereo computer vision. | ▶ 04:10 |
Let me see if you understand what I just talked about by the following quiz. | ▶ 00:00 |
Let's assume we have two scan lines of six pixels each. | ▶ 00:05 |
You get to observe the following that I've colorized: | ▶ 00:10 |
two blacks followed by three reds by one black for the left image | ▶ 00:13 |
and one black followed by four reds by one black for the right scan line. | ▶ 00:17 |
Let us also assume the occlusion penalty is 5, and the bad match penalty is 20. | ▶ 00:23 |
Can you mark the location to which you'd like to correspond | ▶ 00:30 |
this specific red pixel over here in the right scan line | ▶ 00:34 |
by minimizing the total cost of occlusion and bad matches. | ▶ 00:38 |
This is a tricky question, | ▶ 00:00 |
because it turns out that the occlusion answer is better than the bad match answer. | ▶ 00:03 |
If you were to correspond each pixel in the left scan line | ▶ 00:09 |
straight to each pixel in the right scan line, | ▶ 00:12 |
you find that the total penalty will be 20, | ▶ 00:15 |
because there's one bad match between the black pixel over here and the red pixel here. | ▶ 00:18 |
However, you can also correspond as follows. | ▶ 00:23 |
You end up paying two occlusion penalties for this guy over here and this guy over here. | ▶ 00:26 |
Because the occlusion penalty is only 5, you pay only a total of 10 as a penalty, | ▶ 00:32 |
which is better than a single bad match penalty. | ▶ 00:38 |
As a result, this would've been the right answer over here. | ▶ 00:41 |
Let me discuss the second quiz here, in which we're given two scan lines again-- | ▶ 00:00 |
a left and a right scan line. | ▶ 00:05 |
The pixels are for the left scan line black, red, black, black, black, black | ▶ 00:07 |
and for the right scan line black, black, black, black, red, and black. | ▶ 00:13 |
Let's assume an occlusion now costs us 10, and a bad match costs us 20. | ▶ 00:17 |
So what pixel should be aligned with the black pixel over here? | ▶ 00:22 |
Check one of those boxes. | ▶ 00:27 |
The answer is this box over here. | ▶ 00:00 |
There are two points on the line where one may correspond | ▶ 00:04 |
each pixel on the left scan line to each pixel on the right scan line with the same index. | ▶ 00:07 |
So this guy corresponds to this guy and so on--just straight up. | ▶ 00:12 |
We're going to pay a penalty of 40, | ▶ 00:16 |
which is this red pixel over here as a bad match to the pixel over there, | ▶ 00:19 |
and the same is true over here. | ▶ 00:24 |
So it's a penalty of 40, which I conjecture to be the best. | ▶ 00:26 |
Let me show you the answer that I don't like, | ▶ 00:30 |
which corresponds this black pixel to this black pixel here, | ▶ 00:33 |
the red guy to the guy over here, and then black to black over here, | ▶ 00:35 |
in which case we pay an occlusion penalty for the following six pixels: | ▶ 00:39 |
the guys over here and the guys over here. | ▶ 00:46 |
Each occlusion value is 10, which makes a total occlusion penalty of 60, | ▶ 00:49 |
which is worse than the 40. | ▶ 00:53 |
So I conjecture that the straight up max like this is superior, | ▶ 00:55 |
and we should check the box over here. | ▶ 01:00 |
As you can see the optimal path for this diagram over here, | ▶ 00:00 |
which determines what is occlusion and what is a bad match, | ▶ 00:04 |
really is a function of those penalties, | ▶ 00:08 |
the costs that we associate with poor matches or the occlusion assumption. | ▶ 00:11 |
So running dynamic programming through this grid over here will give you | ▶ 00:16 |
the best alignment that gives you the best possible total cost | ▶ 00:19 |
that assumes an optimal trade off between occlusion costs and the cost of matching 2 pixels. | ▶ 00:24 |
This segment is my explanation of correspondence in stereo vision. | ▶ 00:00 |
It came a long way. There are a few things that don't work really well. | ▶ 00:06 |
For example, we have two cameras over here, and we have a big object over here | ▶ 00:10 |
with a foreground separate object. | ▶ 00:14 |
Then the order contraint is being opposed and dynamic programming doesn't hold. | ▶ 00:17 |
That is, an object over here might appear left of the object over here in the left imager | ▶ 00:23 |
but right of the object over here in the right imager. | ▶ 00:29 |
There are other cases where things go wrong. | ▶ 00:32 |
For example, suppose you were imaging a circular object with these two imagers here. | ▶ 00:34 |
Then the occlusion boundary of this object as viewed from the right imager | ▶ 00:39 |
is different from the occlusion boundary of the same object as viewed from the left imager. | ▶ 00:44 |
These are not corresponding points. They correspond to different points on the object. | ▶ 00:49 |
As a result, your stereo calculation will give you a poor result. | ▶ 00:53 |
A final instance where things might go wrong is | ▶ 00:57 |
reflective objects that have specular reflections. | ▶ 01:01 |
This ball over here reflects the ceiling lights, | ▶ 01:05 |
and obviously, where the ceiling lights are being reflected | ▶ 01:08 |
is a function of where an imager is positioned. | ▶ 01:11 |
For these specific features over here, | ▶ 01:15 |
we get a really lousy depth estimate for the object at hand. | ▶ 01:17 |
I'd like to say a few words about how to improve the results of stereo vision. | ▶ 00:00 |
Here is a vision assembly that James David built up of two cameras. | ▶ 00:06 |
In addition to having these two cameras, he also put a projector into the scene | ▶ 00:11 |
that emitted a random light pattern. | ▶ 00:14 |
In fact, it emitted a striped pattern, shown over here on this frog, | ▶ 00:17 |
and by adding texture to the scene, you can making correspondence easier. | ▶ 00:23 |
This is a striped pattern of unequal distances. | ▶ 00:29 |
There's a coding over here, which makes certain stripes larger than others. | ▶ 00:33 |
If you run the same algorithm I just told you, | ▶ 00:38 |
you'll find that stereo vision becomes better, | ▶ 00:41 |
because we can now better disambiguate the correspondence of points. | ▶ 00:44 |
Here is the assembly used for imaging myself. This is me with a sweater on. | ▶ 00:48 |
That's my face. | ▶ 00:53 |
And you can see by emitting structured light, as it is called, | ▶ 00:55 |
you can enhance the performance of stereo | ▶ 01:00 |
and objects that otherwise have very poor texture. | ▶ 01:03 |
Another solution is called the Microsoft Kinect. You're probably familiar with it. | ▶ 01:06 |
It's a new gaming platform that's been sold at record pace. | ▶ 01:11 |
It uses a camera system, together with a laser. | ▶ 01:15 |
The laser adds texture to the scene, | ▶ 01:18 |
and by triangulation using the same method I showed you, | ▶ 01:21 |
it can recover depth. | ▶ 01:24 |
Here's my postdoc Christian using a Kinect-like sensor | ▶ 01:26 |
to do certain poses in front of a depth sensor. | ▶ 01:31 |
You can see in the screen how his pose is being perceived, | ▶ 01:36 |
and you can see Christian trying to do handstands and other acrobatic maneuvers. | ▶ 01:41 |
He's actually pretty good. | ▶ 01:51 |
That's all using effectively stereo vision. | ▶ 01:54 |
There is actually a whole bunch of different types of techniques | ▶ 02:07 |
for sensing range in computer vision. | ▶ 02:10 |
I'm just going to briefly talk about them. | ▶ 02:13 |
They're called laser range finders. | ▶ 02:15 |
They send off beams of light, | ▶ 02:17 |
and they measure the time until the light comes back into the sensor. | ▶ 02:19 |
They're being manufactured by many different companies. | ▶ 02:22 |
In our experiments using robots to drive through the desert and through traffic. | ▶ 02:25 |
We quite extensively used laser range finders as an alternative to stereo vision, | ▶ 02:30 |
because they give us very, very good range estimates. | ▶ 02:35 |
Here is a 3D model constructed by laser range finders of our neighborhood in Palo Alto, | ▶ 02:38 |
and it's easy to see how 3D points can making amazing 3D models, | ▶ 02:45 |
using techniques like stereo vision or like the laser range finders I just briefly talked about. | ▶ 02:52 |
[Thrun] This very final episode of the computer vision classes | ▶ 00:00 |
I will teach you about structure from motion. | ▶ 00:03 |
This is a really funny name for something much more intuitive, | ▶ 00:06 |
and it comes from the early days of computer vision | ▶ 00:11 |
where the structure referred to the 3D world. | ▶ 00:14 |
And of course it's impossible to capture the 3D world with the camera itself | ▶ 00:17 |
because the camera only gives 2D projections of the 3D scene. | ▶ 00:21 |
Motion referred to the locations of the camera. | ▶ 00:24 |
So the idea was to take a handheld camera and move it around a 3D structure | ▶ 00:27 |
and be able to recover or estimate the 3D coordinates of all the features in the world | ▶ 00:34 |
based on many 2D images. | ▶ 00:40 |
So suppose you have a scene with 3 features--A, B, and C-- | ▶ 00:43 |
and you're moving a camera around to different positions--1, 2, and 3. | ▶ 00:47 |
Then the different features get projected onto different points in the camera planes, | ▶ 00:52 |
as shown over here. | ▶ 00:57 |
And from the positions of those projected features | ▶ 00:59 |
it may be impossible to recover not just where the camera was | ▶ 01:03 |
at the time these images were taken but also where in the world the features are. | ▶ 01:08 |
That's called structure from motion. | ▶ 01:13 |
So here is my first quiz. | ▶ 01:15 |
Is this possible? | ▶ 01:17 |
Given that we look at a number of features in the scene-- | ▶ 01:19 |
maybe 1, maybe 2, maybe more-- | ▶ 01:22 |
and given that we have 1 or more camera positions, | ▶ 01:24 |
can we always, sometimes, or never recover or calculate the 3D position of the features | ▶ 01:26 |
and the 3D position of the cameras simultaneously? | ▶ 01:33 |
Please check almost, sometimes, or never. | ▶ 01:36 |
[Thrun] And the answer is sometimes, | ▶ 00:00 |
and it's not entirely obvious whether this is the right answer. | ▶ 00:02 |
Clearly if there is only 1 point feature in the world and 1 image, | ▶ 00:05 |
then you can't recover where the feature is. | ▶ 00:08 |
You already learned this, because the camera can't estimate depth on itself. | ▶ 00:10 |
So it can't be almost, | ▶ 00:15 |
but it's also not never. | ▶ 00:17 |
There are cases in which we can actually recover the full scene | ▶ 00:19 |
and all the camera positions, and we will ask ourselves in a minute | ▶ 00:23 |
under what situation this might be possible. | ▶ 00:27 |
[Thrun] Let me first give you a brief quiz to understand how the projection works | ▶ 00:00 |
in structure from motion. | ▶ 00:04 |
Suppose we have 3 point features at the known location. | ▶ 00:06 |
We have a camera over here, camera A, | ▶ 00:09 |
which can see these 3 point features. | ▶ 00:13 |
We have a second camera over here, camera B, | ▶ 00:16 |
a pinhole camera which can see the same features. | ▶ 00:18 |
Suppose camera A on the left sees feature 1, | ▶ 00:21 |
at the center sees feature 3, and on the right side of the camera plane sees feature 2. | ▶ 00:25 |
I would like to know for camera B what will be on the camera plane on the left side, | ▶ 00:30 |
the center, or the right side. | ▶ 00:35 |
Which of the features 1, 2, 3 will be seen left, center, or right? | ▶ 00:37 |
[Thrun] And the answer is 3, 2, 1. Let's just go through this. | ▶ 00:00 |
Clearly the leftmost feature in camera A is 1, which corresponds to this point over here, | ▶ 00:05 |
the center will be 3, and the right one will be 2, | ▶ 00:11 |
as indicated in the table over here. | ▶ 00:17 |
If we now look into imager B, | ▶ 00:19 |
we find that the leftmost projection comes from feature number 3, | ▶ 00:22 |
the center projection from feature number 2, | ▶ 00:27 |
and the rightmost projection from feature number 1. | ▶ 00:31 |
This is not the full structure from motion problem, | ▶ 00:34 |
but it's a good exercise to understand how feature indices under known position A and B | ▶ 00:37 |
and under known locations of the target features map to each other, | ▶ 00:43 |
and it's good to understand the complexity of the structure from motion problem. | ▶ 00:48 |
[Thrun] Here is a very early example of structure from motion by Carlo Tomasi | ▶ 00:00 |
and Takeo Kanade. | ▶ 00:05 |
They used Harris corner detectors to find corners in the image of this toy 3D house, | ▶ 00:07 |
and they were able from a number of images to fully recover the 3D structure | ▶ 00:13 |
of every single corner point, as shown in this video. | ▶ 00:18 |
So as they then take this 3D data set and turn it in arbitrary directions, | ▶ 00:22 |
you can see the full 3D structure was recovered. | ▶ 00:26 |
This is work in 1992. | ▶ 00:29 |
It used principal component analysis to solve the problem | ▶ 00:31 |
and is one of the most amazing pieces of early computer vision research. | ▶ 00:34 |
Carlo, who used to be a Stanford professor for many years, | ▶ 00:39 |
then scanned his kitchen and with the same Harris corner detector | ▶ 00:44 |
was able to reconstruct a 3D structure of his kitchen, as shown over here. | ▶ 00:47 |
Again, this is one of the most impressive early computer vision research results I've seen. | ▶ 00:52 |
Here is a flight video of flying over the hills of Pennsylvania. | ▶ 00:58 |
As you can see, using the same technique he was able to recover the 3D structure | ▶ 01:03 |
of the outdoor terrain and build elevation maps, as shown over here. | ▶ 01:09 |
Marc Pollefeys, who presently teaches at ETH Zurich, | ▶ 01:32 |
came up with a beautiful solution to the structure from motion problem, | ▶ 01:35 |
here imaging different buildings in his hometown. | ▶ 01:40 |
From this video you can see multiple snapshots of a single building | ▶ 01:43 |
where the different perspective distortion has an effect on the appearance of the building, | ▶ 01:48 |
quite obviously. | ▶ 01:53 |
Using those images he was able to reconstruct the 3D shape of the building facade, | ▶ 01:55 |
as shown in this video. | ▶ 02:00 |
Again, at the time it was one of the most impressive results ever achieved | ▶ 02:10 |
in structure from motion. | ▶ 02:14 |
You can see amazing detail as he zooms in to his building model. | ▶ 02:29 |
He then moved on to map entire cities, | ▶ 02:38 |
and here is an example of a map that he produced from an entire city block. | ▶ 02:42 |
You can see how he reconstructs the building facades in unprecedented detail. | ▶ 02:48 |
There's also a lot of occlusion gaps where the original imager wasn't able to see anything. | ▶ 02:55 |
Those show up in black, and they look a little bit disturbing in this image over here. | ▶ 03:00 |
But in reality, your camera can't see everything. | ▶ 03:04 |
So even if you do a perfect job with structure from motion, | ▶ 03:07 |
it's really hard to reconstruct every single inch of the environment. | ▶ 03:09 |
Still, this stands out as one of the most impressive results ever | ▶ 03:17 |
in what I would call the Holy Grail of 3D computer vision. | ▶ 03:20 |
[Thrun] The mathematics of the structure from motion problem are involved, | ▶ 00:00 |
and I don't want to go into detail here. | ▶ 00:04 |
Here is our perspective projection model with our well-known equation on the right. | ▶ 00:07 |
Under the assumption that the camera itself might be at a random location | ▶ 00:13 |
and a random orientation, this equation becomes a really complicated composition | ▶ 00:17 |
of original image points in 3D, 3 rotation matrices as shown over here, | ▶ 00:24 |
and an offset over here that relates to the camera coordinates. | ▶ 00:30 |
You do this for X divided by Z over here. | ▶ 00:34 |
This will be the projected camera input coordinates. | ▶ 00:40 |
This is the generative math that specifies how cameras work under arbitrary orientations | ▶ 00:43 |
and arbitrary translations. | ▶ 00:49 |
If you now want to solve it, you can look at the observed measurements | ▶ 00:51 |
minus the predicted measurements, minimize all this, | ▶ 00:57 |
and solve for the translations, the point locations, and the orientations simultaneously. | ▶ 01:02 |
This is entirely nontrivial, and nonlinear optimization techniques | ▶ 01:10 |
have been used extensively to solve this problem. | ▶ 01:15 |
They go by names like gradient descent, conjugate gradient, | ▶ 01:19 |
Gauss-Newton, Levenberg Marquardt, and other things like singular value decomposition. | ▶ 01:23 |
I won't go into detail--just to give you a flavor of the problem. | ▶ 01:30 |
Instead I'd like to ask you a question. | ▶ 01:35 |
[Thrun] I should warn you this question is hard. | ▶ 00:00 |
If you're new to computer vision, you likely won't get it. | ▶ 00:03 |
I'd like to ask it to you anyhow just to see how close you can get | ▶ 00:07 |
and whether you appreciate the answer I'll be giving you. | ▶ 00:12 |
Suppose we have m camera poses. | ▶ 00:15 |
That means m directions from which we take an image. | ▶ 00:19 |
That's called the motion. | ▶ 00:21 |
And suppose we have n 3D points, which is called the structure. | ▶ 00:23 |
Then it's quite obvious that we have 2 times m times n constraints | ▶ 00:29 |
simply because in each of the images we see all the n points | ▶ 00:35 |
and we get an x and a y coordinate for each of the points, | ▶ 00:39 |
which makes 2-m-n constraints. | ▶ 00:42 |
We also have unknowns. | ▶ 00:45 |
Specifically, each camera position is a 6D unknown | ▶ 00:47 |
about the rotation and translation of the camera, | ▶ 00:51 |
and each point itself has a 3D coordinate. | ▶ 00:53 |
So the total number of unknowns is 6m plus 3n. | ▶ 00:55 |
At first glance, to solve the structure from motion problem you would want 6m plus 3n | ▶ 00:59 |
to be smaller or equal to 2mn. | ▶ 01:06 |
And of course if m and n is big enough, this equation will be satisfied. | ▶ 01:09 |
But my question is if you run the structure from motion problem, | ▶ 01:13 |
how many of these unknowns can you actually recover? | ▶ 01:16 |
Or, put differently, how many of those unknowns can you not recover? | ▶ 01:19 |
If you think about it, for example, you won't be able to really recover | ▶ 01:25 |
the absolute coordinates of our system, because you can move the entire system | ▶ 01:28 |
1 meter to the right and you'll still get the same answer. | ▶ 01:32 |
So there's going to be a number over here that I want you to enter | ▶ 01:36 |
that specifies the number of parameters that cannot possibly be recovered | ▶ 01:41 |
in this structure from motion problem. | ▶ 01:45 |
[Thrun] And surprisingly, the answer is 7. | ▶ 00:00 |
You cannot recover the absolute location and orientation of the coordinate system, | ▶ 00:03 |
which are 6 of those parameters, | ▶ 00:07 |
but you can also not recover scale. | ▶ 00:10 |
For example, take a situation like this where you have 3 points over here | ▶ 00:14 |
and now make this situation twice as large | ▶ 00:19 |
with the points spread out twice as widely. | ▶ 00:22 |
Because of perspective math, this over here will be the same answer | ▶ 00:25 |
as this guy over here. | ▶ 00:28 |
So this is 1 scale parameter that you can't recover, | ▶ 00:30 |
so you can only recover 6m plus 3n minus 7 parameters. | ▶ 00:33 |
And as long as this is smaller than 2mn, you have a solution | ▶ 00:38 |
of the structure from motion problem. | ▶ 00:42 |
This was entirely nontrivial. | ▶ 00:44 |
If you got this wrong, I would have gotten this wrong if I hadn't known the solution. | ▶ 00:46 |
But it's fun to think about these things. | ▶ 00:50 |
A lot of computer vision people care about whether the problem is well posed, | ▶ 00:52 |
and you need a certain number of features and a certain number of images | ▶ 00:55 |
to make this equation hold true. | ▶ 00:58 |
[Thrun] Now, at this point I would love to go deeper into structure from motion | ▶ 00:00 |
and tell you more about how to solve it. | ▶ 00:04 |
But, unfortunately, this is the Introduction to Artificial Intelligence class, | ▶ 00:06 |
so I really want to leave it at a level that covers the typical material I cover at Stanford. | ▶ 00:10 |
If you're interested, take a computer vision class. | ▶ 00:16 |
It's a fascinating subject area. | ▶ 00:18 |
This finishes my survey of computer vision. | ▶ 00:20 |
Congratulations. You made it through the computer vision classes. | ▶ 00:24 |
I think you now understand the very basics of computer vision, | ▶ 00:27 |
you understand how images are being formed, | ▶ 00:30 |
how features are being extracted, | ▶ 00:33 |
and how we can do some very basic 3D inference about the world. | ▶ 00:35 |
This is just a teaser. | ▶ 00:39 |
The field of computer vision is of course much richer. | ▶ 00:41 |
I used to teach the class at Stanford, | ▶ 00:43 |
and I hope to be able to invite you in the near future | ▶ 00:46 |
to an actual online 3D computer vision class. | ▶ 00:48 |
Welcome to the homework assignment on computer vision. | ▶ 00:00 |
I will first ask you a few questions about perspective projection | ▶ 00:03 |
in which you will exercise the math that we explored | ▶ 00:07 |
that relates the size of an object in the scene, uppercase X, | ▶ 00:10 |
with the depth for the range to the pinhole camera Z, the focal length f, | ▶ 00:15 |
and the size of the projection, small x. | ▶ 00:21 |
Remember from class the following equation. | ▶ 00:25 |
I'm literally dropping the minus sign. I want all numbers to be positive in this example. | ▶ 00:30 |
So please don't worry about the minus sign that might not occur | ▶ 00:34 |
in this specific version of the equation. | ▶ 00:37 |
I'll give you three values for X, Z, f, | ▶ 00:40 |
and would like you to understand what the missing value is. | ▶ 00:46 |
X is measured in meters and the same with Z. | ▶ 00:49 |
f is in millimeters and so is lowercase x. | ▶ 00:53 |
Here's my first question. Suppose X is 10 m in size. | ▶ 00:58 |
It's 100 meters away. Suppose our focal length is 10 mm. | ▶ 01:02 |
How large is our projection lowercase x in millimeters? | ▶ 01:07 |
Now we're asking you what is the focal length if an object of size 20 m that is 400 meters out | ▶ 01:10 |
if we observe it to be 1 mm on our projection surface. | ▶ 01:16 |
Suppose we have a 2 m sized object that with a 40 mm focal length | ▶ 01:20 |
appears to be 1 mm in size. What is the distance of this object to the camera? | ▶ 01:24 |
Finally, say a known object is 300 m away. | ▶ 01:30 |
Our focal length is now 100 mm, and the object is again 1mm. | ▶ 01:33 |
How far is this object away in meters? | ▶ 01:37 |
The answer can be seen directly from this formula over here. | ▶ 00:00 |
In the first case we plug in X equals 10. | ▶ 00:04 |
F over Z is 10 divided by 100. | ▶ 00:06 |
We multiply 10 by 0.1, and we get 1. | ▶ 00:10 |
It turns out all the units take care of themselves. | ▶ 00:14 |
No matter which way we pose the question, we can effectively ignore those units, | ▶ 00:17 |
because they're the same outside the camera as they are inside the camera. | ▶ 00:21 |
For the second question, we transform this equation over here as follows: | ▶ 00:25 |
We now plug in Z as 400, X as 20, lowercase x as 1, | ▶ 00:29 |
to give us a focal length of 20 mm. | ▶ 00:37 |
We can transform this equation further into this quotient over here. | ▶ 00:40 |
F over x is 40 times uppercase X of 2 mm is 80 meters. | ▶ 00:43 |
Finally we can write this expression over here, and we plug in x over f. | ▶ 00:49 |
We get 100th times 300 is 3 meters over here. | ▶ 00:55 |
In this question I'm going to ask you about whether certain image functions are linear. | ▶ 00:00 |
A function is linear if each resulting pixel of the processed image | ▶ 00:06 |
is a linear combination of input pixels. | ▶ 00:11 |
They could be rated by constants like plus 1 or minus 1, | ▶ 00:14 |
and they could be added up. Addition is linear. | ▶ 00:17 |
But for example, taking the square of a pizel isn't a linear operation. | ▶ 00:20 |
I realize this question goes beyond what we discussed in class, | ▶ 00:24 |
so please think a little bit about it and understand the difference | ▶ 00:28 |
between linear and nonlinear in trying to answer these questions. | ▶ 00:31 |
First is our gradient kernel here: minus 1, 1. | ▶ 00:35 |
Please check if it's linear or nonlinear. | ▶ 00:39 |
Again, the linearity of an output image is given if the output image is a linear function | ▶ 00:41 |
of the pixels in the input image. | ▶ 00:47 |
How about our Gaussian kernel that we discussed in class of size 5 by 5? | ▶ 00:49 |
Is the kernal linear or nonlinear? | ▶ 00:53 |
How would we take the absolute value of a pixel? | ▶ 00:56 |
If pixels are negative, we just ignore the negative sign and map back to the absolute value. | ▶ 00:58 |
Is it linear or nonlinear? | ▶ 01:03 |
We talked about the gradient magnitude kernel, | ▶ 01:05 |
which was defined over a square root of the squares of the image gradients. | ▶ 01:07 |
Is this a linear or nonlinear operation. | ▶ 01:13 |
Finally, if you were to calculate the absolute brightness of a grey-scale image, | ▶ 01:15 |
or let me call this the average brightness. | ▶ 01:20 |
We have an imager of a certain size and like to calculate just the average brightness | ▶ 01:23 |
of all the individual image pixels. They are all in greyscale. | ▶ 01:27 |
Is this linear or nonlinear? | ▶ 01:30 |
The answer is every kernel convolution is linear. | ▶ 00:00 |
Each pixel becomes the linear sum of, in this case, 2 pixels | ▶ 00:04 |
that are weighted by plus 1 and minus 1, but in terms of the original variables, | ▶ 00:09 |
which is the original image, this resulting sum, minus the left pixel plus the right pixel, | ▶ 00:12 |
is a linear equation in the original pixel values. | ▶ 00:18 |
The same is true for the Gaussian kernel of size 5 by 5. | ▶ 00:22 |
It is a linear kernel because it just adds up all these values | ▶ 00:26 |
summed by the Gaussian kernel. | ▶ 00:30 |
Absolute value is nonlinear. | ▶ 00:32 |
The function that governs absolute value for input and output looks like this, | ▶ 00:34 |
and there is a nonlinear kink over here. | ▶ 00:40 |
The same is true for gradient magnitude. | ▶ 00:43 |
There are squares in there, which are nonlinear, | ▶ 00:45 |
and the square root makes it a nonlinear operation. | ▶ 00:47 |
The absolute brightness is a linear operation. | ▶ 00:50 |
It's just like a Gaussian kernel with a uniform mask. | ▶ 00:52 |
It just adds up all the values and divides them by the number of pixels. | ▶ 00:57 |
It is a linear function in all the input pixels. | ▶ 01:00 |
In this example, I'd like you to calculate a gradient image. | ▶ 00:00 |
I'm giving you a relatively simple image of size 3 by 3 | ▶ 00:04 |
with the following greyscale pixel values:, 2, 0, 2, 4, 100, 102, 242. | ▶ 00:07 |
And for the sake of this exercise, I'd like to retain another 3 by 3 image, | ▶ 00:15 |
so we'll assume that all the values outside the image are just zero. | ▶ 00:20 |
What I'm asking you is to compute a 3 by 3 matrix | ▶ 00:24 |
that is the result of convolving this image with the following kernel: | ▶ 00:28 |
minus 1 on the left, zero, and 1. | ▶ 00:32 |
Then take the absolute value of each pixel, so you're going to ignore the minus sign, | ▶ 00:35 |
which is clearly an nonlinear operation. | ▶ 00:39 |
Please apply this kernel to the image over here. | ▶ 00:41 |
For each pixel, down here you get a linear combination of applying this kernel | ▶ 00:44 |
from the values over here, assuming these off image values are all zero. | ▶ 00:49 |
We then take the absolute--drop the minus sign--and please plug in the number over here. | ▶ 00:53 |
This kernel applied in this location over here will give me | ▶ 00:00 |
a minus 1 times zero plus 1 times zero is zero. | ▶ 00:04 |
Shift to the right you get minus 1 times 2 plus 2 times 1 to zero. | ▶ 00:09 |
The absolute value of this is zero. | ▶ 00:14 |
On the right side we get zero again. | ▶ 00:16 |
You get 100 over here, which is minus zero plus 100. | ▶ 00:18 |
We get 98 over here, which is minus 4 plus 102. | ▶ 00:24 |
And we get 100 over here, which is minus 100 over here plus zero, | ▶ 00:31 |
which gives minus 100. We take the absolute, so it's 100. | ▶ 00:36 |
You get 4 over here, which is minus zero plus 4 equals 4. | ▶ 00:40 |
Zero over here. These two balance each other out. | ▶ 00:47 |
Then another 4 over here. Minus 4 plus zero is minus 4. | ▶ 00:51 |
Taking the absolute we get 4 over here. | ▶ 00:55 |
I now have a stereo question. | ▶ 00:00 |
For a valid calibrated stereo rig, we're given two pinhole cameras | ▶ 00:03 |
whose displacement is B, and we observe a point out in the scene. | ▶ 00:08 |
The distance of this point to the image plane up uppercase Z. | ▶ 00:12 |
Our cameras have a focal length of f. | ▶ 00:19 |
Of course, this point is being projected into two different locations | ▶ 00:22 |
for the two different images--x2 and x1. | ▶ 00:25 |
For the sake of this question, we're going to just consider delta x, which is x2 minus x1-- | ▶ 00:29 |
the displacement in the corresponding images. | ▶ 00:36 |
We measure delta x in millimeters, same with the focal length f. | ▶ 00:40 |
B is in meters and so is Z, | ▶ 00:45 |
and I use meters for B and Z so that the units effectively fall out. | ▶ 00:49 |
You don't really have to consider them. | ▶ 00:53 |
Suppose the measured delta x is 4 mm | ▶ 00:55 |
with a focal length of 40, and our displacement is 0.1. | ▶ 00:58 |
How far away is the object in meters? | ▶ 01:03 |
Suppose we have a displacement of 0.05 mm for a focal length of 50. | ▶ 01:07 |
We now care about the baseline B, if we happen to know the object is 100 meters away. | ▶ 01:12 |
Next we have a displacement of 0.1 mm. We don't know our focal length. | ▶ 01:20 |
We know that the baseline is 0.2 mm, and the object is 50 meters away. | ▶ 01:25 |
Finally we don't know the displacement, but we do know that the focal length is 200 mm. | ▶ 01:32 |
We have a baseline of 1 m and the object is 50 meters away. | ▶ 01:39 |
Can you fill in the missing numbers? | ▶ 01:44 |
We answer this question by first writing down the fundamental equation here, | ▶ 00:00 |
which is delta x over f relates to B over Z. | ▶ 00:04 |
To see this, we find by equal triangles that this triangle over here | ▶ 00:10 |
described by Z over B, is the same as these 2 things over here | ▶ 00:14 |
put together into a single triangle, which is delta x over f. | ▶ 00:19 |
This proportionality must be the case. | ▶ 00:24 |
This can now be transformed to solve for Z. | ▶ 00:27 |
Z equals f over delta x B. | ▶ 00:31 |
If we plug in f with delta x, we get 10 times 0.1 which is 1. | ▶ 00:33 |
We can resolve B is delta x over f times Z. | ▶ 00:39 |
We plug in delta x, 0.05, divided by 50 is 0.001 times 100 gives us 0.1. | ▶ 00:44 |
We can also resolve it for f, which is Z over B times delta x. | ▶ 00:54 |
Z over B is 10 times delta x as 0.1 gives us 1 over here. | ▶ 01:00 |
Finally, we can resolve it for delta x. | ▶ 01:05 |
B over Z times f. B over Z is 1/50 times f as 200 gives us 4 over here. | ▶ 01:07 |
All the units fall out by themselves. | ▶ 01:14 |
We will now talk about correspondence in stereo. | ▶ 00:00 |
You might remember our dynamic programming approach for | ▶ 00:03 |
resolving correspondence along an entire scan line. | ▶ 00:06 |
So I'll give you another scan line. This is the left scan line--red, red, blue, blue, blue, red. | ▶ 00:09 |
Then the right scan line we get to see the following. | ▶ 00:15 |
Obviously there is a shift going on. | ▶ 00:19 |
I'd like to ask you where this little pixel over here will go into the lead association. | ▶ 00:22 |
It can go into any of those pixels over here, so please check exactly one of those boxes. | ▶ 00:27 |
Let's assume the cost for a bad match, | ▶ 00:31 |
when we match 2 colors that don't correspond, is 20. | ▶ 00:34 |
The cost of an assumed occlusion or a disocclusion is 10. | ▶ 00:37 |
Try to find the optimal alignment, | ▶ 00:41 |
and then tell me where in the right scan line this 1 pixel corresponds to. | ▶ 00:43 |
Check the exact box to which it corresponds. | ▶ 00:49 |
Here is a second question I'd like to ask you. | ▶ 00:52 |
What if we changed the cost of occlusion to 100? | ▶ 00:54 |
Please answer the exact same question--where does the B over here go-- | ▶ 00:58 |
under this different cost model. | ▶ 01:01 |
In the case where the occlusion costs are low, | ▶ 00:00 |
it is best to assume that those Bs over here correspond as indicated. | ▶ 00:03 |
That means we have an occlusion cost to pay over here | ▶ 00:08 |
and an occlusion cost to pay over here. | ▶ 00:10 |
Our total cost is 20 for 2 occlusions, but we have a perfect match. | ▶ 00:12 |
As a result, this B moves over here. | ▶ 00:16 |
However, there is another viable solution when the occlusion costs are large, | ▶ 00:19 |
because you would pay a total of 200 with the occlusion cost. | ▶ 00:22 |
If we match pixel one to one like this, then you get two mismatches-- | ▶ 00:25 |
one over here and one over here. | ▶ 00:29 |
The cost of those in total are 40. | ▶ 00:31 |
That is still smaller than the 200 occlusion cost we had before. | ▶ 00:33 |
Therefore the B gets matched to this point over here. | ▶ 00:37 |
Notice how the different occlusion costs give different results | ▶ 00:39 |
for the correspondence program in this dynamic programming question. | ▶ 00:42 |
This final question is motivated by structure from motion, | ▶ 00:00 |
and it's not the full-blown structure from motion problem, | ▶ 00:03 |
which is hard to do on a piece of paper here. | ▶ 00:05 |
But it's a variant event for which I know the motion but not the structure. | ▶ 00:08 |
Suppose we are given two cameras, | ▶ 00:12 |
and we happen to know there are three features in the scene. | ▶ 00:14 |
All three features can be scene by both cameras. | ▶ 00:16 |
This camera will see a feature on the left, in the center, and on the right--L, C, R. | ▶ 00:19 |
This camera is camera A. | ▶ 00:25 |
One camera B, we also see a feature on the left, center, and right, | ▶ 00:27 |
but I don't know the identity of those features, which I will call 1, 2, and 3. | ▶ 00:30 |
Suppose in camera A we see from left to the right the following sequence: 1, 2, 3. | ▶ 00:35 |
These are the features numbers. | ▶ 00:42 |
So in the left camera we notice that the left-most visible feature is feature 1, | ▶ 00:44 |
the center feature is feature number 2, and the right feature is feature number 3. | ▶ 00:49 |
I'm going to ask what is the order of those pixels in camera B, | ▶ 00:53 |
assuming that the features are located as shown over here and so are the cameras. | ▶ 00:57 |
Please give the index of the features over here that clearly have to be 1, 2, and 3 | ▶ 01:01 |
in some order which you have to determine. | ▶ 01:06 |
For a different configuration let's now assume we get to see 1, 2, 3 in camera B, | ▶ 01:08 |
and we care about the feature indices we see in camera A | ▶ 01:13 |
that corresponds to those features over here. | ▶ 01:17 |
Let's study the first case. | ▶ 00:00 |
We saw left feature number 1. | ▶ 00:02 |
The left one is over here, therefore this one will be feature number 1. | ▶ 00:04 |
Feature number 1 will be seen at the center of camera B. | ▶ 00:09 |
In the center of the left camera, we see feature number 2, | ▶ 00:12 |
which must be this guy over here, because it's the center feature to be seen. | ▶ 00:15 |
This will be projected to the right side of camera B. | ▶ 00:18 |
Even though it's left in the image plane, the way a I drew it you can see that | ▶ 00:21 |
the projection over here is on the right side of the camera chip. | ▶ 00:25 |
The same with feature number 3, which is seen in R over here on the right side. | ▶ 00:28 |
Hence, it'll project into the leftmost field over here. | ▶ 00:32 |
Using a different color now, if camera B sees feature number 1 on its L position, | ▶ 00:36 |
Then this must be feature number one, which will appear on the right position for camera A. | ▶ 00:42 |
And feature number 2 is in the center position--this guy over here-- | ▶ 00:47 |
which shows up in the left position over here, | ▶ 00:50 |
and the remaining fits in over here. This is the correct answer. | ▶ 00:52 |
One of the things I've been working on for most of my professional career are self-driving cars. | ▶ 00:00 |
The vision is that in the future cars will drive themselves, | ▶ 00:07 |
and in doing so they can be significantly safer. | ▶ 00:11 |
We lose about a little over 1 million people per year in the entire world in traffic accidents. | ▶ 00:14 |
I believe most of these accidents can be avoided by making cars safer. | ▶ 00:21 |
If they drive themselves, they can drive disabled people. | ▶ 00:25 |
They can drive blind people, young children, aging people, | ▶ 00:28 |
and they could drive all of us while we do better things that staring at the road ahead. | ▶ 00:32 |
So one of my life passions has been to be develop self-driving cars. | ▶ 00:36 |
Today, I'd like to tell you about those, and also show you some of the basic techniques | ▶ 00:40 |
so you can in principle program your own self-driving car. | ▶ 00:45 |
So for me the work on self-driving cars started in 2004 after the first DARPA Grand Challenge. | ▶ 00:51 |
This was a government-sponsored robot race | ▶ 01:01 |
in which autonomous robots were asked to drive through the Mojave desert from California to Nevada | ▶ 01:04 |
along 141 miles of really punishing desert terrain. | ▶ 01:11 |
Lots of teams competed from various universities, car companies, | ▶ 01:17 |
and also lots of hobbiests that were new to the field competed, | ▶ 01:21 |
and built this huge set of different cars. | ▶ 01:25 |
There were over 100 different entries into the first DARPA Grand Challenge. | ▶ 01:28 |
Despite all this work, most robots failed out of the starting gate, | ▶ 01:32 |
like this one over here flipped over less than 100 meters into the race. | ▶ 01:36 |
Some were very, very large. | ▶ 01:42 |
This is a major defense contractor who built this 35,000 pound vehicle, | ▶ 01:44 |
which on the course was rather timid, | ▶ 01:50 |
and some of the the teams had very small robots, like the next one by UC Berkeley, | ▶ 01:53 |
which was a motorcycle. | ▶ 02:00 |
So here we go. | ▶ 02:04 |
The first DARPA Grand Challenge came with $1 million of prize money, | ▶ 02:11 |
and despite this prize money, no team made it further than 5% of the total course. | ▶ 02:16 |
In fact, almost all cars stopped for something very stupid, | ▶ 02:21 |
some went up in flames, | ▶ 02:26 |
and the furthest any team made it was this car over here by Carnegie Mellon University, | ▶ 02:28 |
which made it about just below 8 miles of the total distance. | ▶ 02:33 |
So for many of us, this was a massive failure of robotic technology, | ▶ 02:37 |
which motivated me to get involved in this race. | ▶ 02:41 |
My own story is really simple. | ▶ 02:46 |
I started a class at Stanford, and I got about 20 students to work with me | ▶ 02:48 |
on what would become the Stanford racing team that would ultimately go and win this race. | ▶ 02:52 |
We modified a Volkswagen Toureg to put all kinds of sensors onto the roof | ▶ 02:57 |
and actuators into the car that could actuate the steering wheel, the gas pedal, and the brake. | ▶ 03:04 |
The sensors came in multiple versions. | ▶ 03:09 |
Some were related to localization, such as global positioning sensors, | ▶ 03:11 |
and some were related to understanding where obstacles are, like laser-range finders. | ▶ 03:15 |
We talked about computer vision in this class. | ▶ 03:19 |
The actuators were basically a motor on the steering wheel and on the brake pedals and on the gas pedal. | ▶ 03:21 |
Early on, we tested on Stanford's campus. | ▶ 03:27 |
This is the roof of the medical parking garage. | ▶ 03:31 |
Here you can see my students and I performed simple maneuvers. | ▶ 03:34 |
Now, I've got to tell you that this is usually a busy parking garage. | ▶ 03:38 |
It's the medical parking garage at Stanford Hospital, | ▶ 03:41 |
but as we practiced autonomous driving, people would come and pick up their car | ▶ 03:44 |
and ask us about, what we were doing, so we kept telling them, | ▶ 03:48 |
well, we're building a self-driving car. | ▶ 03:51 |
Within less than a week, people just chose not to park there anymore. | ▶ 03:53 |
Closer to the next version of the Grand Challenge, the second one in 2005, | ▶ 04:01 |
we had built a car that could drive competently on most desert tracks | ▶ 04:06 |
at speeds up to about 60 km per hour through dry river beds, through steep inclines and declines, | ▶ 04:12 |
and would be able to avoid obstacles like this little shrub on the right side over here. | ▶ 04:22 |
It was never really elegant, but it was insanely effective. | ▶ 04:27 |
Now, not all testing went smooth. | ▶ 04:35 |
This is imagery that the New York Times shot of us when we invited them for a test drive. | ▶ 04:38 |
During this day, we managed to crash into a tree and get stuck in the mud. | ▶ 04:43 |
It was pretty embarrassing. | ▶ 04:47 |
Here is imagery of our laser system mapping out the terrain ahead. | ▶ 00:03 |
We talked a little bit about lasers and range finders in this class. | ▶ 00:07 |
Here you can see all these systems work together on building 3D maps of the environment | ▶ 00:11 |
that our car, Stanley, uses to assess the driving situation. | ▶ 00:16 |
This shows work on machine learning autonomous driving, | ▶ 00:24 |
where we used the laser to identify driveable terrain at a short range | ▶ 00:27 |
and then extrapolate this out into the long range using a machine-learning technique | ▶ 00:32 |
applied to computer vision. | ▶ 00:36 |
What you see here is a coloring, which is the output of a machine learning algorithml | ▶ 00:39 |
that identifies driveable terrain in the desert. | ▶ 00:43 |
So very briefly to tell you about the race, one with a lot of fame and $2 million. | ▶ 00:47 |
This race started early in the morning. The sun was basically still gone and was just rising. | ▶ 00:54 |
Our car was able to drive itself followed by a human-driven change vehicle and did quite well. | ▶ 01:01 |
It did so well that it actually passed the front-seated and first-running vehicle by Carnegie Mellon University. | ▶ 01:07 |
It had to navigate complicated and dangerous mountain trails where destruction lured on both sides of the car. | ▶ 01:13 |
On the left there was a cliff. On the right side there was a mountain. | ▶ 01:21 |
It is here followed by a human-driven chase vehicle. | ▶ 01:24 |
Our car very carefully ascended this route. | ▶ 01:27 |
You can see it here close before the finishing line, | ▶ 01:30 |
and after just about 7 hours it managed to do what no robot had every done before. | ▶ 01:33 |
It managed to really finish DARPA Grand Challenge, do this race, and won Stanford $2 million. | ▶ 01:38 |
We were insanely proud on this day. | ▶ 01:44 |
From this we moved onto build Junior, which competed in the DARPA Urban Challenge. | ▶ 01:49 |
Here you can see Junior's laser pursuing obstacles and being able to detect those, | ▶ 01:57 |
using basically range vision. | ▶ 02:02 |
We will talk today of localization. | ▶ 02:08 |
Junior was able to localize itself using particle filters | ▶ 02:10 |
relative to a given map of the environment, which is essential for navigating safely in traffic. | ▶ 02:15 |
It was able to detect other cars using particle filters | ▶ 02:23 |
and estimate not just where they are and how far they are moving but also what size they are, how big they are. | ▶ 02:27 |
You can see on the left the detected cars. | ▶ 02:34 |
On the right side, you see our camera view of the same situation. | ▶ 02:36 |
Here again, you can see it detect cars. | ▶ 02:42 |
Here is how it looked like from an external observation point. | ▶ 02:49 |
You can see Junior, our vehicle, driving in a fairly busy city street with lots of cars passing. | ▶ 02:52 |
It has to wait for a gap to take a left turn. | ▶ 03:00 |
When the gap finally occurs, it confidently takes the turns and drives. | ▶ 03:03 |
Today in today's class I teach you how to basically program a car just like that. | ▶ 03:09 |
So this is footage from our Google self-driving car, which you might have heard about. | ▶ 03:18 |
This car was able to drive at speeds as high as a Prius can go. | ▶ 03:23 |
It drives seamlessly in traffic. | ▶ 03:29 |
In fact, we drove over 100,000 miles without anybody noticing | ▶ 03:32 |
that there were self-driving cars in our experiments. | ▶ 03:36 |
This is near Stanford University on University Street in Palo Alto. | ▶ 03:39 |
You can see how the vehicle yields by itself for pedestrians. | ▶ 03:43 |
Of course, there's also a human driver on board just for safety, | ▶ 03:48 |
but this car, you can take my word for it, is really driving itself in traffic. | ▶ 03:50 |
This is image footage from the car itself as it goes onto a highway. | ▶ 03:55 |
This is sped up, I should say. | ▶ 03:58 |
Driving through a toll booth, and driving in Los Angeles. | ▶ 04:01 |
You can see a lot of palm trees here. It's a beautiful environment to drive in. | ▶ 04:07 |
Here you can see some of the inner workings, | ▶ 04:23 |
where you can see a corridor that the vehicle attempts to go. | ▶ 04:26 |
We can see obstacles being flagged using machine-learning techniques, | ▶ 04:28 |
range vision, laser radar, and so on. | ▶ 04:32 |
You can see it is colored by its relation to our car and its nature, | ▶ 04:36 |
and you can see it drives fairly confidently. | ▶ 04:40 |
This is an attempt to drive down Lombard Street in San Francisco--the famous crooked street. | ▶ 04:43 |
It's very curvy, and while this is sped up it gives you a sense of the complexity | ▶ 04:48 |
that is involved in building cars like these. | ▶ 04:53 |
It's actually quiet amazing how far technology has come in such a short amount of time. | ▶ 04:55 |
Here is an experiment that my Stanford students did on south parking using machine learning, | ▶ 05:00 |
reinforcement learning for control, | ▶ 05:06 |
and you can see how agile and how capable these methods are. | ▶ 05:08 |
So today I really want to enable you to write software like this based on lots of what we learned before. | ▶ 05:15 |
We talked a little bit about machine learning, a lot about particle filters, | ▶ 05:21 |
and some about motion planning, which relates to the planning class | ▶ 05:25 |
that Peter taught you quite a while back. | ▶ 05:29 |
Welcome to my class on robotics. | ▶ 00:00 |
In many ways this is applying AI technology to the problem of robotics. | ▶ 00:03 |
You might remember that a robot agent takes in sensor data from its environment. | ▶ 00:07 |
Here is the environment and here is the robot agent. | ▶ 00:11 |
It processes it into controls and actions | ▶ 00:14 |
that it uses to manipulate its actuators. | ▶ 00:17 |
Robotics is the science of bridging the gap between sensor data and actions. | ▶ 00:21 |
Just for the fun of it, let me ask you a quiz that links back to the very first class. | ▶ 00:00 |
Robotics is partially observable, yes or no? | ▶ 00:05 |
It's continuous in its state and action spaces and its measurement, yes or no? | ▶ 00:08 |
The environment may be stochastic, yes or no, | ▶ 00:15 |
and it may be adversarial, yes or no? | ▶ 00:19 |
Specifically, something like the DARPA Grand Challenge might be adversarial or not. | ▶ 00:22 |
Please choose the answer that best fits. | ▶ 00:26 |
I understand there might be multiple choices possible here, | ▶ 00:28 |
but please go back to what fits the best. | ▶ 00:30 |
And very clearly robotics is partially observable. | ▶ 00:00 |
We'll talk about this today a little bit more when I apply particle filters. | ▶ 00:02 |
It is continuous--that is all measurements are continuous and all actions tend to be continuous. | ▶ 00:06 |
The environment is clearly stochastic. | ▶ 00:11 |
It's impossible to predict what's happening next with absolute certainty. | ▶ 00:13 |
Then you can argue back and forth whether it's adversarial or not. | ▶ 00:18 |
In most cases we don't treat it as adversarial. We don't think about it. | ▶ 00:21 |
But somethings about robotics is indeed adversarial and to some extent driving is as well. | ▶ 00:24 |
I want to say no for now, but I'm going to accept both answers over here, | ▶ 00:29 |
so you can write whatever you want, | ▶ 00:33 |
because I don't want us to think about robotics as adversarial. | ▶ 00:35 |
At least not in the case of driving cars. | ▶ 00:38 |
One of the fundamental things about robotics is called "perception." | ▶ 00:00 |
The story here is that you get sensor measurements, and you're trying to estimate | ▶ 00:04 |
an internal state such that the internal state is sufficient to determine what to do next. | ▶ 00:07 |
It's usually a recursive method. It's called a "filter." We talked about this at length. | ▶ 00:12 |
I'm going to ask you a few questions about the state itself. So here's a quiz. | ▶ 00:16 |
Suppose we have a mobile robot that is round and lives on a plane, | ▶ 00:21 |
and it can turn on the spot, but its location is given by a two-dimensional coordinate. | ▶ 00:29 |
It might face in a certain direction. | ▶ 00:37 |
We really care about what's called the kinematic state. | ▶ 00:39 |
That is, we care about where it is but not how fast it is actually moving. | ▶ 00:42 |
So what is the dimensionality of the state space for such a robot? | ▶ 00:47 |
I do realize we haven't really talked about this much yet. | ▶ 00:52 |
I'd like you to take a good guess, and I'll tell you the answer once you have made your guess. | ▶ 00:54 |
The answer of this specific example over here is 3, | ▶ 00:00 |
although you could argue it could be something else, | ▶ 00:05 |
but 3 is a convenient and common answer, | ▶ 00:07 |
which is this robot's state is determined by its xy location and by its heading direction, | ▶ 00:09 |
which is often called theta. | ▶ 00:16 |
Now, you could argue heading doesn't matter because it can turn itself on the spot, | ▶ 00:18 |
and in some examples that is actually correct. | ▶ 00:22 |
You might be able to get away with it with a two-dimensional state, | ▶ 00:24 |
but if you're going to predict what's happened next when you, for example, drive forward, | ▶ 00:27 |
the heading matters greatly. | ▶ 00:30 |
In that sense, it's actually a three-dimensional state, so 3 is the best answer. | ▶ 00:32 |
Let me ask you a few more questions about dimensionality of state spaces. | ▶ 00:37 |
Here is a car, | ▶ 00:00 |
and again we worry about the kinematic state. | ▶ 00:03 |
That is, where in the world is this robot irrespective of its current velocity? | ▶ 00:05 |
What do you now the right answer is for dimensionality? | ▶ 00:11 |
The correct answer is still 3, although you can argue it is more, | ▶ 00:00 |
because maybe the position of its steering wheel matters, | ▶ 00:04 |
but at first approximation is the same as before. | ▶ 00:07 |
There might be a center point for the car. | ▶ 00:10 |
It has a certain location in the global coordinate system, | ▶ 00:12 |
and it again has a heading direction in which it can go. | ▶ 00:15 |
So 3 is a convenient answer, | ▶ 00:18 |
although technically the steering wheel angle might also influential for what's happening next. | ▶ 00:20 |
Now let's talk about the dynamic state. | ▶ 00:00 |
The dynamic state includes the velocities of the vehicle itself. | ▶ 00:03 |
I'd like to understand what's a good number of dimensions | ▶ 00:06 |
to encode the dynamic state of this car. | ▶ 00:09 |
The common answer here is 5, | ▶ 00:00 |
although I realize many, many different answers are possible, | ▶ 00:03 |
and there is clearly no single answer that is really correct. | ▶ 00:06 |
But 5 is my favorite. | ▶ 00:09 |
It's the 3 kinematic state variables, and in addition we care about velocities, | ▶ 00:11 |
so there is formal velocity of the vehicle itself | ▶ 00:15 |
and the faster the vehicle moves, the further it is going to advance in a max timestep. | ▶ 00:18 |
There is also something called a yaw rate. | ▶ 00:23 |
Yaw is one way to name the heading of the car, | ▶ 00:26 |
the orientation of the car, or the bearing as some people call it. | ▶ 00:30 |
The rate is the change over time. | ▶ 00:33 |
This car will not just move forward. It will also turn. | ▶ 00:35 |
That turn has a velocity, and the velocity is often called "yaw rate." | ▶ 00:38 |
We're going to talk about this in a few minutes. | ▶ 00:42 |
Before we do this, let me look into the state helicopter. | ▶ 00:00 |
Here's by best depiction of a helicopter. | ▶ 00:05 |
This helicopter can fly anywhere in the xyz space, | ▶ 00:07 |
and it can also point in any possible direction. | ▶ 00:11 |
What's the dimensionality of the kinematic state for such a vehicle? | ▶ 00:15 |
The answer is now 6. | ▶ 00:00 |
You can really see that this helicopter can assume any location in xyz space, | ▶ 00:03 |
but it also has 3 rotation degrees of freedom. | ▶ 00:09 |
It has a yaw, which is its bearing. | ▶ 00:11 |
It can tilt forward and backward. | ▶ 00:14 |
And it can roll left and right. | ▶ 00:17 |
If we look from above, here is the yaw. | ▶ 00:20 |
This is the tilt. | ▶ 00:23 |
If you look from the front, you find there is also a roll variable | ▶ 00:25 |
where the vehicle can turn around its own axis. | ▶ 00:29 |
So the total answer is 6. | ▶ 00:32 |
Here is my most difficult question for the helicopter. | ▶ 00:00 |
What's the dynamic state? What's the right dimensionality here? | ▶ 00:03 |
The answer is commonly 12, | ▶ 00:00 |
which is simply we can have 6 state variables, | ▶ 00:02 |
and in each of those the helicopter might have its own velocity. | ▶ 00:05 |
So for each of these variables, we have the state variable itself | ▶ 00:09 |
and its velocity, which makes a total of 12. | ▶ 00:12 |
This is completely nontrivial and something you probably can't know. | ▶ 00:15 |
We just have to learn, but when you think about it, you realize this is the most general case | ▶ 00:19 |
of a vehicle that can move in 3D at every possible location, | ▶ 00:23 |
every possible orientation at any velocity. | ▶ 00:26 |
That's called the dynamic state of a free-flying object. | ▶ 00:29 |
Let's talk about localization of a car like our DARPA Urban Challenge car Junior. | ▶ 00:01 |
This car uses a map of the environment-- | ▶ 00:07 |
it knows it advance where the lay markers are-- | ▶ 00:11 |
and uses probabilistic localization to keep track of where it is. | ▶ 00:13 |
The reason for that is it could use GPS, the global positioning system, | ▶ 00:18 |
but that has enormous errors, sometimes in order of 5 or more meters, | ▶ 00:22 |
which is very unsafe for driving. | ▶ 00:27 |
By localizing utilizing particle filters or histogram filters, | ▶ 00:30 |
our car can do the same with about 10 cm error, | ▶ 00:33 |
which means it can really understand where to stay in the lane | ▶ 00:37 |
just by known where the lane is in advance and using localization techniques | ▶ 00:40 |
like the ones we'll discuss right now. | ▶ 00:44 |
Let's talk about particle filters for localization | ▶ 00:00 |
that is commonly called Monte Carlo localization. | ▶ 00:03 |
We learned in the particle filter lesson that the state is retained by a set of particles. | ▶ 00:06 |
Each particle is a three-dimensional vector here, | ▶ 00:12 |
comprising x,y, and the heading direction theta, | ▶ 00:15 |
as indicated by these little arrows that I'm going to just draw here. | ▶ 00:18 |
A set of particles like these would be a representation for the distribution at any point in time. | ▶ 00:22 |
Now let me look at the 2 main steps in particle filters. | ▶ 00:28 |
On is the prediction step, and one if the measurement step. Let's start with prediction. | ▶ 00:31 |
Just to make things simpler, let's assume our vehicle has only 2 wheels. | ▶ 00:35 |
It's called a differential-drive robot, and it can navigate by moving both wheels forward, | ▶ 00:40 |
but if 1 wheel moves faster than the other one, it'll turn. | ▶ 00:47 |
Let's understand how to apply a particle filter to a robot on that simplicity. | ▶ 00:51 |
This is simpler than a car, but not much simpler. It's about the same complexity. | ▶ 00:56 |
As I said, the state of this vehicle is given by the following 3 values: x, y, and θ. | ▶ 01:01 |
And to predict the outcome of an action, we need to write a function | ▶ 01:07 |
that predicts those values based on values Δt over here where Δt might be a 10th of a second. | ▶ 01:11 |
Now the math for this in first approximation is very simple. | ▶ 01:19 |
It turns out this approximation is good enough to do pretty much anything in robotics | ▶ 01:23 |
even though it is not very accurate. | ▶ 01:28 |
Let's assume the robot just keeps moving forward at a fixed velocity v. | ▶ 01:30 |
Then the new x is given by the old x plus the progress it makes along axis x with velocity v. | ▶ 01:35 |
So you get v times Δt, which is the total distance traversed, | ▶ 01:44 |
but the x portion of it is cos θ. | ▶ 01:49 |
Similarly, for the y coordinates, you get the old y plus the distance traversed-- | ▶ 01:53 |
velocity times Δt times sin θ. | ▶ 01:58 |
This is a robot that doesn't really change heading directions, | ▶ 02:03 |
and it'll be sufficient for very small Δt to assume that robot doesn't change heading directions. | ▶ 02:07 |
These are actually good equations even if the robot is turning. | ▶ 02:11 |
However, to understand the change of heading direction, | ▶ 02:15 |
we also have to assume that there is an angular velocity, | ▶ 02:17 |
and we call this ω [omega], which is a Greek letter. | ▶ 02:20 |
So the new heading direction is the old one plus ω times Δt. | ▶ 02:24 |
These are really nice equations to model relatively complex smaller robots. | ▶ 02:31 |
They're really simple geometry. If you understand cosine and sine, | ▶ 02:36 |
you realize this is basically a robot that moves on a fixed straight trajectory. | ▶ 02:39 |
For time Δt it then applies the rotation and it moves again for fixed time Δt on a straight trajectory, | ▶ 02:46 |
which is an approximation to the actual curve the robot might be taking. | ▶ 02:55 |
Let's exercise these equations over here using the following example. | ▶ 00:00 |
Suppose in the beginning x equals 24, y is 18, | ▶ 00:04 |
and the orientation for now is going to be zero, just to make it simple, | ▶ 00:10 |
and suppose Δt is 1 second, our velocity is 5 units per second, | ▶ 00:15 |
and our rotation velocity is π over 8 seconds. | ▶ 00:20 |
Can you use this formula to calculate x', y' and θ' after Δt? | ▶ 00:23 |
This is a robot that points in the x direction, because θ equals zero. | ▶ 00:00 |
We have a coordinate system over here where this is 24. | ▶ 00:04 |
That's 18 in the y direction, and it moves forward for a while. | ▶ 00:08 |
In fact we have 1 second, and it moves at 5 units per second, so this is 5. | ▶ 00:12 |
Then it turns its heading into this direction over here, and this is π/8. | ▶ 00:17 |
Again, the real robot would take a curve, | ▶ 00:22 |
but in our approximation we assume it goes in a straight line | ▶ 00:24 |
and then finally does a very discrete turn, which is an approximation, | ▶ 00:27 |
but that's the question I have asked you here. | ▶ 00:30 |
When you plug these values in, you'll find that the x extends by 5, | ▶ 00:32 |
which is 29. Y doesn't change. | ▶ 00:36 |
The final turn is π/8, which is about 0.3927. | ▶ 00:39 |
Let me ask you a similar quiz, | ▶ 00:00 |
but this time let's say that x equals zero, y equals 10, and our heading direction is π/4, | ▶ 00:02 |
which is the same as 45 degrees. | ▶ 00:09 |
Again Δt equals 1 second, and we move at 5 units per second. | ▶ 00:11 |
Then we turn by -π/4 per second. | ▶ 00:16 |
Don't worry about the units, seconds. It's just hear to make it mathematically consistent. | ▶ 00:19 |
So please plug your best estimates into the boxes over here. | ▶ 00:24 |
This robot is located at 0, 10 and it points at 45 degrees. | ▶ 00:00 |
In doing so, it moves some right and some up. | ▶ 00:06 |
In fact the same ratio in the x dimension as it will be in the y dimension. | ▶ 00:09 |
Now cosine of this θ here, is about 0.7071 multiplied by 5 added to 0 and you get 3.5355. | ▶ 00:15 |
It so turns out that this is also the value of sin θ, | ▶ 00:27 |
so you can add the same value over here to the x value of 10, which gives us 13.5355. | ▶ 00:30 |
Finally, we find that within 1 second the initial heading is canceled out by the change of heading, | ▶ 00:38 |
so we'll be facing in this direction over here, and that angle is just zero. | ▶ 00:45 |
If you got those right, this was a somewhat tedious exercise on simple geometry, | ▶ 00:49 |
but this is the kind of math you need to implement in particle filter equations like those over here in robotics. | ▶ 00:56 |
Let's get back to Monte Carlo localization. | ▶ 00:00 |
Let's look at single particle that sits over here. | ▶ 00:03 |
There's an x, y, and a θ. | ▶ 00:06 |
Let's assume we happen to know that the robot is moving at velocity v | ▶ 00:09 |
and at angle velocity ω, which is the differential of its wheels. | ▶ 00:13 |
And after it moves so far, it will end up exactly over there with a heading pointing in this direction. | ▶ 00:16 |
Now that you worked the math, you know exactly how to implement this. | ▶ 00:23 |
In Monte Carlo localization, we don't predict the exact outcome. We add noise. | ▶ 00:27 |
We add noise to velocity v and to the heading direction ω. | ▶ 00:32 |
As we do so, we might find ourselves with lots of particles, | ▶ 00:36 |
all of which have a slightly different xy coordinate and a slightly different heading outcome. | ▶ 00:40 |
These particles together comprise our estimation after the motion command over here. | ▶ 00:45 |
So, a single particle over here, if drawn multiple times, | ▶ 00:50 |
gives a set of particles like the ones over here. | ▶ 00:53 |
They're kind of hard to see at this point, | ▶ 00:56 |
but you can imagine by varying v and ω with a little bit of noise that we | ▶ 00:58 |
add or subtract from these values, | ▶ 01:03 |
we will get slightly different predictions where the robot might be | ▶ 01:05 |
and as a result get a particle cloud like this one over here. | ▶ 01:08 |
That's really important. | ▶ 01:11 |
We just implemented the prediction step of a particle filter in a real robotics example. | ▶ 01:13 |
This is exactly what's happening when we drive our Google self-driving car and our Stanford car. | ▶ 01:18 |
Now we have a set of predictions that might arise from a single particle, | ▶ 00:00 |
and the other important step in particle filtering is the measurement step. | ▶ 00:06 |
We need to understand at what rate will these particles survive, | ▶ 00:11 |
and that's usually done in proportion to the measurement probabilities. | ▶ 00:14 |
Let's talk about measurements. | ▶ 00:18 |
For the sake of this exercise, let's assume we only have 2 measurements. | ▶ 00:20 |
We would either see something bright or something dark. | ▶ 00:25 |
It does a certain response to whether it's on land marking. | ▶ 00:29 |
Just for simplicity, let's assume we have certain locations that have land markings, | ▶ 00:32 |
like this one over here and these over there. | ▶ 00:37 |
If a robot center is aligned with a lane marking, it should see a bright spot | ▶ 00:41 |
because lane markings tend to be bright. | ▶ 00:45 |
But if it's off the lane marking on the regular road, it should see a dark spot. | ▶ 00:48 |
Let's turn this into a probability that's called the measurement probability. | ▶ 00:51 |
The probability of seeing something bright is going to be large when it's on a lane marker, say 0.8. | ▶ 00:56 |
From that we can deduce that the probability of seeing something dark on a lane marker is 0.2. | ▶ 01:04 |
The probability of seeing something dark when off a lane marker is even higher at 0.9, | ▶ 01:10 |
and from that it follows that the probability of seeing something bright | ▶ 01:17 |
on the regular road with regular pavement is going to be 1 minus 0.9 equals 0.1. | ▶ 01:20 |
Here's my quiz for you. This is an entirely nontrivial quiz. | ▶ 01:27 |
If you get this right, you understand particle filters. | ▶ 01:32 |
Suppose we measure bright. | ▶ 01:36 |
The actual sensor told us it saw something bright underneath the robot. | ▶ 01:39 |
I'd like to know what is the importance weight of the particle over here, | ▶ 01:43 |
which we're going to call x1, and the particle over here, which I'll call x2. | ▶ 01:49 |
Tell me what's the weight w of x1 after I apply the measurement probability | ▶ 01:55 |
and the normalization that's common in particle filters. | ▶ 02:03 |
Please do the same for the particle x2 where x1 happens to be on the lane marker, | ▶ 02:07 |
and x2 happens to be off a lane marker. | ▶ 02:13 |
So please put in these two numbers of over here. | ▶ 02:16 |
It'll take a while to calculate those, so please take the time. | ▶ 02:18 |
I assure you if you get those right, you really understand particle filters. | ▶ 02:22 |
As promised, the answer is nontrivial. | ▶ 00:00 |
The importance weight for x1 will be 8/27, which is the same as 0.2963, | ▶ 00:03 |
and the one for x2 will be 1/27 or 0.037. | ▶ 00:14 |
How did we get there? | ▶ 00:21 |
Let's look at the non-normalized importance weights before normalization. | ▶ 00:23 |
The guys on the lane markings will all get a 0.8. | ▶ 00:27 |
The guys off the lane markings will get a 0.1. So the three guys over here. | ▶ 00:31 |
The reason is the probability of seeing bright, which is what we saw, | ▶ 00:38 |
off a lane marker is 1 minus 0.9. That's 0.1. | ▶ 00:42 |
Now we have 3 guys that are on the lane markings and 3 off the lane markings. | ▶ 00:46 |
The total weight over here is 2.4, and the total weight over here is 0.3. | ▶ 00:51 |
Our total weight for all particles, not normalized particles, will be 2.7 or 27 tenths. | ▶ 00:57 |
We have to really normalize the weight by dividing by 2.7. | ▶ 01:04 |
0.8 divided by 2.7 is 8/27 or this number over here. | ▶ 01:09 |
0.1 divided by 2.7 is 1/27, which is this value over here. | ▶ 01:16 |
That's how we get to these final weights. | ▶ 01:21 |
If you got this, you understand that the measurement probability effects | ▶ 01:23 |
the weight before normalization, which is multiplying in the measurement probability, | ▶ 01:28 |
and you did this correctly. | ▶ 01:31 |
Afterwards we have to normalize to make sure all the weights add up to 1. | ▶ 01:33 |
So we divide by the total weight of all particles, | ▶ 01:37 |
and we get out those probabilities over here. | ▶ 01:40 |
Put differently, this particle x1 that sits on a lane marker | ▶ 01:42 |
is being regenerated in the resampling phase with a probability of 0.2963. | ▶ 01:47 |
The same is true for the 2 other particles that sit on lane markers. | ▶ 01:53 |
For the 3 particles that are off lane markers like the one x2 over here, | ▶ 01:58 |
the resampling probability is a small as small as 0.037. | ▶ 02:02 |
That's true for x2, but it's true for all 3 particles. | ▶ 02:08 |
In total, these probabilities add up exactly to 1. | ▶ 02:11 |
Let's now apply the resampling where the on-lane-marker particles are being resampled for probability 0.2963, | ▶ 00:00 |
and the ones in the middle with probability 0.037. | ▶ 00:08 |
Let's draw with replacement 6 new particles. | ▶ 00:11 |
A typical outcome will be we draw this one over here twice, | ▶ 00:14 |
this one down here twice, and this one over here once. | ▶ 00:18 |
Perhaps we draw once over here. | ▶ 00:20 |
Clearly we draw the particles that sit on the lane markings much more frequently | ▶ 00:22 |
than the ones that sit off the lane markings for a total of 6 new particles. | ▶ 00:26 |
We now apply our resampling method whereby we draw twice from this particle over here. | ▶ 00:31 |
That might give us something over there and over here, given that we add a small amount of noise. | ▶ 00:37 |
The guy over here will be resampled to something over there. | ▶ 00:43 |
Same with this guy over here, and this guy might find itself over here. | ▶ 00:46 |
The set over here of 6 particles in total, will now be the new posterior. | ▶ 00:49 |
As you can see, this posterior is more consistent with the lane marking observation | ▶ 00:53 |
than the one of not being on a lane marking by virtue of the fact that | ▶ 00:58 |
we saw a bright measurement before. | ▶ 01:01 |
Now we just repeat. We look at the next measurement. We weight particles accordingly. | ▶ 01:03 |
We resample. We do forward prediction. | ▶ 01:08 |
That's the basic particle filter algorithm.4 | ▶ 01:11 |
Look at measurment, compute weights, sample, and predict | ▶ 01:14 |
where the prediction has a certain amount randomness. | ▶ 01:17 |
If you get that loop implemented, you've implemented an amazing algorithm | ▶ 01:20 |
that's exactly what has driven many of my robots in the ability to localize themselves. | ▶ 01:24 |
Obviously they have more than just 1 pixel sensor that measures bright and dark. | ▶ 01:29 |
They might take an entire road image and use the road image to complete the measurement probability, | ▶ 01:33 |
but the basic mechanics is exactly the same as shown over here. | ▶ 01:38 |
So let me ask you, did you actually understand this. Yes or no? | ▶ 01:42 |
I just hope you answered "yes." | ▶ 00:00 |
If you answered "no," please go through the same sequence again, | ▶ 00:02 |
because the steps end up to be relatively straightforward | ▶ 00:06 |
even though they're somewhat uncommon. | ▶ 00:09 |
But if you understand it, you can now go and implement particle filters | ▶ 00:12 |
for the great range of robotics applications. | ▶ 00:15 |
Let's talk a bit about planning. | ▶ 00:00 |
One of the key problems is that these robots have to decide what to do next. | ▶ 00:03 |
I'll address the planning problem at multiple levels of abstraction. | ▶ 00:07 |
The easiest happens to be to look at a city level of abstraction. | ▶ 00:12 |
Suppose we have a road like this over here, | ▶ 00:17 |
and my car is located down here in the beginning. | ▶ 00:20 |
I wish to get to this point up here. | ▶ 00:23 |
Let me draw in an abstraction of the state space, | ▶ 00:25 |
and we just draw the states as shown with these red lines over here. | ▶ 00:29 |
So you see a maze with lots of discrete states. | ▶ 00:34 |
I'm going to ask you given that you traversing from red cell to red cell costs you 1, | ▶ 00:38 |
or -1 using the definition of the MDP lecture before. | ▶ 00:42 |
Suppose the goal state has a value of 100. | ▶ 00:47 |
What's the value of the start state assuming deterministic actions? | ▶ 00:50 |
You probably got it right. It's 86. | ▶ 00:00 |
The reason is the value of the goal is 100, 99, 98, 97. | ▶ 00:03 |
It turns out the start state is 14 steps away from the goal, | ▶ 00:08 |
so 100 minus 14 is 86. | ▶ 00:11 |
The exact same algorithm works beautifully for planning the shortest path | ▶ 00:00 |
to a single mission goal from any possible start location, | ▶ 00:07 |
and the only difference here is in this graph over here of an actual road graph | ▶ 00:11 |
we are also incorporating the heading direction as measure of distance. | ▶ 00:17 |
Green corresponds to nearby to large values, red to far away. | ▶ 00:21 |
The reason why the area below the mission goal is green is because we expect | ▶ 00:27 |
the car to point up, to point north, at the time it reached the mission. | ▶ 00:31 |
So if it came from the north, it would point in the wrong direction. | ▶ 00:35 |
The state space is augmented correspondingly. | ▶ 00:38 |
Where if it comes from over here, it points in the correct direction. | ▶ 00:41 |
If you look at the circle over here, it's interesting. | ▶ 00:45 |
If we came from the left side over here, it could do a right turn, | ▶ 00:47 |
but over here it's forced on this one-way circle to do the entire loop to go around, | ▶ 00:50 |
and that increases the value as it comes over here. | ▶ 00:55 |
This is value iteration applied to the road graph where we keep track of heading | ▶ 00:58 |
and where the circle over here is a one-way circle. | ▶ 01:03 |
Let's look at dynamic programming again. | ▶ 00:00 |
Specifically let's look at environment that has a loop. | ▶ 00:03 |
Here's the environment. | ▶ 00:06 |
The environment possesses 14 states, | ▶ 00:08 |
and here is the loop as indicated, and there is a big intersection in the middle over here. | ▶ 00:11 |
Let's assume this is our start state in the very south, and we're facing north. | ▶ 00:14 |
We we reach to the goal state in the west, and here we will be facing west. | ▶ 00:20 |
Obviously, there are two ways to get to the goal. | ▶ 00:25 |
We can go north 3 steps and then turn left to the goal, | ▶ 00:27 |
or we can take the entire loop over here, avoid left turns, | ▶ 00:30 |
but eventually find ourselves at the goal as well after more steps. | ▶ 00:34 |
Let's assume there are different costs attached. | ▶ 00:37 |
The cost of motion is -1 per red cell. | ▶ 00:39 |
The cost of right turns is -2. | ▶ 00:43 |
Why would be penalize right turns? | ▶ 00:47 |
Well, as you turn right, you might have to yield for pedestrians. | ▶ 00:49 |
That might cost you some time. | ▶ 00:52 |
Let's assume an expectation that -2 accounts for the time relative to the motion of -1. | ▶ 00:54 |
What I'd like you to know is a tricky question. | ▶ 00:59 |
What is the max cost of a left turn? | ▶ 01:01 |
If we avoid left turns altogether, we'd much rather take the loop over here. | ▶ 01:04 |
I want the solution where you turn left to be distinctly more expensive, or more negative, | ▶ 01:10 |
than the solution where you turn right. | ▶ 01:16 |
When I say "max", we're dealing with negative numbers. | ▶ 01:18 |
So if you would were to look into positive numbers, what's the minimum cost of a left turn? | ▶ 01:21 |
But I want you to enter the negative number over here. | ▶ 01:25 |
[Thrun] And the answer is -15. | ▶ 00:00 |
If you look at the pure motion cost for the short route, there are 6 steps. | ▶ 00:03 |
So we wanted to turn left. | ▶ 00:08 |
You pay the penalty of -6. | ▶ 00:10 |
The longer route is 14 steps, and we add a penalty of -6 for 3 right turns. | ▶ 00:12 |
Each is -2. That gives us -20. | ▶ 00:19 |
If we penalize left turns with -15, then the total will be -21, | ▶ 00:23 |
which is smaller or higher cost, so to speak, than the alternative route | ▶ 00:29 |
that we wish to favor. | ▶ 00:36 |
For anything larger than -15, we either have a tie or we just go the shortcut, | ▶ 00:38 |
so that's the correct number over here. | ▶ 00:44 |
It was a really nontrivial question. I'm really proud if you got this right. | ▶ 00:46 |
[Thrun] So let me give you some examples of this method in action. | ▶ 00:00 |
Here we have an actual planning technique that uses dynamic programming | ▶ 00:03 |
and understands how far things are away. | ▶ 00:08 |
And on top of it, it also considers local rollouts to avoid local obstacles. | ▶ 00:10 |
These local rollouts are continuous trajectories. | ▶ 00:15 |
They are variant by discrete decisions, | ▶ 00:18 |
like whether to change the lane, and by various small discrete nudges around obstacles | ▶ 00:20 |
so we can avoid obstacles. | ▶ 00:26 |
And in rolling out to a certain horizon, like up to here, | ▶ 00:29 |
and then connecting to the dynamic programming value, | ▶ 00:32 |
we can calculate in actual traffic situations what is the cost of going a certain path. | ▶ 00:34 |
Here is an attempt to turn right. | ▶ 00:41 |
You can see the vehicle approaching a stop sign. | ▶ 00:43 |
There is an entire maze of streets. | ▶ 00:46 |
The best way to go right and then into the left lane is to take the right turn | ▶ 00:48 |
and then initiate a lane shift, which is happening right now, | ▶ 00:54 |
to reach a target location that is indicated by this big orange circle | ▶ 00:58 |
that it's crossing right now. | ▶ 01:02 |
If we increase the cost of a lane shift to a much larger value, | ▶ 01:04 |
the answer that emerges is really interesting. | ▶ 01:09 |
It doesn't turn right because of the cost of the subsequent lane shift. | ▶ 01:13 |
Instead this vehicle goes straight, | ▶ 01:16 |
takes a left turn, which happens to be much cheaper than the lane shift. | ▶ 01:19 |
It then follows this left turn, takes another left turn, | ▶ 01:27 |
and eventually takes a third left turn just to get to the left lane. | ▶ 01:38 |
And if you look very carefully, you can now reach the goal location | ▶ 01:44 |
without a lane change maneuver which would have much higher cost. | ▶ 01:49 |
So here it is now in the left lane, and without lane-changing maneuver | ▶ 01:57 |
it manages to reach the goal. | ▶ 02:01 |
This illustrates that dynamic programming in the context of controlling actual cars | ▶ 02:03 |
has a big value to play. | ▶ 02:07 |
Here is the same idea applied to a passing maneuver in normal driving. | ▶ 02:10 |
You see our vehicle following another vehicle. | ▶ 02:13 |
This is actual data in preparation for the Urban Challenge. | ▶ 02:16 |
Now we placed an abandoned vehicle on the left lane, | ▶ 02:19 |
and you can see how trainers are being made that incorporate dynamic obstacles | ▶ 02:23 |
by virtue of those rollouts and a dynamic programming function | ▶ 02:27 |
by virtue of the background green to red to find the optimal actions. | ▶ 02:31 |
This method has really enabled us to navigate complicated situations with self-driving cars. | ▶ 02:36 |
[Thrun] The last example I want to talk about in this lecture | ▶ 00:00 |
is related to general purpose path planning | ▶ 00:04 |
where we don't have a road network. | ▶ 00:06 |
Here is an example of where this occurs. | ▶ 00:09 |
This is an example of where a blockage occurs. | ▶ 00:11 |
None of the preplanned paths are drivable by our robot, | ▶ 00:15 |
so it has to, after a certain time out here--20 seconds-- | ▶ 00:20 |
find itself a path anywhere in the environment. | ▶ 00:23 |
In fact, our Urban Challenge car did just this. | ▶ 00:27 |
We don't do this today in traffic. It's a little bit dangerous. | ▶ 00:31 |
But for the Urban Challenge it was perfectly doable, and it was safe. | ▶ 00:35 |
So this car found a route that was outside any pre-given corridor. | ▶ 00:38 |
Here is an even more challenging example | ▶ 00:45 |
where our robot Junior approaches a complete road blockage, | ▶ 00:47 |
but its target location is behind the road blockage. | ▶ 00:52 |
You can see that none of the paths can possibly make it there | ▶ 00:56 |
and the only correct action is to turn around and pick a different road | ▶ 00:59 |
to finally approach the goal location from the opposite side. | ▶ 01:03 |
You can see an attempted lane shift to the opposite lane doesn't function either, | ▶ 01:07 |
and there is time models supposed to be with all of those. | ▶ 01:12 |
Eventually, using a general purpose planner | ▶ 01:15 |
of the type that Peter talked about in his class | ▶ 01:18 |
to find what ends up to be a really complicated multi-turn turnaround | ▶ 01:21 |
where the car backs into a driveway a little bit, as you can see over here, | ▶ 01:29 |
and it is all planned completely dynamically without any preconception | ▶ 01:33 |
how such a multi-point U-turn would look like. | ▶ 01:38 |
Then it goes forward, then it goes backward, and does so multiple times | ▶ 01:41 |
until it finally has turned around. | ▶ 01:46 |
It's not particularly efficient or elegant, but it's very, very safe. | ▶ 01:48 |
This vehicle will eventually be able to drive in a different direction | ▶ 01:53 |
and reach the goal point behind the blockage. | ▶ 01:56 |
That was one of the tasks DARPA gave us. | ▶ 01:59 |
So you can see it do its job until it finally breaks free | ▶ 02:01 |
and is able to navigate around this blockage onto a different street, as shown over here. | ▶ 02:05 |
[Thrun] So let's talk about robot path planning or robot motion planning, | ▶ 00:00 |
which is a rich field in itself, and I can't give you a complete survey | ▶ 00:04 |
of all the algorithms involved. | ▶ 00:08 |
But one of the key differences to the planning algorithms we talked about before | ▶ 00:10 |
is that now the world is continuous. | ▶ 00:14 |
For example, we learned about A* in which we discretize the world. | ▶ 00:17 |
We might have a goal location, we might have obstacles, | ▶ 00:21 |
and then A*, a valid action sequence, might look like this. | ▶ 00:24 |
And even though this is a valid solution to the planning problem, | ▶ 00:28 |
a car can't really follow these discrete choices. | ▶ 00:32 |
There is a number of very sharp turns over here that are just irreconcilable | ▶ 00:35 |
with the motion of a car. | ▶ 00:39 |
So the fundamental problem here is A* is discrete, | ▶ 00:42 |
whereas the robotic world is continuous. | ▶ 00:45 |
So the question arises, is there a version of A* that can deal with the continuous nature | ▶ 00:48 |
and give us provably executable paths? | ▶ 00:52 |
This is a big, big question in robot motion planning. | ▶ 00:56 |
Let me just discuss it for this one example | ▶ 00:59 |
and show you what we've done to solve this problem in the DARPA Urban Challenge. | ▶ 01:02 |
The key to solving this with A* has to do with the state transition function. | ▶ 01:07 |
Suppose we have a cell like this and we apply a sequence of very small step simulations | ▶ 01:12 |
using our continuous math from before. | ▶ 01:17 |
Then a state over here might find itself right here in the corner of the next discrete state. | ▶ 01:20 |
Instead of assigning this just to the grid cell, | ▶ 01:27 |
in the algorithm called hybrid A*, it memorizes the exact x prime, y prime, | ▶ 01:29 |
and theta prime and associates it with this grid cell over here | ▶ 01:34 |
the first time the grid cell is expanded. | ▶ 01:38 |
Then when expanding from this cell it uses a specific starting point over here | ▶ 01:40 |
to figure out what the next cell might be. | ▶ 01:45 |
It might happen that the same cell is reached again in A*, maybe from over here, | ▶ 01:47 |
leading to a different continuous amortization of x, y, and theta, | ▶ 01:51 |
but because in A* we tend to expand cells along the shortest path | ▶ 01:55 |
before we look at the longer paths, we now just cut this off | ▶ 01:59 |
and never consider the state over here again. | ▶ 02:03 |
This leads to a lack of completeness, | ▶ 02:06 |
which means there might be solutions to the navigation problem | ▶ 02:09 |
that this algorithm doesn't capture, | ▶ 02:12 |
but it does give us correctness. | ▶ 02:14 |
So as long as our motion equations are correct, the resulting paths can be executed. | ▶ 02:16 |
Now here is a caveat. | ▶ 02:21 |
This is an approximation and is only correct to the point | ▶ 02:23 |
that these motions equations are correct that are not correct. | ▶ 02:26 |
But nevertheless, our paths that come out are nice, smooth, and curved paths, | ▶ 02:28 |
and every time we expand a grid cell | ▶ 02:34 |
we memorize explicitly the continuous values of x prime, y prime, | ▶ 02:36 |
and theta with this grid cell. | ▶ 02:40 |
[Thrun] Now here is an actual result of applying this A* algorithm | ▶ 00:00 |
for our vehicle that sits over here. | ▶ 00:03 |
Real obstacles--these are laser scans of parked cars-- | ▶ 00:05 |
and a target location over here. | ▶ 00:09 |
And while the curve isn't super smooth, | ▶ 00:11 |
you can still see it is able to find a continuous and drivable curve | ▶ 00:14 |
to the parking location over here | ▶ 00:18 |
by this small but important modification of A*. | ▶ 00:20 |
There are a few other modifications of A* which I can't go into detail, | ▶ 00:24 |
but here you can see a typical attempt of a robot to navigate a parking lot | ▶ 00:30 |
here in simulation. | ▶ 00:35 |
You can see the tree that is being expanded in that search. | ▶ 00:37 |
And every time it gets stuck, it does a new A* search. | ▶ 00:43 |
You can see how the map is being acquired as the robot moves. | ▶ 00:47 |
In its state that's in front of the robot, it not only considers the x, y and hidden direction | ▶ 00:51 |
but also allows the robot to go forward and backwards, | ▶ 00:56 |
and driving backwards is just a different state than going forwards. | ▶ 00:59 |
Now you can see how it backs up, finds a new path, and it is an incomplete maze | ▶ 01:03 |
until it finally is able to reach the goal location through an actual opening. | ▶ 01:08 |
We made this maze really hard to test our algorithms. | ▶ 01:13 |
The nice thing is these algorithms work almost real time. | ▶ 01:17 |
It takes less than a tenth of a second to build this entire search tree, | ▶ 01:20 |
and the robot is able to navigate this parking lot really, really efficiently. | ▶ 01:25 |
This was one of the fastest motion planning algorithms that I saw | ▶ 01:30 |
in the DARPA Urban Challenge. | ▶ 01:34 |
In fact, in all of robotics it's been one of the fastest algorithms | ▶ 01:36 |
I've personally seen in my life. | ▶ 01:39 |
Here is the same algorithm applied to an actual parking example using our robot Junior. | ▶ 01:42 |
It's driving over here, it wishes to get over there, | ▶ 01:49 |
and you can see it has backed up into a parking gap over here, | ▶ 01:53 |
which is an amazing precision for a robot, and then moved forward along the line over here. | ▶ 01:57 |
Our state space is, I guess, 4-dimensional. | ▶ 02:04 |
It comprises x, y, hidden direction, and whether the car is going forward or backwards. | ▶ 02:08 |
There is a cost to changing directions, so it doesn't change direction too often. | ▶ 02:13 |
You can see it navigate to its target location. | ▶ 02:17 |
Details I am not telling you include that the trajectory that the planner generates | ▶ 02:20 |
is subsequently smoothed using a quadratic smoother | ▶ 02:25 |
so that we get rid of the kinks, | ▶ 02:29 |
and the car drives much nicer as a result. | ▶ 02:31 |
But the workhorse here that does all the work to find the best path | ▶ 02:34 |
is actually A* modified into hybrid A*, as I told you. | ▶ 02:38 |
And in this final video we see the car navigating a parking lot with lots of traffic cones. | ▶ 02:46 |
On the left you see the video imagery, on the right side you can see the internal map | ▶ 02:52 |
and the path planner, | ▶ 02:57 |
and it attempts to park itself in the designated spot on the left. | ▶ 02:59 |
[Thrun] So this finishes my short lecture on robotics. | ▶ 00:00 |
Obviously the field is much, much bigger than what I just showed you, | ▶ 00:04 |
but I gave you examples of the key elements. | ▶ 00:07 |
I gave you an example of perception using particle filters. | ▶ 00:09 |
I gave you an example of planning using MDPs and also A*. | ▶ 00:12 |
These are some of the key methods we've been applying to self-driving cars. | ▶ 00:18 |
There are many other methods. | ▶ 00:21 |
Most notably, there's also reinforcement learning | ▶ 00:24 |
that has recently received a lot of attention. | ▶ 00:26 |
But I don't have time to talk about those. | ▶ 00:28 |
I hope you are able to apply these methods yourself to pretty much any robotics problems | ▶ 00:31 |
that you might be working on. | ▶ 00:36 |
Robotics is really a fascinating area. | ▶ 00:39 |
There's a lot of things to learn-- | ▶ 00:42 |
way too much than I can offer in this single class. | ▶ 00:44 |
But what you should have noticed is there's a really strong interplay | ▶ 00:46 |
between artificial intelligence and the methods I showed you before | ▶ 00:49 |
and what we are doing, for example, to make cars drive themselves. | ▶ 00:53 |
Now, in the next class we'll talk about a topic that's equally important, | ▶ 00:57 |
which is natural language processing. | ▶ 01:01 |
So when you learned today a little bit about how to build self-driving cars, | ▶ 01:04 |
next lecture you might actually learn how to build the next Google | ▶ 01:08 |
when Peter Norvig tells you all about natural language processing. | ▶ 01:11 |
So please come and see us again when the next class comes up. | ▶ 01:15 |
So, this question I want to test your knowledge about | ▶ 00:00 |
the dimension of the state space of a dynamic system. | ▶ 00:03 |
In all these questions, I'm going to look at a ball, like a soccer ball. | ▶ 00:06 |
The interesting thing about a soccer ball is its orientation | ▶ 00:10 |
is an important variable, whether it's upside down and so on. | ▶ 00:14 |
So, in all of these questions we're going to explore this, | ▶ 00:18 |
and the fact that this object is rotationally invariant. | ▶ 00:21 |
Let me start with a simpler question, | ▶ 00:24 |
which is the kinematic state, which is the state without any velocities | ▶ 00:26 |
of the soccer ball, where it is on the ground. | ▶ 00:30 |
And I'll follow up with a question with the same kinematic state | ▶ 00:34 |
of the ball in mid-air, | ▶ 00:36 |
the difference being between these 2 questions that on the ground | ▶ 00:39 |
it's confined to be in a 2-dimensional ground plane, | ▶ 00:42 |
whereas in mid-air, you might add another dimension or not. | ▶ 00:46 |
There's also the dynamic state on the ground and in mid-air. | ▶ 00:50 |
And for the dynamic state, I wish to ignore things like spin. | ▶ 00:54 |
I just care about velocities as a person really far away | ▶ 00:58 |
could observe them, just to make things clear. | ▶ 01:02 |
And finally, I'd like to include spin, | ▶ 01:05 |
so let me take the most complicated situation | ▶ 01:07 |
of dynamic state in mid-air considering spin. | ▶ 01:11 |
The last one is really a tricky question, | ▶ 01:14 |
so I don't mind at all if you get this wrong. | ▶ 01:17 |
But in all of those, I would like to exploit the fact | ▶ 01:19 |
that I really don't care about the absolute orientation of the soccer ball that is here, | ▶ 01:21 |
so it's invariant to its orientation, but it might still have spin. | ▶ 01:26 |
And the answer is 2, 3, 4, 6, and 9. | ▶ 00:00 |
On the ground, the static state without velocity is just X and Y. | ▶ 00:06 |
That's 2. | ▶ 00:10 |
If we add mid-air, we have height, which adds a third dimension, 3. | ▶ 00:12 |
If we add the dynamic state to the ground, | ▶ 00:17 |
which is data X and data Y over time, that's 4 in total | ▶ 00:19 |
plus the original X and Y. | ▶ 00:24 |
Same for mid-air. Multiply 3 up to 6. | ▶ 00:26 |
And the last one is the tricky one. | ▶ 00:30 |
Clearly, a helicopter in mid-air | ▶ 00:33 |
that looks at rotational velocities would have 12 dimensions. | ▶ 00:36 |
But again, because I don't care about the absolute coordinates | ▶ 00:40 |
of its yaw, roll and pitch. | ▶ 00:43 |
The ball is variant. | ▶ 00:45 |
The spin variables are 3, data roll, data pitch and data yaw. | ▶ 00:47 |
If we add those to the dynamic state in mid-air, | ▶ 00:51 |
we get 9 and not 12. | ▶ 00:54 |
Once again, these are X, Y and Z: data X, data Y, data Z over time, | ▶ 00:56 |
and the velocities in the different rotational directions, | ▶ 01:02 |
which make a total of 9. | ▶ 01:07 |
This will be in dynamic programming question for a robot | ▶ 00:00 |
with 3 coordinates, X, Y and theta, | ▶ 00:03 |
even though in this diagram I just show 2. | ▶ 00:07 |
Suppose our target location is in the top right corner | ▶ 00:09 |
facing east as shown over here. | ▶ 00:13 |
Initially, the robot's location is in the bottom left corner facing north. | ▶ 00:16 |
Suppose our goal is worth 100. | ▶ 00:20 |
Going straight costs us -1, | ▶ 00:22 |
and we can turn on the spot, but turning on the spot | ▶ 00:25 |
costs us -5. | ▶ 00:27 |
What would be the value of the start state? | ▶ 00:29 |
And the answer is 88. | ▶ 00:00 |
It takes 7 steps to go from start to goal | ▶ 00:02 |
if we just count the go straight steps. | ▶ 00:05 |
1, 2, 3, 4, 5, 6, 7. | ▶ 00:07 |
And we have to turn once in this spot right over here, | ▶ 00:11 |
which costs an additional -5, so we pay a total of -12. | ▶ 00:14 |
That plus 100 gives us 88. | ▶ 00:18 |
Same situation as before. | ▶ 00:00 |
We'd like to go from here to here. | ▶ 00:02 |
We have a 3-dimensional state space. | ▶ 00:04 |
The goal is worth 100. | ▶ 00:06 |
A straight motion costs -1. | ▶ 00:08 |
Turning on the spot clockwise costs -10, | ▶ 00:10 |
but turning counterclockwise, | ▶ 00:13 |
which is "C-CW," costs us 0. | ▶ 00:15 |
What is now the value of the start state? | ▶ 00:18 |
Please put your answer in here. | ▶ 00:20 |
And the answer is 93, | ▶ 00:00 |
the reason being there's 7 straight steps to the goal. | ▶ 00:03 |
You can go down here or up here. | ▶ 00:06 |
And suppose we go up here and we wanted to turn clockwise, | ▶ 00:09 |
which is the one that gets us oriented towards the goal. | ▶ 00:13 |
But that costs us -10, but we can turn 3 times counterclockwise. | ▶ 00:16 |
We first turn left and down and then right. | ▶ 00:21 |
And the total cost of this is 0 because each counterclockwise turn | ▶ 00:24 |
is worth 0, therefore, we just go straight to the goal, | ▶ 00:27 |
and we only pay the straight motion cost, which is 7 in total | ▶ 00:31 |
because it's -7 for the straight penalty | ▶ 00:35 |
plus 100 is 93. | ▶ 00:39 |
This is a particle filter question where we start with a single particle over here facing east. | ▶ 00:00 |
The particle has an X, a Y, and a heading direction, | ▶ 00:06 |
and this particle is on a checker board with black squares and white squares. | ▶ 00:11 |
Let's assume we draw 5 new particles from this particle for the motion of going right, | ▶ 00:17 |
and they end up as indicated over here. | ▶ 00:22 |
Each of these 5 new particles--1 of which falling into a2, 2 of which falling into b2, | ▶ 00:25 |
1 into c2, and 1 into b3-- | ▶ 00:32 |
each of these particles will obtain an importance weight, | ▶ 00:35 |
as now that what we'll measure is on a black square. | ▶ 00:40 |
So the measurement is black. | ▶ 00:44 |
To calculate the importance weight, let me tell you that the probability of seeing black | ▶ 00:45 |
on a black square = 0.8, | ▶ 00:50 |
whereas the probability of seeing black on a white square is only 0.1. | ▶ 00:53 |
So I want you to tell me the total importance weight that falls to a2-- | ▶ 00:57 |
here we just have a single particle-- | ▶ 01:02 |
into b2--please add the importance weight of both particles-- | ▶ 01:05 |
c2, and b3. | ▶ 01:09 |
Please add your numbers over here. | ▶ 01:10 |
The answer is 1/19th for a2, c2, and b3, and 16/19th or 0.8421 for b2. | ▶ 00:00 |
To see, let's associate the nonnormalized probability value to each of the particles. | ▶ 00:10 |
Over here, we have 0.1. Here is 0.8, but if 2 particles, let's make this 1.6. | ▶ 00:16 |
0.1 and 0.1 again. | ▶ 00:23 |
These nonnormalized importance weights add up to 1.9, | ▶ 00:25 |
so the desired result is the division of the original particle weights by 1.9, | ▶ 00:29 |
which is the value as shown on the left. | ▶ 00:35 |
In resampling in the next motion step, let's assume the following 3 particles are used | ▶ 00:00 |
with the other ones are being ignored. | ▶ 00:05 |
2 of them live in b2, 1 in c2, who again moves right, | ▶ 00:07 |
and we get particles distributed as follows--2 fall into b3, 2 into b4, and 1 in c4. | ▶ 00:12 |
So using the same measurement probability as before, | ▶ 00:20 |
and now a measurement of a white square. | ▶ 00:23 |
Tell me what the cumulative importance weight for the 3 new squares, | ▶ 00:25 |
where 2 of the particles fall into b3, which happens to be a white square, | ▶ 00:30 |
2 into b4, which happens to be a black square, | ▶ 00:35 |
and 1 into c4, which happens to be a white square again. | ▶ 00:37 |
The answer is 18/31, which is approximately 0.5806. | ▶ 00:00 |
4/31 is 0.1290, | ▶ 00:09 |
and 9/31, which is half of the thing over here, 0.2903. | ▶ 00:13 |
And again, we look at the same as before. | ▶ 00:18 |
We look at the total nonnormalized measurement probabilities for our particles. | ▶ 00:20 |
In a white square, the probability of seeing white is 1 - 0.1, that is 0.9. | ▶ 00:24 |
Since we have 2 particles, the nonnormalized cumulative particle weight is 1.8. | ▶ 00:32 |
Doing the same for the black square, the probability of seeing white is 0.2, | ▶ 00:37 |
which is 1 - 0.8. | ▶ 00:44 |
We have 2 particles in the black square to get a nonnormalized total probability | ▶ 00:46 |
of 0.4, so you get a nonnormalized total importance weight of 0.4. | ▶ 00:52 |
And finally, the probability of seeing white in the white square is 1 - 0.1 is 0.9, | ▶ 00:57 |
and here we only have 1 particle, so the nonnormalized importance weight is 0.9. | ▶ 01:04 |
If you add those up, we get 3.1. | ▶ 01:09 |
So we have to divide all of those by 3.1. | ▶ 01:12 |
So 18/31 is the 1st one. 4 by 31--the 2nd, and 9 by 31--the 3rd, as indicated on the left. | ▶ 01:15 |
Our robot, Stanley, performed as follows in the DARPA Urban Challenge in 2005. | ▶ 00:00 |
He came in 1st, 2nd, 3rd, or 4th or below in the ranking. | ▶ 00:05 |
And oh, my God! Yes, we came in first! | ▶ 00:00 |
It was one of the most amazing events in my entire life. | ▶ 00:03 |
Our robot made it first across the finishing line. | ▶ 00:06 |
In this final question, I'm going to quiz you about our approximate motion model | ▶ 00:00 |
for this robot, which I restate. | ▶ 00:05 |
I'd like you to apply this exact motion model over here even though you might be suspicious | ▶ 00:08 |
of its accuracy. | ▶ 00:13 |
Suppose a time, t = 0, coordinates are 0, 0, and 0 for all 3 variables. | ▶ 00:15 |
Delta t = 4. I'll admit the units over here. | ▶ 00:23 |
v = 10 and omega = pi/8, which is 22.5 degrees in degrees. | ▶ 00:27 |
Assuming you run one these simulations exactly every 4 time steps, | ▶ 00:34 |
I would like to know what the mobile state is after 4 of those updates, | ▶ 00:39 |
or put differently, total time of 16. | ▶ 00:43 |
So please tell me, what x will be, y, and theta. | ▶ 00:47 |
The answer surprisingly is 0, 0, 0. | ▶ 00:00 |
It just survives the way it was before. | ▶ 00:04 |
To see this, [ ] initially faces east. | ▶ 00:06 |
Next direction, it'll move forward 40, and then its heading direction changes | ▶ 00:10 |
by 4 x pi/8, which is pi(1/2), so it's going to start pointing up. | ▶ 00:17 |
It repeats the same action 3 more times, and eventually arrives at the original location | ▶ 00:21 |
and points right again. | ▶ 00:26 |
So this is a square motion. The results are in the exact same initial state as it started out with. | ▶ 00:28 |
Now in reality, if we didn't use these equations over here, it would be on a circle, | ▶ 00:35 |
but even in a circle, it would arrive back at the original location with those | ▶ 00:41 |
parameters shown over here. | ▶ 00:44 |
So the fact that our simulation simulates a square | ▶ 00:46 |
doesn't effect the end result of this question, | ▶ 00:49 |
and I didn't even ask about the circle. | ▶ 00:51 |
I just asked about applying those equations over here, so 0, 0, 0 is the correct answer. | ▶ 00:53 |
Welcome back. We're down to our final main unit. | ▶ 00:00 |
This one is on natural language processing-- | ▶ 00:04 |
that is, understand natural languages like English or German or French | ▶ 00:06 |
and figuring out what to do with them. | ▶ 00:11 |
Now, this is a very interesting topic for three reasons. | ▶ 00:13 |
One is a philosophical one--we as humans have defined ourselves much in terms | ▶ 00:16 |
of our ability to speak with each other and understand each other. | ▶ 00:21 |
This ability to use language is something that we feel sets us apart | ▶ 00:26 |
from all the other animals and from the other machines. | ▶ 00:30 |
Second is in terms of applications. | ▶ 00:34 |
We really would like to be able to talk to our computers and use them for various things. | ▶ 00:36 |
Sure there are occasions where clicking with a mouse is the right thing to do, | ▶ 00:43 |
but talking is natural, and we want to be able to communicate with our machines. | ▶ 00:48 |
Then third is in terms of learning. | ▶ 00:53 |
We want out computers to be smarter, | ▶ 00:55 |
and much of human knowledge is written down in terms of paragraphs and sentences of text, | ▶ 00:58 |
and not in terms of formal databases or formal procedures written in code. | ▶ 01:04 |
It's all in text, and if we want our computers to be smart, | ▶ 01:11 |
they'd better be able to read that text and make sense of it. | ▶ 01:14 |
That's what this lesson is all about. | ▶ 01:16 |
We'll start by talking about language models. | ▶ 00:00 |
Historically, there have been two types of models that have been popular | ▶ 00:03 |
for natural language understanding within AI. | ▶ 00:07 |
One of the types of models has to do with sequences of letters or words? | ▶ 00:10 |
These types of models tend to be probabilistic | ▶ 00:16 |
in that we're talking about the probability of a sequence, | ▶ 00:20 |
word based in that mostly what we're dealing with is the surface words themselves, | ▶ 00:24 |
and sometimes letters. | ▶ 00:30 |
But we're dealing with the actual data of what we see, | ▶ 00:33 |
Rather than some underlying extractions, | ▶ 00:37 |
and these models are primarily learned from data. | ▶ 00:39 |
Now, in contrast to that is another type of model that you might have seen before, | ▶ 00:44 |
where we're primarily dealing with trees and with abstract structures. | ▶ 00:50 |
So we say we can have a sentence, which is composed of a noun phrase and a verb phrase, | ▶ 00:54 |
and a noun phrase might be a person's name, and that might be "Sam." | ▶ 01:01 |
And the verb phrase might be a verb and we might say "Sam slept"-- | ▶ 01:07 |
a very simple sentence. | ▶ 01:14 |
Now, these types of models have different properties. | ▶ 01:16 |
For one, they tend to be logical rather than probabilistic-- | ▶ 01:20 |
that is whereas on this side, we're talking about the probability of a sequence of words, | ▶ 01:25 |
on this side we're talking about a set of sentences and that set defines the language, | ▶ 01:32 |
and a sentence is either in the language or not. | ▶ 01:40 |
It's a Boolean logical distinction rather than on this side a probabilistic distinction. | ▶ 01:44 |
These models are based on abstraction such as trees and categories-- | ▶ 01:50 |
categories like noun phrase and verb phrase and tree structures like this | ▶ 01:57 |
that don't actually occur in the surface form, so the words that we can observe. | ▶ 02:02 |
An agent can observe the words "Sam" and "slept," | ▶ 02:08 |
but an agent can't directly observe the fact that slept is a verb or that it's part of this tree structure. | ▶ 02:12 |
Traditionally, these types of approaches have been primarily hand-coded. | ▶ 02:19 |
That is, rather than learning this type of structure from data, | ▶ 02:25 |
we learn it by going out and having linguists and other experts write down these rules. | ▶ 02:29 |
Now, these distinctions are not hard to cut. | ▶ 02:35 |
You could have trees and have a probabilistic model of them. | ▶ 02:39 |
You could learn trees. | ▶ 02:45 |
We can go back and forth, but traditionally the two camps have divided up in this way. | ▶ 02:48 |
Now, we've seen probabilistic word models before. | ▶ 00:00 |
If you remember when we were doing machine learning, | ▶ 00:03 |
we talked about the bad-of-words model. | ▶ 00:06 |
What I'm showing you now is a copy of a bumper sticker that my friend | ▶ 00:09 |
Othar Hansson, who is one of the main engineers on the Google search team came up with, | ▶ 00:13 |
the bumper sticker, of course, says "Honk if you love the bag-of-words model," | ▶ 00:18 |
but it says that in a way where the words are a bag rather than a sequence. | ▶ 00:23 |
This just kind of indicates the power of the model-- | ▶ 00:28 |
that it gets the idea across while loosing all notion of sequence, | ▶ 00:31 |
and thus making the probabilistic model simpler to deal with. | ▶ 00:35 |
But we can move on from a bag-of-words model, which we can think of as a unigram-- | ▶ 00:39 |
sometimes also called a naive Bayes model, | ▶ 00:45 |
because every individual word is treated as a separate factor that's unrelated | ▶ 00:48 |
or unconditionally independent of all the other words. | ▶ 00:54 |
We can move beyond those types of models to ones where we do take sequence into account. | ▶ 01:01 |
What we want then is a probabilistic model P over a word sequence, | ▶ 00:00 |
and we can write that sequence word 1, word 2, word 3, all the way up to word n, | ▶ 00:06 |
and we can use an abbreviation for that and write that the sequence of | ▶ 00:14 |
words 1 through n, using the colon. | ▶ 00:19 |
Now the next step is to say we can factor this and take these individual variables | ▶ 00:23 |
write that in terms of conditional probabilities. | ▶ 00:29 |
So, this probability is equal to the product over all i of the probability of words of i | ▶ 00:33 |
given all the subsequent words. | ▶ 00:43 |
So that would be from word 1 up to word i - 1. | ▶ 00:46 |
The is just the definition of conventional probability-- | ▶ 00:51 |
the joint probability of a set of variables can be factored out as the conditional probability | ▶ 00:55 |
of one variable given all the others, | ▶ 01:02 |
and then we can recursively do that until we've got all the variables accounted for. | ▶ 01:05 |
We can make the Markov assumption | ▶ 01:09 |
and that's the assumption that the effect of one variable on another will be local. | ▶ 01:12 |
That is, if we're looking at the nth word, the words that are relevant to that | ▶ 01:17 |
are the ones that have occurred recently and not the ones occurred a long time ago. | ▶ 01:21 |
What the Markov assumption means is that the probability of a word i, | ▶ 01:25 |
given the words all the was from 1 up to word i minus 1, | ▶ 01:32 |
we can assume that that's equal or approximately equal to the probability | ▶ 01:38 |
of the word given only the words from i minus k up to i minus 1. | ▶ 01:45 |
Instead of going all the way back to number 1, we only go back k steps. | ▶ 01:52 |
For order 1 Markov model, for an order k equals one, then this would be equal to | ▶ 01:58 |
the probability of word i given only word i minus 1. | ▶ 02:04 |
Now, the next thing we want to do is in our mode is called the Stationarity Assumption. | ▶ 02:10 |
What that says is that the probability of each variable is going to be the same. | ▶ 02:16 |
So the probability distribution over the first word is going to be same | ▶ 02:23 |
as the probability distribution over the nth word. | ▶ 02:27 |
Another way to look at that is if I keep saying sentences, | ▶ 02:31 |
the words that show up in my sentence depend on what the surrounding words are | ▶ 02:35 |
in the sentence, but they don't depend on whether I'm on the first sentence | ▶ 02:38 |
or the second sentence or the third sentence. | ▶ 02:42 |
Stationarity assumption we can write as the probability of a word given | ▶ 02:45 |
the previous word is the same for all variables. | ▶ 02:51 |
For all values of i and j, the probability of word i given the previous word | ▶ 02:56 |
as the same as the probability of word j given the previous word. | ▶ 03:02 |
That gives us all the formalism we need to talk about these word sequence models-- | ▶ 03:06 |
probabilistic word sequence models. | ▶ 03:11 |
In practice there are many tricks. | ▶ 03:14 |
One thing we talked about before, when we were doing the spam filterings and so on, | ▶ 03:16 |
is a necessity of smoothing. | ▶ 03:21 |
That is, if we're going to learn these probabilities from counts, | ▶ 03:24 |
we go out into the world, we observe some data, | ▶ 03:27 |
we figure out how often word i occurs given word i - 1 was the previous word, | ▶ 03:31 |
we're going to find out that a lot of these counts are going to be zero | ▶ 03:38 |
or going to be some small number, and the estimates are not going to be good. | ▶ 03:41 |
And therefore we need some type of smoothing, | ▶ 03:44 |
like the Laplace smoothing that we talked about, | ▶ 03:46 |
and there are many other techniques for doing smoothing to come up good estimates. | ▶ 03:48 |
Another thing we can do is augment these models to say | ▶ 03:53 |
maybe we want to deal not just with words but with other data as well. | ▶ 03:57 |
We saw that in the spam-filtering model also. | ▶ 04:01 |
So there you might want to think about who the sender is, | ▶ 04:04 |
what the time of day is and so on, | ▶ 04:07 |
these auxiliary fields like in the header fields of the email messages | ▶ 04:10 |
as well as the words in the message, and that's true for other applications as well. | ▶ 04:15 |
You may want to go beyond the words and consider variables that have to do with context of the words. | ▶ 04:20 |
We may also want to have other properties of words. | ▶ 04:25 |
The great thing about just dealing with an individual word like "dog" | ▶ 04:29 |
is that it's observable in the world. | ▶ 04:33 |
We see this spoken or written text, and we can figure out what it means, | ▶ 04:36 |
and we can start making counts about it and start estimating probabilities, | ▶ 04:41 |
but we also might want to know that, say, "dog" is being used as a noun, | ▶ 04:45 |
and that's not immediately observable in the world, but it is inferable. | ▶ 04:52 |
It's a hidden variable, and we may want to try to recover these hidden variables like parts of speech. | ▶ 04:55 |
We may also want to go to bigger sequences than just individual words, | ▶ 05:01 |
so rather than treat "New York City" as three separate words, | ▶ 05:06 |
we may want to a model that allows us to think of it as a single phrase. | ▶ 05:10 |
Or we may want to go smaller than that and look at a model that deals with individual letters | ▶ 05:15 |
rather than dealing with words. | ▶ 05:21 |
So these are all variations, and the type of model you choose depends on the application, | ▶ 05:23 |
but they all follow from this idea of a probabilistic model over sequences. | ▶ 05:28 |
Now, we talked about using language for learning, | ▶ 00:00 |
and this slide is demonstrating the power of language, | ▶ 00:03 |
how it has such a powerful connection to the real world that allows us to learning things | ▶ 00:07 |
just by observing language use. | ▶ 00:14 |
What I've done here is I've gone to Google trends and types in two search terms-- | ▶ 00:16 |
"full moon" and "ice cream" and gotten back a graph of how popular those queries are over time. | ▶ 00:22 |
We also have a graph of the news volume for those terms over time, | ▶ 00:29 |
but that's not so interesting here. | ▶ 00:35 |
What's interesting this side is the regularity in these patterns. | ▶ 00:37 |
This pattern allows me just by observing language to do amateur astronomy. | ▶ 00:41 |
What do I mean by that? | ▶ 00:47 |
Well, look at the curve for ice cream. | ▶ 00:49 |
Popular in the summer. Not so popular in the winter. | ▶ 00:51 |
What that means is if I wanted to figure out what the rotational period is of the earth around the sun, | ▶ 00:56 |
all I have to do is measure these peaks, and it would come out to 365 days-- | ▶ 01:02 |
a very regular performance of language speakers using the term "ice cream" in the summer | ▶ 01:08 |
repeatedly year after year. | ▶ 01:15 |
Now, there's a little bit of a blip here. | ▶ 01:18 |
What happened in this case? | ▶ 01:20 |
Well, it turns out that a manufacturer of cell phone operating systems | ▶ 01:22 |
decided to call the latest update to their operating system "Ice Cream Sandwich," | ▶ 01:28 |
and so there was a lot of searching for that when it came out. | ▶ 01:33 |
But that just lasted a few days, and then things went back to normal. | ▶ 01:37 |
Similarly for the query "full moon" in blue, we see this period here, | ▶ 01:40 |
and we can measure that period to be 28 days, | ▶ 01:46 |
we we can do amateur astronomy and figure out how the moon works as well, | ▶ 01:49 |
just by observing how people in the real world use language. | ▶ 01:54 |
What can you do with language in the world besides amateur astronomy? | ▶ 00:00 |
Well, I haven't told you yet, but I want to give you a little quiz | ▶ 00:04 |
and allow you to guess. | ▶ 00:07 |
And so for each of these applications here, I want you to tell me | ▶ 00:09 |
whether you think that language models, and specifically these types of word models, | ▶ 00:14 |
would be a major part of an implementation of that task. | ▶ 00:19 |
Almost all of these are great examples of applications that are used everyday, | ▶ 00:00 |
and are primarily based on word models. | ▶ 00:04 |
We've seen classification before for spam and other types of categories, | ▶ 00:07 |
language ID and so on. | ▶ 00:12 |
There's also the idea of clustering, where we don't have categories ahead of time. | ▶ 00:14 |
Yes, you can take news stories and classify them into, say, sports or weather, | ▶ 00:19 |
but you can also cluster them to say here's a cluster of stories | ▶ 00:24 |
that are all related about the latest topic that has maybe never occurred before. | ▶ 00:28 |
There's also input correction of various kinds such as spelling correction, | ▶ 00:33 |
sentiment analysis--taking reviews of products and trying to decide | ▶ 00:38 |
if they're favorable or unfavorable and rate products that way, | ▶ 00:42 |
information retrieval--web search is a problem that I've worked on | ▶ 00:46 |
and is primarily addressed with word models, | ▶ 00:49 |
question answer such as IBM's Watson did in playing the game of Jeopardy. | ▶ 00:52 |
They use a variety of techniques. | ▶ 00:57 |
Much of them are based around word models, but there are other techniques as well. | ▶ 01:00 |
Machine translation--we saw the example of Chinese menus and translating to English from examples. | ▶ 01:04 |
The examples are primarily dealt with by phrases and individual words | ▶ 01:11 |
and some augmentation to that as well. | ▶ 01:16 |
Speech recognition--as similar story. | ▶ 01:19 |
And then finally I threw in one that is no primarily a question for word models. | ▶ 01:21 |
Driving a car is autonomously is primarily a question in perception and localization. | ▶ 01:26 |
Yes, you might want to be able to talk to the car to direct it to do something, | ▶ 01:32 |
but that wouldn't be part of the autonomous part. | ▶ 01:37 |
Now, I wanted to show you how powerful n-gram models of language are. | ▶ 00:00 |
That is, if we're only looking at word sequences, | ▶ 00:05 |
what is it that we're giving up and what are we getting? | ▶ 00:08 |
So I read in the complete works of Shakespeare into a small computer program, | ▶ 00:11 |
and then built n-gram models and sampled from that model. | ▶ 00:16 |
That is, generated random sentences that come from the probability distribution defined by that model. | ▶ 00:20 |
And here are samples from the unigram model. | ▶ 00:27 |
That is sampling from words according to frequency in the corpus of Shakespeare text, | ▶ 00:30 |
but not taking into account any relationship between adjacent words. | ▶ 00:37 |
And looking at this, it doesn't make much sense. | ▶ 00:41 |
It does seem like real sentences. | ▶ 00:44 |
You can tell the vocabulary is somewhat archaic. | ▶ 00:46 |
You have words like "o'erthrown" and "thou" and "'tis" and so on, | ▶ 00:49 |
but you aren't really getting very much from this. | ▶ 00:54 |
Now we move to a bigram model, | ▶ 00:00 |
where we're sampling from the probability of a word given the previous word. | ▶ 00:02 |
Now we see a little bit of structure emerge. | ▶ 00:06 |
So you can see at the start of the sentences "I have", "hear you," "hark ye." | ▶ 00:09 |
The words seem to go together, | ▶ 00:15 |
but then as the sentences move on they ramble and don't go any definitive direction. | ▶ 00:18 |
So the sentences are locally consistent at the level of one or two words, | ▶ 00:24 |
but that consistency doesn't go very far. | ▶ 00:28 |
Now with the trigram models, we're starting to get a little bit more structure. | ▶ 00:00 |
In fact, we get complete sentences that actually make some sense-- | ▶ 00:04 |
"I will never yield," and the exclamation "little pretty ones!" | ▶ 00:08 |
And we get sentences that are fairly coherent-- | ▶ 00:12 |
"I would learn of noble Edward's sons"-- | ▶ 00:15 |
but then break down a little bit--"what thing, avoid!" | ▶ 00:17 |
So we're getting a model that appears to be a little bit closer to actual Shakespeare | ▶ 00:20 |
but it's still incomplete. | ▶ 00:26 |
And finally this example based on 4-grams-- | ▶ 00:00 |
Now we're seeing an even longer structure. | ▶ 00:03 |
Sometimes we have this generate something that makes sense like | ▶ 00:06 |
"betwixt their two estates," | ▶ 00:10 |
and this is not something that appears in Shakespeare, | ▶ 00:13 |
but it was just generated and it made sense. | ▶ 00:16 |
Sometimes we get things that are actually quotes like "even to the frozen ridges of the alps." | ▶ 00:18 |
The model chose to duplicate something that is actually in Shakespeare. | ▶ 00:25 |
Certainly there were lots of traces it could have made. | ▶ 00:28 |
"Alps" appears four or five times. | ▶ 00:31 |
"Even" appears many, many times. | ▶ 00:33 |
"Frozen" appears multiple times. | ▶ 00:35 |
But it just happened to duplicate something that was a quotation from the original. | ▶ 00:37 |
And then there's lots of examples of sentences and phrases that make a lot of sense-- | ▶ 00:42 |
"I know my duty," and "give me some little breath," and so on. | ▶ 00:47 |
So it looks like even though all we know is a sequence of 4 woeds, | ▶ 00:52 |
We're still capturing quite a bit of what it means to have coherent sentences but not everything. | ▶ 00:56 |
Here's a little quiz--for each of these pieces of text, which was generated from an n-gram model, | ▶ 00:00 |
I want you to try to guess if it was generated from a 1-gram model, a 2-gram, or a 3-gram. | ▶ 00:06 |
Now, I know you won't necessarily be able to get these all right. Don't worry about that. | ▶ 00:13 |
The main point is just for you to get some facility in | ▶ 00:17 |
kind of looking at these models and trying to understand them. | ▶ 00:20 |
I'll give you a hint--three of them were generate by a 1-gram model, | ▶ 00:23 |
three by a 2-gram model, and 3 of them by a 3-gram model, | ▶ 00:28 |
and one of them is an actual excerpt from the corpus of Shakespeare's work, | ▶ 00:31 |
And so leave that one blank. Don't mark it at all. | ▶ 00:37 |
Here we see the answers. You can tell that the 3-grams are more consistent, | ▶ 00:00 |
more sentence-like than the 1-gram model generated sentences, | ▶ 00:04 |
and this is an actual sentence or stage direction from the works of Shakespeare. | ▶ 00:08 |
Here's a quiz to make sure you understand how to calculate these probabilistic models. | ▶ 00:00 |
We're going to calculate the probability of the string "woe is me," | ▶ 00:05 |
and we're going to calculate that beginning at the beginning of the sentence, | ▶ 00:11 |
which we'll mark with this dot character, | ▶ 00:14 |
given that we are starting at the beginning of the sentence. | ▶ 00:16 |
I want you to figure that out, put the probability in here, | ▶ 00:20 |
and it's going to be a small number with a lot of zeros to the right of the decimal place. | ▶ 00:24 |
So scale that by a factor of 1 billion--the probability times 10 to -9. | ▶ 00:29 |
Now, of course, I'm going to have to give you some data to make this make sense. | ▶ 00:35 |
I'm going to tell you that the probability that "woe" occurs at position i | ▶ 00:38 |
given that the start-of-sentence marker occurs at position i minus 1. | ▶ 00:44 |
I should say what we're doing here is we're sort of artificially introducing a token | ▶ 00:49 |
into our data of the start-of-sentence marker, | ▶ 00:55 |
which could be either what comes after a period or exclamation point | ▶ 00:58 |
or at the beginning of the file. That all counts as a start-of-sentence marker. | ▶ 01:03 |
That probability is 0.002. | ▶ 01:09 |
The probability that "is" occurs at position i given that "woe" occurred at i minus 1 is 0.07, | ▶ 01:13 |
and the probability that "me" occurs at position i given that "is" occurred at i minus 1 is 0.0005. | ▶ 01:23 |
Tell me the probability of the whole string "woe is me" at the beginning of a sentence | ▶ 01:37 |
given that we're starting at the beginning of a sentence and put your answer in here. | ▶ 01:42 |
The answer is that we just multiply them together, | ▶ 00:00 |
and it words out to 7 parts per billion. | ▶ 00:03 |
I should note that these numbers are small, but that shouldn't bother you. | ▶ 00:07 |
So "woe is me" seems like a fairly common phrase. | ▶ 00:12 |
The reason it's small is because there are so many common phrases of three words. | ▶ 00:15 |
And so even though this one's fairly common, it works out to only a few parts per billion, | ▶ 00:21 |
because of the many other possibilities. | ▶ 00:27 |
Now let's take a step back for a second, and I'm going to talk about | ▶ 00:00 |
probabilistic letter models. | ▶ 00:03 |
Here we have a sequence of letters, | ▶ 00:06 |
and it looks like this sequence is rather infrequent in English. | ▶ 00:09 |
But what can we do with letter models that we can't do | ▶ 00:15 |
or that we can do in opposition to word models? | ▶ 00:18 |
The answer is letter models are very good in cases | ▶ 00:22 |
where we're going to be dealing with unique words that maybe we haven't seen before, | ▶ 00:25 |
but they still give you properties of the language. | ▶ 00:30 |
One very interesting task is language identification. Let's see how that would work. | ▶ 00:32 |
Let's take some example phrases--"hello, world," "guten tag, welt," "salam dunya," | ▶ 00:37 |
and let's suppose you have the task of classifying these | ▶ 00:45 |
into the language from which they were sampled from, | ▶ 00:49 |
and we'll make this into a quiz, and we'll give you some choices-- | ▶ 00:51 |
English, German, French, Spanish, and Azerbaijani-- | ▶ 00:56 |
and tell me for each of these want your best guess is at the most likely language classification. | ▶ 00:59 |
That didn't seem too hard. | ▶ 00:00 |
This looks like English. This looks like German. | ▶ 00:02 |
I may not be familiar with Azerbaijan, | ▶ 00:05 |
but it doesn't look like English, German, French, or Spanish, | ▶ 00:08 |
so I'll probably choose that, and that would be the right answer. | ▶ 00:11 |
Now, how could I do that? Well, I could do it by recognizing some of the words. | ▶ 00:15 |
But it turns out I can also do it just by looking at letter sequences, | ▶ 00:18 |
the frequency of of single letters or pairs of letters or triplets of letters. | ▶ 00:23 |
In fact, you can get about 99% accuracy for language identification just looking at tables of letters. | ▶ 00:28 |
And a great thing about dealing with letter models is that | ▶ 00:35 |
the probability tables you need are much more compact. | ▶ 00:39 |
If you think about triples of words, there may be a million words in the vocabulary, | ▶ 00:42 |
so a table of triples is a million to the 3rd power. | ▶ 00:49 |
That's quite a number of entries. | ▶ 00:53 |
Whereas for letters in the alphabet, most alphabets have about 30 letters or so. | ▶ 00:56 |
So it's very easy and compact to store triples of those. | ▶ 01:01 |
Now, in doing actual language identification, | ▶ 01:05 |
it's also common to add other features, to not look only at the letter combinations. | ▶ 01:08 |
So you might add words as well. | ▶ 01:13 |
You might add a small number of words--the most common words in a language, | ▶ 01:15 |
or it may be even better to add the most discriminative words-- | ▶ 01:18 |
words that show up in one language but not in another language | ▶ 01:23 |
and count the occurrence of those words. | ▶ 01:27 |
In this table what I've done is I've taken samples of text in 3 languages | ▶ 00:00 |
and just counted to the frequency of the letter bigrams, | ▶ 00:04 |
and then ordered them from top to bottom. | ▶ 00:08 |
And so for language A, TH was the most frequent letter bigram, | ▶ 00:10 |
TE was the second most frequent, and so on | ▶ 00:15 |
In language B, EN was most popular, ER was the second most popular, and so on. | ▶ 00:18 |
And the same for language C. | ▶ 00:24 |
What I want you to do is just try to guess which language is which. | ▶ 00:26 |
Is language A English, German, or Azerbaijani? | ▶ 00:31 |
And do the same for B and C. | ▶ 00:35 |
If you're familiar with these languages, you probably could have guessed that | ▶ 00:00 |
TH is the most common two-letter sequence in English. | ▶ 00:04 |
These look like German. | ▶ 00:08 |
These are a little bit unfamiliar, and it has a letter that doesn't show up in English and German, | ▶ 00:10 |
and so this one is Azerbaijani. | ▶ 00:15 |
Here I just wanted to show that even very short sequences can be identified quite easily | ▶ 00:00 |
with language ID models based on letter frequencies. | ▶ 00:06 |
So for the 3-letter sequence T-H-E, I have trigram models for 3 different languages. | ▶ 00:10 |
And you can see that in language A there's a 1.1% chance of representing T-H-E, | ▶ 00:17 |
which is 4 times more than B and quite a bit more than C. | ▶ 00:25 |
For the 3-letter sequence D-E-R, that's 10 times more likely to be language B | ▶ 00:30 |
than to be language A and quite a bit more than language C. | ▶ 00:36 |
For the letter sequence R-B-A, that's 50 times more likely to be C than it is to be B and even moreso than A. | ▶ 00:40 |
What I want you to tell me is what are each of these languages? | ▶ 00:48 |
Where did this column of numbers come from? | ▶ 00:51 |
Is language A English, German or Azerbaijani, and the same for B and C? | ▶ 00:55 |
You can see that English is a language in which THE is very common. | ▶ 00:00 |
German is a language in which DER is more common. | ▶ 00:04 |
And in Azerbaijani, the sequence RBA is more common. | ▶ 00:07 |
Enough about letters. Now let's use all the tools at our disposal and tackle a new task-- | ▶ 00:00 |
the task of classification into semantic classes. | ▶ 00:05 |
Say we're given a sequence of phrases and want to classify them | ▶ 00:09 |
into one of several categories. | ▶ 00:13 |
Here I've chosen just three--people, places, and drugs, | ▶ 00:16 |
and I have some examples of each. | ▶ 00:20 |
What would you use to do that? | ▶ 00:23 |
Well, you have a number of things at your disposal. | ▶ 00:25 |
One, you could memorize some common parts. | ▶ 00:27 |
So "Steve" and "Bill" are very common as that first word in a phrase which represents people. | ▶ 00:30 |
"San" and "New" are common in places. | ▶ 00:37 |
"City" is common at the end of places. | ▶ 00:43 |
But not all these techniques will be unambiguous or 100% accurate. | ▶ 00:46 |
So for example, if you have a phrase where the last word is "grove" | ▶ 00:52 |
and the first word seems like part of a name, | ▶ 00:56 |
that could be a place, but it could also be a person's name. | ▶ 00:59 |
With drugs, it looks like maybe the letter-based approach is better than the word-based approach. | ▶ 01:03 |
They seem to have a much higher frequency of starting with "z" or ending in "x", for example, | ▶ 01:10 |
but you can imagine a classifier using the techniques that we've seen in machine learning | ▶ 01:16 |
that takes all these features. | ▶ 01:21 |
What's the first word? What's the second word? | ▶ 01:23 |
What's the first letter? What's the last letter or the last two letters? | ▶ 01:26 |
Throw all those features in and build a classifier. | ▶ 01:30 |
Here's a quick quiz. Which of these would be a good algorithm or technique | ▶ 00:00 |
for doing this classification into things like people, places, and drugs. | ▶ 00:04 |
Could we use Naive Bayes, k-Nearest Neighbors, Support Vector Machines, Logistic Regression. | ▶ 00:09 |
Could we use the Unix Sort command or the Gzip command? | ▶ 00:18 |
Check all those that you think would be reasonably good algorithms | ▶ 00:22 |
for doing classification. | ▶ 00:27 |
The answer is that all of these are good, except for the sort command. | ▶ 00:00 |
That wouldn't be very good. | ▶ 00:04 |
It would maybe separate out the drugs that begin with "z" near the end of the list, | ▶ 00:06 |
so that would help, but it would just do probably about random for everything else. | ▶ 00:10 |
Now, you may be surprised to learn that the gzip command | ▶ 00:15 |
is actually pretty good as a classification algorithm. Let's try to understand that. | ▶ 00:19 |
Here I have 3 files containing a corpus of text in each of the languages that I want to classify into, | ▶ 00:00 |
and imagine these are much longer, so it gives you a good sample in text in | ▶ 00:07 |
English, German, and Azerbaijan. | ▶ 00:11 |
Now I have a new piece of text that I want to classify against each of these possibilities. | ▶ 00:14 |
Well, I can do that using the gzip command. | ▶ 00:20 |
So I could issue this Unix command that says | ▶ 00:23 |
"concatenate together the new file with the English file, | ▶ 00:27 |
gzip them, compress them, then count the number of characters, | ▶ 00:31 |
and do the same for the German and Azerbaijani, | ▶ 00:35 |
and then figure out which one is shortest. | ▶ 00:39 |
In fact, when we do that with the files I've collected, it gives me the right answer. | ▶ 00:43 |
Now how does it do that? | ▶ 00:48 |
Well, you have to understand a little bit about how compression algorithms like gzip work. | ▶ 00:50 |
What they do is they take a file like this and they look for common subsequences, | ▶ 00:55 |
and they represent that in less than 1 byte. | ▶ 01:00 |
For example, I-S-SPACE would be represented by 3 bytes in an ASCII encoding, | ▶ 01:04 |
but in compressed encoding you could say, | ▶ 01:12 |
"Hey, I see that sequence here. I see it here again. It's going to show up many times." | ▶ 01:14 |
So maybe I can represent those 3 bytes just in terms of one, | ▶ 01:18 |
saying this is a common subsequence that I'm going to see again and again. | ▶ 01:22 |
Once we've done that for English, we come up with common subsequences in English. | ▶ 01:26 |
Then if we add in another file that has a lot of the same common sequences, | ▶ 01:31 |
like here it has I-S-SPACE again, | ▶ 01:37 |
then that's going to compress well with respect to this. | ▶ 01:40 |
It's not going to compress very well with respect to the Azerbaijan, | ▶ 01:43 |
because that won't have built up a code for I-S-SPACE. | ▶ 01:47 |
That will have built up codes for things like R-B-A rather than for I-S-SPACE. | ▶ 01:50 |
So it turns out that the ideas of compression and learning are actually very closely related, | ▶ 01:58 |
and they're related by information theory and this idea of entropy of an expression | ▶ 02:04 |
or the information content. | ▶ 02:10 |
That wasn't discovered until fairly recently. | ▶ 02:13 |
The two fields had developed independently, but now they've come back together, | ▶ 02:16 |
and we understand how they relate. | ▶ 02:20 |
The next topic I want to address is called "Segmentation." | ▶ 00:00 |
This is the problem of given a sequence of language, | ▶ 00:04 |
figure out how to break it up into words. | ▶ 00:07 |
Now, in Chinese we don't have spaces between the words, | ▶ 00:10 |
and so in order to understand if the first word of this message corresponds | ▶ 00:13 |
to a single character or two characters or what, | ▶ 00:17 |
we have to be able to do the process of segmentation and figure out where they are. | ▶ 00:20 |
In English, we don't have that. Words have spaces between them. | ▶ 00:25 |
So we don't have the segmentation problem, | ▶ 00:31 |
but we certainly have it in speech recognition in languages like English, | ▶ 00:33 |
because this speech sounds are sometimes run together without pauses in between them, | ▶ 00:37 |
and there are places where we do have a language without segmentation. | ▶ 00:42 |
For example, in the language of URLs | ▶ 00:47 |
you could have a URL like "choosespain.com", | ▶ 00:50 |
which is the travel site that tries to encourage you to choose Spain as your travel destination, | ▶ 00:56 |
but if you segment it wrong, you'd come up with "chooses pain," | ▶ 01:02 |
which would not be the intended expression for that particular URL. | ▶ 01:07 |
So segmentation is an important problem. Let's talk about how to do it. | ▶ 01:12 |
Let's build a probabilistic word model of segmentation. | ▶ 00:00 |
By definition, the best segmentation, which we'll call S*, | ▶ 00:03 |
is equal to the one which maximizes the joint probability of the segmentation. | ▶ 00:08 |
So we're going to segment the text into a sequence of words-- | ▶ 00:15 |
word 1 through word n-- | ▶ 00:18 |
and find that segmentation into words that maximize the joint probability. | ▶ 00:20 |
By the definition of joint probability, that's the same as maximizing the product over the words | ▶ 00:26 |
of the probability of each word given all the previous words. | ▶ 00:33 |
Now this is going to be a little unwieldy to deal with, so we can make an approximation. | ▶ 00:41 |
We can say that the best segmentation is approximately equal to the one that maximizes, | ▶ 00:46 |
and what we could do here is we could make the Markov assumption | ▶ 00:52 |
and say we're only going to be considering the few previous words. | ▶ 00:56 |
But I'm going to go all the way and make the naive Bayes assumption | ▶ 01:00 |
and say we're going to treat each word independently. | ▶ 01:04 |
We just want to maximize the probability of each individual word | ▶ 01:08 |
regardless of the word that comes before or after it. | ▶ 01:12 |
Now, I know that that assumption is wrong and that the words do depend | ▶ 01:15 |
on the words to the right or the left of them, | ▶ 01:19 |
but I'm going to hope that this simplification is going to make the process of learning easier | ▶ 01:21 |
and will turn out to be good enough. | ▶ 01:27 |
Now for a quick quiz. | ▶ 00:00 |
For a given string--say we have this string with 12 characters-- | ▶ 00:02 |
how many possible segmentations are there? | ▶ 00:07 |
How many ways can we break this up into words? | ▶ 00:09 |
And let's answer that not just for 12 characters, but for n characters in general. | ▶ 00:12 |
With n characters, how many ways of segmenting could there be? | ▶ 00:17 |
Could there be n-1 ways, n-1 squared, n-1 factorial, or 2^n-1? | ▶ 00:22 |
Tell me which of those you think is right. | ▶ 00:32 |
The answer is 2^n-1, and the way you can see that is here. | ▶ 00:00 |
With 12 characters, there are 11 spaces in between characters, | ▶ 00:05 |
and we can either place or not place a word segment in between each of the characters. | ▶ 00:09 |
And so 11 of them either occur or don't occur, so that's to the 11th. | ▶ 00:17 |
Now, 2^n is a lot. | ▶ 00:00 |
For example, if we have 30 characters in our string, then there'd be a billion possible segmentations to deal with. | ▶ 00:02 |
We clearly don't want to have to enumerate them all. | ▶ 00:09 |
We'd like some way of searching through them efficiently | ▶ 00:12 |
without having to consider the probability of every possible segmentation. | ▶ 00:15 |
That's one of the reasons why making this naive Bayes assumption is so helpful. | ▶ 00:19 |
It means that there's no interrelations between the various words, | ▶ 00:25 |
so we can consider them one at a time. | ▶ 00:29 |
That is, here's one thing we can say. | ▶ 00:31 |
We can say that the best segmentation is equal to the argmax | ▶ 00:33 |
over all possible segmentations of the string into a first word and the rest of the words | ▶ 00:39 |
of the probability of that first word times the probability of the best segmentation of the rest of the words. | ▶ 00:45 |
And notice that this is independent. | ▶ 00:53 |
The best segmentation of the rest of the words doesn't depend on the first word. | ▶ 00:55 |
And so that means we don't have to consider all interactions, | ▶ 01:00 |
and we don't need to consider all 2^n possibilities. | ▶ 01:03 |
So now we have two reasons why the naive Bayes assumption is a good thing. | ▶ 01:06 |
One is it makes this computation much more efficient, | ▶ 01:10 |
and secondly, it makes learning easier, | ▶ 01:13 |
because it's easy to come up with a unigram probability. | ▶ 01:16 |
What's the probability of an individual word from our corpus of text? | ▶ 01:19 |
It's much harder to get combinations of multiple word sequences. | ▶ 01:23 |
We're going to have to do more smoothing, more guessing what those probabilities are, | ▶ 01:27 |
because we just won't have the counts for them. | ▶ 01:32 |
So given this formula and given our input string-- | ▶ 00:00 |
let's stick with the familiar one-- | ▶ 00:04 |
we can start enumerating the possibilities for splitting up this string S | ▶ 00:06 |
into a first word and a rest part and figuring out the probabilities. | ▶ 00:10 |
So the first could be "n," could be "no," could be "now," could be "nowi," and so on, | ▶ 00:15 |
and then the rest would be "owis..." or starting with "w" or starting with "is" | ▶ 00:26 |
or starting with "s," and then what's the probability of the first. | ▶ 00:36 |
Well, that we get from our corpus by counting and then smoothing, | ▶ 00:41 |
and in our Shakespeare corpus "n" occurs infrequently"-- | ▶ 00:45 |
about one in a million times--"no" occurs fairly frequently--about 0.004, | ▶ 00:51 |
"now" 0.003, and "nowi" doesn't occur at all, | ▶ 00:57 |
and so we'd use some factor based on smoothing. | ▶ 01:02 |
Then if we take the rest and multiply out this whole term, | ▶ 01:06 |
the best segmentation of the rest times the probability of the first that comes from this column, | ▶ 01:11 |
then that column will give us about 10 to -19 for the segmentation that starts with "n," | ▶ 01:16 |
10 to -13 for the one that starts with "no," | ▶ 01:24 |
10 t the -10 for the one that starts with "now," | ▶ 01:27 |
and 10 to -18 for the one starts with "nowi." | ▶ 01:31 |
Again, that depends on exactly what type of smoothing you choose to do. | ▶ 01:35 |
But it turns out that this row here is at least 1,000 times better than any of the other segmentations. | ▶ 01:40 |
That is the segmentation that comes out "now is the time." | ▶ 01:48 |
So this model, simplified though it is, coming up with this naive Bayes assumption, | ▶ 01:52 |
gets this one right, and it does about 99% of the segmentations accurately. | ▶ 01:58 |
Here we have a demonstration that the implementation of this algorithm into actual code | ▶ 00:00 |
is not that much more complicated than the mathematical formulas I just described to you. | ▶ 00:05 |
Here's the function segment, which takes a text, and it does what we just said. | ▶ 00:10 |
So it splits the text up into all possible first and rest components, | ▶ 00:16 |
and then the candidates will be the first word plus the best segmentation of the rest, | ▶ 00:21 |
and then out of all those candidates we just take the maximum | ▶ 00:27 |
according to the probability of the words | ▶ 00:30 |
where the probability of the words is just the product of the probability of each individual word. | ▶ 00:32 |
So that's the naive Bayes assumption coming into this definition, | ▶ 00:37 |
and this is just the definition of how to split something up into a first and rest. | ▶ 00:41 |
And you can follow the links in the note to see the source code for this | ▶ 00:46 |
and play with it on your own if you like. | ▶ 00:50 |
Now I want to give you an idea of how well the segmentation program performs. | ▶ 00:00 |
Here I've trained it on a corpus of 4 billion words-- | ▶ 00:04 |
not just the Shakespeare corpus but a larger corpus, | ▶ 00:07 |
and then I give it some test cases to try to find the best segmentation. | ▶ 00:10 |
So I gave it the test case here. The program came up with "base rate sought to," | ▶ 00:14 |
but the correct answer was "base rates ought to." | ▶ 00:19 |
In this case, it just seems somewhat like bad luck that that was the right answer, | ▶ 00:22 |
but both segmentations seem like good segmentations. | ▶ 00:28 |
Next was this trial. | ▶ 00:32 |
My program came up with "small and in significant," | ▶ 00:34 |
but the correct answer was "small and insignificant." | ▶ 00:38 |
Here it seems like it really has erred that "small and insignificant" | ▶ 00:41 |
seems like a much better segmentation than the one my program came up with. | ▶ 00:45 |
What I want you to tell me is what do you think could help us do a better job of getting the right answer. | ▶ 00:49 |
Would it be helpful to gather more data? | ▶ 00:55 |
Check that box if you think that would be helpful. | ▶ 00:59 |
Would it be helpful to make a Markov assumption rather than the naive Bayes assumption? | ▶ 01:02 |
Check here. | ▶ 01:08 |
Or would it be helpful to do a better job with our smoothing algorithm? Check here. | ▶ 01:10 |
And you can check more than one. | ▶ 01:16 |
In this case, the problem really comes down to the naive Bayes assumption is | ▶ 00:00 |
a weak one, and the Markov assumption would do much better. | ▶ 00:06 |
It wouldn't really help to have more data or to do a better job of smoothing, | ▶ 00:09 |
because I already have good counts for words like "in" and "significant" | ▶ 00:12 |
as well as words like "small" and "and." | ▶ 00:16 |
They're all common enough that I have a good representation of how often they occur | ▶ 00:18 |
as a unigram as a single word. | ▶ 00:22 |
The problem is that we would like to know that the word "small" goes very well | ▶ 00:25 |
with the word "insignificant" but does not goes very well with the word "significant." | ▶ 00:30 |
So if we had a Markov model where the probability of "insignificant" depended | ▶ 00:35 |
on the probability of "small," then we could catch that, | ▶ 00:40 |
and we could get this segmentation correct. | ▶ 00:44 |
Now let's move on, and I want to do just one more example. | ▶ 00:00 |
Here's this input, and my program came up with "g in or mouse go"-- | ▶ 00:04 |
a sequence of common words, but the correct answer was "ginormous ego." | ▶ 00:10 |
Again, what do you think could help us get the right answer this time? | ▶ 00:15 |
More data? Making the Markov assumption rather than naive Bayes assumption? | ▶ 00:20 |
Or doing a better job with smoothing. Check all the ones that you think might apply. | ▶ 00:25 |
Here is seems to be a problem of not enough data and not a very good smoothing algorithm. | ▶ 00:00 |
Now the problem was even though I had 4 billion words from which I trained by probabilistic model, | ▶ 00:08 |
I had never seen the word "ginormous"--not once in those 4 billion. | ▶ 00:13 |
Yet, I should be able to deal with it even if I haven't seen the word before. | ▶ 00:18 |
So having more data might mean that I would've seen "ginormous" | ▶ 00:22 |
and I could have some probability for it rather than just making the Laplace smoothing assumption. | ▶ 00:26 |
And having better smoothing could also help-- | ▶ 00:32 |
maybe something more sophisticated than Laplace, | ▶ 00:35 |
maybe something that looks more carefully at the content of the word. | ▶ 00:37 |
So it might have a letter model to say these letters look common, | ▶ 00:42 |
ending in "ous"--that's a common ending in English--so this looks more like a word, | ▶ 00:47 |
even if I haven't seen it before, than some other combination of letters. | ▶ 00:54 |
Now let's do one more example of a probabilistic problem--this time, spelling correction. | ▶ 00:00 |
That is, given a word that is possibly misspelled, | ▶ 00:05 |
how do we come up with the best correction for that word? | ▶ 00:08 |
We're going to do the same type of analysis. | ▶ 00:12 |
We're saying we're looking for the best possible correction, C*, | ▶ 00:14 |
and that's going to be the argmax over all possible corrections c to maximize | ▶ 00:20 |
the probability of that correction given the word. | ▶ 00:26 |
So that's the definition of what it means to have the best correction. | ▶ 00:30 |
Then we can start the analysis, and we can apply Bayes rule to say | ▶ 00:33 |
that's going to be equal to the probability of the word given the correction | ▶ 00:38 |
times the probability of the correction. | ▶ 00:45 |
Of course, in Bayes rule there's a factor on the bottom, but that cancels out, | ▶ 00:48 |
because it's equal for all possible corrections. | ▶ 00:52 |
So to choose the maximum, we just have to deal with these two probabilities. | ▶ 00:54 |
Now, it may seem like we made a backwards step. | ▶ 00:59 |
Here we had one probability to estimate. | ▶ 01:02 |
Now we've applied Bayes rule and now we have two probabilities we have to estimate, | ▶ 01:05 |
but the hope is that we can come up with data that can help us with this. | ▶ 01:10 |
And certainly, these unigram statistics--what's the probability of a correction?-- | ▶ 01:15 |
those we can get from our document counts, so we look at our corpus. | ▶ 01:20 |
The probability of a correct word is from the data. | ▶ 01:25 |
We just look at those counts and apply whatever smoothing we decided is best. | ▶ 01:30 |
Now, the other part--what's the probability that somebody typed the word w | ▶ 01:35 |
when they meant to type to the word c--that's harder. | ▶ 01:41 |
We can't observe that directly by just looking at documents that are typed, | ▶ 01:45 |
because there we only have the words where we are. | ▶ 01:51 |
We don't have the intent and the word, | ▶ 01:54 |
but maybe we can look at lists of spelling corrections. | ▶ 01:56 |
So this is from spelling correction data. | ▶ 02:01 |
Now that kind of data is much harder to come by. | ▶ 02:04 |
It's easy to go out and collect billions of words of regular text and do those counts, | ▶ 02:08 |
but to find spelling correction data--that's harder to do | ▶ 02:14 |
unless you're, say, already running a spelling correction service. | ▶ 02:17 |
If you're a big company that happens to run that, then it's easy to collect the data. | ▶ 02:21 |
But bootstrapping it is hard. | ▶ 02:24 |
There are, however, some sites that will give you on the order of thousands | ▶ 02:26 |
or tens of thousands of examples of misspellings, not billions or trillions. | ▶ 02:30 |
Now, here I show some data that I've gathered from sites that deal with spelling correction, | ▶ 00:00 |
and these are all examples of the correct spelling followed by misspelled words | ▶ 00:06 |
and maybe multiple of them. | ▶ 00:12 |
And from that we want to calculate the probability of a word given the correction. | ▶ 00:15 |
So for example, we would like to know what's the probability of P-L-U-S-E | ▶ 00:22 |
being the word that's spelled when the correct word was "pulse." | ▶ 00:29 |
And we do have examples of that here. We have a single example. | ▶ 00:33 |
But it's clear that we're just not going to have enough to cover all | ▶ 00:38 |
the possible words we want to deal with and all the possible misspellings for those words. | ▶ 00:42 |
With only tens of thousands of examples, | ▶ 00:46 |
there are so many words in English that we're not going to have them all. | ▶ 00:49 |
Instead of trying to deal with word-to-word spelling errors, | ▶ 00:53 |
let's deal with letter-to-letter errors. | ▶ 00:57 |
And so let's not say that this is "pulse" misspelled as "pluse," | ▶ 01:00 |
but rather let's say this is U-L misspelled as L-U. | ▶ 01:06 |
Here, let's say this is the E in "elegant" misspelled as an A. | ▶ 01:12 |
And we'll look at these types of edits from one word to another, | ▶ 01:19 |
a transposition between 2, a replacement, or an insertion or deletion of a single letter. | ▶ 01:24 |
We'll build up probability tables for those rather than probability tables for all the words. | ▶ 01:32 |
That's much easier to do with a smaller amount of data. | ▶ 01:37 |
Here's an example of spelling correction in action. | ▶ 00:00 |
Take the word w equals "thew," | ▶ 00:03 |
and we want to find the correction c | ▶ 00:09 |
that maximizes the probability of w given c times the probability of c. | ▶ 00:11 |
We start searching for the possible corrections c | ▶ 00:18 |
that are close to our target word "thew" in terms of added distance. | ▶ 00:23 |
That is, first we start with all possible c that are one letter away, | ▶ 00:28 |
replacing one letter, swapping two letters, inserting one letter, or transposing two letters. | ▶ 00:34 |
And here we have a list of a few of those possible corrections. | ▶ 00:42 |
So it could be "the" by deleting the "w. | ▶ 00:45 |
We could do no correction at all; we have to consider that as one of the possibilities. | ▶ 00:48 |
We could replace the "e" with an "a." | ▶ 00:53 |
We could add a "r." | ▶ 00:55 |
We could transpose the "w" and the "e." | ▶ 00:57 |
Then we look into our spelling correction tables, | ▶ 01:01 |
and again we reduce them from a word-based to a letter- or edit-based, | ▶ 01:05 |
and we say what's the probability of inserting a "w." | ▶ 01:10 |
Here we've conditioned the insert not just absolutely of inserting a "w" anywhere, | ▶ 01:15 |
but for insertions and deletions, we condition them on the previous letter. | ▶ 01:21 |
So what's the possibility of inserting a "w" given that the previous letter was an "e?" | ▶ 01:26 |
It turns out that's what the probability is, | ▶ 01:33 |
and then we go through the list. | ▶ 01:35 |
Here's replacing an "e" with an "a." | ▶ 01:37 |
That's one of the most common edits made in English, | ▶ 01:40 |
one of the most common spelling corrections. | ▶ 01:43 |
A 10th of a percent of all spelling errors are mistaking an "e" for an "a," | ▶ 01:46 |
and similarly down the list. | ▶ 01:50 |
So we get this probability for the probability of w given c, | ▶ 01:52 |
and then the probability of the correction word c, | ▶ 01:56 |
that we just get by looking up in our corpus how many times we have seen this word | ▶ 01:59 |
and applying whatever smoothing we're getting. | ▶ 02:04 |
Then we multiply them all out, and I've scaled these by a factor of 1 billion. | ▶ 02:06 |
It turns out with the model I've built that "thew" is most probably corrected to "the." | ▶ 02:11 |
And that makes sense. | ▶ 02:21 |
It's easy to imagine your finger slipping off the "e" key and going over to | ▶ 02:23 |
the "w" since they're next to each other, | ▶ 02:26 |
and "w" is a very common word in English. | ▶ 02:28 |
But it's troubling that the second possibility, | ▶ 02:33 |
namely leaving "thew" alone and keeping it as is has such a high probability. | ▶ 02:37 |
Now, it turns out "thew" is a word. | ▶ 02:44 |
It's rather archaic. It does show up in the Shakespeare corpus. | ▶ 02:48 |
It has to do with muscle tissue, | ▶ 02:52 |
but it's a fairly uncommon word, | ▶ 02:56 |
and how high it ranks depends in large part on the probability that we assign | ▶ 02:58 |
to this edit of doing nothing at all. | ▶ 03:04 |
Here I've assigned it a probability of 0.95. | ▶ 03:07 |
That is, I've said for my probabilistic model, | ▶ 03:11 |
I've made this choice to say I think that about 95% of the words are spelled correctly | ▶ 03:15 |
and 5% are spelled incorrectly. | ▶ 03:22 |
You have to make that choice in order to have a complete model. | ▶ 03:24 |
The probability distribution has to be spread out over all possible, | ▶ 03:27 |
and they have to sum up to one, so I've got to put it somewhere. | ▶ 03:31 |
If I had made another choice, then these two could have been swapped around. | ▶ 03:33 |
So the answer you get depends on the assumptions you make. | ▶ 03:37 |
Still, we can have spelling correcters that are highly accurate. | ▶ 03:41 |
This very simple model of just looking at unigram possibilities | ▶ 03:46 |
and looking at the edits achieves accuracy in the 80% range. | ▶ 03:51 |
If we go beyond that and start dealing with Markov assumptions | ▶ 03:58 |
and looking at multiple word sequences, then we can get up into the high 90%. | ▶ 04:03 |
Now, let me back up just for a minute and talk about software engineering in general | ▶ 00:00 |
rather than talking about specific AI techniques. | ▶ 00:05 |
What I'm showing here is a small excerpt from the spelling correction code | ▶ 00:08 |
from a project called Htdig, which is an open-source search engine. It's a great search engine. | ▶ 00:13 |
If you ever have need of one, you might want to check it out. | ▶ 00:18 |
All the code is very straightforward and easy to deal with. | ▶ 00:22 |
It has several thousand lines of code dealing with spelling correction. | ▶ 00:26 |
Here we see a little bit of code. | ▶ 00:32 |
It has the good idea of saying one word might be misspelled for another if they sound alike, | ▶ 00:34 |
and so let's go through each word and figure out what each letter is sounding like | ▶ 00:40 |
and see if there are other words that sound similar. | ▶ 00:44 |
So for example, here it's saying what does a "c" sound like. | ▶ 00:47 |
Well, "c" is ambiguous in English. | ▶ 00:51 |
It has this "x" sound, the "ch" sound, this "s" or "k" sound, | ▶ 00:54 |
and there's all these possibilities about how it can have one sound or another. | ▶ 00:59 |
Now imagine you're in charge of maintaining this program. | ▶ 01:03 |
In order for you to make sure that it's right you have to do several things. | ▶ 01:06 |
First, you could look at this comment and say, well, does this comment | ▶ 01:10 |
accurately reflect the rules for English pronunciation? | ▶ 01:13 |
Here, it's talking about pronouncing a "c" as an "s" in the context of an "i," "e," or "y." | ▶ 01:17 |
What about the other vowels--"a" and "o?" | ▶ 01:24 |
Were they left out by accident or is this correct? | ▶ 01:26 |
So you'd have to do some work to check that out. | ▶ 01:29 |
Then you'd have to do more work to say if this comment correct, | ▶ 01:31 |
is the comment correctly implemented in this code here? | ▶ 01:35 |
In fact, just this sort of one page of code just dealing with a couple letters | ▶ 01:39 |
is about the same as all the code that we use to implement the probabilistic model. | ▶ 01:43 |
But I think the most important difficulty in maintaining code like this | ▶ 01:50 |
is that it's so specific to the English language. | ▶ 01:55 |
Imagine you're in charge of maintaining it, and you're boss or professor comes to you and says, | ▶ 01:59 |
"Great job. Now I'd like you to make this work for | ▶ 02:05 |
German and French and Azerbaijani and 50 other languages." | ▶ 02:09 |
You'd have to go through and understand the pronunciation rules in each of those languages | ▶ 02:13 |
and edit a version of this code for each particular language. | ▶ 02:18 |
That would be quite tedious. | ▶ 02:22 |
But if you were dealing with a probabilistic model | ▶ 02:24 |
and you were asked to work in another language, | ▶ 02:28 |
all you would have to do is go out and collect a large corpus of words in that language. | ▶ 02:30 |
Then you'd have the probability of the individuals words. | ▶ 02:35 |
And then find a corpus of spelling errors. | ▶ 02:38 |
Then you'd have the probability of the spelling edits. | ▶ 02:41 |
And so gathering that data is much faster, much easier software engineering process | ▶ 02:44 |
than writing this code by hand. | ▶ 02:50 |
In sense, you could say that machine learning over probabilistic models | ▶ 02:52 |
is the ultimate in agile programming. | ▶ 02:58 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
No subtitles... | ▶ 00:00 |
The very first question, is a search question. | ▶ 00:00 |
You probably know about the Towers of Hanoi, | ▶ 00:04 |
if you don't then please go and Google them. | ▶ 00:07 |
It's a single player game, by which you try to move the tower of four slices over here, | ▶ 00:11 |
onto the right peg, over here. | ▶ 00:18 |
You can use the middle peg, but the rules are you can only move one disk at a time. | ▶ 00:21 |
And it might never happen, that a small disk sits below a larger disk. | ▶ 00:27 |
So the way to solve it is to move the disk over here, | ▶ 00:34 |
the second largest disk to the right side, | ▶ 00:38 |
the small one over, | ▶ 00:40 |
and the third largest to the center, and so on. | ▶ 00:42 |
If you know it, you know what I'm talking about. If not, just Google it. | ▶ 00:45 |
So, I would like to know, | ▶ 00:49 |
what is the size of the state space of valid disk configurations in this puzzle. | ▶ 00:51 |
Please enter this here. | ▶ 00:57 |
I'd like to know, whether the number of the disks on the left peg | ▶ 00:59 |
are an admissible heuristic, if you use A* search. | ▶ 01:04 |
And I'd like to know, what is the number of steps | ▶ 01:09 |
that an optimal solution will require | ▶ 01:13 |
to move all the disks from the left peg, to the right peg. | ▶ 01:16 |
So here's a Bayes Network, with 6 variables, A, B, C, D, E, and F. | ▶ 00:00 |
And I'd like you to count parameters. | ▶ 00:05 |
If this was a binary based network, | ▶ 00:07 |
where each variable can take on two values, | ▶ 00:09 |
then, A would require one independent parameter, | ▶ 00:12 |
and B another one. | ▶ 00:16 |
And C would require four independent parameters, | ▶ 00:18 |
because there's four different ways A and B can come together in condition C. | ▶ 00:22 |
Now in this question I'd like to ask you, | ▶ 00:27 |
What happens if each node can assume three values, not just two? | ▶ 00:30 |
So A can be, A1, A2, A3. | ▶ 00:35 |
And C can be, C1, C2, C3. | ▶ 00:38 |
For each node, specify the number of independent parameters required | ▶ 00:41 |
to state the conditional probability of that node. | ▶ 00:45 |
And I'll tell you this is a tricky question, | ▶ 00:48 |
So for A, the correct answer is two. | ▶ 00:50 |
I won't give you the other ones. | ▶ 00:55 |
And, it's two because A can take three values, | ▶ 00:57 |
but it takes two independent parameters. | ▶ 01:00 |
The last one can be inferred from, one minus the first two. | ▶ 01:03 |
Please fill in the values for all the other variables. | ▶ 01:07 |
This is a true or false set of questions for Machine Learning. | ▶ 00:00 |
Suppose we've trained a machine learning model, | ▶ 00:04 |
and we've found really good values for our parameters in our model. | ▶ 00:07 |
And now, we're going to increase the noise, | ▶ 00:11 |
that affects our data. | ▶ 00:16 |
What should we do to accommodate the increase of noise? | ▶ 00:18 |
Shall we increase k, if we're using k nearest neighbor? | ▶ 00:22 |
True or False? | ▶ 00:27 |
Increase k if we are using the k means algorithm. | ▶ 00:29 |
True or False? | ▶ 00:33 |
Increase k if we are using Laplacian smoothing. | ▶ 00:35 |
True or False? | ▶ 00:39 |
Use fewer particles if we are using particle filters. | ▶ 00:41 |
True or False? | ▶ 00:44 |
And use more data if available. | ▶ 00:46 |
True or False? | ▶ 00:48 |
So this is a planning question. | ▶ 00:00 |
And I apologize, it's a little bit hard to read. | ▶ 00:03 |
There's a lot of text here. | ▶ 00:06 |
And I ask you to consult the pdf document to read the text. | ▶ 00:08 |
Given the resources on the left, over here, | ▶ 00:12 |
can we reach those five goals; | ▶ 00:16 |
A, B, C, D, E, on the right side? | ▶ 00:19 |
And in looking at those, there's words like 'consume', | ▶ 00:22 |
which means, the action eliminates the resource. | ▶ 00:26 |
Whereas 'use' means, | ▶ 00:31 |
you have to have it, but you retain it after using it. | ▶ 00:33 |
Now initially, you know there's a couple of books; | ▶ 00:38 |
one by Nau, about planning, | ▶ 00:41 |
one by Zweben, about scheduling, | ▶ 00:43 |
and one by Melville, about Whales. | ▶ 00:45 |
And there's also videos. | ▶ 00:48 |
Video 8 is about Planning. | ▶ 00:50 |
And Video 15 is about Scheduling. | ▶ 00:52 |
These might be our in-class videos. | ▶ 00:54 |
That's your initial state. | ▶ 00:56 |
And your goal is that you, as a student, | ▶ 00:58 |
know about planning, | ▶ 01:02 |
and you know about scheduling. | ▶ 01:04 |
So the question is, with certain resources | ▶ 01:06 |
that are available in the beginning, | ▶ 01:10 |
and they differ from question to question, | ▶ 01:13 |
can you attain the state of knowing about planning and scheduling? | ▶ 01:15 |
Now, there's two ways to know about a topic. | ▶ 01:19 |
One is to study it using a book. | ▶ 01:23 |
And one is to view it using a video. | ▶ 01:26 |
In both cases, the outcome is to know about the topic over here. | ▶ 01:29 |
Now either one has a different precondition. | ▶ 01:34 |
In the 'book' case, you have to have the book, | ▶ 01:37 |
and the book has to be about the topic you care about. | ▶ 01:40 |
In which case, the action 'study' | ▶ 01:43 |
lets you understand the book and you know about the topic. | ▶ 01:46 |
So for example, if you have a book about planning, | ▶ 01:49 |
and study it, then you know about planning. | ▶ 01:52 |
In the 'view' case, you have to have a video that's about the topic, | ▶ 01:56 |
and you have to have a certain bandwidth, | ▶ 02:01 |
which happens to be 2.5. | ▶ 02:04 |
If you don't have the bandwidth 2.5, you won't be able to view the video, | ▶ 02:06 |
and you won't be able to know about the topic. | ▶ 02:10 |
That's the way the problem is set up. | ▶ 02:12 |
Now, books can be bought or borrowed. | ▶ 02:15 |
In the buying case, you consume 50 dollars | ▶ 02:19 |
In the borrowing case, | ▶ 02:23 |
you have to have a privilege, at the library, that's at least '1'. | ▶ 02:25 |
It might be larger but it can't be lower than '1'. | ▶ 02:30 |
And in either case after doing this, you have the book, | ▶ 02:32 |
and you can now plug this into the 'study' action, | ▶ 02:36 |
and you can read about it, | ▶ 02:39 |
and study it, and know the topic. | ▶ 02:41 |
So here are the questions. | ▶ 02:44 |
If your resource is that you have 50 dollars, | ▶ 02:46 |
and you have library privileges of '1', | ▶ 02:49 |
can you then attain the state of | ▶ 02:51 |
knowing about planning and scheduling? | ▶ 02:54 |
Secondly, suppose your resources is no dollars, | ▶ 02:57 |
but you have library privileges of '2', | ▶ 03:01 |
can you attain the same state? | ▶ 03:03 |
Third, what about the same with library privileges of '1'? | ▶ 03:06 |
Can you get here? | ▶ 03:10 |
Fourth, what about if you have 40 dollars, | ▶ 03:12 |
and bandwidth of '3'? Can you get here? | ▶ 03:15 |
And fifth, what about if you have bandwidth of '2', and 95 dollars? | ▶ 03:18 |
Can you get here? | ▶ 03:23 |
Check all, or any, or none | ▶ 03:25 |
of those five questions that apply. | ▶ 03:28 |
This is a question about logic. | ▶ 00:00 |
We have four different statements. | ▶ 00:03 |
Pink is True, | ▶ 00:06 |
Pink or Green is True, | ▶ 00:07 |
Pink and Green is True, | ▶ 00:09 |
and not Pink implies that Green is True. | ▶ 00:11 |
Now these statements could imply each other, | ▶ 00:15 |
and in this matrix over here, | ▶ 00:19 |
I'd like you to select each circle | ▶ 00:21 |
where an implication is necessarily true. It's always true. For example, | ▶ 00:23 |
if you believe that Pink implies | ▶ 00:28 |
that Pink or Green is True, | ▶ 00:31 |
then mark the A implies B circle, over here. | ▶ 00:35 |
If you believe the Pink is True implies | ▶ 00:40 |
Pink and Green must be True, | ▶ 00:43 |
then mark the circle A implies C, over here. | ▶ 00:46 |
And so on for the entire matrix. | ▶ 00:50 |
One hint, D looks complex, | ▶ 00:53 |
but it happens to be the same as one of the previous cases. | ▶ 00:56 |
So if you fill out the matrix for A to C first, | ▶ 01:00 |
and then copy the result over for D, | ▶ 01:04 |
it will be easier, than if you start thinking about D separately. | ▶ 01:09 |
And to find the equivalency of D, | ▶ 01:13 |
just write down the Truth Table of these different things over here, | ▶ 01:15 |
and observe D is already represented among A to C. | ▶ 01:19 |
In this question we study a particle filter. | ▶ 00:00 |
Let's just zoom in for a second. | ▶ 00:04 |
We have eight particles that land on this checkerboard. | ▶ 00:07 |
They are labeled, 'A' all the way to 'H'. | ▶ 00:11 |
And some of them are on black squares. | ▶ 00:15 |
And some of them are on white squares. | ▶ 00:18 |
Given those particles, | ▶ 00:21 |
we'll assume that the probability of measuring 'black', | ▶ 00:23 |
for any particle that falls on a black square, | ▶ 00:26 |
is 0.7. | ▶ 00:29 |
And the probability of measuring 'white', | ▶ 00:31 |
for any particle that falls on a white square, | ▶ 00:34 |
is 0.6. | ▶ 00:37 |
From that you can easily calculate the probability of measuring 'white', | ▶ 00:39 |
if a particle falls on a black square. | ▶ 00:43 |
And the probability of 'black', | ▶ 00:45 |
if the particle falls on a white square. | ▶ 00:47 |
Now I'd like to what's the normalized importance weight, after normalization, | ▶ 00:50 |
of the particle, labeled 'A', | ▶ 00:55 |
if our measurement happens to be 'white'? | ▶ 00:59 |
That's a number that you put in over here. | ▶ 01:03 |
And I'm going to ask you the same question | ▶ 01:06 |
about the normalized importance weight of particle 'A', | ▶ 01:08 |
if the measurement is 'black'. | ▶ 01:10 |
To calculate this, | ▶ 01:14 |
you will go through these probabilities. | ▶ 01:16 |
For each particle, you will assign the measurement probability. | ▶ 01:19 |
And then you just normalize all of those, | ▶ 01:22 |
so they add up to one. | ▶ 01:25 |
In this question, we assume that a particle, | ▶ 00:00 |
'A', has a already normalized importance | ▶ 00:04 |
weight of 0.2. | ▶ 00:07 |
So there might be other particles, | ▶ 00:09 |
we don't even care how many. | ▶ 00:10 |
But their importance weights add up to 0.8. | ▶ 00:12 |
We now sample 3 new particles, | ▶ 00:15 |
with replacement. | ▶ 00:19 |
What is the probability that this particle, 'A', | ▶ 00:21 |
is sampled at least once? | ▶ 00:24 |
And the way you derive this is by asking | ▶ 00:26 |
the question, what's the probability | ▶ 00:28 |
that particle 'A' is never sampled? | ▶ 00:30 |
And then you take the compliment of this. | ▶ 00:33 |
This is a question about Alpha-Beta Pruning, | ▶ 00:00 |
in min-max search, in games. | ▶ 00:03 |
Consider the following tree, | ▶ 00:06 |
where this is the max node, | ▶ 00:08 |
and these are min nodes. | ▶ 00:10 |
We perform alpha-beta pruning. | ▶ 00:12 |
I'd like you to check all leaf nodes, | ▶ 00:14 |
of these 9 leaf nodes over hear, in this tree, | ▶ 00:17 |
that will be expanded, assuming | ▶ 00:20 |
that we expand from the left to the right, | ▶ 00:22 |
and we expand in depth first mode, of course, | ▶ 00:25 |
as always in these game trees. | ▶ 00:29 |
These are four True or False questions | ▶ 00:00 |
for computer vision, and specifically, | ▶ 00:03 |
perspective projection. | ▶ 00:06 |
Consider a projective image of an object. | ▶ 00:07 |
Which of the following statements is true? | ▶ 00:10 |
If the object moves closer to the camera, | ▶ 00:13 |
the size of the projected image | ▶ 00:16 |
of the object will increase. | ▶ 00:19 |
Is this True of False? Please just check one. | ▶ 00:21 |
If we use a camera with a longer focal length, | ▶ 00:24 |
as a result of using the longer focal length, | ▶ 00:28 |
the size of the projected image will increase. | ▶ 00:31 |
Check one. | ▶ 00:34 |
If we double the distance to the object, | ▶ 00:36 |
the projected image will be half as large, as before. | ▶ 00:38 |
Check one. | ▶ 00:42 |
And finally, the ratio of the focal length | ▶ 00:44 |
over the distance to the object | ▶ 00:47 |
is the same as the projected size of the object in the camera plane. | ▶ 00:50 |
Please check True or False. | ▶ 00:55 |
A question on stereo vision. | ▶ 00:00 |
An object at range of 100 meters | ▶ 00:03 |
leads to a 2mm displacement for a stereo rig, | ▶ 00:05 |
with focal length 40mm. | ▶ 00:08 |
Now we double the baseline. | ▶ 00:12 |
What will happen to the new displacement, | ▶ 00:14 |
that used to be 2mm, | ▶ 00:18 |
what will it be now? | ▶ 00:20 |
Here's a 'Structure from Motion' type problem, | ▶ 00:00 |
that is similar to what I asked you on a homework assignment. | ▶ 00:03 |
Assume there is a world of 3 point features, | ▶ 00:06 |
that will be named; 1, 2, 3, but I won't tell you which one is which. | ▶ 00:10 |
There are 4 pinhole cameras; A, B, C, and D. | ▶ 00:14 |
And they all have a left, center, and right side. | ▶ 00:17 |
Left, center, and right side. | ▶ 00:20 |
And you should observe, | ▶ 00:22 |
that the perceived order of features in the scene, | ▶ 00:24 |
by virtue of using a pinhole, | ▶ 00:28 |
will be inverted inside the pinhole camera. | ▶ 00:30 |
So camera 'A', sees in the left position, | ▶ 00:33 |
feature '1', | ▶ 00:35 |
on the center position, feature '2', | ▶ 00:36 |
on the right position, feature '3'. | ▶ 00:38 |
I would like to know, | ▶ 00:41 |
for which of the other camera's, | ▶ 00:42 |
is it the case that feature '3' | ▶ 00:44 |
will appear in the leftmost position? | ▶ 00:47 |
So the leftmost position is 'L' over here, | ▶ 00:51 |
'L' over here, 'L' over here. | ▶ 00:53 |
Assuming the optical centers, shown over here. | ▶ 00:55 |
Please check any or all of the following; | ▶ 00:59 |
Camera B, | ▶ 01:03 |
Camera C, | ▶ 01:04 |
Camera D, | ▶ 01:05 |
or None of them. | ▶ 01:06 |
My final question is a simplified self-driving car question, | ▶ 00:00 |
that is usually solved using dynamic programming. | ▶ 00:04 |
But I have to warn you, the state space shown here | ▶ 00:07 |
isn't the full state space. | ▶ 00:10 |
The orientation isn't really made explicit, in this state space. | ▶ 00:12 |
But suppose you have a road environment, | ▶ 00:16 |
that has a straight street over here, | ▶ 00:18 |
you can turn left, go straight, or turn right over here, | ▶ 00:20 |
and similarly you can turn left or right over here. | ▶ 00:23 |
And we assume that moving from one grid cell to the next | ▶ 00:26 |
has a cost of '1'. | ▶ 00:30 |
Turning left has a cost of '14'. | ▶ 00:31 |
And turning right has a cost of '1' as well. | ▶ 00:34 |
Let's assume the robot when it turns, | ▶ 00:37 |
stays in the same grid cell, but it only can turn once. | ▶ 00:39 |
After it turned, it has to actually move. | ▶ 00:41 |
So it's impossible, for example, to turn right three times | ▶ 00:43 |
just to avoid the cost of a left turn. | ▶ 00:47 |
I would like to know, | ▶ 00:50 |
what is the minimum total cost | ▶ 00:52 |
of going from the start location, over here, | ▶ 00:55 |
to location 'A'. | ▶ 00:58 |
I realize that there is many ways to get there. | ▶ 00:59 |
I'd like to know the minimum. | ▶ 01:02 |
So what's the minimum cost to get to 'A', | ▶ 01:04 |
irrespective of what orientation you assume at 'A'? I don't really care. | ▶ 01:06 |
The same from the start location to 'B'. | ▶ 01:10 |
And from the start location to 'C'. | ▶ 01:13 |
Please enter your best guesses on the right side. | ▶ 01:16 |