Notes

CS886-1: Decision making under uncertainty

University of Waterloo
Winter 2010
Pascal Poupart
TTh 10:00 - 11:20
MC2036
Course website

1.  2010-01-07 (Jan. 7)

Road map

PmGraphViz

Modeling dimensions

  • Uncertainty
  • Time (T)
  • Decision making (D)
  • Partial observability (PO)
  • Learning (L)
  • Multi-agent (MA)

Computational dimensions

Note: Bayesian Networks (BN) are NP-hard, so all of these methods are at least NP-hard.
  1. Approximate reasoning
    • Additive and multiplicative decompositions
    • Monte Carlo techniques (sampling)
  2. Hierarchical modeling
    • Divide-and-conquer, making things more tractable
  3. Relational (first-order) modeling
    • Like logic vs. first-order logic
      • e.g. forall trucks drive(truck)
  4. Continuous modeling
    • Previously, all dicrete
    • Sometimes need continuous variables
      • e.g. time, state, action
  5. Bayesian learning
    • Similar to learning modeling dimension
  6. Kernel methods
  7. Multi-agent issues
    • Decentralized cooperative modeling and mechanism design

Models

Possible project: biological neurons as cooperative agents in a multi-agent MDP?

  1. (BN) Bayesian Networks
    • Compact representation for multi-variate distributions
  2. (DBN) Dynamic Bayesian Networks
    (HMM) Hidden Markov Models
    • Temporal distributions
    • Application: speech recognition
    • Application: behaviour recognition
    • Application: stock market predictions
    • Application: robot localization
  3. (DN) Decision Networks
    • Decision making under uncertainty
    • Application: recommender systems
    • Application: medical decision support
  4. (MDP) Markov Decision Processes
    • Sequential decision making
    • Inventory management
    • Dynamic pricing
    • Application: dynamic airline ticket prices
    • Application: robotic control
    • Application: autonomic computing
  5. (POMDP) Partially Observable MDPs
    • Application: assistive technologies
    • Application: spoken dialog systems
  6. (RL) Reinforcement Learning
    • Simultaneous learning and sequential decision making
    • Application: robotic control
  7. (MA-MDP) Multi-Agent MDPs
    • Sequential decision making in the presence of other agents
    • May be competitive (game theory) or cooperative (decentralized policies)
  8. (SG) Stochastic Games

2.  2010-01-12 (Jan. 12)

Slides
Text: [RN] Chapter 13, Sections 14.1-4

Super important probability rules:

  • Product rule:
  • Summing out rule:
  • Chain rule: (this generalizes to any number of variables)
  • Bayes' rule:

An example using these rules, from the slides:

  • Diseases: {malaria, cold, flu}; Symptom = fever
  • Need
    • But, something like is not natural because the disease is what causes the symptom.
    • This is also not stable; an epidemic would change the value.
  • So, we use Bayes' rule:
Independence:

Conditional independence (given z, in this case):

Most domains are not independent, but most have some conditional independence.

Bayesian networks are directed acyclic graphs plus combinational probability tables (CPTs).

  • Nodes = Variables
  • CPTs:
  • Family = Self and parents
  • Every is conditionally independent of all its non-descendants given its parents.

Note equations on Slide 18-19.

Slide 15:

  • 1) If Z is known, then X and Y are conditionally independent.
  • 2) If Z is known, then X and Y are conditionally independent.
  • 3) If Z or any of its descendants are known, then X and Y are dependent.

See Russel & Norvig chapters on Bayesian Networks.

Slide 23-24: the tables do not list probabilities.

Verify slide 27 by hand. Go over algorithm in slides 29-30.

3.  2010-01-14 (Jan. 14)

Slides
Text: [RN] Chapter 15

D-separation: open = dependent, blocked = independent

Potentials != probabilities

Finding by variable elimination ordering is NP-Hard (note from Pascal: "This is actually quite sad.")

Elimination ordering: lowest degree (i.e. number of neighbours) first (so, singly connected nodes would be first). Because there are loops, we can't guarantee linear time.

Dynamic Bayesian Networks:

  • Slide 7: CPT for each state the same
  • Slide 16: Why?
    • E.g.1: stock market, use prediction to tell if a day is start of boom or bust or whatever
    • E.g.2: Behaviour recognition, walker patients
  • Slide 17: Maxout = take summout and instead of adding, find the max
  • Slide 20: Because laser and sonar has evidence, path is blocked, so all get correlated
  • Slide 22: Position vs speed vs acceleration. If we add acceleration to the model, the process is now stationary.
    • Analogous to approximating by taking derivatives

4.  2010-01-19 (Jan. 19)

Slides
Text: [RN] Chapter 16

Today's dimension: decisions.

Slide 18 has some CPTs for reference.

In Bayesian Network influence diagrams, ellipse = chance node, rectangle = decision node, diamond = value node.

  • Arcs into decision nodes indicate pre-knowledge. We must know them (i.e. we can't have non-known inputs).
  • Decision nodes are totally ordered.
  • There's usually only one value (reward) node that depends only on the state of its parents.
  • Utilities aren't probabilities.

In generating policies, remember that the policy space is where n is the number of combinations. So, for 2 factors c and f you have (c,f) (c,~f) c,f) c,~f), then 16 possibilities for that (4 for each).

5.  2010-01-21 (Jan. 21)

Slides
Text: [RN] Sections 17.1-3
Text: [P] entire book

Decisions + time: Markov Decision Processes (MDPs)

  • Slide 8: Good overview
  • Slide 5: Because of the Markov principle. D-separation blocks S2.
  • Slide 11: Linear equations!
  • Slide 15: Another useful slide
    • Bellman's equations -- link with Linear Bellman Controllers?
  • Slide 17: Do as an exercise

Policy Iteration:

  1. Policy evaluation (as before -- what is the value?)
  2. Policy improvement

    This converges because there are a finite number of actions / policies.

Linear Programming:

  • must hold for the optimal policy.
    • can be any distribution that sums to 1 and is ; i.e.
  • So, in practice, P(1) = 1, all else = 0, and is subject to just the one thing.

MDP complexity in space and time is ; i.e. MDPs are polynomial.

6.  2010-01-26 (Jan. 26)

Slides
Text: [RN] Section 17.4-5

Partially Observable Markov Decision Processes (POMDPs)

Bayes theorem belief update = Bayes filtering = Kalman filtering?

For project: Neurons: actions are only spike or not spike.
Goal depends on neural circuit...
Possible circuit = object tracking with neural integrator?

Policy is mapping from beliefs to actions (like MDP except beliefs instead of explicit states).

Note that policy evaluation process is convex.

For most problems, we can't solve exactly; the process is P-space complete, which is worse than NP-complete. So we approximate.

Belief monitoring: like a belief lookahead?

2nd set, slide 17: like Bellman's equation.

7.  2010-01-28 (Jan. 28)

Slides
Text: [RN] Chapter 21

Temporal difference learning convergence proof:


8.  2010-02-02 (Feb. 2)

Slides
Text: [RN] Section 17.6

Multi-agent stuff

Slide 9: Strategy = policy from before

Slide 16: -i is the other player

Questions from project:
To what degree does a cooperative multi-agent algorithm match what neurons do?
Is it a game? Maybe competing systems, trying to get reward.
Go through the aspects of agent design, mechanism design.
Strategy = tuning curve?

9.  2010-02-04 (Feb. 4)

Multiplicative and additive decompositions for making things more tractible (approximations or sampling)

Types of approximate decompositions:

1. Multiplicative decomposition

Useful to approximate probability distributions.

More project ideas:
Use error instead of utility, or a function of error.

2. Additive decomposition

Useful to approximate utility functions.

Use KL-divergence instead of MSE in the NEF to get only positive weights?

Consider a dynamic bayesian network. Recall that X represents the state variables and O represents the observations.

In theory, all variables get correlated.

What are the best such that is minimal.

Use KL-Divergence to get only positive numbers that add up to 1 (since it's a probability distribution).

KL (Kullback-Leibler) Divergence

Let ; we're looking for the best 's to approximate .

  • This is a convex optimization problem! So, if we take the derivative and set it to 0 we can find a global minimum.

  • such that
  • is the Lagrange multiplier

Strong duality, so no gap; we just take the derivative and set it to 0.

  • In the first step, we're summing out all variables except
  • is like a normalization constant; hence, we have to set it to 1 because we're working with probabilities
  • is the marginal of the original distribution

What if is not a distribution? Then let .

Approximation:

Can we compute the factorization without knowing the joint distribution in the first place? (Will be discussed in the papers!)

In practice, we do not compute and approximate the next, we just compute each directly.

  • where is a normalization constant

Factor as you do the computation.

(Next week's paper explains how to do this efficiently.)

Additive decomposition

Markov decision process (policy evaluation)

  • where

This is the linear system we want to solve. Note that is an exponentially long vector.

Approximate to be .

Let be the euclidean distance. We're dealing with utilities, not probabilities, so we don't need KL-Divergence.

  • This is again convex because the quadratic terms are positive.
  • We'll call the subtraction that we're minimizing over

  • There are no constraints for this equation.

Now we can just solve the linear system (linear in .

This is our linear system -- same as in linear regression and the like.

10.  2010-02-09 (Feb. 9)

Topics for today:

  • Sampling!
  • Rejection sampling
  • Likelihood weighting
  • Importance sampling
  • Markov Chain Monte Carlo (MCMC)

Direct sampling (toy example):

PmGraphViz

What is ?

Steps:
  • Sample a value for A from : true
  • Sample a value for B from : false
  • Sample a value for C from : true
  • Sample a value for D from : false
A Pr(A)
T 0.4
F 0.6

However, if we have evidence, we fail.

Rejection sampling:

What is ?

We do direct sampling like before, but reject all samples that are inconsistent with the evidence.

So, we sample for A, B, C, D; then, if D = false, this is inconsistent, so we reject this sample.

What if:

B C D Pr(D|BC)
T T T 0.01
. . T 0
. . T 0.001
. . T 0.1
. . F 0.99
. . F 1
. . F 0.999
. . F 0.9

Then we're going to have to sample a lot. So this doesn't work.

We usually need exponential numbers of samples based on the number of pieces of evidence.

Likelihood weighting:

  • Sample values for the hidden variables.
  • Reweight each sample according to the probabilities of the evidence.
    • Hidden = no evidence
  • Sample A, B, C like before, but now instead of sampling D, we weight our sample by (e.g. 0.01 for the table above)
  • Looks better, but it turns out to be as bad as rejection sampling.
  • Not rejecting, but each sample is now only giving a fraction of a sample.
    • Essentially rejecting even though it doesn't reject it.

Importance sampling:

Let's say we are trying to estimate . If we sample directly from , then all samples have a weight of 1.

If instead we sample from another distribution , then we need to weight the samples:

In likelihood weighting, we have instead of . So, we get:

You need to have a good approximation of the sample you're choosing from.

Markov Chain Monte Carlo (MCMC) techniques:

  • Used a lot in practice.
  • Iterative technique to sample given a Markov chain.
  • A Markov chain is defined by a conditional distribution that can produce a chain (or sequence of states) by repeatedly sampling from .
  • If the Markov chain is ergotic (i.e. every state can be reached from every state and there are no strictly periodic cycles) then there is a unique distribution such that
    • where is a stationary distribution.
  • Hence, if we repeatedly sample from the sample approximates in the limit.

Generic MCMC:

  1. Select such that is the stationary distribution of .
    1. Set = {}
    2. For each t=1 to k, do:
      1. Sample from
      2. Set = Set union {}

Gibbs sampling for Bayes Networks:

  1. Set = {}
  2. For t = 1 to k do:
    1. For i = 1 to n do:
      1. Sample from the values we already have (supposedly)
      2. Set = Set union {}

For example, is the first state was <T,T,F,T>, then:

  • A ~ P(A|B=t, C=f, D=t)
  • B ~ P(B|A=f, C=f, D=t)
  • C ~ P(C|A=f, B=t, D=t)
  • D ~ P(D|A=f, B=t, C=f)

The second state is thus: <F,T,F,F>

Then keep going!

We can prove that this produces a that is the stationary distribution of .

11.  2010-02-11 (Feb. 11)

Gibbs Sampling will converge to the joint distribution of the Bayes Net!

Proof: show that satisfies the detailed balance condition, which is a sufficient condition for convergence.

Detailed balance: . (Go from one joint assignment to the next.)

Convergence follows because:

  • The left-hand side is the distribution at given that we started at .

For Gibbs sampling: satisfies detailed balance because:

Efficient Gibbs sampling:

Use the fact that is determined by its Markov blanket to simplify .

The Markov Blanket is the smallest set of variables such that knowledge of any other variable won't influence our belief about .

Then sample from this factor when we renormalize, or something like that.

Sequential Monte Carlo sampling:

(Also known as particle filtering, condensation algorithm, importance resampling, etc.)

Think about a Hidden Markov Model; . The process forgets over time, so we can do likelihood weighting -- with a twist!

  1. Start with some samples {} that approximates .
  2. For each {} sample according to . This approximates .
  3. Reweight each by .
  4. Resample some from the weighted sample of the previous step.

Bayes rule:

Or, in another way to look at it: