Talk:Markov decision process

Robotics High‑importance

	This article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.RoboticsWikipedia:WikiProject RoboticsTemplate:WikiProject RoboticsRobotics articles
High	This article has been rated as High-importance on the project's importance scale.

Systems High‑importance

	Systems science portal This article is within the scope of WikiProject Systems, which collaborates on articles related to systems and systems science.SystemsWikipedia:WikiProject SystemsTemplate:WikiProject SystemsSystems articles
High	This article has been rated as High-importance on the project's importance scale.
	This article is within the field of Control theory.

Statistics Mid‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
Mid	This article has been rated as Mid-importance on the importance scale.

Daily pageviews of this article

A graph should have been displayed here but graphs are temporarily disabled. Until they are enabled again, visit the interactive graph at pageviews.wmcloud.org

Notational inconsistency[edit]

To adhere to the notation that the article has established, I believe the formula

\pi (s):=\operatorname {argmax} _{a}\left\{\sum _{s'}P(s'\mid s,a)\left(R(s'\mid s,a)+\gamma V(s')\right)\right\}

should be rewritten as

\pi (s):=\operatorname {argmax} _{a}\left\{\sum _{s'}P_{a}(s,s')\left(R_{a}(s,s')+\gamma V(s')\right)\right\}

Have I got that wrong? If no-one comments to the contrary, I will make the change. 85.211.24.141 (talk) 14:12, 6 June 2020 (UTC)[reply]

Major rewrite and reorganization suggested[edit]

As someone who has worked on Optimal Control applications and studied Reinforcement Learning, Optimal Control and hence Markov Decision Processes, I think this article is not well written even for an introductory viewpoint. It feels incomplete and quite random at times. I suggest major rewrite and reorganization of the material, sticking more closely to: (1) Reinforcement Learning An Introduction, second edition by Richard S. Sutton and Andrew G. Barto, or (2) Dynamic Programming and Optimal Control, Vol. I, 4th Edition by Dimitri Bertsekas. — Preceding unsigned comment added by Okankoc (talk • contribs) 14:36, 2 February 2020 (UTC)[reply]

I disagree. As a newcomer, I found it a very high-quality article. I've fully understood the motivation and definition of MDPs and the idea of using dynamic programming to find the optimal assignment of actions to states. 85.211.24.141 (talk) 14:05, 6 June 2020 (UTC)[reply]

Well it's much better than the reinforcement learning page. While I'm not a complete newcomer, it seems decent though I think there are some mistakes that need to be fixed. — Preceding unsigned comment added by 76.116.14.33 (talk) 14:36, 19 July 2020 (UTC)[reply]

Untitled[edit]

It would also be nice to have a section on Semi-Markov Decision Processes. (This extension to MDP is particularly important for intrinsically motivated RL and temporal abstraction in RL.) Npdoty 01:33, 24 April 2006 (UTC)[reply]

It would be nice to hear about partially observable MDPs as well! --Michael Stone 22:59, 23 May 2005 (UTC)[reply]

Not to mention a link to Markov chain! I've been meaning to expand this article, but I'm trying to decide how best to do it. Anything that can go in Markov chain should go there, and only stuff specific to decision processes should go here, but there will need to be some frequent cross-reference. I think eventually POMDPs should get their own article, as well, which should similarly avoid duplicating material in Hidden Markov model. --Delirium 06:15, 13 November 2005 (UTC)[reply]

What is γ[edit]

The constant γ is used but never defined. What is it?

Well, at the first usage it's described as "discounting factor γ (usually just under 1)", which pretty much defines it - do you think it needs to be more prominent than that?

This is now fixed in the article

Invented by Howard?[edit]

"They were invented by Ronald A. Howard in 1960"

Is that right? Is "invent" the proper term? Also, weren't there works on MDPs (even if with other names) before 1960?

Stochastic games were introduced already in ^[1]. Since they are more general than MDPs, I would be surprised if MDPs were not used even earlier than that.

^ Shapley, L.S.: "Stochastic Games", pages 1095--1100. In Proceedings of the National Academy of Sciences 39(10), 1953

—The preceding unsigned comment was added by Svensandberg (talk • contribs) 13:31, 9 January 2007 (UTC).[reply]

"Invent" may not be the right word. However, Howard's book was very important. In E. V. Denardo's book "Dynamic Programming" he does mention Shapley (1953) but adds "a lovely book by Howard (1960) highlighted policy iteration and aroused great interest in this model". So that book set off a lot of subsequent research. And it is still a classic. Feel free to replace the word "invent" with another more appropriate... Encyclops 22:58, 9 January 2007 (UTC)[reply]

I rewrote that a bit and added a reference to Bellman 1957 (which I found in Howard's book). OK? Svensandberg 16:31, 17 January 2007 (UTC)[reply]

What is R(s)?[edit]

It's probably the immediate reward received after transitioning from state s, but this is not explained in the article currently. There's only R_a(s,s').

To keep the equations simple, you could change the definition to use R(s) and refer to the Minor Extensions section. —Preceding unsigned comment added by 80.221.23.134 (talk) 11:47, 10 November 2008 (UTC)[reply]

Yeah,

R(s)

is the immediate reward for visiting state

s

, left over from an earlier version of the article that used

R(s)

and only mentioned

R_{a}(s,s')

as a variant. Fixed. Sabik (talk) 18:46, 22 August 2010 (UTC)[reply]

Finite state spaces?[edit]

Since P is not treated as a density, I assume A is a finite set, but this is not mentioned. Also, is S a finite set? Note that in Partially observable Markov decision process, both sets are said to be finite. Would be helpful to clarify this. I'm going to make the change, so please correct me if I'm wrong. --Rinconsoleao (talk) 15:12, 26 May 2009 (UTC)[reply]

Ininite state spaces[edit]

The technique for the discounted Markov decision process is valid for an infinite (denumerable) state space and other more general spaces. Shuroo (talk) 12:56, 5 June 2009 (UTC)[reply]

OK, I have added a clarifying note to that effect, but I have also added a 'citation needed' tag. Can you add a good reference book on this case? --Rinconsoleao (talk) 11:08, 6 June 2009 (UTC)[reply]

References[edit]

Here is the best book. It won the ORSA Lancaster prize a few weeks ago. Markov decision processes: discrete stochastic dynamic programming, ML Puterman - 1994 - John Wiley & Sons, Inc. New York, NY, USA Shuroo (talk) 18:24, 6 June 2009 (UTC)[reply]

Another attractive and easier to read is: Intoroduction to stochastic dynamic programming by SM Ross, 1983, Academic press. Shuroo (talk) 07:15, 7 June 2009 (UTC)[reply]

Clarifying question (especially for mathematicians)[edit]

A lot of economists apply dynamic programming to stochastic models, without using the terminology 'Markov decision process'. Which of the following, if any, is true?

(1) 'Markov decision process' is a synonym for 'stochastic dynamic programming'

(2) 'Markov decision processes' is a subfield of 'stochastic dynamic programming'

(3) 'Stochastic dynamic programming' is a subfield of 'Markov decision processes'

(4) None of the above is precisely true

Answers, comments, debate, references appreciated. --Rinconsoleao (talk) 11:14, 6 June 2009 (UTC)[reply]

I guess MDP is used for a discrete (finite or denumerable) state space. The dynamic programming technique can be used also for continuous state space (e.g. Euclidian space) if the Markov property holds. However, I am not aware of the broadest sufficient condition for its validity. In any case, it seems to me that (2) would be correct if you will write 'infinite horizon dynamic programming'. Shuroo (talk) 18:38, 6 June 2009 (UTC)[reply]

I vote for (2). MDP is SDP restricted to discrete time and a discrete state space. Encyclops (talk) 21:48, 6 June 2009 (UTC)[reply]

In my own reading, MDP are not limited to discrete time nor discrete state space. Indeed in Sutton & Barto boork (Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998), it is clearly said that the discrete description is only chosen for convenience since it avoid to use probability densities and integrals. So IMHO (3) is the right choice. G.Dupont (talk) 21:00, 10 August 2010 (UTC)[reply]

Wouldn't that mean (1) is the correct answer? Rinconsoleao (talk) 09:29, 11 August 2010 (UTC)[reply]

My answer is (4). They are obviously not the same thing, and I don't think the notion of "subfield" makes any sense. (How can you ever say "A" is a subfield of "B"? Once you identify A as interesting, even if you came from B, you already create an emphasis and aspect which distinguishes it from B.) 85.211.24.141 (talk) 14:16, 6 June 2020 (UTC)[reply]

Hamilton Jacobi Bellman not belong to MDP[edit]

These equations lead to optimal control for a DETERMINSTIC transition state no STOCHASTIC transition state. So it belongs to Deterministic Dynamic Programming. Not here.Shuroo (talk) 08:27, 1 July 2012 (UTC)[reply]

Competitive Markov Decision Processes?[edit]

I was wondering if anyone would mind adding a section or page about competitive Markov Decision Processes? I do not know much about it, but I believe it is just like a MDP with multiple decision makers. — Preceding unsigned comment added by 150.135.222.234 (talk) 00:53, 22 March 2013 (UTC)[reply]

Partial observability[edit]

Is the claim that Burnetas and Katehakis' paper was a major advance in this area supportable? It may be a nice paper but the rest of the paragraph does not make the impact clear and uses some unexplained jargon. Is this more of an advance than the Witness algorithm^[1], for example, (which isn't mentioned)?Jbrusey (talk) 15:20, 28 January 2016 (UTC)[reply]

References

^ http://cs.brown.edu/research/ai/pomdp/tutorial/witness.html

Stationary[edit]

In one section the word stationary is used to describe a type of policy, but the word is never defined (and, indeed, stationary policies are probably out of scope).

On the other hand, the stationarity property of MDPs should probably be discussed somewhere. — Preceding unsigned comment added by Cokemonkey11 (talk • contribs) 12:19, 25 April 2016 (UTC)[reply]

Dr. Johansen's comment on this article[edit]

Dr. Johansen has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

I have read the note and it sounds very convincing. I am not an expert in the decision theory, si I cannot recommend another reeviewer

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

We believe Dr. Johansen has expertise on the topic of this article, since he has published relevant scholarly research:

Reference : Sren Johansen & Bent Nielsen, 2014. "Outlier detection algorithms for least squares time series regression," Economics Papers 2014-W04, Economics Group, Nuffield College, University of Oxford.

ExpertIdeasBot (talk) 16:24, 11 July 2016 (UTC)[reply]

Sounds like Dr Johansen could do with a spelling checker (I'd also advise him to concentrate on comments whose subject matter he knows something about). 85.211.24.141 (talk) 14:08, 6 June 2020 (UTC)[reply]

What is $x$ ?[edit]

What is $x$ in the central formula of Section "Problem"? — Preceding unsigned comment added by 176.207.10.78 (talk) 17:12, 16 November 2019 (UTC)[reply]

Policy iteration steps mislabeled[edit]

The value and policy steps/equations are not labeled, so it looks like the value step is step 1 and the policy step is step 2. I believe this is the normal order of things. First the algorithm updates the value function and then the algorithm computes the new policy. The problem is the description of the algorithms have them backwards. For example, the policy iteration algorithm talks about how repeating step 2 can be replaced by solving the linear equations. But the linear equations are the value equations which come from step 1. It looks like this is a consistent error. — Preceding unsigned comment added by Mesterharm (talk • contribs) 14:58, 19 July 2020 (UTC)[reply]

[1] Shapley, L.S.: "Stochastic Games", pages 1095--1100. In Proceedings of the National Academy of Sciences 39(10), 1953

[2] ttp://cs.brown.edu/research/ai/pomdp/tutorial/witness.html

[1]

[1]