辅导SDSC4001、辅导python程序语言
- 首页 >> CS SDSC4001 (Semester B, 2022)
Foundation of Reinforcement Learning
Assignment 2
All questions are weighted equally. For questions that require Python, please submit
the .py file and not the screenshot of the code.
Question 1. Use the idea of Bellman operator Tpi to solve the following equation
vpi =
3/4 1/4 0
1/4 3/4vpi,
where vpi ∈ R3. By applying Tpi, solve the above problem (in Python) with all combinations of
γ = 0.9, 0.999 and = 0.5, 0.01. What is the difference in terms of convergence? Discuss your
results and please submit your Python code.
Question 2. As a stock enthusiast, Tom decides to participate in the stock market and make
extremely risky investments on every Friday.
He plans to start with an initial budget $300, and he will stop when he either (i) lose it all
or (ii) double his money or above (i.e. ≥ $600).
On each Friday, he would invest in expiring option contracts that either (a) give him an
immediate reward at the end of that day or (b) lose all the money that he invested on that
day.
He has two strategies. On each Friday, he would choose either
Strategy A: Invest $100. With probability 0.45, it will return $200 (i.e. net gain $100),
and $0 (i.e. net gain ?$100) otherwise.
Strategy B: Invest $100. With probability 0.4, it will return $300 (i.e. net gain $200),
and $0 (i.e. net gain ?$100) otherwise.
1
His discount factor is γ = 1 (or you may use γ = 0.99999 as an approximation for computa-
tion)
Based on the above information,
(a) Model this problem as a MDP. Define the state space, action space, reward function, and the
transition kernel.
(b) Suppose Tom would choose one strategy (A or B) at the beginning and stick with this strategy
until this process stops. His good friend, Pete, claims that Strategy A is a better choice. Do
you agree with him? Explain your answer. You may use Python to help for the computations;
please submit your Python code as part of your submission in that case. Please include
the code that will print/display your result(s).
Note: Questions 3 and 4 require the book “Reinforcement Learning: An Introduction”
(2nd edition), which can be found here: http://incompleteideas.net/book/RLbook2018.
pdf
Question 3. Choose either (a) or (b) below.
(a) Write Python code to reproduce the right figure in Example 6.2 (Random Walk) on page 125
from the book. Please submit your code.
(b) Write Python code to reproduce the lower figure in Example 6.6 (Cliff Walking) on page 132
from the book. Please submit your code.
Question 4. Write Python code to reproduce Figure 13.1 on page 328 from the book (softmax
policy is used). Please submit your code.
2
Foundation of Reinforcement Learning
Assignment 2
All questions are weighted equally. For questions that require Python, please submit
the .py file and not the screenshot of the code.
Question 1. Use the idea of Bellman operator Tpi to solve the following equation
vpi =
3/4 1/4 0
1/4 3/4vpi,
where vpi ∈ R3. By applying Tpi, solve the above problem (in Python) with all combinations of
γ = 0.9, 0.999 and = 0.5, 0.01. What is the difference in terms of convergence? Discuss your
results and please submit your Python code.
Question 2. As a stock enthusiast, Tom decides to participate in the stock market and make
extremely risky investments on every Friday.
He plans to start with an initial budget $300, and he will stop when he either (i) lose it all
or (ii) double his money or above (i.e. ≥ $600).
On each Friday, he would invest in expiring option contracts that either (a) give him an
immediate reward at the end of that day or (b) lose all the money that he invested on that
day.
He has two strategies. On each Friday, he would choose either
Strategy A: Invest $100. With probability 0.45, it will return $200 (i.e. net gain $100),
and $0 (i.e. net gain ?$100) otherwise.
Strategy B: Invest $100. With probability 0.4, it will return $300 (i.e. net gain $200),
and $0 (i.e. net gain ?$100) otherwise.
1
His discount factor is γ = 1 (or you may use γ = 0.99999 as an approximation for computa-
tion)
Based on the above information,
(a) Model this problem as a MDP. Define the state space, action space, reward function, and the
transition kernel.
(b) Suppose Tom would choose one strategy (A or B) at the beginning and stick with this strategy
until this process stops. His good friend, Pete, claims that Strategy A is a better choice. Do
you agree with him? Explain your answer. You may use Python to help for the computations;
please submit your Python code as part of your submission in that case. Please include
the code that will print/display your result(s).
Note: Questions 3 and 4 require the book “Reinforcement Learning: An Introduction”
(2nd edition), which can be found here: http://incompleteideas.net/book/RLbook2018.
Question 3. Choose either (a) or (b) below.
(a) Write Python code to reproduce the right figure in Example 6.2 (Random Walk) on page 125
from the book. Please submit your code.
(b) Write Python code to reproduce the lower figure in Example 6.6 (Cliff Walking) on page 132
from the book. Please submit your code.
Question 4. Write Python code to reproduce Figure 13.1 on page 328 from the book (softmax
policy is used). Please submit your code.
2