代写CSCI 5512: Artificial Intelligence II Spring 2024 Homework 3代写Python编程

2024.03.30 - 首页 >> C/C++编程

CSCI 5512: Artificial Intelligence II

(Spring 2024)

Homework 3

(Due Thu, Mar 28, 11:59 PM CT)

In this assignment, we will be implementing a few reinforcement learning algorithms to compute value functions and policies for a given Markov Decision Process (MDP). Recall, MDPs specify a set of actions A, set of states S, state-transition probability P(s 0 |s, a), and reward function R(s, a, s0 ). We will consider the MDP specified in Figure 1. The set of actions are: A = {up, down, left,right}; the set of states S are the empty squares, i.e., S = {(column, row)} \ (2, 2) for column ∈ {1, 2, 3, 4} and row ∈ {1, 2, 3} (in other words, all squares are states except square (2,2) which is not a valid state and is not in S); the transition probability is specified in Figure 1(b) where with probability 0.8 the desired action will take the agent into the next state in that direction, with probability 0.1 the agent will instead move to the left of that direction, and with probability 0.1 the agent will move to the right of that direction; and the reward function is defined by the amount of reward given to the agent when moving to another state – if the state is a non-terminal state the agent receives a reward of r (specified below) and if the state is a terminal state, either (4,3) or (4,2), the agent will receive a reward of +1 or -1 respectively. When the agent moves to a non-terminal state, we will consider three different rewards: (i) r = −2, (ii) r = −0.2, and (iii) r = −0.01.

Figure 1: Grid world MDP.

You will implement three algorithms: value iteration, policy iteration, and on-policy first-visit Monte Carlo control and apply them to the MDP for the three different values of the non-terminal reward r specified above. Your code must print out the value (state or action) function (v(s) or q(s, a)) and policy π of each state, on separate lines, in the form. (column, row): value, policy, where row ∈ {1, 2, 3}, column ∈ {1, 2, 3, 4}, value ∈ R, and policy ∈ {up, down, left,right}. Note that the value function and policy have to be stated for 9 non-terminal states, since (2, 2) is not a valid state, and (4, 2) and (4, 3) are terminal states. Example output is:

State: State-values v(s), Policy

(1,1): 0.71, up

(2,1): 0.65, left

(3,3): 0.92, right

1. (30 points) In this problem, you will need to implement the Value Iteration (VI) algorithm to compute the optimal state-value function v∗ and policy π∗. Write Python code (from scratch) to implement the Value Iteration algorithm in file mdpVI.py and run your code for each of the three reward functions above. Your code must take exactly two arguments: reward which determines the reward for every non-terminal state and γ ∈ (0, 1) which is the discount factor. For testing, you can set γ = 0.9. The output should be the state-value and policy of each state as specified above.

Sample input when r = −2 and γ = 0.9: ✩python mdpVI.py -2 0.9.

2. (35 points) In this problem, you will need to implement the Policy Iteration (PI) algorithm to compute the optimal state-value function v∗ and policy π∗. Write Python code (from scratch) to implement the Policy Iteration algorithm in file mdpPI.py and run your code for each of the three reward functions above. Your code must take exactly two arguments: reward which determines the reward for every non-terminal state and γ ∈ (0, 1) which is the discount factor. For testing, you can set γ = 0.9. The output should be the state-value and policy of each state as specified above.

Sample input when r = −2 and γ = 0.9: $python mdpPI.py -2 0.9.

3. (35 points) In this problem, you will need to implement the on-policy first-visit Monte Carlo (MC) control algorithm to compute the optimal action-value function q∗ and policy π∗. Write Python code (from scratch) to implement the first-visit MC control algorithm in file mdpMC.py and run your code for each of the three reward functions above. Your code must take exactly three arguments: reward which determines the reward for every non-terminal state, γ ∈ (0, 1) which is the discount factor, and ∈ [0, 1] which is the exploration parameter for the -greedy policy. For testing, you can set γ = 0.9 and = 0.1. The output should be the action-value and policy of each state (similar as specified above), for example:

State: Action-values q(s, a), Policy

(1,1): up 0.71, down 0.23, left 0.23, right 0.58, up

(2,1): up 0.50, down 0.50, left 0.65, right 0.59, left

(3,3): up 0.86, down 0.62, left 0.79, right 0.92, right

Sample input when r = −2, γ = 0.9, and = 0.1: $python mdpMC.py -2 0.9 0.1.

Instructions

You must complete this homework assignment individually. You may discuss the homework at a high-level with other students but make sure to include the names of the students in your README file. You may not use any AI tools (like GPT-3, ChatGPT, Copilot, etc.) to complete the homework. Code can only be written in Python 3.10+; no other programming languages will be accepted. One should be able to execute all programs from the Python command prompt or terminal. Make sure to include a requirements.txt, yaml, or other files necessary to set up your environment. Please specify instructions on how to run your program in the README file.

Each function must take the inputs in the order specified in the problem and display the textual output via the terminal and plots/figures, if any, should be included in the PDF report.

In your code, you may use libraries for basic matrix computations and plotting such as numpy, pandas, and matplotlib. Put comments in your code so that one can follow the key parts and steps in your code.

Follow the rules strictly. If we cannot run your code, you will not get any credit.

❼ Things to submit

1. [YOUR NAME] hw3 solution.pdf: A document which contains solutions to all problems.

2. Python code for Problem 1 (must include the required mdpVI.py file).

3. Python code for Problem 2 (must include the required mdpPI.py file).

4. Python code for Problem 3 (must include the required mdpMC.py file).

5. README.txt: README file that contains your name, student ID, email, instructions on how to run your code, any assumptions you are making, and any other necessary details.

6. Any other files, except the data (if applicable), which are necessary for your code.

Homework Policy. (1) You are encouraged to collaborate with your classmates on homework problems at a high level only. Each person must write up the final solutions individually. You need to list in the README.txt which problems were a collaborative effort and with whom. Please refer to the syllabus for more details. (2) Regarding online resources, you should not:

❼ Google around for solutions to homework problems,

❼ Ask for help on online,

❼ Look up things/post on sites like Quora, StackExchange, etc.