代做Instruction-Level Parallelism Homework #5代写留学生Matlab程序

2024.04.01 - 首页 >> Java编程

Homework #5

Instruction-Level Parallelism (ILP)

**** Remember to SHOW YOUR WORK to get credit, partial or full. ****

Problem 1: Single-Issue ILP - In-Order Issue & In-Order Execution

Assume a single-issue MIPS pipeline with five stages (IF, ID, EXE, MEM, WB), where the EXE stage for an instruction takes the following number of cycles. (A picture for the architecture can be found at the end of this homework or in the appendix of your textbook.)

Now consider the MIPS code sequence below, where r#are integer registers and f#are floating-point registers.

Line #

01 loop: lw f12, 0(r8) 02

mul.s f12, f10, f12

03 div.s f18, f12, f10

04 lw f14, 0(r9)

05 add.s f14, f10, f14

06 add.s f20, f18, f12

07 sw f14, 0(r9)

08 addi r8, r8, #8

09 addi r9, r9, #8

10 sub r22, r4, r8

11 bnz r0, r22, loop # assume r0 = 0

Assume that the MIPS pipeline is in-order issue and in-order execution and that the branch is taken and has the usual one-cycle branch delay slot, answer the following:

(a) Clearly identify all the data hazards (and specify the type of data hazard and on what registers),

structural hazards, and control hazards that the in-order issue and in-order execution architecture must address.

(b) Show how many clock cycles that the code sequence takes to execute IF the loop is executed 10 times and then exits. (To answer this problem, show the execution pipeline as a function of time, at least for the first iteration. Suggestion: Use a Microsoft Excel spreadsheet or Google Sheets.)

Problem 2: Dual-Issue ILP - In-Order Issue & In-Order Execution

Assume a dual-issue MIPS pipeline with five stages (IF, ID, EXE, MEM, WB), where the EXE stage for an instruction takes the following number of cycles:

Now consider the MIPS code sequence below, where r# are integer registers and f# are floating-point registers.

Line #

01 loop: lw f12, 0(r8) 02

mul.s f12, f10, f12

03 div.s f18, f12, f10

04 lw f14, 0(r9)

05 add.s f14, f10, f14

06 add.s f20, f18, f12

07 sw f14, 0(r9)

08 addi r8, r8, #8

09 addi r9, r9, #8

10 sub r22, r4, r8

11 bnz r0, r22, loop # assume r0 = 0

Assume that the MIPS pipeline is in-order issue and in-order execution and that the branch is taken and has the usual one-cycle branch delay slot. If the MIPS pipeline is dual-issue (i.e., two parallel ports to instruction memory) and contains dual ALUs in the EXE stage, how many clock cycles does the code sequence take to execute IF the loop is executed 10 times and then exits? (To answer this question, show the execution pipeline as a function of time, at least for the first iteration. Suggestion: Use Microsoft Excel or Google Sheets.)

Problem 3: Single-Issue ILP - Out-of-Order Execution

Now consider the following MIPS code sequence, where r#are integer registers and f#are floating-point registers. Assume r0 = 0.

Line #

01 lp: lw f12, 0(r8)

02 mul.s f12, f10, f12

03 div.s f18, f12, f10

04 lw f14, 0(r9)

05 add.s f14, f10, f14

06 add.s f20, f18, f12

07 sw f14, 0(r9)

08 addi r8, r8, #8

09 addi r9, r9, #8

10 sub r22, r4, r8

11 bnz r0, r22, lp

Assume that the single-issue MIPS pipeline is modified to allow for ~~in-orde~~r out-of-order issue and out-of- order execution, that the branch is taken and has the usual one-cycle branch delay slot, and that the availability of reservation stations and supporting architecture is as shown below:

Figure 1. Architecture for Tomasulo’s Algorithm

(a) Show the state of Tomasulo’s algorithm, using the sample table format below, when “Line 07: swf14,

0(r9)” has completely finished execution during the first iteration. Relative to sample table, consider the “Issue” phase as being completed when the IF and ID stages are done and the “Write Result” phase as being completed when the MEM and WB stages are done. That is, assume here that “Issue = IF + ID” and thus takes two stages and “Write Result = MEM + WB” and thus takes two stages.

Figure 2. Status Table for Tomasulo’s Algorithm

(b) Show how many clock cycles that the code sequence takes to execute IF the loop is executed 10 times and then exits. (To answer this question, show the execution pipeline as a function of time, at least for the first iteration.)

Problem 4: Dual-Issue ILP - Out-of-Order Execution

Assume a dual-issue MIPS pipeline with five stages (IF, ID, EXE, MEM, WB), where the EXE stage for an instruction takes the following number of cycles:

Now consider the following MIPS code sequence, where r#are integer registers and f#are floating-point registers. Assume r0 = 0 .

Line #

01 lp: lw f12, 0(r8)

02 mul.s f12, f10, f12

03 div.s f18, f12, f10

04 lw f14, 0(r9)

05 add.s f14, f10, f14

06 add.s f20, f18, f12

07 sw f14, 0(r9)

08 addi r8, r8, #8

09 addi r9, r9, #8

10 sub r22, r4, r8

11 bnz r0, r22, lp

Assume that the MIPS pipeline is out-of-order issue and out-of-order completion and that the

branch is taken and has the usual one-cycle branch delay slot. Also, assume that the MIPS pipeline is dual- issue (i.e., two parallel ports to instruction memory and to data memory) and contains dual ALUs in the EXE stage.

(a) Assuming the availability of reservation stations and supporting architecture, as shown in Figure 1 of this homework assignment, show the state of Tomasulo’s algorithm, as shown in Figure 2 from this

homework assignment, when “Line 07: swf14, 0(r9)” has completely finished execution during the first iteration. Relative to Figure 2, consider the “Issue” phase as being completed when the IF and ID

stages are done and the “Write Result” phase as being completed when the MEM and WB stages are done.

(b) If the loop is executed 10 times and then exits, how many clock cycles does the code sequence take to execute? (To answer this question, show the execution pipeline as a function of time, at least for the first iteration.)

Problem 5: Instruction-Level Parallelism in gem5

This problem needs gem5to be installed in your userspace on rlogin , a la the instructions provided in HW #4.

Application Workloads

You are being provided the C code for three applications: matmul.c,nq.c, lts.c

1. matrix multiply (as in HW #4, but you need to modify the loop ordering to be the optimal i-k-j)

2. n-queens – an implementation of the eight-queens problem

(https://en.wikipedia.org/wiki/Eight_queens_puzzle)

3. lower triangular solve to perform. forward substitution

(https://en.wikipedia.org/wiki/Triangular_matrix#Forward_substitution)

Compilation

On rlogin.cs.vt.edu , modify (if needed) and compile the provided codes using the commands below: Matrix

Multiply:

gcc -O0 -ggdb3 -std=c99 -static matmul.c -o matmul86

N-queens:

gcc -O0 -ggdb3 -std=c99 -static nq.c -o nq86

Lower Triangular Solve:

gcc -O0 -ggdb3 -std=c99 -static lts.c -o lts86

For matrix multiply (matmul.c), ensure that the matrix size is set to 128 x 128 and loop order to i-k-j. For

n-queens (nq.c), ensure that the number of queens (n) is set to 10.

For lower triangular solve (lts.c), ensure that the matrix size is set to 128 x 128.

Memory Hierarchy

For this homework assignment, you will use the x86 memory hierarchy of the Intel Broadwell CPU, as you

did in HW #4 and as reproduced below for your convenience. Remember that you need to make appropriate changes in your caches.pyfile to set the respective cache sizes, associativity, and data latencies for the

Intel Broadwell CPU. Assume that tag, data, and response latencies within each cache level are the same.

a. Report the execution time (in simulated seconds and in simulated clock cycles so you gain a better

perspective on just how much difference the performance can be) for the three programs by setting the CPU (i.e., system.cpuin the configuration script) to each of the following three configurations, thus making a total of nine runtime measurement tests that you must run, i.e., 3 apps x 3 CPU configs = 9 combinations.

(1) X86TimingSimpleCPU() :

https://www.gem5.org/documentation/general_docs/cpu_models/SimpleCPU#timingsimplecpu (2) X86MinorCPU() : https://www.gem5.org/documentation/general_docs/cpu_models/minor_cpu

(3) X86O3CPU() : https://www.gem5.org/documentation/general_docs/cpu_models/O3CPU

Rename your stats.txtfiles after each runtime measurement test so that subsequent runs do not overwrite the results of the current run. See the “What to Submit” section for additional guidance. Use the Linux commands below to boost your productivity. 1. grep -rn "text" . ‣ This will check whether "text" is in any files located in the current directory (i.e., ".") 2. nohup ./your_scripts.sh & ‣ This is a way to run your script. ("./your_scripts.sh") so that it is immune to “hanging up” your connection. 3. diff a.txt b.txt ‣ diffprovides the differences between two files. ‣ Output: "<" indicates contents in the left file ("a.txt"), and ">" indicates contents in the right file ("b.txt")

b. For each workload:

1) Rigorously explain the performance differences (if any) between the TimingSimpleCPU

configuration and the MinorCPU configuration using qualitative analysis and quantitative

analysis (i.e., in addition to looking at simulated seconds and cycles, “dig deeper” and look at other appropriate metrics from stats.txtthat go towards explaining the performance differences).

2) Rigorously explain the performance differences (if any) between the MinorCPU configuration

and the O3CPU configuration using qualitative analysis and quantitative analysis (i.e., in

addition to looking at simulated seconds and cycles, “dig deeper” and look at other appropriate metrics from stats.txtthat go towards explaining the performance differences).

c. Which workload benefits the most when moving between the MinorCPU configuration and the O3CPU configuration? Substantiate your answer using quantitative analysis (i.e., in addition to looking at

simulated seconds and cycles, “dig deeper” and look at other appropriate metrics from stats.txtthat go towards explaining the performance differences).

Problem 6: Branch Prediction in Instruction-Level Parallelism (ILP) in gem5

This problem needs gem5to be installed in your userspace on rlogin , a la the instructions provided in HW #4.

Application Workloads

You are being provided the C code for three applications: matmul.c,nq.c, lts.c

1. matrix multiply (as in HW #4, but you need to modify the loop ordering to be the optimal i-k-j)

2. n-queens - an implementation of the eight-queens problem

(https://en.wikipedia.org/wiki/Eight_queens_puzzle)

3. lower triangular solve to perform. forward substitution

(https://en.wikipedia.org/wiki/Triangular_matrix#Forward_substitution)

Compilation

On rlogin.cs.vt.edu , modify (if needed) and compile the provided codes using the commands below: Matrix

Multiply:

gcc -O0 -ggdb3 -std=c99 -static matmul.c -o matmul86

N-queens:

gcc -O0 -ggdb3 -std=c99 -static nq.c -o nq86

Lower Triangular Solve:

gcc -O0 -ggdb3 -std=c99 -static lts.c -o lts86

For matrix multiply (matmul.c), ensure that the matrix size is set to 128 x 128 and loop order to i-k-j. For

n-queens (nq.c), ensure that the number of queens (n) is set to 10.

For lower triangular solve (lts.c), ensure that the matrix size is set to 128 x 128.

CPU

Set the CPU to X86O3CPUin your configuration script in gem5( i.e., system.cpu = X86O3CPU() )

Memory Hierarchy

Use the x86 memory hierarchy of the Intel Broadwell CPU, as you did in HW #4 and as reproduced below for your convenience. Remember that you need to make appropriate changes in your caches.pyfile to set the respective cache sizes, associativity, and data latencies for the Intel Broadwell CPU. Assume that tag, data, and response latencies within each cache level are the same.

Branch Prediction

Evaluate the performance of three branch prediction policies: LocalBP, tournament, and TAGE.

Set the branch prediction policy using one of the lines below at any given time: system.cpu.branchPred =

LocalBP() # sets branch prediction to localBP

system.cpu.branchPred = TournamentBP() # sets branch prediction to tournament

system.cpu.branchPred = TAGE() # sets branch prediction to TAGE

Note:

Simulate each of the three workloads, i.e., matmul, nq, and lts, in gem5for each of the three types of branch

predictors – LocalBP, Tournament, and TAGE. Generate the stats.txtfiles for each simulation, which you will need to rename after each runtime measurement test so that subsequent runs do not overwrite the results of the current

run. See the “What to Submit” section for additional guidance. (You will have nine (9) stats.txtfiles, i.e., three (3) workloads running on each of the three branch predictors for a total of 9 combinations.)

(a) Report the following metrics from stats.txtin a table for all 9 combinations of workloads and branch predictors,

i. system.cpu.branchPred.condPredicted

ii. system.cpu.branchPred.condIncorrect

(b) Is there a predictor which consistently performs better than others in terms of

system.cpu.branchPred.condIncorrect?

(c) Rigorously explain the performance differences (if any) between the 9 different runtime measurement tests, both quantitatively and qualitatively. That is, in addition to looking at simulated seconds and cycles, “dig deeper” and look at other appropriate metrics from stats.txtthat go towards explaining the performance differences.

Problem 7: Strawman Proposal for Research on ILP in gem5

What to Submit

• DOCX writeup for Problems 1 - 4 and Problem 7.

• ZIP Files for Problem 5 and 6 separately:

o caches.py

o matmul.c

o nq.c

o Its.c

o Each of the nine (9) stats.txtfiles, which you should rename appropriately after each runtime measurement test (otherwise your subsequent runs with overwrite the file). Specifically, rename each stats.txtoutput for each combination, as articulated in the enumerated list and table below:

1. stats-matmul-TimingSimpleCPU.txt

2. stats-matmul-MinorCPU.txt

3. stats-matmul-O3CPU.txt

4. stats-nq-TimingSimpleCPU.txt

5. stats-nq-MinorCPU.txt

6. stats-nq-O3CPU.txt

7. stats-lts-TimingSimpleCPU.txt

8. stats-lts-MinorCPU.txt

9. stats-lts-O3CPU.txt

1. stats-matmul-LocalBP.txt

2. stats-matmul-TournamentBP.txt

3. stats-matmul-TAGE.txt

4. stats-nq-LocalBP.txt

5. stats-nq-TournamentBP.txt

6. stats-nq-TAGE.txt

7. stats-lts-LocalBP.txt

8. stats-lts-TournamentBP.txt

9. stats-lts-TAGE.txt

The following MIPS 5-stage pipeline with Forwarding Unit and a Hazard Detection Unit deals with both data and control hazards, respectively. This processor writes to the register file in the first half of the clock cycle and reads from the register file in the second half of the clock cycle.

Specifications of supported MIPS assembly instructions:

add rd, rs, rt # GPR[rd] = GPR[rs] + GPR[rt] // Add

sub rd, rs, rt # GPR[rd] = GPR[rs] - GPR[rt] // Sub

or rd, rs, rt # GPR[rd] = GPR[rs] | GPR[rt] // Or

addi rt, rs, imm16 # GPR[rt] = GPR[rs] + SignExtImm // Add Immediate

lw rt, imm16(rs) # GPR[rt] = Mem[GPR[rs] + SignExtImm] // Load Word

sw rt, imm16(rs) # Mem[GPR[rs] + SignExtImm] = GPR[rt] // Store Word

beq rs, rt, label # if GPR[rs] == GPR[rt], jump to label // Branch On Equal