代做Instruction-Level Parallelism Homework #5代写留学生Matlab程序
- 首页 >> Java编程Homework #5
Instruction-Level Parallelism (ILP)
**** Remember to SHOW YOUR WORK to get credit, partial or full. ****
Problem 1: Single-Issue ILP - In-Order Issue & In-Order Execution
Assume a single-issue MIPS pipeline with five stages (IF, ID, EXE, MEM, WB), where the EXE stage for an instruction takes the following number of cycles. (A picture for the architecture can be found at the end of this homework or in the appendix of your textbook.)
Now consider the MIPS code sequence below, where r#are integer registers and f#are floating-point registers.
Line #
01 loop: lw f12, 0(r8) 02
mul.s f12, f10, f12
03 div.s f18, f12, f10
04 lw f14, 0(r9)
05 add.s f14, f10, f14
06 add.s f20, f18, f12
07 sw f14, 0(r9)
08 addi r8, r8, #8
09 addi r9, r9, #8
10 sub r22, r4, r8
11 bnz r0, r22, loop # assume r0 = 0
Assume that the MIPS pipeline is in-order issue and in-order execution and that the branch is taken and has the usual one-cycle branch delay slot, answer the following:
(a) Clearly identify all the data hazards (and specify the type of data hazard and on what registers),
structural hazards, and control hazards that the in-order issue and in-order execution architecture must address.
(b) Show how many clock cycles that the code sequence takes to execute IF the loop is executed 10 times and then exits. (To answer this problem, show the execution pipeline as a function of time, at least for the first iteration. Suggestion: Use a Microsoft Excel spreadsheet or Google Sheets.)
Problem 2: Dual-Issue ILP - In-Order Issue & In-Order Execution
Assume a dual-issue MIPS pipeline with five stages (IF, ID, EXE, MEM, WB), where the EXE stage for an instruction takes the following number of cycles:
Now consider the MIPS code sequence below, where r# are integer registers and f# are floating-point registers.
Line #
01 loop: lw f12, 0(r8) 02
mul.s f12, f10, f12
03 div.s f18, f12, f10
04 lw f14, 0(r9)
05 add.s f14, f10, f14
06 add.s f20, f18, f12
07 sw f14, 0(r9)
08 addi r8, r8, #8
09 addi r9, r9, #8
10 sub r22, r4, r8
11 bnz r0, r22, loop # assume r0 = 0
Assume that the MIPS pipeline is in-order issue and in-order execution and that the branch is taken and has the usual one-cycle branch delay slot. If the MIPS pipeline is dual-issue (i.e., two parallel ports to instruction memory) and contains dual ALUs in the EXE stage, how many clock cycles does the code sequence take to execute IF the loop is executed 10 times and then exits? (To answer this question, show the execution pipeline as a function of time, at least for the first iteration. Suggestion: Use Microsoft Excel or Google Sheets.)
Problem 3: Single-Issue ILP - Out-of-Order Execution
Assume a single-issue MIPS pipeline with five stages (IF, ID, EXE, MEM, WB), where the EXE stage for an instruction takes the following number of cycles. (A picture for the architecture can be found at the end of this homework or in the appendix of your textbook.)
Now consider the following MIPS code sequence, where r#are integer registers and f#are floating-point registers. Assume r0 = 0.
Line #
01 lp: lw f12, 0(r8)
02 mul.s f12, f10, f12
03 div.s f18, f12, f10
04 lw f14, 0(r9)
05 add.s f14, f10, f14
06 add.s f20, f18, f12
07 sw f14, 0(r9)
08 addi r8, r8, #8
09 addi r9, r9, #8
10 sub r22, r4, r8
11 bnz r0, r22, lp
Assume that the single-issue MIPS pipeline is modified to allow for in-order out-of-order issue and out-of- order execution, that the branch is taken and has the usual one-cycle branch delay slot, and that the availability of reservation stations and supporting architecture is as shown below:
Figure 1. Architecture for Tomasulo’s Algorithm
(a) Show the state of Tomasulo’s algorithm, using the sample table format below, when “Line 07: swf14,
0(r9)” has completely finished execution during the first iteration. Relative to sample table, consider the “Issue” phase as being completed when the IF and ID stages are done and the “Write Result” phase as being completed when the MEM and WB stages are done. That is, assume here that “Issue = IF + ID” and thus takes two stages and “Write Result = MEM + WB” and thus takes two stages.
Figure 2. Status Table for Tomasulo’s Algorithm
(b) Show how many clock cycles that the code sequence takes to execute IF the loop is executed 10 times and then exits. (To answer this question, show the execution pipeline as a function of time, at least for the first iteration.)
Problem 4: Dual-Issue ILP - Out-of-Order Execution
Assume a dual-issue MIPS pipeline with five stages (IF, ID, EXE, MEM, WB), where the EXE stage for an instruction takes the following number of cycles:
Now consider the following MIPS code sequence, where r#are integer registers and f#are floating-point registers. Assume r0 = 0 .
Line #
01 lp: lw f12, 0(r8)
02 mul.s f12, f10, f12
03 div.s f18, f12, f10
04 lw f14, 0(r9)
05 add.s f14, f10, f14
06 add.s f20, f18, f12
07 sw f14, 0(r9)
08 addi r8, r8, #8
09 addi r9, r9, #8
10 sub r22, r4, r8
11 bnz r0, r22, lp
Assume that the MIPS pipeline is out-of-order issue and out-of-order completion and that the
branch is taken and has the usual one-cycle branch delay slot. Also, assume that the MIPS pipeline is dual- issue (i.e., two parallel ports to instruction memory and to data memory) and contains dual ALUs in the EXE stage.
(a) Assuming the availability of reservation stations and supporting architecture, as shown in Figure 1 of this homework assignment, show the state of Tomasulo’s algorithm, as shown in Figure 2 from this
homework assignment, when “Line 07: swf14, 0(r9)” has completely finished execution during the first iteration. Relative to Figure 2, consider the “Issue” phase as being completed when the IF and ID
stages are done and the “Write Result” phase as being completed when the MEM and WB stages are done.
(b) If the loop is executed 10 times and then exits, how many clock cycles does the code sequence take to execute? (To answer this question, show the execution pipeline as a function of time, at least for the first iteration.)
Problem 5: Instruction-Level Parallelism in gem5
This problem needs gem5to be installed in your userspace on rlogin , a la the instructions provided in HW #4.
Application Workloads
You are being provided the C code for three applications: matmul.c,nq.c, lts.c
1. matrix multiply (as in HW #4, but you need to modify the loop ordering to be the optimal i-k-j)
2. n-queens – an implementation of the eight-queens problem
(https://en.wikipedia.org/wiki/Eight_queens_puzzle)
3. lower triangular solve to perform. forward substitution
(https://en.wikipedia.org/wiki/Triangular_matrix#Forward_substitution)
Compilation
On rlogin.cs.vt.edu , modify (if needed) and compile the provided codes using the commands below: Matrix
Multiply:
gcc -O0 -ggdb3 -std=c99 -static matmul.c -o matmul86
N-queens:
gcc -O0 -ggdb3 -std=c99 -static nq.c -o nq86
Lower Triangular Solve:
gcc -O0 -ggdb3 -std=c99 -static lts.c -o lts86
For matrix multiply (matmul.c), ensure that the matrix size is set to 128 x 128 and loop order to i-k-j. For
n-queens (nq.c), ensure that the number of queens (n) is set to 10.
For lower triangular solve (lts.c), ensure that the matrix size is set to 128 x 128.
Memory Hierarchy
For this homework assignment, you will use the x86 memory hierarchy of the Intel Broadwell CPU, as you
did in HW #4 and as reproduced below for your convenience. Remember that you need to make appropriate changes in your caches.pyfile to set the respective cache sizes, associativity, and data latencies for the
Intel Broadwell CPU. Assume that tag, data, and response latencies within each cache level are the same.
a. Report the execution time (in simulated seconds and in simulated clock cycles so you gain a better
perspective on just how much difference the performance can be) for the three programs by setting the CPU (i.e., system.cpuin the configuration script) to each of the following three configurations, thus making a total of nine runtime measurement tests that you must run, i.e., 3 apps x 3 CPU configs = 9 combinations.
(1) X86TimingSimpleCPU() :
https://www.gem5.org/documentation/general_docs/cpu_models/SimpleCPU#timingsimplecpu (2) X86MinorCPU() : https://www.gem5.org/documentation/general_docs/cpu_models/minor_cpu
(3) X86O3CPU() : https://www.gem5.org/documentation/general_docs/cpu_models/O3CPU
Rename your stats.txtfiles after each runtime measurement test so that subsequent runs do not overwrite the results of the current run. See the “What to Submit” section for additional guidance. Use the Linux commands below to boost your productivity. 1. grep -rn "text" . ‣ This will check whether "text" is in any files located in the current directory (i.e., ".") 2. nohup ./your_scripts.sh & ‣ This is a way to run your script. ("./your_scripts.sh") so that it is immune to “hanging up” your connection. 3. diff a.txt b.txt ‣ diffprovides the differences between two files. ‣ Output: "<" indicates contents in the left file ("a.txt"), and ">" indicates contents in the right file ("b.txt")b. For each workload:
1) Rigorously explain the performance differences (if any) between the TimingSimpleCPU
configuration and the MinorCPU configuration using qualitative analysis and quantitative
analysis (i.e., in addition to looking at simulated seconds and cycles, “dig deeper” and look at other appropriate metrics from stats.txtthat go towards explaining the performance differences).
2) Rigorously explain the performance differences (if any) between the MinorCPU configuration
and the O3CPU configuration using qualitative analysis and quantitative analysis (i.e., in
addition to looking at simulated seconds and cycles, “dig deeper” and look at other appropriate metrics from stats.txtthat go towards explaining the performance differences).
c. Which workload benefits the most when moving between the MinorCPU configuration and the O3CPU configuration? Substantiate your answer using quantitative analysis (i.e., in addition to looking at
simulated seconds and cycles, “dig deeper” and look at other appropriate metrics from stats.txtthat go towards explaining the performance differences).
Problem 6: Branch Prediction in Instruction-Level Parallelism (ILP) in gem5
This problem needs gem5to be installed in your userspace on rlogin , a la the instructions provided in HW #4.
Application Workloads
You are being provided the C code for three applications: matmul.c,nq.c, lts.c
1. matrix multiply (as in HW #4, but you need to modify the loop ordering to be the optimal i-k-j)
2. n-queens - an implementation of the eight-queens problem
(https://en.wikipedia.org/wiki/Eight_queens_puzzle)
3. lower triangular solve to perform. forward substitution
(https://en.wikipedia.org/wiki/Triangular_matrix#Forward_substitution)
Compilation
On rlogin.cs.vt.edu , modify (if needed) and compile the provided codes using the commands below: Matrix
Multiply:
gcc -O0 -ggdb3 -std=c99 -static matmul.c -o matmul86
N-queens:
gcc -O0 -ggdb3 -std=c99 -static nq.c -o nq86
Lower Triangular Solve:
gcc -O0 -ggdb3 -std=c99 -static lts.c -o lts86
For matrix multiply (matmul.c), ensure that the matrix size is set to 128 x 128 and loop order to i-k-j. For
n-queens (nq.c), ensure that the number of queens (n) is set to 10.
For lower triangular solve (lts.c), ensure that the matrix size is set to 128 x 128.
CPU
Set the CPU to X86O3CPUin your configuration script in gem5( i.e., system.cpu = X86O3CPU() )
Memory Hierarchy
Use the x86 memory hierarchy of the Intel Broadwell CPU, as you did in HW #4 and as reproduced below for your convenience. Remember that you need to make appropriate changes in your caches.pyfile to set the respective cache sizes, associativity, and data latencies for the Intel Broadwell CPU. Assume that tag, data, and response latencies within each cache level are the same.
Branch Prediction
Evaluate the performance of three branch prediction policies: LocalBP, tournament, and TAGE.
Set the branch prediction policy using one of the lines below at any given time: system.cpu.branchPred =
LocalBP() # sets branch prediction to localBP
system.cpu.branchPred = TournamentBP() # sets branch prediction to tournament
system.cpu.branchPred = TAGE() # sets branch prediction to TAGE
Note:
Rename your stats.txtfiles after each runtime measurement test so that subsequent runs do not overwrite the results of the current run. See the “What to Submit” section for additional guidance. Use the Linux commands below to boost your productivity. 1. grep -rn "text" . ‣ This will check whether "text" is in any files located in the current directory (i.e., ".") 2. nohup ./your_scripts.sh & ‣ This is a way to run your script. ("./your_scripts.sh") so that it is immune to “hanging up” your connection. 3. diff a.txt b.txt ‣ diffprovides the differences between two files. ‣ Output: "<" indicates contents in the left file ("a.txt"), and ">" indicates contents in the right file ("b.txt")Simulate each of the three workloads, i.e., matmul, nq, and lts, in gem5for each of the three types of branch
predictors – LocalBP, Tournament, and TAGE. Generate the stats.txtfiles for each simulation, which you will need to rename after each runtime measurement test so that subsequent runs do not overwrite the results of the current
run. See the “What to Submit” section for additional guidance. (You will have nine (9) stats.txtfiles, i.e., three (3) workloads running on each of the three branch predictors for a total of 9 combinations.)
(a) Report the following metrics from stats.txtin a table for all 9 combinations of workloads and branch predictors,
i. system.cpu.branchPred.condPredicted
ii. system.cpu.branchPred.condIncorrect
(b) Is there a predictor which consistently performs better than others in terms of
system.cpu.branchPred.condIncorrect?
(c) Rigorously explain the performance differences (if any) between the 9 different runtime measurement tests, both quantitatively and qualitatively. That is, in addition to looking at simulated seconds and cycles, “dig deeper” and look at other appropriate metrics from stats.txtthat go towards explaining the performance differences.
Problem 7: Strawman Proposal for Research on ILP in gem5
What to Submit
• DOCX writeup for Problems 1 - 4 and Problem 7.
• ZIP Files for Problem 5 and 6 separately:
o caches.py
o matmul.c
o nq.c
o Its.c
o Each of the nine (9) stats.txtfiles, which you should rename appropriately after each runtime measurement test (otherwise your subsequent runs with overwrite the file). Specifically, rename each stats.txtoutput for each combination, as articulated in the enumerated list and table below:
1. stats-matmul-TimingSimpleCPU.txt
2. stats-matmul-MinorCPU.txt
3. stats-matmul-O3CPU.txt
4. stats-nq-TimingSimpleCPU.txt
5. stats-nq-MinorCPU.txt
6. stats-nq-O3CPU.txt
7. stats-lts-TimingSimpleCPU.txt
8. stats-lts-MinorCPU.txt
9. stats-lts-O3CPU.txt
1. stats-matmul-LocalBP.txt
2. stats-matmul-TournamentBP.txt
3. stats-matmul-TAGE.txt
4. stats-nq-LocalBP.txt
5. stats-nq-TournamentBP.txt
6. stats-nq-TAGE.txt
7. stats-lts-LocalBP.txt
8. stats-lts-TournamentBP.txt
9. stats-lts-TAGE.txt
The following MIPS 5-stage pipeline with Forwarding Unit and a Hazard Detection Unit deals with both data and control hazards, respectively. This processor writes to the register file in the first half of the clock cycle and reads from the register file in the second half of the clock cycle.
Specifications of supported MIPS assembly instructions:
add rd, rs, rt # GPR[rd] = GPR[rs] + GPR[rt] // Add
sub rd, rs, rt # GPR[rd] = GPR[rs] - GPR[rt] // Sub
or rd, rs, rt # GPR[rd] = GPR[rs] | GPR[rt] // Or
addi rt, rs, imm16 # GPR[rt] = GPR[rs] + SignExtImm // Add Immediate
lw rt, imm16(rs) # GPR[rt] = Mem[GPR[rs] + SignExtImm] // Load Word
sw rt, imm16(rs) # Mem[GPR[rs] + SignExtImm] = GPR[rt] // Store Word
beq rs, rt, label # if GPR[rs] == GPR[rt], jump to label // Branch On Equal