前递:快人一步

激烈的斗争

我们做完了吗?还没完。

之前 的测试里,程序没有任何数据冒险——在实际运行中这几乎是不可能的。我们简单修改一点:

1
2
3
4
5
6
7
addi x1,x0,1
addi x2,x0,2
addi x3,x0,3
addi x4,x0,4
addi x1,x0,7
addi x2,x0,21
add x7,x1,x2 # 加入数据冒险

然后运行。发生了什么?我们期望看到x7输出的是x1x2变化后相加的值,应当为28。但是计算结果输出了错误的3!

数据冒险导致的错误输出

这就是最经典的“数据冒险”。我们看看发生了什么。

1
2
3
4
5
6
7
8
9
   x1   x2    x7      IF          ID          EX          MEM         WB
-------------------------------------------------------------------------------
1 1 2 x x1=7
2 1 2 x x2=21 x1=7
3 1 2 x x7=x1+x2 x2=21 x1=7
4 1 2 x x7=x1+x2[R] x2=21 x1=7
5 7 2 x x7=x1+x2 x2=21 x1=7 [W]
6 7 21 x x7=x1+x2 x2=21[W]
7 7 21 3 x7=3 [W]

这是流水线示意图。可以看到,因为写回在WB级才被完成,导致第三条指令在ID级取到了旧的值。直到第六个周期结束,源寄存器的值才更新完毕。

空泡与阻塞

为了不让计算取到错误的值,我们可以等。时间,会给出答案。在前面的计算还没完成时,让译码级一直等待,直到从寄存器堆中读取正确的值就行。我们需要对数据进行判断:如果两条指令之间存在数据关联,则给出一个信号,之后IF级的程序计数器根据信号来决定是否停止、ID/EX级流水线寄存器根据信号进行冲刷。

从上面的例子可知,只要间隔不超过三条的指令,都有着数据冒险的可能。因此,判断一下即可。记得判断一下写入的目标是否为0,毕竟0被塞了多少东西也不会吭一声。之后,再定义一个判断信号,用于控制冲刷:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
always_comb begin
// 间隔一级流水线 判断下写入的不为x0即可
RAW_1_rD1 = (wR_EX == rR1_ID) && rf_we_EX && rs1_used_ID && (wR_EX != 5'b0);
RAW_1_rD2 = (wR_EX == rR2_ID) && rf_we_EX && rs2_used_ID && (wR_EX != 5'b0);
// 间隔两级流水线
RAW_2_rD1 = (wR_MEM == rR1_ID) && rf_we_MEM && rs1_used_ID && (wR_MEM != 5'b0);
RAW_2_rD2 = (wR_MEM == rR2_ID) && rf_we_MEM && rs2_used_ID && (wR_MEM != 5'b0);
// 间隔三级流水线
RAW_3_rD1 = (wR_WB == rR1_ID) && rf_we_WB && rs1_used_ID && (wR_WB != 5'b0);
RAW_3_rD2 = (wR_WB == rR2_ID) && rf_we_WB && rs2_used_ID && (wR_WB != 5'b0);
end
// 冲刷信号生成
always_comb begin
RAW_flush_ID = RAW_1_rD1 || RAW_2_rD1 || RAW_3_rD1
|| RAW_1_rD2 || RAW_2_rD2 || RAW_3_rD2;
end

然后接到CPU顶层模块的对应位置。测试一下!

流水线停顿得出正确结果

可以看到,流水线被停顿了很久,但是至少结果出来了。

与其停滞不前,不如大步向前

流水线停顿确实能解决数据冒险的问题,但是效率太低了。我们直接把EX级算出来的结果送到前面一级用就是了。上面的信号改个名字就可以拿来用了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// 前递数据选择
// 优先级:EX > MEM > WB
always_comb begin
// 源操作数1前递数据选择
if (RAW_1_rD1) fwd_rD1_ID = rf_wd_EX; // 来自EX级
else if (RAW_2_rD1) fwd_rD1_ID = rf_wd_MEM; // 来自MEM级
else if (RAW_3_rD1) fwd_rD1_ID = rf_wd_WB; // 来自WB级
else fwd_rD1_ID = 32'b0;

// 源操作数2前递数据选择
if (RAW_1_rD2) fwd_rD2_ID = rf_wd_EX; // 来自EX级
else if (RAW_2_rD2) fwd_rD2_ID = rf_wd_MEM; // 来自MEM级
else if (RAW_3_rD2) fwd_rD2_ID = rf_wd_WB; // 来自WB级
else fwd_rD2_ID = 32'b0;
end

然后接入前递使能到ID/EX级流水线寄存器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// 前递信号与数据
logic [31:0] rD1_forwarded, rD2_forwarded;
// 考虑了前递就不需要冲刷了
always_comb begin
rD1_forwarded = fwd_rD1e_ID ? fwd_rD1_ID : rD1_i;
rD2_forwarded = fwd_rD2e_ID ? fwd_rD2_ID : rD2_i;
end

// 寄存器堆数据
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
rD1_o <= 32'b0;
rD2_o <= 32'b0;
end else begin
rD1_o <= rD1_forwarded;
rD2_o <= rD2_forwarded;
end
end

再测试一下,可以看到提前了三个周期便算出了结果,这个延时刚好对应上面算出的数据相关性间隔。波形图上x7紧接着x2的变化,和预期的一样!

加入数据前递后的正确结果

1
2
3
4
5
6
7
8
9
   x1   x2    x7      IF          ID          EX          MEM         WB
-------------------------------------------------------------------------------
1 1 2 x x1=7
2 1 2 x x2=21 x1=7
3 1 2 x x7=x1+x2 x2=21 x1=7
4 7[F] 21[F] x x7=x1+x2[F] x2=21 x1=7
5 7 2 x x7=x1+x2 x2=21 x1=7 [W]
6 7 21 x x7=x1+x2 x2=21[W]
7 7 21 28 x7=28[W]

Load-Use冒险

除了常见的依赖冒险之外,还有一种“取数-使用”型冒险。我们举一个简单的例子:

1
2
3
4
5
6
7
addi x1,x0,1
addi x2,x0,2
add x3,x1,x2
sw x3,0(x0)
lw x4,0(x0)
addi x5,x4,1 # Load-Use 冒险
addi x6,x0,6

可以看到,本应当计算出4的x5现在只是1。

取数-使用冒险导致的错误输出

原因是要先从DRAM中读出值存入寄存器,之后再读取寄存器值进行计算,但读出的值在送到寄存器之前就需要被使用,怎么办?我们也可以加入前递。

可以吗?

L-Type指令在流水线中要经过MEM级才能得到从DRAM返回的数据,而普通的前递是把 EX 级产生的 ALU 结果直接送给后续指令的 EX 阶段使用。换句话说,L-Type的数据在 EX 之后、MEM 之后才可用。单靠常规 的EX2EX 转发是无法消除的。

那我们加一个MEM2EX级的前递路径不就行了?想法是好的,但是流水线执行坏了:

  • L-Type的数据往往在 MEM 级的末期才可用(这里是MUX之后输出的结果),而 EX 的操作数通常在该周期早期就要用到,会出现时序错误;
  • 插入一个气泡更为简单,也更容易控制流水线流动。

那直接插气泡就行了,何乐而不为呢?

1
2
3
4
5
6
7
8
9
10
// Load_use 冒险判断
logic load_use_hazard;
assign load_use_hazard = (wd_sel_EX == `WD_SEL_FROM_DRAM) && (RAW_1_rD1 || RAW_1_rD2);

// 流水线冲刷与停顿
always_comb begin
keep_pc = load_use_hazard ? 1'b1 : 1'b0;
stall_IF_ID = load_use_hazard ? 1'b1 : 1'b0;
flush_ID_EX = load_use_hazard ? 1'b1 : 1'b0;
end

停顿一级流水线后得出正确结果

事实上,对于Load-Use类冒险,最标准的做法就是插入空泡。这也是几乎教科书中都提及的方法。

分支跳转:Jumpin'

B-Type类指令压根不需要写寄存器,只是在EX级计算出是否跳转后更新PC值。我们在 执行级 已经写好了下一PC计算模块,其中take_branch输出就是是否跳转。在跳转时,我们需要让程序计数器的下一PC取到跳转目标地址,并冲刷IF/ID级的流水线寄存器。为了分别以后接入分支跳转预测模块,我们预留一个输入端口。

1
2
3
4
5
6
7
8
9
10
11
12
logic branch_predicted_result;
// 此处设置为静态不预测 因此获取EX级的跳转结果
// assign branch_predicted_result = branch_predicted_i;
assign branch_predicted_result = take_branch_NextPC;

// 流水线冲刷与停顿
always_comb begin
keep_pc = load_use_hazard ? 1'b1 : 1'b0;
stall_IF_ID = load_use_hazard ? 1'b1 : 1'b0;
flush_IF_ID = branch_predicted_result ? 1'b1 : 1'b0;
flush_ID_EX = (branch_predicted_result || load_use_hazard) ? 1'b1 : 1'b0;
end

然后写个简单的分支跳转测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
addi x1, x0, 5   # x1 = 5
addi x2, x0, 5 # x2 = 5
addi x3, x0, 0 # x3 = 0

beq x1, x2, 8 # 跳转到PC+8(跳过下一条),x1==x2时跳转
addi x3, x3, 1 # 未跳转时执行(应被跳过)

addi x4, x0, 10 # x4 = 10

addi x5, x0, 7
bne x1, x5, 8 # 跳转到PC+8(跳过下一条),x1!=x5时跳转
addi x4, x4, 1 # 未跳转时执行(应被跳过)

addi x6, x0, 3
blt x6, x1, 8 # 跳转到PC+8(跳过下一条),x6<x1时跳转
addi x4, x4, 2 # 未跳转时执行(应被跳过)

addi x7, x0, 5
bge x1, x7, 8 # 跳转到PC+8(跳过下一条),x1>=x7时跳转
addi x4, x4, 3 # 未跳转时执行(应被跳过)

addi x8, x0, 100 # 跳转后执行

# 最终寄存器值为:
# x3 = 0 (未执行任何加1操作)
# x4 = 10 (未执行任何加1或加2或加3操作)

运行一下:

分支跳转正确执行

可以看到程序计数器接受了branch_op信号并正确选择了分支跳转结果。

为什么x3变为0后,过了四个周期,x4才变为10?这是因为我们直接阻塞流水线,x3的值可以通过旁路提前送过去,但是只有在beq执行完毕后才能得知是否跳转。跳转目标的指令addi x4beq指令 EX 级结束后的下一周期才进入IF级:

1
2
3
4
5
6
7
8
9
10
11
   x3  x4      IF          ID          EX          MEM         WB
------------------------------------------------------------------------
1 x x addi 0
2 x x beq addi 0
3 x x addi 1 beq addi 0
4 x x addi 1 !==阻塞!== beq addi 0
5 0 x addi 10 <==========得出结果=| addi 0 -|
6 0 x addi 10 |
7 0 x addi 10 |- 四个周期
8 0 x addi 10 |
9 0 10 addi 10 -|

数据冒险控制模块 HazardUnit.sv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
`include "include/defines.svh"

module HazardUnit (
input logic clk,
input logic rst_n,
// LOAD指令判断
input logic [ 1:0] wd_sel_EX,
// 寄存器使用信号
input logic rs1_used_ID,
input logic rs2_used_ID,
// ID级源寄存器地址
input logic [ 4:0] rR1_ID,
input logic [ 4:0] rR2_ID,
// EX级目的寄存器地址
input logic [ 4:0] wR_EX,
// MEM级目的寄存器地址
input logic [ 4:0] wR_MEM,
// WB级目的寄存器地址
input logic [ 4:0] wR_WB,
// 写入数据使能
input logic rf_we_EX,
input logic rf_we_MEM,
input logic rf_we_WB,
// 写入数据
input logic [31:0] rf_wd_EX,
input logic [31:0] rf_wd_MEM,
input logic [31:0] rf_wd_WB,
// 预留的分支预测结果
input logic branch_predicted_i,
// PC保持信号
output logic keep_pc,
// IF/ID停顿信号
output logic stall_IF_ID,
// IF/ID冲刷信号
output logic flush_IF_ID,
// ID/EX冲刷信号
output logic flush_ID_EX,
// 前递使能
output logic fwd_rD1e_ID,
output logic fwd_rD2e_ID,
// 前递数据
output logic [31:0] fwd_rD1_ID,
output logic [31:0] fwd_rD2_ID
);

// RAW 冒险判断
// verilog_format: off
logic RAW_1_rD1, RAW_1_rD2;
logic RAW_2_rD1, RAW_2_rD2;
logic RAW_3_rD1, RAW_3_rD2;

always_comb begin
// 间隔一级流水线 判断下写入的不为x0即可
RAW_1_rD1 = (wR_EX == rR1_ID) && rf_we_EX && rs1_used_ID && (wR_EX != 5'b0);
RAW_1_rD2 = (wR_EX == rR2_ID) && rf_we_EX && rs2_used_ID && (wR_EX != 5'b0);
// 间隔两级流水线
RAW_2_rD1 = (wR_MEM == rR1_ID) && rf_we_MEM && rs1_used_ID && (wR_MEM != 5'b0);
RAW_2_rD2 = (wR_MEM == rR2_ID) && rf_we_MEM && rs2_used_ID && (wR_MEM != 5'b0);
// 间隔三级流水线
RAW_3_rD1 = (wR_WB == rR1_ID) && rf_we_WB && rs1_used_ID && (wR_WB != 5'b0);
RAW_3_rD2 = (wR_WB == rR2_ID) && rf_we_WB && rs2_used_ID && (wR_WB != 5'b0);
end

// 前递使能信号生成
always_comb begin
fwd_rD1e_ID = RAW_1_rD1 || RAW_2_rD1 || RAW_3_rD1;
fwd_rD2e_ID = RAW_1_rD2 || RAW_2_rD2 || RAW_3_rD2;
end

// 前递数据选择
// 优先级:EX > MEM > WB
always_comb begin
// case-true 语句
// 源操作数1前递数据选择
case (1'b1)
RAW_1_rD1: fwd_rD1_ID = rf_wd_EX; // 来自EX级
RAW_2_rD1: fwd_rD1_ID = rf_wd_MEM; // 来自MEM级
RAW_3_rD1: fwd_rD1_ID = rf_wd_WB;
default: fwd_rD1_ID = 32'b0;
endcase
// 源操作数2前递数据选择
case (1'b1)
RAW_1_rD2: fwd_rD2_ID = rf_wd_EX; // 来自EX级
RAW_2_rD2: fwd_rD2_ID = rf_wd_MEM; // 来自MEM级
RAW_3_rD2: fwd_rD2_ID = rf_wd_WB;
default: fwd_rD2_ID = 32'b0;
endcase
end
// verilog_format: on

// Load_use 冒险判断
logic load_use_hazard;
assign load_use_hazard = (wd_sel_EX == `WD_SEL_FROM_DRAM) && (RAW_1_rD1 || RAW_1_rD2);
// assign load_use_hazard = 1'b0;

// [TODO] 静态分支预测
// [TODO] 动态分支预测
logic branch_predicted_result;
// 此处设置为静态不预测 因此获取EX级的跳转结果
assign branch_predicted_result = branch_predicted_i;

// 流水线冲刷与停顿
always_comb begin
keep_pc = load_use_hazard ? 1'b1 : 1'b0;
stall_IF_ID = load_use_hazard ? 1'b1 : 1'b0;
flush_IF_ID = branch_predicted_result ? 1'b1 : 1'b0;
flush_ID_EX = (branch_predicted_result || load_use_hazard) ? 1'b1 : 1'b0;
end

endmodule

复杂测试

到此,我们的CPU已经完成70%了。运行一些复杂的汇编程序,不出意外的话,是不会有问题的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# ========================================
# RV32I Complete Test Suite (No Pseudo Instructions)
# Testing all RV32I instructions without misaligned access
# ========================================

.text
.globl _start

_start:
# ========================================
# 1. Test ADDI - Add Immediate
# ========================================
addi x1, x0, 100 # x1 = 0 + 100 = 100 (0x64)
addi x2, x0, 200 # x2 = 0 + 200 = 200 (0xC8)
addi x3, x1, 50 # x3 = 100 + 50 = 150 (0x96)
addi x4, x0, -10 # x4 = 0 + (-10) = -10 (0xFFFFFFF6)

# ========================================
# 2. Test ADD - Add Register
# ========================================
add x5, x1, x2 # x5 = 100 + 200 = 300 (0x12C)
add x6, x3, x4 # x6 = 150 + (-10) = 140 (0x8C)

# ========================================
# 3. Test SUB - Subtract
# ========================================
sub x7, x2, x1 # x7 = 200 - 100 = 100 (0x64)
sub x8, x1, x2 # x8 = 100 - 200 = -100 (0xFFFFFF9C)

# ========================================
# 4. Test SLTI - Set Less Than Immediate (Signed)
# ========================================
slti x9, x1, 101 # x9 = (100 < 101) = 1
slti x10, x1, 99 # x10 = (100 < 99) = 0
slti x11, x4, 0 # x11 = (-10 < 0) = 1

# ========================================
# 5. Test SLTIU - Set Less Than Immediate (Unsigned)
# ========================================
sltiu x12, x1, 101 # x12 = (100 < 101) = 1
sltiu x13, x4, 10 # x13 = (0xFFFFFFF6 < 10) = 0 (unsigned compare)

# ========================================
# 6. Test SLT - Set Less Than (Signed)
# ========================================
slt x14, x1, x2 # x14 = (100 < 200) = 1
slt x15, x2, x1 # x15 = (200 < 100) = 0
slt x16, x4, x1 # x16 = (-10 < 100) = 1

# ========================================
# 7. Test SLTU - Set Less Than (Unsigned)
# ========================================
sltu x17, x1, x2 # x17 = (100 < 200) = 1
sltu x18, x4, x1 # x18 = (0xFFFFFFF6 < 100) = 0 (unsigned)

# ========================================
# 8. Test Logical Operations - ANDI, ORI, XORI
# ========================================
addi x19, x0, 255 # x19 = 255 (0xFF)
andi x20, x19, 15 # x20 = 255 & 15 = 15 (0x0F)
ori x21, x20, 240 # x21 = 15 | 240 = 255 (0xFF)
xori x22, x21, 170 # x22 = 255 ^ 170 = 85 (0x55)

# ========================================
# 9. Test Logical Operations - AND, OR, XOR
# ========================================
addi x23, x0, 60 # x23 = 60 (0x3C = 0b00111100)
addi x24, x0, 51 # x24 = 51 (0x33 = 0b00110011)
and x25, x23, x24 # x25 = 60 & 51 = 48 (0x30)
or x26, x23, x24 # x26 = 60 | 51 = 63 (0x3F)
xor x27, x23, x24 # x27 = 60 ^ 51 = 15 (0x0F)

# ========================================
# 10. Test Shift Operations - SLLI, SRLI, SRAI
# ========================================
addi x28, x0, 8 # x28 = 8 (0x08)
slli x29, x28, 2 # x29 = 8 << 2 = 32 (0x20)
srli x30, x29, 1 # x30 = 32 >> 1 = 16 (0x10)
addi x31, x0, -8 # x31 = -8 (0xFFFFFFF8)
srai x1, x31, 1 # x1 = -8 >> 1 = -4 (0xFFFFFFFC, arithmetic shift)

# ========================================
# 11. Test Shift Operations - SLL, SRL, SRA
# ========================================
addi x2, x0, 16 # x2 = 16
addi x3, x0, 2 # x3 = 2 (shift amount)
sll x4, x2, x3 # x4 = 16 << 2 = 64 (0x40)
srl x5, x4, x3 # x5 = 64 >> 2 = 16 (0x10)
addi x6, x0, -16 # x6 = -16 (0xFFFFFFF0)
sra x7, x6, x3 # x7 = -16 >> 2 = -4 (0xFFFFFFFC)

# ========================================
# 12. Test LUI - Load Upper Immediate
# ========================================
lui x8, 0x12345 # x8 = 0x12345000
lui x9, 0xFFFFF # x9 = 0xFFFFF000

# ========================================
# 13. Test AUIPC - Add Upper Immediate to PC
# ========================================
auipc x10, 0 # x10 = current PC + 0
auipc x11, 1 # x11 = current PC + 0x1000

# ========================================
# 14. Test Memory Operations - Store and Load
# Setup memory base address
# ========================================
lui x12, 0x10000 # x12 = 0x10000000 (memory base)

# Store values
lui x15, 0xABCDE # x15 = 0xABCDE000
addi x15, x15, 0x789 # x15 = 0xABCDE789
sw x15, 8(x12) # Store word: mem[0x10000008] = 0xABCDE789

# Load values

lw x26, 8(x12) # x26 = 0xABCDE789

# ========================================
# 15. Test Branch Instructions - BEQ
# ========================================
addi x27, x0, 10 # x27 = 10
addi x28, x0, 10 # x28 = 10
beq x27, x28, branch_eq_taken # Should branch (10 == 10)
addi x29, x0, 99 # x29 = 99 (SHOULD NOT EXECUTE)

branch_eq_taken:
addi x29, x0, 1 # x29 = 1 (branch taken marker)

addi x27, x0, 10 # x27 = 10
addi x28, x0, 20 # x28 = 20
beq x27, x28, branch_eq_not_taken # Should not branch
addi x30, x0, 1 # x30 = 1 (not taken marker)

branch_eq_not_taken:

# ========================================
# 16. Test Branch Instructions - BNE
# ========================================
addi x1, x0, 5 # x1 = 5
addi x2, x0, 10 # x2 = 10
bne x1, x2, branch_ne_taken # Should branch (5 != 10)
addi x3, x0, 99 # x3 = 99 (SHOULD NOT EXECUTE)

branch_ne_taken:
addi x3, x0, 1 # x3 = 1 (branch taken marker)

# ========================================
# 17. Test Branch Instructions - BLT (Signed)
# ========================================
addi x4, x0, 5 # x4 = 5
addi x5, x0, 10 # x5 = 10
blt x4, x5, branch_lt_taken # Should branch (5 < 10)
addi x6, x0, 99 # x6 = 99 (SHOULD NOT EXECUTE)

branch_lt_taken:
addi x6, x0, 1 # x6 = 1 (branch taken marker)

addi x7, x0, -5 # x7 = -5
addi x8, x0, 3 # x8 = 3
blt x7, x8, branch_lt_neg_taken # Should branch (-5 < 3)
addi x9, x0, 99 # x9 = 99 (SHOULD NOT EXECUTE)

branch_lt_neg_taken:
addi x9, x0, 1 # x9 = 1 (branch taken marker)

# ========================================
# 18. Test Branch Instructions - BGE (Signed)
# ========================================
addi x10, x0, 10 # x10 = 10
addi x11, x0, 5 # x11 = 5
bge x10, x11, branch_ge_taken # Should branch (10 >= 5)
addi x12, x0, 99 # x12 = 99 (SHOULD NOT EXECUTE)

branch_ge_taken:
addi x12, x0, 1 # x12 = 1 (branch taken marker)

# ========================================
# 19. Test Branch Instructions - BLTU (Unsigned)
# ========================================
addi x13, x0, 5 # x13 = 5
addi x14, x0, 10 # x14 = 10
bltu x13, x14, branch_ltu_taken # Should branch (5 < 10 unsigned)
addi x15, x0, 99 # x15 = 99 (SHOULD NOT EXECUTE)

branch_ltu_taken:
addi x15, x0, 1 # x15 = 1 (branch taken marker)

# ========================================
# 20. Test Branch Instructions - BGEU (Unsigned)
# ========================================
addi x16, x0, 10 # x16 = 10
addi x17, x0, 5 # x17 = 5
bgeu x16, x17, branch_geu_taken # Should branch (10 >= 5 unsigned)
addi x18, x0, 99 # x18 = 99 (SHOULD NOT EXECUTE)

branch_geu_taken:
addi x18, x0, 1 # x18 = 1 (branch taken marker)

# ========================================
# 21. Test JAL - Jump and Link
# ========================================
jal x19, jal_target # x19 = return address, jump to jal_target
addi x20, x0, 99 # x20 = 99 (SHOULD NOT EXECUTE)

jal_target:
addi x20, x0, 1 # x20 = 1 (jump successful)

# ========================================
# 22. Test JALR - Jump and Link Register
# ========================================
auipc x21, 0 # x21 = current PC
addi x21, x21, 16 # x21 = PC + 16 (target address)
jalr x22, x21, 0 # x22 = return address, jump to x21
addi x23, x0, 99 # x23 = 99 (SHOULD NOT EXECUTE)

jalr_target:
addi x23, x0, 1 # x23 = 1 (jump successful)

# ========================================
# 23. Test Complex Data Dependencies
# ========================================
addi x24, x0, 1 # x24 = 1
addi x24, x24, 2 # x24 = 3
addi x24, x24, 3 # x24 = 6
addi x24, x24, 4 # x24 = 10

# ========================================
# 24. Test Edge Cases
# ========================================
# Maximum positive immediate
addi x25, x0, 2047 # x25 = 2047 (0x7FF, max 12-bit signed)

# Maximum negative immediate
addi x26, x0, -2048 # x26 = -2048 (0xFFFFF800)

# Overflow test
lui x27, 0x7FFFF # x27 = 0x7FFFF000
addi x27, x27, 0x7FF # x27 = 0x7FFFF7FF (near max positive)
addi x28, x0, 1 # x28 = 1
add x29, x27, x28 # x29 = overflow result

# Zero register test
addi x0, x0, 100 # x0 should remain 0
add x30, x0, x0 # x30 = 0

# ========================================
# 25. Final Test - Load/Store with Computed Address
# ========================================
lui x31, 0x10000 # x31 = 0x10000000
addi x1, x0, 100 # x1 = 100 (offset)
add x2, x31, x1 # x2 = 0x10000000 + 100
addi x3, x0, 0x55A # x3 = 0x55AA
sw x3, 0(x2) # Store at computed address
lw x4, 0(x2) # x4 = 0x55AA (load back)

# ========================================
# End of Test - Infinite Loop
# ========================================
end_loop:
beq x0, x0, end_loop # Infinite loop

# ========================================
# Expected Register Values After Test:
# ========================================
# x0 = 0x00000000 (always zero)
# x1 = 100 (0x64)
# x2 = 0x10000064
# x3 = 0x000055AA
# x4 = 0x000055AA
# x5 = 16 (0x10)
# x6 = -16 (0xFFFFFFF0)
# x7 = -4 (0xFFFFFFFC)
# x8 = 0x12345000
# x9 = 0xFFFFF000
# x10 = PC value at AUIPC
# x11 = PC value + 0x1000
# x12 = 0x10000000
# x13 = 127 (0x7F)
# x14 = 0x1234
# x15 = 1 (branch marker)
# x16 = 10
# x17 = 5
# x18 = 1 (branch marker)
# x19 = return address from JAL
# x20 = 1
# x21 = target address
# x22 = return address from JALR
# x23 = 1
# x24 = 10
# x25 = 2047 (0x7FF)
# x26 = -2048 (0xFFFFF800)
# x27 = 0x7FFFF7FF
# x28 = 1
# x29 = 0x7FFFF800 (overflow wrapped)
# x30 = 0
# x31 = 0x10000000

效果非常好!

完整测试

为什么才完成70%?那是因为我们还没有支持lb/lhsb/sh一类指令。在这之前,我们还有一个历史遗留问题需要去解决,那就是无法被综合的同步写异步读DRAM……