High-performance computer architecture
2024-10-26 00:36:23

In the history of computer hardware, some early reduced instruction set computer central processing units (RISC CPUs) used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

Each of these classic scalar RISC designs fetches and tries to execute one instruction per cycle. The main common concept of each design is a five-stage execution instruction pipeline. During operation, each pipeline stage works on one instruction at a time. Each of these stages consists of a set of flip-flops to hold state, and combinational logic that operates on the outputs of those flip-flops.

在计算机硬件的历史上,一些早期的精简指令集计算机中央处理器 (RISC CPU) 使用了非常相似的架构解决方案,现在称为经典的 RISC 管道。这些 CPU 是:MIPS、SPARC、摩托罗拉 88000,以及后来为教育发明的名义 CPU DLX。

这些经典的标量 RISC 设计中的每一个都在每个周期中获取并尝试执行一条指令。每个 design 的主要共同概念是一个 5 阶段的执行指令 pipeline。在操作期间,每个 pipeline stage 一次处理一条指令。这些阶段中的每一个都由一组用于保存状态的 flip-flops 和对这些 flip-flops 的输出进行操作的组合 logic 组成。

Instruction fetch

The instructions reside in memory that takes one cycle to read. This memory can be dedicated to SRAM, or an Instruction Cache. The term “latency” is used in computer science often and means the time from when an operation starts until it completes. Thus, instruction fetch has a latency of one clock cycle (if using single-cycle SRAM or if the instruction was in the cache). Thus, during the Instruction Fetch stage, a 32-bit instruction is fetched from the instruction memory.

The Program Counter, or PC is a register that holds the address that is presented to the instruction memory. The address is presented to instruction memory at the start of a cycle. Then during the cycle, the instruction is read out of instruction memory, and at the same time, a calculation is done to determine the next PC. The next PC is calculated by incrementing the PC by 4, and by choosing whether to take that as the next PC or to take the result of a branch/jump calculation as the next PC. Note that in classic RISC, all instructions have the same length. (This is one thing that separates RISC from CISC [1]). In the original RISC designs, the size of an instruction is 4 bytes, so always add 4 to the instruction address, but don’t use PC + 4 for the case of a taken branch, jump, or exception (see delayed branches, below). (Note that some modern machines use more complicated algorithms (branch prediction and branch target prediction) to guess the next instruction address.)

指令驻留在内存中,需要一个周期来读取。此内存可以专用于 SRAM 或 Instruction Cache。术语“延迟”在计算机科学中经常使用,是指从操作开始到完成的时间。因此, instruction fetch 有一个 clock cycle 的延迟(如果使用单周期 SRAM 或者如果指令在 cache中)。因此,在 Instruction Fetch 阶段,从指令内存中获取 32 位指令。

程序计数器或 PC 是一个寄存器,它保存提供给指令存储器的地址。地址在周期开始时呈现给 instruction memory。然后在循环期间,从指令内存中读出指令,同时进行计算以确定下一个 PC。通过将 PC 增加 4 来计算下一个 PC,并选择是将其作为下一个 PC,还是将分支/跳跃计算的结果作为下一个 PC。请注意,在经典 RISC 中,所有指令都具有相同的长度。(这是 RISC 与 CISC 的区别之一 [1])。在最初的 RISC 设计中,指令的大小为 4 字节,因此始终在指令地址上加 4,但在出现分支、跳转或异常的情况下,不要使用 PC + 4(请参阅下面的延迟分支)。(请注意,一些现代机器使用更复杂的算法(分支预测和分支目标预测)来猜测下一个指令地址。

Instruction decode

Another thing that separates the first RISC machines from earlier CISC machines, is that RISC has no microcode.[2] In the case of CISC micro-coded instructions, once fetched from the instruction cache, the instruction bits are shifted down the pipeline, where simple combinational logic in each pipeline stage produces control signals for the datapath directly from the instruction bits. In those CISC designs, very little decoding is done in the stage traditionally called the decode stage. A consequence of this lack of decoding is that more instruction bits have to be used to specifying what the instruction does. That leaves fewer bits for things like register indices.

All MIPS, SPARC, and DLX instructions have at most two register inputs. During the decode stage, the indexes of these two registers are identified within the instruction, and the indexes are presented to the register memory, as the address. Thus the two registers named are read from the register file. In the MIPS design, the register file had 32 entries.

At the same time the register file is read, instruction issue logic in this stage determines if the pipeline is ready to execute the instruction in this stage. If not, the issue logic causes both the Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the input flip flops do not accept new bits, thus no new calculations take place during that cycle.

If the instruction decoded is a branch or jump, the target address of the branch or jump is computed in parallel with reading the register file. The branch condition is computed in the following cycle (after the register file is read), and if the branch is taken or if the instruction is a jump, the PC in the first stage is assigned the branch target, rather than the incremented PC that has been computed. Some architectures made use of the Arithmetic logic unit (ALU) in the Execute stage, at the cost of slightly decreased instruction throughput.

The decode stage ended up with quite a lot of hardware: MIPS has the possibility of branching if two registers are equal, so a 32-bit-wide AND tree runs in series after the register file read, making a very long critical path through this stage (which means fewer cycles per second). Also, the branch target computation generally required a 16 bit add and a 14 bit incrementer. Resolving the branch in the decode stage made it possible to have just a single-cycle branch mis-predict penalty. Since branches were very often taken (and thus mis-predicted), it was very important to keep this penalty low.

第一台 RISC 机器与早期 CISC 机器的另一点区别是 RISC 没有微码。[2] 对于 CISC 微编码指令,一旦从指令缓存中获取,指令位就会在流水线中向下移动,其中每个流水线级中的简单组合逻辑直接从指令位为数据路径生成控制信号。在这些 CISC 设计中,在传统上称为解码阶段的阶段中完成的解码非常少。这种缺乏解码的结果是必须使用更多的指令位来指定指令的作用。这样就为 register indices 之类的东西留下了更少的位。

所有 MIPS、SPARC 和 DLX 指令最多有两个寄存器输入。在解码阶段,这两个寄存器的索引在指令中被识别,索引作为地址呈现给寄存器内存。因此,从 register 文件中读取 names 的两个 registers。在 MIPS 设计中,寄存器文件有 32 个条目。

在读取 register 文件的同时,此阶段的 instruction issue logic 确定 pipeline 是否已准备好执行此阶段的 instruction。否则,问题逻辑会导致 Instruction Fetch 阶段和 Decode 阶段停止。在 stall cycle上,input flip-flops 不接受新 bits,因此在该 cycle期间不会进行新的计算。

如果解码的指令是 branch 或 jump,则 branch 或 jump 的目标地址与读取 register 文件并行计算。分支条件在下一个周期中计算 (读取寄存器文件之后),如果采用分支或指令是跳转,则第一阶段的 PC 将被分配分支目标,而不是已计算的递增 PC。一些架构在 Execute 阶段使用 Arithmetic logic unit (ALU),但代价是指令吞吐量略有降低。

解码阶段最终需要相当多的硬件: 如果两个寄存器相等,MIPS 有可能分支,因此 32 位宽的 AND 树在寄存器文件读取后串联运行,从而形成一个非常长的关键路径通过这个阶段(这意味着每秒的周期更少)。此外,分支目标计算通常需要 16 位加法和 14 位增量器。在 decode 阶段解析 branch 使得只有一个 single cycle branch 错误预测惩罚成为可能。由于树枝经常被拿走(因此预测错误),因此保持较低的惩罚非常重要。

Execute

The Execute stage is where the actual computation occurs. Typically this stage consists of an ALU, and also a bit shifter. It may also include a multiple cycle multiplier and divider.

The ALU is responsible for performing boolean operations (and, or, not, nand, nor, xor, xnor) and also for performing integer addition and subtraction. Besides the result, the ALU typically provides status bits such as whether or not the result was 0, or if an overflow occurred.

The bit shifter is responsible for shift and rotations.

Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:

Register-Register Operation (Single-cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage.
Memory Reference (Two-cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle.
Multi-cycle Instructions (Many cycle latency). Integer multiply and divide and all floating-point operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers.

Execute 阶段是进行实际计算的地方。通常,此 stage 由一个 ALU 和一个 bit shifter 组成。它还可能包括多周期乘法器和分频器。

ALU 负责执行布尔运算 (and, or, not, nand, nor, xor, xnor) 以及执行整数加法和减法。除了结果之外,ALU 通常还提供状态位,例如结果是否为 0,或者是否发生溢出。

移位器负责移位和旋转。

这些简单的 RISC 机器上的指令可以根据操作类型分为三个延迟等级:

Register-Register Operation (Single-cycle latency):加、减、比较和逻辑运算。在执行阶段,这两个参数被馈送到一个简单的 ALU,该 ALU 在执行阶段结束时生成结果。 内存引用 (Two-cycle latency)。所有加载都来自内存。在执行阶段,ALU 添加了两个参数(一个 register 和一个 constant offset),以便在周期结束时生成一个虚拟地址。 多周期指令 (许多周期延迟)。整数乘法和除法以及所有浮点运算。在 execute 阶段,这些操作的操作数被馈送到多周期乘法/除法单元。管道的其余部分可以自由地继续执行,而乘法/除法单元则执行其工作。为了避免使 writeback stage 和 issue logic复杂化, multicycle instruction 将其结果写入一组单独的 registers。

Memory access

If data memory needs to be accessed, it is done in this stage.

During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both one and two cycle instructions always write their results in the same stage of the pipeline so that just one write port to the register file can be used, and it is always available.

For direct mapped and virtually tagged data caching, the simplest by far of the numerous data cache organizations, two SRAMs are used, one storing data and the other storing tags.

如果需要访问数据存储器,则在此阶段完成。

在此阶段,单周期 latency 指令只是将其结果转发到下一阶段。这种转发确保 one 和 two cycle 指令始终将其结果写入 pipeline 的同一阶段,以便只能使用寄存器文件的一个写入端口,并且它始终可用。

对于直接映射和虚拟标记的数据缓存,这是迄今为止众多数据缓存组织中最简单的,它使用两个 SRAM,一个存储数据,另一个存储标签。

Writeback

During this stage, both single cycle and two cycle instructions write their results into the register file. Note that two different stages are accessing the register file at the same time—the decode stage is reading two source registers, at the same time that the writeback stage is writing a previous instruction’s destination register. On real silicon, this can be a hazard (see below for more on hazards). That is because one of the source registers being read in decode might be the same as the destination register being written in writeback. When that happens, then the same memory cells in the register file are being both read and written the same time. On silicon, many implementations of memory cells will not operate correctly when read and written at the same time.

在此阶段,单周期和双周期指令都将其结果写入 register 文件。请注意,两个不同的阶段同时访问 register 文件—decode 阶段正在读取两个 source registers,同时 writeback stage 正在写入前一条指令的目标 register。在真正的硅上,这可能是一个危险(有关危险的更多信息,请参见下文)。这是因为在 decode 中读取的 source registers 之一可能与在 writeback 中写入的目标 register 相同。发生这种情况时,register 文件中的相同 memory cells 将同时被读取和写入。在 Silicon 上,许多 memory cells 的实现在同时读取和写入时将无法正常运行。


Hazards

Hennessy and Patterson coined the term hazard for situations where instructions in a pipeline would produce wrong answers.
Hennessy 和 Patterson 创造了 hazard 一词,用于描述管道中的指令会产生错误答案的情况。

Structural hazards

Structural hazards occur when two instructions might attempt to use the same resources at the same time. Classic RISC pipelines avoided these hazards by replicating hardware. In particular, branch instructions could have used the ALU to compute the target address of the branch. If the ALU were used in the decode stage for that purpose, an ALU instruction followed by a branch would have seen both instructions attempt to use the ALU simultaneously. It is simple to resolve this conflict by designing a specialized branch target adder into the decode stage.
当两个指令可能尝试同时使用相同的资源时,就会发生结构性危险。Classic RISC 管道通过复制硬件避免了这些危险。特别是,分支指令可以使用 ALU 来计算分支的目标地址。如果在解码阶段使用 ALU 来实现此目的,则 ALU 指令后跟分支将看到两条指令尝试同时使用 ALU。通过在 decode 阶段设计一个专门的 branch target adder 来解决这个冲突很简单。

Data hazards

Data hazards occur when an instruction, scheduled blindly, would attempt to use data before the data is available in the register file.
当盲目调度的指令在数据在 register 文件中可用之前尝试使用数据时,就会发生数据危害。
In the classic RISC pipeline, Data hazards are avoided in one of two ways:

Solution A. Bypassing

Bypassing is also known as operand forwarding.

Suppose the CPU is executing the following piece of code:

1
2
SUB r3,r4 -> r10     ; Writes r3 - r4 to r10
AND r10,r3 -> r11 ; Writes r10 & r3 to r11

The instruction fetch and decode stages send the second instruction one cycle after the first. They flow down the pipeline as shown in this diagram:
指令 fetch 和 decode 阶段在第一个指令之后一个周期内发送第二个指令。它们沿管道向动,如下图所示:

image

In a naive pipeline, without hazard consideration, the data hazard progresses as follows:
在简单的管道中,如果不考虑危险,数据危险将按以下方式进行:

In cycle 3, the SUB instruction calculates the new value for r10. In the same cycle, the AND operation is decoded, and the value of r10 is fetched from the register file. However, the SUB instruction has not yet written its result to r10. Write-back of this normally occurs in cycle 5 (green box). Therefore, the value read from the register file and passed to the ALU (in the Execute stage of the AND operation, red box) is incorrect.
在周期 3 中,SUB 指令计算 r10 的新值。在同一周期中,对 AND 操作进行解码,并从 register 文件中获取 r10 的值。但是,SUB 指令尚未将其结果写入 r10。此的写回通常发生在 cycle 5 (绿色框) 。因此,从 register 文件读取并传递给 ALU 的值(在 AND 操作的 Execute 阶段,红色框)是不正确的。

Instead, we must pass the data that was computed by SUB back to the Execute stage (i.e. to the red circle in the diagram) of the AND operation before it is normally written-back. The solution to this problem is a pair of bypass multiplexers. These multiplexers sit at the end of the decode stage, and their flopped outputs are the inputs to the ALU. Each multiplexer selects between:
相反,我们必须将 SUB 计算的数据传回 AND 运算的 Execute 阶段(即图中的红色圆圈),然后才能正常回写。这个问题的解决方案是一对旁路多路复用器。这些多路复用器位于解码级的末端,其翻转输出是 ALU 的输入。每个多路复用器在以下选项中进行选择:

A register file read port (i.e. the output of the decode stage, as in the naive pipeline): red arrow
The current register pipeline of the ALU (to bypass by one stage): blue arrow
The current register pipeline of the access stage (which is either a loaded value or a forwarded ALU result, this provides bypassing of two stages): purple arrow. Note that this requires the data to be passed backwards in time by one cycle. If this occurs, a bubble must be inserted to stall the AND operation until the data is ready.
Decode stage logic compares the registers written by instructions in the execute and access stages of the pipeline to the registers read by the instruction in the decode stage, and cause the multiplexers to select the most recent data. These bypass multiplexers make it possible for the pipeline to execute simple instructions with just the latency of the ALU, the multiplexer, and a flip-flop. Without the multiplexers, the latency of writing and then reading the register file would have to be included in the latency of these instructions.
寄存器文件读取端口(即 decode 阶段的输出,如 naive pipeline):红色箭头 ALU 的当前寄存器流水线(要旁路一个阶段):蓝色箭头 访问阶段的当前 register pipeline (要么是加载的值,要么是转发的 ALU 结果,这提供了两个阶段的旁路):紫色箭头。请注意,这需要将数据向后传递一个周期。如果发生这种情况,则必须插入一个气泡以停止 AND 操作,直到数据准备就绪。 Decode stage logic 将 pipeline 的 execute 和 access 阶段中的指令写入的 registers 与 decode stage中指令读取的 registers 进行比较,并使 multiplexer 选择最新的数据。这些旁路多路复用器使流水线能够执行简单的指令,只需 ALU、多路复用器和触发器的延迟。如果没有 multiplexers,写入和读取 register 文件的 latency 必须包含在这些指令的 latency 中。

Note that the data can only be passed forward in time - the data cannot be bypassed back to an earlier stage if it has not been processed yet. In the case above, the data is passed forward (by the time the AND is ready for the register in the ALU, the SUB has already computed it).
请注意,数据只能在时间上向前传递 - 如果数据尚未处理,则无法绕过回较早的阶段。在上述情况下,数据被向前传递(当 AND 准备好用于 ALU 中的 register 时,SUB 已经计算了它)。

image

Solution B. Pipeline interlock

However, consider the following instructions:

1
2
LD  adr    -> r10
AND r10,r3 -> r11

The data read from the address adr is not present in the data cache until after the Memory Access stage of the LD instruction. By this time, the AND instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the AND instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are stalled - they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the LD and AND instructions.
从 adr 地址读取的数据在 LD 指令的 Memory Access 阶段之后才出现在数据缓存中。此时,AND 指令已经通过 ALU 完成。要解决此问题,需要将内存中的数据及时向后传递到 ALU 的 input。这是不可能的。解决方案是将 AND 指令延迟一个周期。在 decode 阶段检测到数据危害,并且 fetch 和 decode 阶段停滞 - 它们被阻止了 Flop 的输入,因此在一个周期内保持相同的状态。下游的 execute、access 和 write-back 阶段会在 LD 和 AND 指令之间看到一个额外的无操作指令 (NOP)。

This NOP is termed a pipeline bubble since it floats in the pipeline, like an air bubble in a water pipe, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a pipeline interlock.
这个 NOP 被称为管道气泡,因为它漂浮在管道中,就像水管中的气泡一样,占用资源但不产生有用的结果。用于检测数据危险并停止管道直到危险被清除的硬件称为管道联锁。

Bypassing backwards in timeimage
Problem resolved using a bubbleimage

A pipeline interlock does not have to be used with any data forwarding, however. The first example of the SUB followed by AND and the second example of LD followed by AND can be solved by stalling the first stage by three cycles until write-back is achieved, and the data in the register file is correct, causing the correct register value to be fetched by the AND’s Decode stage. This causes quite a performance hit, as the processor spends a lot of time processing nothing, but clock speeds can be increased as there is less forwarding logic to wait for.
但是,管道联锁不必与任何数据转发一起使用。第一个 SUB 后跟 AND 的例子和第二个例子 LD 后跟 AND 可以通过将第一阶段停滞三个周期来解决,直到实现回写,并且寄存器文件中的数据是正确的,从而导致 AND 的 Decode 阶段获取正确的寄存器值。这会导致相当大的性能损失,因为处理器花费大量时间不处理任何内容,但是由于要等待的转发 logic 较少,因此可以提高 clock speed。

This data hazard can be detected quite easily when the program’s machine code is written by the compiler. The Stanford MIPS machine relied on the compiler to add the NOP instructions in this case, rather than having the circuitry to detect and (more taxingly) stall the first two pipeline stages. Hence the name MIPS: Microprocessor without Interlocked Pipeline Stages. It turned out that the extra NOP instructions added by the compiler expanded the program binaries enough that the instruction cache hit rate was reduced. The stall hardware, although expensive, was put back into later designs to improve instruction cache hit rate, at which point the acronym no longer made sense.
当编译器编写程序的机器代码时,可以很容易地检测到这种数据危害。在这种情况下,斯坦福 MIPS 机器依赖于编译器来添加 NOP 指令,而不是让电路来检测和(更费力地)停顿前两个流水线阶段。因此得名 MIPS:没有互锁流水线级的微处理器。事实证明,编译器添加的额外 NOP 指令扩展了程序二进制文件,以至于指令缓存命中率降低。stall 硬件虽然昂贵,但被重新投入到后来的设计中以提高指令缓存命中率,此时首字母缩略词不再有意义。

Control hazards

Control hazards are caused by conditional and unconditional branching. The classic RISC pipeline resolves branches in the decode stage, which means the branch resolution recurrence is two cycles long. There are three implications:
控制危害是由有条件和无条件分支引起的。经典的 RISC 管道在解码阶段解析分支,这意味着分支解析重复周期长达两个周期。有三个含义:

The branch resolution recurrence goes through quite a bit of circuitry: the instruction cache read, register file read, branch condition compute (which involves a 32-bit compare on the MIPS CPUs), and the next instruction address multiplexer.
Because branch and jump targets are calculated in parallel to the register read, RISC ISAs typically do not have instructions that branch to a register+offset address. Jump to register is supported.
On any branch taken, the instruction immediately after the branch is always fetched from the instruction cache. If this instruction is ignored, there is a one cycle per taken branch IPC penalty, which is adequately large.
There are four schemes to solve this performance problem with branches:
分支解析递归会经过相当多的电路: 指令缓存读取、寄存器文件读取、分支条件计算(涉及 MIPS CPU 上的 32 位比较)和下一个指令地址多路复用器。 由于 branch 和 jump 目标是与 register 读取并行计算的,因此 RISC ISA 通常没有分支到 register+offset 地址的指令。支持跳转到注册。 在任何采用的分支上,紧跟在分支之后的指令总是从指令缓存中获取。如果忽略此指令,则每个采用的分支 IPC 都会受到一个周期的惩罚,这已经足够大了。 有四种方案可以解决 branches 的这个性能问题:

Predict Not Taken: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch is not taken. If the branch is not taken, the pipeline stays full. If the branch is taken, the instruction is flushed (marked as if it were a NOP), and one cycle’s opportunity to finish an instruction is lost.
Predict Not Taken:始终从指令缓存中获取分支之后的指令,但仅在分支未被获取时执行它。如果未使用分支,则管道将保持满状态。如果 branch 被占用,则 CODE 指令将被刷新(标记为 NOP),并且将失去一个 cycle 完成 INSTRUCTION 的机会。

Branch Likely: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch was taken. The compiler can always fill the branch delay slot on such a branch, and since branches are more often taken than not, such branches have a smaller IPC penalty than the previous kind.
Branch Likely:始终从指令缓存中获取分支之后的指令,但仅在分支被占用时执行它。编译器总是可以填充此类 branch 上的 branch delay slot,并且由于经常采用 branch,因此此类 branches 的 IPC 损失比前一种更小。

Branch Delay Slot: Depending on the design of the delayed branch and the branch conditions, it is determined whether the instruction immediately following the branch instruction is executed even if the branch is taken. Instead of taking an IPC penalty for some fraction of branches either taken (perhaps 60%) or not taken (perhaps 40%), branch delay slots take an IPC penalty for those branches into which the compiler could not schedule the branch delay slot. The SPARC, MIPS, and MC88K designers designed a branch delay slot into their ISAs.
Branch Delay Slot:根据延迟分支的设计和分支条件,确定即使占领了分支,是否也执行紧随分支指令之后的指令。分支延迟槽不是对已采用 (可能 60%) 或未采用 (可能是 40%) 的某些分支进行 IPC 惩罚,而是对编译器无法将分支延迟槽调度到的那些分支进行 IPC 惩罚。SPARC、MIPS 和 MC88K 设计人员在其 ISA 中设计了一个分支延迟槽。 分支预测:在获取每条指令的同时,猜测该指令是分支还是跳转,如果是,则猜测目标。在 branch 或 jump 之后的 cycle 上,在猜测的目标处获取指令。当猜测错误时,刷新错误获取的目标。 延迟分支是有争议的,首先,因为它们的语义很复杂。延迟分支指定跳转到新位置发生在下一条指令之后。下一条指令是 command cache 在 branch 之后不可避免地加载的指令。

Branch Prediction: In parallel with fetching each instruction, guess if the instruction is a branch or jump, and if so, guess the target. On the cycle after a branch or jump, fetch the instruction at the guessed target. When the guess is wrong, flush the incorrectly fetched target.
Delayed branches were controversial, first, because their semantics are complicated. A delayed branch specifies that the jump to a new location happens after the next instruction. That next instruction is the one unavoidably loaded by the instruction cache after the branch.
分支预测:在获取每条指令的同时,猜测该指令是分支还是跳转,如果是,则猜测目标。在 branch 或 jump 之后的 cycle 上,在猜测的目标处获取指令。当猜测错误时,刷新错误获取的目标。 延迟分支是有争议的,首先,因为它们的语义很复杂。延迟分支指定跳转到新位置发生在下一条指令之后。下一条指令是 command cache 在 branch 之后不可避免地加载的指令。

Delayed branches have been criticized[by whom?] as a poor short-term choice in ISA design:
延迟分支被批评为 ISA 设计中糟糕的短期选择:

Compilers typically have some difficulty finding logically independent instructions to place after the branch (the instruction after the branch is called the delay slot), so that they must insert NOPs into the delay slots.
Superscalar processors, which fetch multiple instructions per cycle and must have some form of branch prediction, do not benefit from delayed branches. The Alpha ISA left out delayed branches, as it was intended for superscalar processors.
The most serious drawback to delayed branches is the additional control complexity they entail. If the delay slot instruction takes an exception, the processor has to be restarted on the branch, rather than that next instruction. Exceptions then have essentially two addresses, the exception address and the restart address, and generating and distinguishing between the two correctly in all cases has been a source of bugs for later designs.
编译器通常很难找到逻辑上独立的指令放在 branch 之后(branch 之后的指令称为 delay slot),因此它们必须将 NOPs 插入 delay slots。 超标量处理器每个周期获取多条指令,并且必须具有某种形式的分支预测,因此无法从延迟分支中受益。Alpha ISA 省略了延迟分支,因为它是为超标量处理器设计的。 延迟分支最严重的缺点是它们带来额外的控制复杂性。如果 delay slot 指令出现异常,则必须在分支上重新启动处理器,而不是下一条指令。然后,异常基本上有两个地址,异常地址和重启地址,在所有情况下正确生成和区分两者一直是以后设计的 bug 来源。

Exceptions

Suppose a 32-bit RISC processes an ADD instruction that adds two large numbers, and the result does not fit in 32 bits.
假设 32 位 RISC 处理一条 ADD 指令,该指令将两个大数相加,并且结果不适合 32 位。

The simplest solution, provided by most architectures, is wrapping arithmetic. Numbers greater than the maximum possible encoded value have their most significant bits chopped off until they fit. In the usual integer number system, 3000000000+3000000000=6000000000. With unsigned 32 bit wrapping arithmetic, 3000000000+3000000000=1705032704 (6000000000 mod 2^32). This may not seem terribly useful. The largest benefit of wrapping arithmetic is that every operation has a well defined result.
大多数体系结构提供的最简单解决方案是包装算术。大于最大可能编码值的数字将被截断其最高有效位,直到适合为止。在通常的整数系统中,30000000000+30000000000=6000000000。使用无符号 32 位包装算法,3000000000+3000000000=1705032704 (60000000000 mod 2^32)。这似乎不是很有用。包装算术的最大好处是每个操作都有一个定义明确的结果。

But the programmer, especially if programming in a language supporting large integers (e.g. Lisp or Scheme), may not want wrapping arithmetic. Some architectures (e.g. MIPS), define special addition operations that branch to special locations on overflow, rather than wrapping the result. Software at the target location is responsible for fixing the problem. This special branch is called an exception. Exceptions differ from regular branches in that the target address is not specified by the instruction itself, and the branch decision is dependent on the outcome of the instruction.
但是程序员,特别是当使用支持大整数的语言(例如 Lisp 或 Scheme)编程时,可能不希望包装算术。一些架构(例如 MIPS)定义了特殊的加法操作,这些操作在溢出时分支到特殊位置,而不是包装结果。目标位置的软件负责修复问题。此特殊分支称为异常。例外与常规分支的不同之处在于,目标地址不是由指令本身指定的,并且分支决策取决于指令的结果。

The most common kind of software-visible exception on one of the classic RISC machines is a TLB miss.
在经典 RISC 计算机之一上,最常见的软件可见异常类型是 TLB 未命中。

Exceptions are different from branches and jumps, because those other control flow changes are resolved in the decode stage. Exceptions are resolved in the writeback stage. When an exception is detected, the following instructions (earlier in the pipeline) are marked as invalid, and as they flow to the end of the pipe their results are discarded. The program counter is set to the address of a special exception handler, and special registers are written with the exception location and cause.
异常不同于 branches 和 jumps,因为其他控制流更改是在 decode 阶段解决的。异常在 writeback 阶段解决。当检测到异常时,以下指令(在管道中的早期)被标记为无效,当它们流向管道末端时,它们的结果将被丢弃。程序计数器设置为特殊异常处理程序的地址,并使用异常位置和原因写入特殊寄存器。

To make it easy (and fast) for the software to fix the problem and restart the program, the CPU must take a precise exception. A precise exception means that all instructions up to the excepting instruction have been executed, and the excepting instruction and everything afterwards have not been executed.
为了使软件能够轻松(且快速)地解决问题并重新启动程序,CPU 必须采取精确的异常。精确异常意味着 except 指令之前的所有指令都已执行,而 excepting 指令和之后的所有内容都尚未执行。

To take precise exceptions, the CPU must commit changes to the software visible state in the program order. This in-order commit happens very naturally in the classic RISC pipeline. Most instructions write their results to the register file in the writeback stage, and so those writes automatically happen in program order. Store instructions, however, write their results to the Store Data Queue in the access stage. If the store instruction takes an exception, the Store Data Queue entry is invalidated so that it is not written to the cache data SRAM later.
要获取精确的异常,CPU 必须按程序顺序提交对软件可见状态的更改。这种按顺序提交在经典的 RISC 管道中非常自然地发生。大多数指令在 writeback 阶段将其结果写入 register 文件,因此这些写入会自动按 program order 进行。但是,存储指令会将其结果写入访问阶段的 Store Data Queue。如果 store 指令出现异常,则 Store Data Queue 条目将失效,以便以后不会将其写入缓存数据 SRAM。

Cache miss handling

Occasionally, either the data or instruction cache does not contain a required datum or instruction. In these cases, the CPU must suspend operation until the cache can be filled with the necessary data, and then must resume execution. The problem of filling the cache with the required data (and potentially writing back to memory the evicted cache line) is not specific to the pipeline organization, and is not discussed here.
有时,数据或指令高速缓存不包含所需的数据或指令。在这些情况下,CPU 必须暂停操作,直到缓存中可以填充必要的数据,然后才能恢复执行。使用所需数据填充缓存(并可能将逐出的缓存行写回内存)的问题并非特定于管道组织,此处不讨论。

There are two strategies to handle the suspend/resume problem. The first is a global stall signal. This signal, when activated, prevents instructions from advancing down the pipeline, generally by gating off the clock to the flip-flops at the start of each stage. The disadvantage of this strategy is that there are a large number of flip flops, so the global stall signal takes a long time to propagate. Since the machine generally has to stall in the same cycle that it identifies the condition requiring the stall, the stall signal becomes a speed-limiting critical path.
有两种策略可以处理挂起/恢复问题。第一个是全局 stall 信号。激活此 signal 后,通常会阻止指令沿 pipeline 向下推进,通常是在每个阶段开始时将 clock 关闭到 flip-flops。这种策略的缺点是有大量的 flip flops,所以全局 stall 信号需要很长时间才能传播。由于机器通常必须在识别需要失速条件的同一周期内停止,因此失速信号成为限速关键路径。

Another strategy to handle suspend/resume is to reuse the exception logic. The machine takes an exception on the offending instruction, and all further instructions are invalidated. When the cache has been filled with the necessary data, the instruction that caused the cache miss restarts. To expedite data cache miss handling, the instruction can be restarted so that its access cycle happens one cycle after the data cache is filled.
处理挂起/恢复的另一种策略是重用异常逻辑。机器对违规指令执行异常操作,并且所有进一步的指令都将失效。当缓存中填满了必要的数据时,导致缓存未命中的指令将重新启动。为了加快数据缓存未命中处理,可以重新启动该指令,以便其访问周期在数据缓存填满后的一个周期内进行。