2023年9月27日星期三

祭祖03

九月 27, 2023 学习笔记-计算机组成原理 1 comment

祭祖03

课时03

复习和补遗

$三大类指令 \begin{cases} 运算指令\\ 控制分支指令\\ 数据传送（访存）指令 \end{cases}$
指令的格式，操作码（高6位）和操作数

$操作数的种类 \begin{cases} reg\\ 存储器addr\\ 常数（最快） \end{cases}$
0需要用0号寄存器表示这个立即数，所以常数不可能为0
留给常数的位数有限，一般很小，可正负有符号数，二进制补码

无符号数每一位都是有效数字

符号拓展 Sign Extension

有符号数，根据最高的符号位
e.g 4bit to 8bit

1111 -> 1111 1111
0111 -> 0000 0111

无符号数无论高位如何全部补0到需求的位数。

addi, extend immediate value
lb, lh: extend loaded byte/halfword
beq, bne: extend the displacement

instructions format

r-format instructions

op	rs	rt	rd	shamt	funct
6bits	5bits	5bits	5bits	5bits	6bits

在r-format中，三个oprand都是寄存器
rs:first source reg number
rt:second source reg number
rd destination reg number
shamt shiift aount 00000 移位，因为mips机器字长只有32位。
funct function code: 拓展码，也叫功能码。因此一共可以表示 $2^{12}$ 个指令了。
ppt-p22 example

I-format

Only one oprand is immediate number and other two oprands are register
I-format也是一个有着3个操作数的指令。

op	rs	rt	constand or adddress
6bits	5bits	5bits	16bits

rt: dest or src register number
constant: $-2^{15} \sim 2^{15} - 1$
address: offset addded to base address in rs

和r-format相比，
higher 16 bits are same
lower 16bit is different
组成相似，这使得指令解析简单。
Design priinciple: good design demands good compromises

Different formats complicate decoding, but allow 32-bit
instructions uniformly
Keep formats as similar as possible

logical operations

shift operations comparision
e.g.
op rs rt rd shamt funct
shamt : how many positions to shift
shift operations:

shift left logical:
- shift left and fill with 0bits
- sll by i bits multiplies by 2^i
shift right logical…

1001 1101
srl l:01001110
srav a:11001110

nor 3reg $t0 $t1 $zero
weile 和r型操作数做统一

条件分支指令 Conditional Operations

beq rs,rt,L1

if (rs == rt) branch to inst labeld L1
It is a I-fmt instruction, L1 here is an immediate number
bne
these two above is compiling if statements.

if (i == j)    f = g + h;
else:     f = g - h;

这里都是局部变量，所以用寄存器。

bne $s3,$s4,Else
add $s0,$s1,$s2
j Exit
Else: sub $s0,$s1,$s2

j L1 : unconitonal jump to instruction label L1
It belongs to J-fmt command
One opcode plus one oprand
Here is a compiling loop statement example.

while( save[i] == k ) i+= 1;

Loop: sll $t1, $t1 ,2 # 移位，x4，因为一个int占内存4个字节
add $t1, $t1, $s6        # 得到地址
lw  $t0,  0($t1)  
bne  $t0,  $s5,  Exit  
addi  $s3,  $s3,  1  
j  Loop  
Exit:  ...

lw 抽象一下那张矩形图，从下往上。

关于高级语言代码过编译器的优化tips
A basic block is a sequence of instructions
with

No embedded branches (except at end)
No branch targets (except at beginning)
A compiler identifies basic blocks for optimization
An advanced processor can accelerate execution of basic blocks

Procedure Calling

j and jal
how to use 16bits to express 32bits address?
这就介绍到Program Counter了。
PC: program count程序计数器，无法被用户代码访问对特殊寄存器。
其值=下一条执行指令的地址，不是跳转（如存在）之后的值，而是刚刚执行的下一条指令。
存储程序 $\to$ 程序到存储器中 $\to$ cpu从存储器取出所需要的指令。
相邻指令相距4个字节，pc每次自动+4bytes。

ps:存储器单位都是1个字节1个字节操作的。

所以转移地址怎么计算呢？
op 6bits + address 26bits
规定j和jal操作码只能转移到对齐地址。这个地址只能被4字节整除，也就是地址最低两位为0.。

在 MIPS 平台上，lh 读取一个半字时，存储器的地址必须是 2 的整数倍； lw 读取一个字时，存储器的地址必须是 4的整数倍； sd 写入一个双字时，存储器的地址必须是 8 的整数倍。倘若访存时，目标地址不对齐，则会引起异常,典型的是系统提示“总线错误”后，直接杀死进程。

所以，00 就需要不存了，只存高位。
所以现在16bits可以表示18bits的值。

bne beq : target address = PC + (offset* 4)

tips: pc has already incremented by 4 by this time
offset是有符号数，所以可以为负数。
j/ jal $PC_{31…28} $+ address(后26bits) * 4

例题p62
在最后的j loop中，
题节目中的地址是十进制，convert to bin进制，80024-》 31.。。28就是全0，
所以所求的26bits数 * 4 == 80000（要跳转）

怎么跳的更远呢。

过程调用

Placeparametersinregisters
Transfer control to procedure
Acquire storage for procedure
Performprocedure’soperations
Place result in register for caller
Return to place of call

call

jal ProcedureLabel 起下列两个作用(call)

pc赋值给ra (作为return address),Address of following instruction put in $ra
PC(31-28)连接I * 4 赋值给PC. 这也是j操作码所做的。Jumps to target address

jr return: ra重新赋值给pc

做过程调用的第2和6步。

温习用到的reg
$a0 – $a3: arguments (reg’s 4-7)
$v0, $v1: result values (reg’s 2 and 3)
$sp: stack pointer (reg 29)
$ra: return address (reg 31)

叶子过程（即不会再调用其他过程）示例，

int leaf_exam(intg ,h ,i ,j) // a0 a1 a2 a3必定按顺序编号{
int f;
f= (g + h) - (i +j);
return f;
}

Arguments g, …, j in $a0, …, $a3
f in $s0 (hence, need to save $s0 on stack)
Result in $v0

leaf_example:  

# Save $s0 on stack  
addi  $sp,  $sp, -4  
sw  $s0,  0($sp)  

# Procedure body
add  $t0,  $a0,  $a1  
add  $t1,  $a2,  $a3  
sub  $s0,  $t0,  $t1  

# Result
add  $v0,  $s0,  $zero  

# Restore $s0  
lw  $s0,  0($sp)  
addi  $sp,  $sp,  4  

# Return
jr  $ra

非叶子过程示例，
Procedures that call other procedures
For nested call, caller needs to save on the
stack:

Its return address
Any arguments and temporaries needed after
the call
Restore from the stack after the call
这里示例递归法求阶乘。

  
C code:  
int  fact  (int  n)  
{  
if  (n  <  1)  return  1;  
else  return  n  *  fact(n - 1);  
}

Argument n in $a0
Result in $v0

fact:  
addi  $sp,  $sp, -8  #  adjust stack  for  2  items  
sw  $ra,  4($sp)  #  save  return  address  
sw  $a0,  0($sp)  #  save  argument  
slti  $t0,  $a0,  1  #  test  for  n  <  1  
beq  $t0,  $zero, L1  
addi  $v0,  $zero,  1  #  if  so,  result  is  1  
addi  $sp,  $sp,  8  #  pop  2  items  from  stack  
jr  $ra  #  and  return  
L1:  addi  $a0,  $a0, -1  #  else  decrement  n  
jal  fact  #  recursive  call  
lw  $a0,  0($sp)  #  restore  original  n  
lw  $ra,  4($sp)  #  and  return  address  
addi  $sp,  $sp,  8  #  pop  2  items  from  stack  
mul  $v0,  $a0,  $v0  #  multiply  to get  result  
jr  $ra  #  and  return

其他的话

像c语言的函数内static变量，存储在内存中而不是寄存器。
另外，下周才布置作业。

计组01

九月 21, 2023 学习笔记-计算机组成原理 1 comment

计组01

课时1

intro

这是一个课堂笔记，记录了某武汉大学计算机系本科生学习计算机组成和设计课程。
每次blog包含当日课时的知识点，以及对课后作业典型题的一个讲解析。

课程介绍

本课程的学习将使学生理解单处理器计算机系统中各部件的内部工作原理、组成结构以及相互连接方式，具有完整的计算机系统的整机概念;
 理解计算机系统层次化结构概念，熟悉硬件与软件之间的界面，掌握以MIPS为代表的RISC指令集体系结构的基本知识;
 能够对有关计算机硬件系统中的理论和实际问题进行计算与分析;能根据指令语义进行单周期/多周期/流水线数据通路及其控制器的简单设计;能对MIPS汇编程序设计语言的相关问题进行分析。

教材: 计算机组成与设计:硬件/软件接口 David A.Patterson and John L.Hennessy, 第5版。
 前导课程:  数字逻辑
 C语言程序设计
 考核方式
 课后作业 20%  课堂随机点名 10%  期末考试 70%

计算机的种类

PC，广泛的用途，性价比
服务器计算机，网络，高容量capcity，性能performance，稳定reliability
超级计算机，
嵌入式计算机,系统的一部分，有约束的电源，性能，耗费

十进制和二进制下的值和称呼

重点关注zb和yb

其他重要换算
1 GHz = $1\times10^9$ Hz
1 ns = $1\times10^{-9}$ s

计算机的组建（冯诺依曼）与发展

components of computer
Same components for all kinds of computer

Desktop, server, embedded
Input/output includes  User-interface devices
Display, keyboard, mouse  Storage devices
Hard disk, CD/DVD, flash  Network adapters
For communicating with other computers

半导体科技

从晶锭Silicon ingot到芯片

response time and throughout

response time：响应时间。How long it takes to do a task
throuughout: 吞吐量。Total work done per unit time

e.g., tasks/transactions/… per hour

一些值的定义和计算

performance

定义定义
$Performance = \frac 1 {Execution\ Time}$
倍数关系
X is n time faster than Y
$\begin{align} \frac {Performance_X} {Performance_Y} &= \frac {Execution\ Time_Y} {execution\ Time_X} \\ &= n \end{align}$

eg:time taken to run a program:
10s on A and 15s on B
Thus, A is 1.5 times faster than B

measuring execution time

Elapsed time
- Total response time, including all aspects
  - Processing, I/O, OS overhead, idle time
- Determines system performance
CPU time
- Time spent processing a given job
  - Discounts I/O time, other jobs’ shares
- Comprises user CPU time and system CPU time
- Different programs are affected differently by CPU and system performance

CPU Clocking

时钟周期时间Clock Period
ClockPeriod versus ClockCycle
Both the terms can generally be used for the same meaning however there might be a slight difference in the context of where it is used.

Consider the clock of the cpu which provides timing signals to coordinate all hardware, then while referring to the time interval of the clock after which it takes a transition is called its clock period. This is in general for any clock which provides a timing signal and is not restricted to only clocks providing timing signals to a cpu.(Think of the wall clock, the period of its second hand is 60 seconds)

And when we talk in terms of cpu architecture (like pipelining concepts), we use the term clock cycles to denote the time taken to complete 1 instruction(or 1 microinstruction). It kind of provides a layer of abstraction from the working of the clock(If lets say you change the clock, its clock period might change, but still in cpu terms we would still refer to it as 1 clock cycle).
个人理解上看，前者是一个具体的数值，比如这个CPU的clockperiod是1ns。后者是一个量词，比如这条指令耗费3个clockcycle。
另外,ClockPeriod也可称为clock cycle time, 简写为 $T_c$

时钟频率Clock Frequency
计算式

时钟周期时间Clock Period
时钟频率Clock Frequency
$\begin{align} CPU Time &= CPU Clock Cycles \times Clock Cycle Time \\ &= \frac { CPU Clock Cycles} { Clock Rate} \end{align}$
可以通过以下方式提高性能：

减少时钟周期的数量
提高时钟频率
硬件工程师经常需要权衡时钟频率和周期数量

Instruction Count and CPI

CPI: cycles per instruction。每个指令所需要的时钟周期。

IC: 这里笔者用作指令数量的缩写，而不是一般认为的集成芯片
指令数量。

Clock Cycles = IC $\cdot$ CPI
$\begin{align} CPU Time &= IC \cdot CPI \cdot Clock Cycle Time\\ &= \frac {IC \cdot CPI} {ClockRate} \end{align}$

More about CPI

Average and weighed average
$\text{Clock Cycles} = \sum_{i=1}^{n}{(CPI_i \cdot \text{Instruction Count}_i)}$

$CPI = \frac {Clock Cycle} {Instruction\ Count} = \sum_{i=1}^{n}{(CPI_i \times \frac{\text{Instruction count}_i} {\text{Instruction count}}})$

提升性能

Detailed formula
$CPUtime = \frac {\text{Instructions}} {\text{Program}} \cdot \frac {\text{Clock cycles}} {\text{Instruction}} \cdot \frac{\text{Seconds}} {\text{Clock cycle}}$
Performance depends on :

Algorithm: affects IC, possibly CPI
Programming language: affectss = IC * CPI
CPU Time = IC, * CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC, CPI, T

Power trend $\uparrow$ $\rightarrow$ power wall $\to$ multiprocesser

SPEC(Standard Performance Evaluation Corp)

SPEC CPU基准测试程序

SPECration 用来归纳12种整数基准程序(benchmark)的单一的数字
12 benchmark
为了简化测试结果，SPEC 决定使用单一的数字来归纳所有 12 种整数基准程序。具体方法
是将被测计算机的执行时间标准化，即将被测计算机的执行时间除以一个参考处理器的执行时
间，结果称为 SPECratio。SPECratio 值越大，表示性能越快 (因为 SPECratio 是执行时间的
CINT2006或 CFP2006 的综合测试结果是取 SPECratio 的几何平均值。

逆天翻译，根据插图和相关题目的结果推测，
$SPECratio = \frac {参考值} {测试值}$
因此有，SPECratio 值越大，表示性能越快

几何平均值的公式是
$\sqrt[n] {\prod_{i=1}^{n}执行时间比_i}$
其中，执行时间比 $_i$ 是总共 n个工作负载中第个程序的执行时间按参照计算机进行标准化的
结果。

SPEC功耗基准测试程序

性能采用吞吐率来测量，单位是每秒完成的操作次数。还是为了简化结果，SPEC 采用单个的数字来进
行归纳，称为“overall ssj_ops per watt”，其计算公式是:
$\text{overall ssj\_ops per watt} = (\sum_{i=0}^{10} ssj\_ops_i) / (\sum_{i=0}^{10} power_i)$
式中，ssj_ops $_i$ 为工作负载在每10%增量处的性能，power $_i$ 是对应的功耗。

Concluding Remark

Cost/performance is improving

Due to underlying technology development

Hierarchical layers of abstraction

In both hardware and software

Instruction set architecture

The hardware/software interface

Execution time: the best performance measure

Power is a limiting factor

Use parallelism to improve performance

例题

[时钟周期相关计算]
同一个指令系统体系结构有两种不同的实现方式。根据 CPI的同将指令分成四类(A、B、C和D)。P1的时钟频率为 2.5GHz，CPI分别为1、2、3和3;P2时钟频率为3GHz，CPI分别为2、2、2和2。
给定一个程序，有 1.0x10条动态指令，按如下比例分为4类: A，10%;B，20%; C，50%:D，20%。
a每种实现方式下的整体CPI是多少?
b.计算两种情况下的时钟周期总数。
1.6

额外的话，这道题的时钟周期总数只需要求ClockCycle的数量，所以就不需要用到Clockrate来求CPUtime了

spec分值相关
1.11 SPECCPU2006的bzip2基准程序在AMD Barcelona处理器上执行的总指令数为 $2.38\times10^{12}$ 执
行时间为750s，参考时间为9650s。
1.11.1[5]<1.6，1.9>如果时钟周期时间为0.333ns，求CPI值。
[5]<1.9>求SPEC的分值。
[5]<1.6，19>如果基准程序的指令数增加10%，CPI不变求CPU时间增加多少?
[5]<1.6，1.9>如果基准程序的指令数增加10%，CPI增加5%，求CPU 时间增加多少?
[5]<1.6，19>根据上题中指令数和CPI的变化，求SPEC分值的变化。
[10]<16>假设开发了一款新的AMD Barcelona处理器，其工作频率为4GHz在其指令集中
增加了一些新的指令，从而使程序中指令数目减少了 15%，程序的执行时间减少到了 700.
新的CPI分值为13.7，求新的CPI。
[10]<1.6>当时钟频率由3GHz上升到4GHz时，上一小题算出的 CPI比1.11.1的高。请确
定CPI的升高是否与频率升高相同?如果不同，为什么?
[5]<1.6>CPU时间减少了多少?
[10]<1.6>对第二个基准程序 libquantum，假定执行时间为960ns，CPI为1.61，时钟频率为
3CHz。在时钟频率为4GHz时，在不影响 CPI的前提下执行时间降低10%，求指数
[10]<1.6>在指令数和CPI保持不变的前提下，如果要将 CPU 时间进一步减少10%，求时钟
频率。
10]<1.6>在指令数保持不变的前提下，如果要将 CPI降低15%，CPU 时间减少20%，求时
钟频率。

1.11-1
1.11-2

祭祖02

九月 19, 2023 学习笔记-计算机组成原理 No comments

祭祖02

课时2

上文传送门
计组01 ~ 绯境之外~Outside of Scarlet (scarletborder.blogspot.com)

对上节课的补充

周期时长速换频率

cycle time	cycle rate
250ps	4GHz
500ps	2GHz
1000ps	1GHz

题目常见条件

题目出现same ISA（instructions sets architecture）意味着高级语言成为汇编语言后是相同的，指令数是相同的
问快多少倍，用倍数衡量性能而不是快50%这种百分比叙述。

补遗

指令分类:一些ISA如x86不同指令的CPI悬殊很大，最短少于1个cpu周期，最长如矩阵乘法很长，所以需要给指令分类分别计算时间。依次算平均CPI，这样结果更精确。

计算机功耗：
$\text{POWER }= 电容负载 \times 电压^2 \times 频率$

阈值：功耗墙，不能超，否则烧处理器。
但无法再降低电压和去除热量来提高性能。

功耗墙->另辟蹊径多处理器
Amdahl’’s law
某一个功能部件性能提升n倍 $\neq$ 整个处理器性能提升
$T_\text{improved} = \frac {T_\text{affected}} {\text{improvement factor}} + T_\text{unaffected}$

e.g. 一个程序中乘法运算部分耗时80s，总程序耗时100s
想提高总体性能5倍。则
$20 =\frac {80} n + 20$
n为正无穷，所以无法通过仅提高乘法部分来提高性能至5倍以上。

intro to instructions(指令)

指令集：计算机支持的所有指令的集合。
常见的指令集有x86 arm mips risc_v
其中x86是CISC，其余是RISC（精简指令集，也就是所学的）
但其中大部分内容是相同的

e.g. 86 inherited from Z80

mips是简单，易理解和实现。约有50+个指令
建议指令语法参考这
MIPS官网
 MIPS汇编语言入门 - Sylvain’s Blog (valeeraz.github.io)

MIPS汇编语言
enter image description here

常见于嵌入式处理器，如路由器打印机

指令集分为：

运算指令，算数运算指令，逻辑运算指令。
分支指令，条件/无条件
访存指令，读写,load/store

存储程序的概念

一条指令包含两个部分，操作码和操作数
All arithmetic operations have this form

至少需要操作码，因为操作数可以为0个

add a,b,c

操作码 op code	操作数data operand
add	a,b,c

不同的指令有不同数量的操作数，about[0,8]

e.g. We simulate it in the C language,
break contains 0 data, goto(label) contains 1 data , a = b++ contains 2 data

b,c. 源操作数 a目的操作数

now, lets design.
Design Principle 1: Simplicity favours regularity

Regularity makes implementation simpler
Simplicity enables higher performance at lower cost

How long is a single command?
固定的，32bits
31st bit |op code<->operand| 0th bit

most op codes occupy 6-bits except for some special op codes
operand : which participate in operator

操作数设计

Design Principle 2: Smaller is faster

寄存器
存储器地址
立即数(常量)

Register operand

处理器包含ALU逻辑运算单元, RF寄存器组

寄存器：运算/数据/结果。中间，结果。本质是触发器、锁存器构成用来保存数据。
32位则为32个触发器和32个锁存器构成
从0到31编号
In mips ISA，32bits data is called word

for example, $t0 $t1 $t2 …$t9 for. temporary values. And $s0 … $s7 for saved variables

如果一个操作数是寄存器，则在指令中占用5bit。这一点可以从有32个寄存器 $2^5 =32$ 可以看出。
c语言声明变量 $\to$ 预分配空间

汇编语言里没有变量。
全局变量，局部变量
局部变量分配的是寄存器，快快快
全局变量是主存里，相对较慢

Memory operands 存储器地址

memory，byte存8bits。住8个二进制
比喻Memory成一座楼，其中房间号就是地址，顺序编号。
如果32个房间则需要5bits来编。00000-11111
如hex数0x12345678
0x12 是高字节，0x78是低字节，其余两个称为较高/较低字节
This hex num occupies 4 bytes, thus we divide 4 sequenced storages unit

mips is big endian。大端最低字节排在最低地址。这一点和x86不同

读时，传最低地址入processor。
读多少字节和操作码有关。
比如lw意为load word就是读取一个word，(4bytes)。

In mips，4 bytes per word（word related to different architecture）

reg vs memo merits and shortcomings

Registers are faster to access than memory
Operating on memory data requires loads and stores
- More instructions to be executed
Compiler must use registers for variables as much as possible
- Only spill to memory for less frequently used variables
- Register optimization is important!

立即数Immediate Operands

e.g.

addi $s3, $s3, 4

没有立即数参与的减法指令，只需要用负数

addi $s3, $s2, -5

Design Principle 3: Make the common case fast
立即数的好处

Small constants are common
Immediate operand avoids a load instruction

立即数比寄存器快，CPI也小，为什么指令不全用立即数抛弃寄存器呢，因为立即数表示的大小有限。
例如指令

addi s1 , s2 , -1

分别占位6bits 5bits 5bits
留给立即数只有16bits
表示范围 $-2^{15} ～ 2^{15} - 1$

特殊的，0参与指令时必须用寄存器zero而不是立即数。
MIPS register 0 ($zero) is the constant 0

Cannot be overwritten
Useful for common operations
E.g., move between registers

add  $t2,  $s1,  $zero

额外的话

笔者所学课程是每两周布置一次作业，所以下周的博文中才会带上例题，另外此章未完，下期再见。