Appendix B Flashcards
(38 cards)
Cache performance
CPU execution time = (CPU clock cycles + Memory stall cycles) x Clock cycle time
CPU clock cycles include time to handle cache hit and processor is stalled during cache misses
Miss penalty
Cost per miss (amount of stalled cycles
Memory stall cycles
Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC x (Memory misses/Instruction) x Miss rate x Miss penalty
IC - instruction count
Memory stall cycles for reads and writes
Memory stall cycles = IC x Reads per instruction x Read miss rate x Read miss penalty + IC x Writes per instruction x Write miss rate x Write miss penalty
Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit
CPU execution time all hits = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x CPI + 0) x Clock cycle = IC x 1.0 x Clock cycle
Memory stall cycles = IC x(1 + 0.5)x0.01x50
CPU execution time misses = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x 1.0 + IC x 0.75) x Clock cycles = 1.75 x IC x Clock cycles
Speed up = CPU execution time misses / CPU execution time = 1.75 x IC x Clock cycles / 1.0 x IC x Clock cycles = 1.75
CPU with no cache misses is 75 % faster.
Misses per instruction
Misses/Instruction = Miss rate x Memory accesses / Instruction
Example 2 - Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit.
instead we use miss rate per 1000 instruction 30 misses.
What is memory stall time in terms of instruction count?
Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC/1000 x (Misses/(Instruction x 1000)) x Miss penalty = IC/1000 x 30 x 25 = IC x 0.75
Four memory hierarchy questions
Q1 Where can a block be placed in upper level? (block placement)
Q2 How is a block found if it is in the upper level? (block identification)
Q3 Which block should be replaced on a miss? (block replacement)
Q4 What happens on a write? (write strategy)
Cache - Q1 Where can a block be placed in upper level? (block placement)
Fully associative - block can be placed anywhere in cache
Direct mapped - block has its own place in the cache. Mapping is usually done using (Block address) mod (number of blocks in cache)
Set associative - block can be placed anywhere inside of a set of blocks in cache. Mapping is usually done using (Block address) mod (number of sets in cache). For n blocks in a set, memory is n-way associative
Cache - Q2 How is a block found if it is in the upper level? (block identification)
- valid bit - used to tag if data conatains valid address
- block frame address - tag field and index field
- index field - selects set
- tag field - used to compare with hit - is redundant set 0 has tag 0, set 1 has tag 1 etc. Saves hardware
- block offset - selects desired data from the field, not compared
Due to speed, this search is done in parallel.
Cache - Q3 Which block should be replaced on a miss? (block replacement)
Directly mapped - olny one block is checked
Associative and set associative - multiple blocks are checked.
Three methods
Random - candidate blocks are selected randomly. Debugging using pseudorandom numbers
LRU - least recently used - Replaces blocks that were not used for the longest time first.
FIFO - first in, first out - approximates LRU, determines oldest block instead of the LRU block
Cache - Q4 What happens on a write? (write strategy)
Reads are most common operation in RISC processors
Reads are simpler to implement - in case of miss data is ignored and needs to be reread. Such rules won`t apply for the writes though. Writes also need to specify size of the write, which makes them more complex.
Techniques for writes with Cache:
write through - information is written both to the cache and lower-level memory
write back - information is written only to the cahceand then written to the main memory when replaced - uses dirty bit
Write back writes are at speed of the cache memory and uses less bandwidth and less energy - interesting for embedded systems and servers and multiprocessor systems
Write through is easier to implement - always clean memory, no need to write to lower level memories. Simplifies data coherency. Multilevel caches -> write through more viable
Write miss - can result in write stall. Solution -> write buffer
Write misses resolution:
Write allocate - block is allocated on write miss, followed by the preceding write hit actions. In this natural option, write misses act like read misses.
No–write allocate - write misses do not affect cache, block is only modified in low-level memory only
Average memory access time
Average memory access time = Hit time + Miss rate x Miss penalty
Relationship between CPI and fast clock
- The lower the CPI(execution) the higher relative impact of a fixed number of cache miss clock cycles
- Cache miss penalty is measured in processor clock cycles for a miss. Even if memory hierarchies for two computers are identical, processor with higher clock rate will have higher clock cycles per miss and higher memory portion of CPI
This fact emphasizes importance of caches for processors with higher clock rates and low CPI
Cache optimization categories
Reducing the miss rate - larger block size, larger cache size and higher associativity
Reducing the miss penalty - multilevel caches and giving reads priority over writes
Reducing the time to hit in the cache - avoiding address translation when indexing the cache
Cache miss categories
Compulsory - the very first access to the memory cannot be hit
Capacity - cache usually cannot contain all the blocks from the memory
Conflict - set associative and directly mapped placement strategy can cause cache miss, because multiple blocks can be mapped to one cache block
Cache miss optimization 1: Larger block size to reduce miss rate
- takes advantage of spatial locality
- increases miss penalty
- may increase conflict misses
- increase in miss penalty may overweight decrease of miss rate
Cache optimization 2: Larger Caches to reduce miss rate
- higher capacity reduces the misses
- longer hit time, higher cost and power consumption
Cache optimization 3: Higher associativity to reduce miss rate
Two rules of thumb:
- 8-way set associative memory is as effective in reducing misses as fully associative memory in practice
- 2:1 cache rule of thumb - directly mapped cache of size N has the same miss rate as two-way set associative cache of size N/2.
- improves miss rate, increases miss penalty
Cache optimization 4: Multilevel Caches to reduce miss penalty
- reducing the size of first level cache to improve cache access time to clock cycle
- second level large enough to cover most memory accesses
Performance analysis:
Average memory access time = Hit timeL1 + Miss rateL1 x (Hit timeL2 + Miss rateL2 x Miss penaltyL2)
Average memory stalls per instruction = Misses per instructionL1 x Hit timeL2 + Misses per instructionL2 x Miss penaltyL2
Miss rates:
- Local miss rate - number of misses in a cache divided by the total accesses to the cache (Miss rate L1, Miss rate L2)
- Global miss rate - miss rate to the cache divided by the total number of memory accesse to the memory by processor (Miss rate L1 for L1 and Miss rate L1 x Miss rate L2 for L2)
Practice shows that L2 caches must be huge compared to L1 caches
- Multilevel inclusion - data in L1 cache are also in L2 cache
- Same block size vs. different block size - different block size -> smaller block size in L1 can lead to faster memory accesses, but to higher miss penalties in L1. Problem of inclusion can be solved by maintaining the same size of L1 and L2 cache block sizes.
Multilevel exclusion - if L2 cache is only slighlty bigger than L1 cache -> data in L1 is then swapped with data from L2
Cache optimization 5: Giving priority to read misses over writes to reduce miss penalty
- read misses are easier to handle
- with write through we use write buffers
- simplest solution is to wait for read miss until write miss is complete
- solution to improve the speed, during read miss we can read write buffer or stall the processor until the write buffer is empty
Cache optimization 6: Avoiding address translation during indexing of the cache to reduce hit time
- using virtual vs. physical addresses
- protection - Page-level protection must be obeyed
- solution - copying also TLBs
- switching the processes flushes the caches - physical addresses change (virtual may be the same!!!)
- using PID as part of the tag
- duplicate physical addresses -> different virtual addresses may access the same physical address
- synonyms and aliases may cause this
- antialiasing - each block in cache has a unique physical address - I/O addressing
- usually physical caching, harder to implement virtual caches
- solution - using virtual addressing and physical tagging
- virtual addressing using page offset (same in physical and virtual address)
- physical tags using physical addresses
- drawback -> direct-mapped caches cannot have blocks bigger than page sizes
Virtual memory - why?
- in past programmers had to divide programs into smaller pieces and had to ensure that program cannot access memories greater than computers main memory -> too much burden
- solution - virtual memory which is managed automatically
Virtual memory - principles
- processor generates virtual addresses
- virtual addresses are mapped to physical addresses
- transferring virtual memory to physical memory is called memory mapping or memory translation