Lecture 9 - Simulation Flashcards
(20 cards)
Problem (Motivation - Introduction)
Designing Architectures is generally expensive (economically).
- Costs of High-Level Design
- Costs of Verification of Design
- Costs of Low-level Design
Its hard to determine if our new architectural design works other than to simulate it.
What does a simulator simulate?
- Component Simulator: Simulates an architectural component, e.g., branch predictor, cache.
- Instruction Set Simulator: Simulates just the instruction set of a particular architecture. NOTE: might have to emulate the OS interface.
- Microarchitecture simulator: Simulates the underlying microarchitecture (e.g., cache, BP, ROB, speculation…). NOTE: could be cycle-accurate, or cycle-approximate.
- Full-system simulator: Simulates everything (although not necessarily microarchitecture) and has to deal with IO, MMU, interrupts….
- System-on-chip simulator: Simulates the full system on a chip including CPU, GPU, DSPs, additional processors, network, and IO.
- Electronic circuit simulator: Simulates the electrical components and signals in a circuit.
Terminology
- Host machine: The platform that is running the simulator.
- Guest/target machine: The platform that is being simulated (e.g., AArch64, x86).
- Cross-architecture: Simulating a different architecture than the host (e.g., x86 on AArch64).
- Same-architecture: Simulating the same architecture as the host (e.g., x86 on x86).
Instruction Set Simulator
- Simulates just the instruction set of the guest architecture.
- User-mode simulation: Takes a single application binary as input and executes the guest instructions without needing to simulate system instructions.
- Full-system simulation: Boots an entire operating system as if it was running on a real machine and needs to simulate system instructions.
- Functional: Instructions alter architectural state like a real machine (e.g., register files) but don’t care about microarchitectural state.
- Cycle-accurate: Model of microarchitecture required to simulate caches, branch predictors, etc.
- Paravirtualization: Program/OS under simulation “knows” it’s being simulated.
Microarchitecture Simulator
- Simulates the underlying microarchitectural components of the platform, e.g., the caches, branch predictor, reorder buffer.
- Typically cycle-accurate but can be slow.
- Trace driven: Updates microarchitectural state based on a fixed sequence of records coming from a trace file.
Full-System Simulator
- Simulates not only the instruction set of the architecture but also devices that might be attached to the platform, MMU for virtual memory, and interrupts.
- May or may not include the microarchitecture, if cycle-accurate simulation is desired.
- Instruction Set Simulation extended to include “system” instructions.
- Emulation of devices (e.g., serial, storage, networking, etc.) needs to be implemented.
- Emulation of MMU required.
- Needs the ability to handle interrupts
System-on-chip Simulator
- Simulates multiple components that might be found on a single chip, i.e., multiple processors (CPUs, GPUs, DSPs), IO, memory, etc.
- Can be used to help application developers start writing software before the hardware becomes available.
- Can be modified to try experimental features - rapid prototyping.
- Gives an insight into how hardware could be adapted to suit the target application, or the firmware/software.
Scope of Simulation
- Cycle-accurate simulation.
- Cycle-approximate simulation.
- Instruction Set Simulation (or Functional Simulation): Just model the functional behavior of the guest architecture’s instructions.
- Hybrid approaches.
- Sampling-based (or statistical) simulation.
Approaches to Simulation
- Static simulation: Translates application program ahead-of-time, but is not really used in practice because it’s tricky to do static binary translation.
- Dynamic simulation:
- Interpreter: Can introduce instrumentation at runtime for measuring/debugging.
- Dynamic Binary Translation: Can introduce instrumentation at runtime for measuring/debugging and is the most popular approach with lots of implementations.
Interpretation
- Fetch/Decode/Execute loop is SLOW but “simple”.
- Each instruction is processed one-at-a-time, in order.
- Execute behavior might be a function, or it might be inline in a big switch statement
Dynamic Binary Translation
- Considers (at least) a guest basic block of instructions at a time.
- Translates the guest instructions in the block, into host instructions.
- Caches the translated code, and executes it.
- Tricky to implement and hard to make fast, but it is SIGNIFICANTLY faster than interpretation.
- Translation granularity can be basic-blocks, traces, or regions.
- Optimisation approaches can be none, intra-basic-block, inter-basic-block, trace or region.
- Compilation strategy can be always just-in-time, synchronous/asynchronous, or a hybrid interpreter/DBT.
Simulation Speeds
- Varies greatly depending on the scope and implementation.
- Functional Simulation: Really good interpreter: 100+ MIPS (2-100x slowdown), Really good DBT: 3000 MIPS (sometimes faster than native!).
- Cycle Accurate Simulation: 0.1-1 MIPS (10,000-100,000x slowdown).
- Multicore Cycle Accurate Simulation: 1-5 KIPS (1,000,000x slowdown).
- MIPS: Millions of (guest) Instructions Per Second - measure of simulation throughput
Sampling
- A sampling simulator tries to solve the speed problem (for cycle accurate simulation).
- Only actually fully simulates a small part of the code and runs the rest functionally in some other way (DBT, native implementation, etc.).
Challenges of Sampling Simulation
- What to sample? (Program phase detection)
- How to handle IO?
- How to handle system calls (if emulating OS layer)
- Resource sharing (and context switching)
- Building an accurate statistical model.
- The warming problem.
The Warning Problem
- How do you transition between one type of simulation to another?
- For example, if a sampling simulator works in functional mode 90% of the time, but does 10% cycle accurate simulation, how do we restart cycle accurate simulation after running in functional mode? What is the state of the processor?
Some Solutions:
- Restart with previous state (fast forwarding) which could simulate parts of the processor in the fast forward state but is slow/inaccurate.
- Live points.
- Reusable warm architectural/micro-architectural checkpoints. Multiple checkpoints allows simulation parallelism
Checkpointing
Two types of checkpointing:
- High level: Cache and directory tags, complete memory data (~10-200MB).
- Low level: Registers, TLB, branch predictor, cache tags, touched memory data (~150KB).
- Could use both.
- Wenisch et al. propose the low level for uniprocessors, high level for multiprocessors
Accuracy
- Really hard to know.
- Some simulators are ‘verified’ to perform identically (to a given tolerance) to hardware (usually embedded processors). This means they have a certificate (usually) from the manufacturer and are generally the most accurate.
- Some simulators are the reference platform where silicon is derived from the simulator.
- Modern microprocessors are complex, and even in a full simulator, there may be shortcuts taken/inaccuracies.
- The real processor might have bugs, and the simulator might have bugs or be incomplete.
Power Modelling
- Some simulators support power/energy modelling, which is an active area of research.
- There is no such thing as ‘cycle-accuracy’ for power.
- It is usually based on some ‘power model’, and accuracy is very difficult to determine, generally done by empirical experimentation
Summary
- It’s a really hard problem.
- Don’t blindly trust simulation and try it on real hardware if possible.
- But it is a super-useful tool for all stages of development/design/debugging.
- Fast simulators can really help to enable rapid prototyping and development, continue running legacy applications, support hardware/software co-design, and bridge the gap between hardware development and application development.
- FPGA implementation is an alternative, but only high level design simulated.