This class covered how modern processors actually execute programs, from single-cycle performance models all the way up to multicore and SMT. The work was mostly analytical “paper and pencil” problems rather than large coding projects, but many of them were multi-step performance investigations that felt like mini–case studies.
A big part of the course was performance modeling at the instruction level. Given small C kernels (like a vector add) and their compiled x86, I had to count dynamic instructions, break down which ones were memory vs ALU vs branch, and derive total cycles and runtime under different assumptions about CPI and clock rate. Some questions asked what happens if the ISA changes (e.g., splitting memory+ALU instructions into separate ops) or if the clock rate increases, and then compute the resulting speedup. We also did classic Amdahl’s Law exercises around parallel programs with fixed overhead: varying the processor count and seeing how much speedup is lost to communication and synchronization.
The course then moved into memory hierarchy and caches, using scenarios like matrix transpose on an NVIDIA Tegra-style cache. I had to derive tag/index/offset bits from cache parameters, trace sequences of loads/stores, and classify each access as a hit or miss under LRU replacement. From those traces, I estimated miss rates and separated compulsory vs. conflict misses, and related those numbers back to things like spatial locality, stride, associativity, and working set size.
We also spent time on virtual memory and TLB behavior. Given a small fully associative TLB, a page table, and a sequence of virtual addresses, I had to convert addresses to page numbers, simulate TLB lookups, count TLB misses vs page faults, and update entries using LRU. This made concrete how TLB size and replacement interact with real access patterns, and how many faults you can get even with a small number of pages.
Another major theme was pipelining, ILP, and CPI estimation. Using a loop in x86, I analyzed the steady-state execution on different microarchitectures: limited-issue in-order, versus out-of-order with register renaming, versus an unrealistically wide machine with “unlimited” functional units. For each setup, I built a timing/dataflow picture across loop iterations, respected true dependencies and resource constraints, and then derived an average CPI from the number of instructions retired per cycle. This showed how memory latency, issue width, and functional-unit counts cap achievable ILP, even with perfect branch prediction.
Near the end of the course we looked at multicore execution, SMT, and memory consistency. I compared how two small threads would run on a 2-core chip versus a single-core SMT design: scheduling instructions from each thread onto shared functional units, seeing where stalls appear, and counting wasted issue slots when dependencies block progress. For shared-memory code snippets, I enumerated possible outcomes of loads/stores under relaxed consistency, then used barriers/fences to reason about how to recover sequentially consistent behavior (or at least enforce specific results). Overall, the work tied together low-level timing details, memory hierarchy behavior, and high-level correctness constraints in parallel programs.