Lab-based companion to CSE 142 focused on measuring and optimizing real code using hardware performance counters, microbenchmarks, and parallel implementations. All labs were done in C/C++ via Jupyter on UCSD’s DSMLP and dedicated bare-metal servers
using perfstats, make, cse142 job run, and Gradescope autograding.
Project 1 – Performance Microbenchmarks & the Performance Equation
Goal: Understand and apply the performance equation
by designing and running controlled microbenchmarks.
What I did
- Implemented and analyzed several “baseline” kernels in
microbench.cpp:baseline_int,baseline_double,baseline_float,baseline_char- Optimized variants such as
baseline_int_O4using__attribute__((optimize(4))).
- Used
microbench.exe+perfstatsto collect:- Dynamic instruction count (IC)
- Cycles, CPI, ET, and effective MHz
- Systematically varied:
- Repetitions (
-r) to scale instruction count - Array sizes (
--size) over multiple orders of magnitude - Clock frequency (
-M) to test CT vs ET relationships
- Repetitions (
- Computed speedups by hand (using CSV exports) and compared them with predictions from
the performance equation and Amdahl’s Law.
Representative kernel
for (unsigned j = 0; j < 3; ++j) {
for (unsigned i = 1; i < size; ++i) {
array[i] += i / (1 + j) + array[i - 1];
}
}
This simple loop was used with different data types and optimization levels to see how IC and CPI change even when the algorithm stays the same.
Project 2 – Branch Prediction, Cache Behavior & Threshold Counting
Goal: Explore how data layout and sorting affect branch prediction and cache behavior.
What I did
Implemented a threshold-counting kernel:
long long calculate_sum(int* data, unsigned size, int threshold) {
long long sum = 0;
for (unsigned i = 0; i < size; ++i) {
if (data[i] >= threshold) {
sum++;
}
}
return sum;
}
- In
array_sort.cpp:- Generated large random integer arrays.
- Optionally sorted the array with
std::sortdepending on a command-line flag. - Called
calculate_sumrepeatedly and instrumented the run withperfstats.
Measurements & analysis
- Collected:
- IC, cycles, CPI, ET
- L1 data-cache accesses and misses
- Branches and branch misses
- Compared:
- Unsorted data (unpredictable branches, “random” thresholds)
- Sorted data (more predictable branch behavior and potential prefetch benefits)
- Analyzed how:
- Sorting improves branch prediction and sometimes cache locality—but adds its own cost.
- The “best” configuration depends on array size, number of iterations, and tolerance for preprocessing time vs. query time.
Project 3 – Memory Hierarchy: TLBs & Pointer-Chasing Miss Machines
Goal: Stress and characterize TLB behavior and memory-level parallelism using synthetic pointer-chasing workloads.
TLB microbenchmarks
- Built a templated memory layout:
template <size_t BYTES>
struct Node {
Node* next;
uint64_t payload[BYTES / 8 - 1]; // force full cache line / page footprint
};
- Allocated large aligned arrays of these nodes using
posix_memalignand disabled huge pages withmadvise(..., MADV_NOHUGEPAGE). - Randomized the
nextpointers to create a single large cycle of nodes, then walked it:
template <class Node>
Node* traverse(Node* start, uint64_t count) {
for (uint64_t i = 0; i < count; ++i) {
start = start->next;
}
return start; // return to keep the compiler from optimizing away the loop
}
MissMachine & memory-level parallelism
- Used MissMachine to create multiple independent pointer-chasing “machines”:
// Pseudocode: follow N independent pointer chains per iteration
for (unsigned i = 0; i < iterations; ++i) {
a = a->next;
b = b->next;
c = c->next;
// ...
}
- Parameterized:
- Number of independent chains (1, 2, …, 13)
- Total number of pointer traversals
- Stride and alignment of underlying memory
- Measured how increasing the number of independent chains increases memory-level parallelism (MLP) and reduces ET despite similar IC.
Project 4 – Loop Tiling & Convolution Optimization
Goal: Optimize a 1D convolution kernel using loop tiling, loop splitting, and unrolling, and then measure the impact on IC, CPI, and ET.
Baseline implementation
void do_convolution(const tensor_t<uint32_t>& src,
const tensor_t<uint32_t>& kernel,
tensor_t<uint32_t>& dst) {
for (int i = 0; i < dst.size.x; ++i) {
for (int j = 0; j < kernel.size.x; ++j) {
dst(i) += src(i + j) * kernel(j);
}
}
}
Optimizations implemented
- Loop splitting & manual tiling:
- Introduced an outer loop over jj stepping by tile_size, then processed inner tiles.
- Tiled version (out-of-order loop):
for (int jj = 0; jj < kernel.size.x; jj += tile_size) {
for (int i = 0; i < dst.size.x; ++i) {
for (int j = jj; j < kernel.size.x && j < jj + tile_size; ++j) {
dst(i) += src(i + j) * kernel(j);
}
}
}
- Further variants:
- Fixed tile sizes (
convolution_tiled_fixed_tile) - Unrolling via
__attribute__((optimize("unroll-loops"))) - Edge-handling split loops (
convolution_tiled_split)
- Fixed tile sizes (
Measurement & analysis
- Used the existing benchmarking harness to:
- Sweep over different tile_size values.
- Compare naïve vs. tiled vs. tiled+unrolled implementations.
- Tracked:
- Changes in IC (extra loop overhead vs. better locality),
- CPI (better cache behavior vs. more complex control),
- Overall ET and speedup over baseline
Project 5 – Parallel Histograms, OpenMP & Hyperthreading
Goal: Explore different ways to parallelize a shared histogram and study ILP/MLP and hyperthreading effects.
Histogram variants
All variants build a 256-bin histogram over 64-bit values by counting every byte:
for (uint64_t i = 0; i < size; ++i) {
for (int k = 0; k < 64; k += 8) {
uint8_t b = (data[i] >> k) & 0xff;
// different implementations update histogram[b] in different ways
}
}
I implemented and measured:
- Single-threaded baseline (run_unthreaded_histogram)
- Coarse-grain lock (one global std::mutex)
- Fine-grain locks (one std::mutex per bucket)
- Private per-thread histograms (merge at the end)
- OpenMP with #pragma omp critical
- OpenMP with chunked private histograms (local reduction, then global merge)
Example of the OpenMP + private-histogram variant:
#pragma omp parallel for
for (uint64_t chunk = 0; chunk < size; chunk += block) {
uint64_t local_hist[256] = {0};
for (uint64_t i = chunk; i < size && i < chunk + block; ++i) {
for (int k = 0; k < 64; k += 8) {
uint8_t b = (data[i] >> k) & 0xff;
local_hist[b]++;
}
}
#pragma omp critical
for (int b = 0; b < 256; ++b) {
histogram[b] += local_hist[b];
}
}
ILP/MLP & hyperthreading
- In
hyperthread.cpp, compared:- High ILP vs. low ILP integer compute kernels (different numbers of independent accumulators).
- High MLP vs. low MLP pointer-chasing workloads (via
MissMachinewith different numbers of concurrent chains).
- Measured the effect of enabling more threads (including hyperthreads) on:
- CPI, ET, and throughput.
- How well the additional hardware contexts hide latency vs. causing more contention.
Matrix exponent “canary”
- Implemented
_canaryinmatexp_main.cppas a heavy OpenMP histogram over a very
large array (hundreds of millions of elements) to:- Sanity-check measurement infrastructure under a realistic long-running workload.
- Exercise memory bandwidth and parallel reduction patterns.