Lab-based companion to CSE 142 focused on measuring and optimizing real code using hardware performance counters, microbenchmarks, and parallel implementations. All labs were done in C/C++ via Jupyter on UCSD’s DSMLP and dedicated bare-metal servers
using perfstats, make, cse142 job run, and Gradescope autograding.


Project 1 – Performance Microbenchmarks & the Performance Equation

Goal: Understand and apply the performance equation

ET=IC×CPI×CT

by designing and running controlled microbenchmarks.

What I did

Representative kernel

for (unsigned j = 0; j < 3; ++j) {
    for (unsigned i = 1; i < size; ++i) {
        array[i] += i / (1 + j) + array[i - 1];
    }
}

This simple loop was used with different data types and optimization levels to see how IC and CPI change even when the algorithm stays the same.


Project 2 – Branch Prediction, Cache Behavior & Threshold Counting

Goal: Explore how data layout and sorting affect branch prediction and cache behavior.

What I did

Implemented a threshold-counting kernel:

long long calculate_sum(int* data, unsigned size, int threshold) {
    long long sum = 0;
    for (unsigned i = 0; i < size; ++i) {
        if (data[i] >= threshold) {
            sum++;
        }
    }
    return sum;
}

Measurements & analysis


Project 3 – Memory Hierarchy: TLBs & Pointer-Chasing Miss Machines

Goal: Stress and characterize TLB behavior and memory-level parallelism using synthetic pointer-chasing workloads.

TLB microbenchmarks

template <size_t BYTES>
struct Node {
    Node* next;
    uint64_t payload[BYTES / 8 - 1];  // force full cache line / page footprint
};
template <class Node>
Node* traverse(Node* start, uint64_t count) {
    for (uint64_t i = 0; i < count; ++i) {
        start = start->next;
    }
    return start; // return to keep the compiler from optimizing away the loop
}

MissMachine & memory-level parallelism

// Pseudocode: follow N independent pointer chains per iteration
for (unsigned i = 0; i < iterations; ++i) {
    a = a->next;
    b = b->next;
    c = c->next;
    // ...
}

Project 4 – Loop Tiling & Convolution Optimization

Goal: Optimize a 1D convolution kernel using loop tiling, loop splitting, and unrolling, and then measure the impact on IC, CPI, and ET.

Baseline implementation

void do_convolution(const tensor_t<uint32_t>& src,
                    const tensor_t<uint32_t>& kernel,
                    tensor_t<uint32_t>& dst) {
    for (int i = 0; i < dst.size.x; ++i) {
        for (int j = 0; j < kernel.size.x; ++j) {
            dst(i) += src(i + j) * kernel(j);
        }
    }
}

Optimizations implemented

for (int jj = 0; jj < kernel.size.x; jj += tile_size) {
    for (int i = 0; i < dst.size.x; ++i) {
        for (int j = jj; j < kernel.size.x && j < jj + tile_size; ++j) {
            dst(i) += src(i + j) * kernel(j);
        }
    }
}

Measurement & analysis


Project 5 – Parallel Histograms, OpenMP & Hyperthreading

Goal: Explore different ways to parallelize a shared histogram and study ILP/MLP and hyperthreading effects.

Histogram variants

All variants build a 256-bin histogram over 64-bit values by counting every byte:

for (uint64_t i = 0; i < size; ++i) {
    for (int k = 0; k < 64; k += 8) {
        uint8_t b = (data[i] >> k) & 0xff;
        // different implementations update histogram[b] in different ways
    }
}

I implemented and measured:

  1. Single-threaded baseline (run_unthreaded_histogram)
  2. Coarse-grain lock (one global std::mutex)
  3. Fine-grain locks (one std::mutex per bucket)
  4. Private per-thread histograms (merge at the end)
  5. OpenMP with #pragma omp critical
  6. OpenMP with chunked private histograms (local reduction, then global merge)

Example of the OpenMP + private-histogram variant:

#pragma omp parallel for
for (uint64_t chunk = 0; chunk < size; chunk += block) {
    uint64_t local_hist[256] = {0};
    for (uint64_t i = chunk; i < size && i < chunk + block; ++i) {
        for (int k = 0; k < 64; k += 8) {
            uint8_t b = (data[i] >> k) & 0xff;
            local_hist[b]++;
        }
    }
    #pragma omp critical
    for (int b = 0; b < 256; ++b) {
        histogram[b] += local_hist[b];
    }
}

ILP/MLP & hyperthreading

Matrix exponent “canary”