Cross-listed as CSE 153/253, this course covers machine learning applied to music—across audio (waveforms, spectrograms) and symbolic (MIDI) representations—building from classic DSP and feature engineering up through neural sequence models, generation and inpainting, and the surprisingly hard problem of evaluating generative music systems.
The graded work was a sequence of four homeworks plus two larger assignments, ending in an open-ended group project.
Homework 1 — Synthesis from scratch, and a first classifier
A two-part warm-up. Part A builds a small audio synthesizer with nothing beyond NumPy/SciPy: converting note names to frequencies (A4 = 440 Hz), generating sine waves, applying amplitude envelopes and a delay effect, concatenating and mixing tracks, and adding harmonics to turn sine waves into sawtooth waves. Part B trains a scikit-learn classifier to tell piano vs. drum MIDI files apart (from the Tegridy MIDI dataset), hand-engineering features like beat counts, note ranges, unique-note sets, and average pitch.
Homework 2 — Instrument classification from spectrograms
Built audio-classification pipelines for instrument recognition on a subset of NSynth. I extracted waveforms with librosa, computed MFCC features (13 means + 13 standard deviations), and then moved up to mel-spectrogram inputs for small neural classifiers, experimenting with data augmentation to improve generalization.
Homework 3 — Symbolic generation with Markov chains
Symbolic music generation on a subset of the PDMX dataset (monophonic melodies in 4/4). I trained a REMI tokenizer with MidiTok—where each note becomes a Position, Pitch, Velocity, Duration token group—and built Markov-chain models over those token sequences to generate new melodies.
Homework 4 — Flow-matching audio generation
A generative-modeling homework using flow matching: a velocity-field network is trained to transport noise to data, and sampling integrates the learned ODE from
Assignment 1 — Three music-ML tasks
A larger, competition-style assignment spanning three problems, each scored against held-out baselines:
- Composer classification (symbolic, multiclass) — multiple windowed views per piece, key-aware token streams plus dense musical statistics, fed to a TF-IDF + classical-ensemble stack (logistic regression, Complement Naive Bayes, calibrated SVM, HistGradientBoosting, LightGBM).
- Temporal order prediction (audio, binary) — handcrafted boundary and segment-comparison features with a gradient-boosted-tree ensemble (LightGBM + CatBoost + XGBoost) and mirror test-time augmentation.
- Music tagging (audio, multilabel, mAP) — 3-channel log-mel + delta + delta-delta spectrograms into fine-tuned ResNet-50s with SpecAugment, mixup, and a five-seed ensemble.
Group project — conditioned music generation
For the open-ended final assignment, our group built two complementary tasks: Image-to-Music (continuous)—turning an image into audio that matches its mood and scene via BLIP captioning, a semantic music-prompt bridge, and MusicGen—and MIDI Repair (symbolic)—filling in 2–6 deleted bars of a piano MIDI clip with a beat/bar-aware note-event Transformer. Full write-up here.