This project builds and evaluates machine learning models that predict a song’s genre from Spotify-style audio features (e.g., danceability, energy, acousticness). The workflow follows a full ML pipeline: dataset identification → exploratory data analysis (EDA) → baseline → model iteration + tuning → ensemble comparison.
- Dataset: 15,150 tracks, 18 columns, 19 genres. No missing values; duplicates were removed before modeling.
- Main challenge: Class imbalance (e.g., Pop much larger than smaller genres like World/Gospel), addressed during training with SMOTETomek.
- Feature notes from EDA: strong correlation between Energy and Loudness, so Loudness was dropped in the main modeling notebook to reduce multicollinearity; skewed features (
Speechiness,Acousticness,Instrumentalness,Liveness) were log-transformed and then standardized. - Baseline: DummyClassifier accuracy ≈ 0.06
- Best single model: Tuned Random Forest accuracy ≈ 0.373
- Best overall model: Soft-voting ensemble (Random Forest + XGBoost + Extra Trees + QDA) accuracy ≈ 0.376
- Model takeaway: Tree-based methods (RF / XGBoost / ExtraTrees / CatBoost) performed best; ensembling gave a small additional lift.