Statistical Natural Language Processing

This course introduced modern NLP with a focus on deep learning methods and foundation models. Topics included neural and transformer architectures, pre-trained and instruction-tuned LLMs, parameter-efficient fine-tuning (PEFT), prompting, and post-training techniques. Core problems emphasized semantic understanding and generation (question answering, machine translation, etc.).

Project 1 — Sentiment Analysis with Neural Networks

Built progressively more sophisticated models for binary sentiment classification on movie reviews.

Implemented a Bag-of-Words baseline using CountVectorizer and 2/3-layer feedforward networks in PyTorch
Built a Deep Averaging Network (DAN) that averages word embeddings and feeds them through an MLP, comparing GloVe-initialized vs. randomly-initialized embeddings
Implemented Byte Pair Encoding (BPE) from scratch following Sennrich et al. (2016), learning subword merges iteratively by frequency and testing different vocabulary sizes

class DAN(nn.Module):
    def forward(self, x, lengths):
        embedded = self.embedding(x)
        # Average over non-padded tokens
        mask = (x != 0).unsqueeze(-1).float()
        averaged = (embedded * mask).sum(dim=1) / lengths.unsqueeze(-1)
        return self.log_softmax(self.network(averaged))

Project 2 — Transformer Encoder & Decoder

Implemented transformer architectures from scratch for classification and language modeling.

Built multi-head self-attention with scaled dot-product, causal masking, and attention dropout
Implemented transformer encoder blocks (pre-norm architecture) for 3-way political speech classification (Obama/Bush/Wbush)
Built a causal decoder for character-level language modeling, generating text continuations
Explored architectural variants: ALiBi positional encoding (learned relative positions without positional embeddings) and local window attention (restricting attention to nearby tokens)

class MultiHeadAttention(nn.Module):
    def forward(self, x):
        q = self.query(x).view(B, T, n_head, head_dim).transpose(1, 2)
        k = self.key(x).view(B, T, n_head, head_dim).transpose(1, 2)
        v = self.value(x).view(B, T, n_head, head_dim).transpose(1, 2)
        
        attn = (q @ k.transpose(-2, -1)) / math.sqrt(head_dim)
        if self.use_alibi:
            attn = attn + build_alibi_bias(n_head, T, device)
        if self.causal:
            attn = attn.masked_fill(self.mask[:,:,:T,:T] == 0, float("-inf"))
        return self.out_proj(F.softmax(attn, -1) @ v)

Project 3 — Decoding & Retrieval-Augmented Generation

Explored decoding strategies and modern RAG systems.

Part A: Beam Search Decoding

Implemented beam search for text generation from GPT-2 using Hugging Face Transformers
Explored various score processors for controlling generation (repetition penalty, length normalization)

Part B: RAG System

Built a Retrieval-Augmented Generation pipeline using LangChain
Used FAISS vector database with sentence transformer embeddings to store and retrieve meeting transcripts
Implemented the retriever component for the QMSum dataset (question answering on meeting transcripts)