This course introduced modern NLP with a focus on deep learning methods and foundation models. Topics included neural and transformer architectures, pre-trained and instruction-tuned LLMs, parameter-efficient fine-tuning (PEFT), prompting, and post-training techniques. Core problems emphasized semantic understanding and generation (question answering, machine translation, etc.).
Project 1 — Sentiment Analysis with Neural Networks
Built progressively more sophisticated models for binary sentiment classification on movie reviews.
I:
- Implemented a Bag-of-Words baseline using
CountVectorizerand 2/3-layer feedforward networks in PyTorch - Built a Deep Averaging Network (DAN) that averages word embeddings and feeds them through an MLP, comparing GloVe-initialized vs. randomly-initialized embeddings
- Implemented Byte Pair Encoding (BPE) from scratch following Sennrich et al. (2016), learning subword merges iteratively by frequency and testing different vocabulary sizes
class DAN(nn.Module):
def forward(self, x, lengths):
embedded = self.embedding(x)
# Average over non-padded tokens
mask = (x != 0).unsqueeze(-1).float()
averaged = (embedded * mask).sum(dim=1) / lengths.unsqueeze(-1)
return self.log_softmax(self.network(averaged))
Project 2 — Transformer Encoder & Decoder
Implemented transformer architectures from scratch for classification and language modeling.
I:
- Built multi-head self-attention with scaled dot-product, causal masking, and attention dropout
- Implemented transformer encoder blocks (pre-norm architecture) for 3-way political speech classification (Obama/Bush/Wbush)
- Built a causal decoder for character-level language modeling, generating text continuations
- Explored architectural variants: ALiBi positional encoding (learned relative positions without positional embeddings) and local window attention (restricting attention to nearby tokens)
class MultiHeadAttention(nn.Module):
def forward(self, x):
q = self.query(x).view(B, T, n_head, head_dim).transpose(1, 2)
k = self.key(x).view(B, T, n_head, head_dim).transpose(1, 2)
v = self.value(x).view(B, T, n_head, head_dim).transpose(1, 2)
attn = (q @ k.transpose(-2, -1)) / math.sqrt(head_dim)
if self.use_alibi:
attn = attn + build_alibi_bias(n_head, T, device)
if self.causal:
attn = attn.masked_fill(self.mask[:,:,:T,:T] == 0, float("-inf"))
return self.out_proj(F.softmax(attn, -1) @ v)
Project 3 — Decoding & Retrieval-Augmented Generation
Explored decoding strategies and modern RAG systems.
Part A: Beam Search Decoding
- Implemented beam search for text generation from GPT-2 using Hugging Face Transformers
- Explored various score processors for controlling generation (repetition penalty, length normalization)
Part B: RAG System
- Built a Retrieval-Augmented Generation pipeline using LangChain
- Used FAISS vector database with sentence transformer embeddings to store and retrieve meeting transcripts
- Implemented the retriever component for the QMSum dataset (question answering on meeting transcripts)