Archives

All works, ordered chronologically.

2026 ¹³

April ³

Agentic Pipelines, Code Gen & RAG Evaluation
Building 1 Apr 2026

Three active projects at the intersection of LLM applications — multi-agent document processing, domain-adapted code generation, and systematic RAG quality measurement.
RLHF, Sparse MoE & Rust for Inference
Learning 1 Apr 2026

Deepening expertise in three areas — alignment techniques for LLMs, sparse Mixture of Experts scaling, and systems-level inference serving with Rust.
ML Systems Design, KV-Cache Research & Staff Engineering
Reading 1 Apr 2026

Books and papers shaping how I think about production ML — system design principles, efficient long-context inference, and technical leadership beyond the IC track.

March ³

Multi-Agent Document Understanding
Project 15 Mar 2026

Building multi-agent systems that decompose complex documents into structured knowledge using specialized LLM agents for extraction, reasoning, and validation.
Prompt Engineering for Production Systems
Talk 5 Mar 2026

Workshop on writing reliable, testable prompts for production LLM applications — covering structured outputs, guardrails, and prompt versioning.
Deep Dive: RLHF & Alignment Techniques
Writing 1 Mar 2026

Studying reinforcement learning from human feedback — from reward modeling to PPO and DPO, understanding how modern LLMs are aligned to human preferences.

February ³

Exploring Mixture of Experts Architectures
Writing 15 Feb 2026

Research notes on Mixture of Experts — how sparse activation enables scaling model capacity without proportional compute, from Switch Transformer to Mixtral.
DeepAgents — Multi-Agent Orchestration Research
OSS 1 Feb 2026

Contributing to DeepAgents, a framework for building hierarchical multi-agent systems with planning, tool use, and memory.
Domain-Specific Code Generation with Llama 3
Project 1 Feb 2026

Fine-tuning Llama 3 on proprietary codebases for domain-specific code generation — internal APIs, conventions, and patterns the base model doesn't know.

January ⁴

Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate
Writing 25 Jan 2026

Comprehensive benchmark comparing vector databases for production RAG workloads — latency, recall, cost, and operational complexity.
Learning Rust for High-Performance Inference
Writing 20 Jan 2026

Learning Rust with a focus on building high-performance ML inference servers — async runtimes, zero-copy deserialization, and ONNX runtime bindings.
Multi-Agent RAG System with LangGraph
Project 10 Jan 2026

Production agentic RAG system using LangGraph for multi-step reasoning over enterprise knowledge bases. Handles 10K+ queries/day with sub-2s latency.
Open-Source RAG Evaluation Framework
OSS 10 Jan 2026

Building an open-source framework to systematically evaluate RAG pipeline quality — retrieval relevance, answer faithfulness, and end-to-end correctness.

2025 ¹²

December ¹

Production LLM Inference with vLLM
Writing 5 Dec 2025

How we optimized LLM serving latency by 3x using vLLM's continuous batching, PagedAttention, and quantized model deployment.

November ¹

Domain-Specific LLM Fine-Tuning with Unsloth
Project 20 Nov 2025

Fine-tuned Llama 3 and Mistral models for domain-specific tasks using Unsloth + QLoRA, achieving 40% faster training with 60% less VRAM.

October ¹

Evaluating RAG Systems — Beyond Vibes
Talk 18 Oct 2025

Conference talk on systematic RAG evaluation using RAGAS metrics, human preference ranking, and automated regression testing.

September ¹

HuggingFace Transformers — Core Contributions
OSS 15 Sept 2025

Contributed model implementations and training optimizations to HuggingFace's Transformers library, used by 100K+ developers worldwide.

August ¹

Reproducible ML Pipelines with DVC
Writing 10 Aug 2025

A practical guide to building reproducible, version-controlled ML data pipelines using DVC, from dataset versioning to automated retraining.

July ¹

embed-cache — Persistent Embedding Cache
OSS 12 Jul 2025

Python library that caches OpenAI/Cohere embedding API calls to SQLite, cutting costs by 80% for iterative RAG development.

June ¹

ML Experiment Tracking Platform with MLflow
Project 15 Jun 2025

Built a centralized MLflow-based experiment tracking and model registry platform serving 15+ ML engineers across 3 teams.

May ²

Open Source Mentorship — First-Time Contributors Program
Talk 15 May 2025

Organized and led a 4-week open source mentorship program helping 20+ developers make their first meaningful contributions to ML/AI projects.
AWS Machine Learning — Specialty
Cert 10 May 2025

AWS professional certification covering ML workloads — SageMaker, model training, feature engineering, and ML solution architecture.

April ¹

LLM Monitoring Dashboard with W&B
Project 20 Apr 2025

Real-time LLM monitoring system tracking token costs, latency distributions, hallucination rates, and model drift using Weights & Biases.

March ¹

Migrating 50 Services to Kubernetes — A Retrospective
Writing 20 Mar 2025

What went right, what broke, and what we'd do differently migrating a monolith-era fleet to Kubernetes over six months.

February ¹

Real-Time Feature Store Architecture
Project 10 Feb 2025

Designed a dual-layer feature store with offline batch features in Parquet/S3 and online real-time features in Redis, serving 50M+ predictions/day.

2024 ⁹

December ¹

NeurIPS 2024 — Spotlight Poster Presentation
Conf 10 Dec 2024

Presented poster on efficient fine-tuning methods for domain-specific LLMs at NeurIPS 2024 in Vancouver.

November ²

MLOps Community Meetup — Speaker & Organizer
Meetup 20 Nov 2024

Organized and spoke at the monthly MLOps Community meetup in San Francisco on production LLM monitoring patterns.
taskr — Developer Task Runner CLI
OSS 8 Nov 2024

A fast, opinionated task runner for monorepos — parallel execution, dependency graphs, and smart caching. Written in Go.

September ²

LangChain AI Agents Hackathon — 2nd Place
Hack 15 Sept 2024

Built an autonomous code review agent in 48 hours that analyzes PRs, suggests fixes, and auto-generates test cases. Won 2nd place out of 200+ teams.
Certified Kubernetes Application Developer (CKAD)
Cert 15 Sept 2024

CNCF certification covering Kubernetes application design, deployment, configuration, and observability patterns.

July ¹

SaaS Analytics Dashboard — Full-Stack Build
Project 22 Jul 2024

Self-hosted analytics dashboard with real-time event streaming, custom SQL queries, and team collaboration. React + FastAPI + PostgreSQL.

April ¹

terraform-modules — Reusable Cloud Infrastructure
OSS 18 Apr 2024

Collection of production-tested Terraform modules for AWS — VPC, ECS, Lambda, IAM, and monitoring with security-first defaults.

January ²

AI Engineer at Acme AI
Role 15 Jan 2024

Building LLM-powered applications and agentic workflows. Leading fine-tuning, RAG, and inference optimization for enterprise AI products.
API Gateway Redesign — From Monolith to Microservices
Project 10 Jan 2024

Redesigned the API gateway layer to support 200+ microservices with rate limiting, auth delegation, and circuit breakers.

Ran a production-realistic benchmark of the three most popular vector databases for RAG workloads — not just synthetic benchmarks, but real embedding distributions from enterprise documents.

Methodology

Dataset — 2M embeddings from real enterprise documents (1536-dim, OpenAI ada-002)
Queries — 10K real user queries from production RAG system
Metrics — recall@10, p95 latency, cost/1M queries, operational burden

Results Summary

Database	Recall@10	p95 Latency	Cost/Month (2M vectors)
Qdrant (self-hosted)	98.2%	12ms	$150
Pinecone (managed)	97.8%	18ms	$420
Weaviate (self-hosted)	97.5%	22ms	$180

Recommendation

Qdrant wins on performance and cost for teams comfortable with self-hosting. Pinecone wins on operational simplicity. Weaviate’s multi-tenancy support is best for SaaS use cases.

A production postmortem on migrating from naive HuggingFace pipeline() inference to vLLM — and the 3x latency improvement that came with it.

The Problem

Our Llama 3 8B model was serving at 800ms p95 with HuggingFace’s default inference. At 500 concurrent users, GPU utilization was only 40% — most time was spent in memory allocation and batch scheduling.

The Fix

vLLM’s PagedAttention — eliminated KV cache fragmentation, GPU memory utilization jumped to 90%+
Continuous batching — no more waiting for the slowest request in a batch
AWQ quantization — 4-bit quantized model with negligible quality loss, 2x throughput
Tensor parallelism — split model across 2x A10G for headroom

Results

Metric	Before	After
p95 Latency	800ms	250ms
Throughput	15 req/s	48 req/s
GPU Utilization	40%	92%
Cost/1K requests	$0.12	$0.04

Takeaway

vLLM is production-ready. The continuous batching alone is worth the migration. If you’re still using transformers.pipeline() for serving, you’re leaving 3x performance on the table.

Fine-tuned open-weight LLMs for enterprise use cases — legal document summarization, code review, and customer support triage.

Approach

Base Models — Llama 3 8B, Mistral 7B, Phi-3 Mini
Method — QLoRA (4-bit quantization + Low-Rank Adaptation) via Unsloth
Data — curated instruction datasets (5K-20K examples per domain)
Evaluation — custom benchmarks + human preference ranking

Why Unsloth

Unsloth’s fused kernels and memory optimizations let us fine-tune 8B models on a single A100 in under 2 hours — compared to 5+ hours with vanilla PEFT. The 4-bit training path kept VRAM under 24GB.

Results

Model	Task	Accuracy	vs Base
Llama 3 8B	Legal Summarization	91.3%	+18.7%
Mistral 7B	Code Review	87.5%	+22.1%
Phi-3 Mini	Support Triage	94.0%	+15.3%

Deployment

Models exported to GGUF format for llama.cpp inference and served via vLLM behind a FastAPI gateway with streaming support.

A deep-dive into using DVC (Data Version Control) for production ML workflows — going beyond basic file tracking to full pipeline orchestration.

What’s Covered

Dataset versioning — track large datasets in S3/GCS without bloating Git
Pipeline DAGs — define training pipelines as reproducible dvc.yaml stages
Experiment tracking — dvc exp for hyperparameter sweeps without branch pollution
CI integration — automated retraining triggers on data drift detection

Key Insight

The biggest win from DVC isn’t version control — it’s reproducibility. When a model degrades in production, you can dvc checkout the exact data + code + params that produced the last good model and diff against current state.

Code Examples

# dvc.yaml — define a training pipeline
stages:
  prepare:
    cmd: python src/prepare.py
    deps: [data/raw]
    outs: [data/processed]
  train:
    cmd: python src/train.py --lr ${lr} --epochs ${epochs}
    deps: [data/processed, src/train.py]
    params: [lr, epochs]
    outs: [models/latest]
    metrics: [metrics.json]

Who This Is For

ML engineers tired of model_v2_final_FINAL.pkl and data scientists who want git bisect for their training data.

A zero-config embedding cache that sits between your code and the embedding API. Every embedding is stored in a local SQLite database — identical inputs return cached results instantly.

Why

During RAG development, you re-embed the same documents hundreds of times while iterating on chunking strategies, metadata, and retrieval logic. Each call costs money and adds latency.

Usage

from embed_cache import CachedEmbeddings

embedder = CachedEmbeddings(model="text-embedding-3-small")
vectors = embedder.embed(["document chunk 1", "document chunk 2"])
# Second call: instant, free
vectors = embedder.embed(["document chunk 1", "document chunk 2"])

Features

Drop-in replacement for OpenAI and Cohere embedding clients
SQLite backend — no infrastructure needed
Cache hit rate tracking and cost savings reporting
TTL support for cache invalidation

A CLI tool born from frustration with slow CI builds in large monorepos.

Features

Parallel execution — runs independent tasks concurrently with configurable concurrency limits
Dependency graph — DAG-based task ordering, only runs what’s needed
Smart caching — content-addressable cache skips tasks when inputs haven’t changed
Simple config — YAML task definitions, no DSL to learn

Why Go

Single binary distribution, fast startup, excellent concurrency primitives. Users download one binary — no runtime dependencies.

Usage

# taskr.yaml
tasks:
  lint:
    cmd: eslint src/
    inputs: ["src/**/*.ts"]
  test:
    cmd: pytest tests/
    deps: [lint]
    inputs: ["src/**/*.py", "tests/**/*.py"]
  build:
    cmd: docker build -t app .
    deps: [test]

taskr run build  # runs lint → test → build, skips cached steps

Archives

2026 13

2025 12

2024 9

Agentic Pipelines, Code Gen & RAG Evaluation

Multi-Agent Document Understanding

Domain-Specific Code Generation with Llama 3

Open-Source RAG Evaluation Framework

Tech Stack

RLHF, Sparse MoE & Rust for Inference

Reinforcement Learning from Human Feedback

Mixture of Experts Architectures

Rust for High-Performance Inference

Tech Stack

ML Systems Design, KV-Cache Research & Staff Engineering

Designing Machine Learning Systems — Chip Huyen

KV-Cache Optimization Papers

The Staff Engineer’s Path — Tanya Reilly

Multi-Agent Document Understanding

Architecture

Key Challenges

Tech Stack

Prompt Engineering for Production Systems

Topics Covered

Key Takeaway

Tech Stack

Deep Dive: RLHF & Alignment Techniques

Topics Covered

Key Takeaways So Far

Tech Stack

Exploring Mixture of Experts Architectures

Reading List

Key Insights

Tech Stack

DeepAgents — Multi-Agent Orchestration Research

Contributions

Research Questions

Learnings

Tech Stack

Domain-Specific Code Generation with Llama 3

Approach

Results So Far

Tech Stack

Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate

Methodology

Results Summary

Recommendation

Tech Stack

Learning Rust for High-Performance Inference

Learning Path

Goal

Tech Stack

Multi-Agent RAG System with LangGraph

Timeline

Architecture

Key Decisions

Evaluation

Tech Stack

Open-Source RAG Evaluation Framework

What It Measures

Design Principles

Tech Stack

Production LLM Inference with vLLM

The Problem

The Fix

Results

Takeaway

Tech Stack

Domain-Specific LLM Fine-Tuning with Unsloth

Approach

Why Unsloth

Results

Deployment

Tech Stack

Evaluating RAG Systems — Beyond Vibes

Key Points

The Framework

Audience

Tech Stack

HuggingFace Transformers — Core Contributions

2026 ¹³

2025 ¹²

2024 ⁹