Three active projects at the intersection of LLM applications — multi-agent document processing, domain-adapted code generation, and systematic RAG quality measurement.
Books and papers shaping how I think about production ML — system design principles, efficient long-context inference, and technical leadership beyond the IC track.
Building multi-agent systems that decompose complex documents into structured knowledge using specialized LLM agents for extraction, reasoning, and validation.
Studying reinforcement learning from human feedback — from reward modeling to PPO and DPO, understanding how modern LLMs are aligned to human preferences.
Research notes on Mixture of Experts — how sparse activation enables scaling model capacity without proportional compute, from Switch Transformer to Mixtral.
Fine-tuning Llama 3 on proprietary codebases for domain-specific code generation — internal APIs, conventions, and patterns the base model doesn't know.
Building an open-source framework to systematically evaluate RAG pipeline quality — retrieval relevance, answer faithfulness, and end-to-end correctness.
Redesigned the API gateway layer to support 200+ microservices with rate limiting, auth delegation, and circuit breakers.
Building
Agentic Pipelines, Code Gen & RAG Evaluation
Three active projects at the intersection of LLM applications — multi-agent document processing, domain-adapted code generation, and systematic RAG quality measurement.
Date 1 Apr 2026 Active
#agents#fine-tuning#rag#llm
Multi-Agent Document Understanding
A supervisor agent routes documents to specialized sub-agents — layout parsing, entity extraction, cross-reference resolution, and fact validation. Each agent has access to different tools and retrieval sources, allowing the system to handle multi-modal documents (text + tables + figures) at scale.
Domain-Specific Code Generation with Llama 3
Base LLMs generate generic code. Fine-tuning Llama 3 on ~50K internal code samples using QLoRA + Unsloth to teach it proprietary SDKs, API patterns, and coding conventions. Early results: 73% pass rate on internal API tests vs 12% for the base model.
Open-Source RAG Evaluation Framework
Moving beyond vibes-based RAG evaluation. A CI-friendly framework measuring retrieval precision/recall, answer faithfulness, hallucination detection, and end-to-end correctness — works with any retriever and any LLM, tracks regressions over time.
Tech Stack
LangGraphUnslothQLoRAvLLMPythonPinecone
Learning
RLHF, Sparse MoE & Rust for Inference
Deepening expertise in three areas — alignment techniques for LLMs, sparse Mixture of Experts scaling, and systems-level inference serving with Rust.
Date 1 Apr 2026 Active
#rlhf#moe#rust#research
Reinforcement Learning from Human Feedback
Studying the full alignment pipeline — reward modeling from pairwise preferences, PPO vs DPO tradeoffs, constitutional AI, and open problems around reward hacking and scalable oversight. DPO is simpler and more stable for most cases, but PPO gives finer control. The real bottleneck is always preference data quality.
Mixture of Experts Architectures
How sparse activation scales model capacity without proportional compute cost. Working through Switch Transformers, ST-MoE, Mixtral, and DeepSeek-MoE. Key insight: expert load balancing is the critical implementation challenge, and sparse models need fundamentally different serving infrastructure.
Rust for High-Performance Inference
Python is ML’s lingua franca, but inference serving is a systems problem. Building lightweight servers with Tokio async runtime, ONNX Runtime Rust bindings, and zero-copy tensor handling — targeting sub-millisecond overhead where Python’s GIL is the bottleneck.
Tech Stack
PyTorchTRLDeepSpeedMegablocksTokioONNX Runtime
Reading
ML Systems Design, KV-Cache Research & Staff Engineering
Books and papers shaping how I think about production ML — system design principles, efficient long-context inference, and technical leadership beyond the IC track.
Date 1 Apr 2026
#books#papers#ml-systems
Designing Machine Learning Systems — Chip Huyen
The definitive guide to production ML. Covers data engineering, feature stores, training pipelines, model serving, monitoring, and the organizational patterns that make ML teams effective. Required reading for anyone moving models from notebooks to production.
KV-Cache Optimization Papers
Efficient KV-cache management is the key bottleneck for long-context LLM inference. Studying PagedAttention (vLLM), multi-query attention, grouped-query attention, and sliding window approaches. Understanding these tradeoffs directly informs inference infrastructure decisions.
The Staff Engineer’s Path — Tanya Reilly
Technical leadership beyond writing code. Architecture decisions that compound, mentoring that scales, and organizational influence through technical judgment. Rethinking what “impact” means at senior levels.
Case Study
Multi-Agent Document Understanding
Building multi-agent systems that decompose complex documents into structured knowledge using specialized LLM agents for extraction, reasoning, and validation.
Date 15 Mar 2026 Active
#agents#llm#rag#langraph
Designing an agentic pipeline where specialized agents handle different document understanding tasks — layout parsing, entity extraction, cross-reference resolution, and fact validation.
Architecture
The system uses a supervisor agent that routes documents to specialized sub-agents based on document type and complexity. Each agent has access to different tools and retrieval sources.
A hands-on workshop for engineers moving from ChatGPT-style prompting to production-grade prompt engineering.
Topics Covered
Structured outputs — JSON mode, function calling, and Pydantic validation
Prompt versioning — treating prompts as code with version control and A/B testing
Guardrails — input validation, output filtering, and hallucination detection
Testing prompts — unit tests for LLM outputs using semantic similarity and rubric grading
Cost optimization — prompt caching, token budgeting, and model routing
Key Takeaway
The gap between a demo prompt and a production prompt is the same as the gap between a script and a service — error handling, testing, monitoring, and versioning.
Tech Stack
PythonOpenAIClaudePydanticLangChain
Article
Deep Dive: RLHF & Alignment Techniques
Studying reinforcement learning from human feedback — from reward modeling to PPO and DPO, understanding how modern LLMs are aligned to human preferences.
Date 1 Mar 2026
#rlhf#alignment#llm#research
Documenting my learning journey through RLHF and alignment techniques. Covering the full pipeline from preference data collection to reward model training to policy optimization.
Topics Covered
Reward modeling from pairwise human preferences
PPO vs DPO — tradeoffs in practice
Constitutional AI and self-supervised alignment
Open questions: reward hacking, goodharting, scalable oversight
Key Takeaways So Far
DPO is simpler and more stable than PPO for most use cases, but PPO gives more fine-grained control when you need it. The real bottleneck is always preference data quality.
Tech Stack
PyTorchTRLDeepSpeed
Article
Exploring Mixture of Experts Architectures
Research notes on Mixture of Experts — how sparse activation enables scaling model capacity without proportional compute, from Switch Transformer to Mixtral.
Date 15 Feb 2026
#moe#architecture#llm#research
MoE is how the industry is scaling LLMs beyond dense transformer limits. Studying the key architectures and implementation details.
Reading List
Switch Transformers (Fedus et al., 2021)
ST-MoE (Zoph et al., 2022)
Mixtral (Jiang et al., 2024)
DeepSeek-MoE and fine-grained expert design
Key Insights
Expert load balancing is the critical implementation challenge
Token-choice vs expert-choice routing has major throughput implications
Sparse models need different serving infrastructure than dense models
Tech Stack
PyTorchMegablocks
Open Source
DeepAgents — Multi-Agent Orchestration Research
Contributing to DeepAgents, a framework for building hierarchical multi-agent systems with planning, tool use, and memory.
Contributing to DeepAgents, a research framework exploring how multiple LLM-powered agents can collaborate on complex tasks through hierarchical planning and shared memory.
Tool registry — built a dynamic tool discovery and registration system
Evaluation harness — added benchmarks for multi-agent task completion on SWE-bench
Research Questions
How do agents decompose ambiguous tasks into sub-plans?
When should agents delegate vs. execute directly?
What memory architectures minimize hallucination in long-horizon tasks?
Learnings
The biggest insight: agent reliability scales better with structured state machines (like LangGraph) than with pure prompt-driven autonomy. Explicit control flow + LLM reasoning at decision nodes beats end-to-end agent prompting.
Tech Stack
PythonDeepAgentsLangGraphGPT-4Claude
Case Study
Domain-Specific Code Generation with Llama 3
Fine-tuning Llama 3 on proprietary codebases for domain-specific code generation — internal APIs, conventions, and patterns the base model doesn't know.
Date 1 Feb 2026 Active
#fine-tuning#llm#code-gen#llama
Base LLMs generate generic code. When your team has internal SDKs, specific API patterns, and coding conventions, the model needs domain knowledge it was never trained on.
Approach
Curated training dataset from internal repos (~50K code samples with docstrings)
QLoRA fine-tuning with Unsloth for memory-efficient training on a single A100
Custom evaluation harness testing function correctness, API usage accuracy, and style compliance
Results So Far
73% pass rate on internal API usage tests (vs 12% for base Llama 3)
3x faster inference with vLLM serving + speculative decoding
Tech Stack
UnslothQLoRAPyTorchWeights & BiasesvLLM
Article
Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate
Comprehensive benchmark comparing vector databases for production RAG workloads — latency, recall, cost, and operational complexity.
Ran a production-realistic benchmark of the three most popular vector databases for RAG workloads — not just synthetic benchmarks, but real embedding distributions from enterprise documents.
Methodology
Dataset — 2M embeddings from real enterprise documents (1536-dim, OpenAI ada-002)
Queries — 10K real user queries from production RAG system
Qdrant wins on performance and cost for teams comfortable with self-hosting. Pinecone wins on operational simplicity. Weaviate’s multi-tenancy support is best for SaaS use cases.
Tech Stack
QdrantPineconeWeaviatePythonDocker
Article
Learning Rust for High-Performance Inference
Learning Rust with a focus on building high-performance ML inference servers — async runtimes, zero-copy deserialization, and ONNX runtime bindings.
Date 20 Jan 2026
#rust#inference#performance#learning
Python is the lingua franca of ML, but inference serving is a systems problem. Rust gives you the performance of C++ with memory safety guarantees.
Async with Tokio — building concurrent HTTP/gRPC servers
ONNX Runtime Rust bindings — running models without Python overhead
Zero-copy tensor handling — minimizing allocations in the hot path
Goal
Build a lightweight inference server that can serve ONNX models with sub-millisecond overhead, suitable for real-time applications where Python’s GIL is the bottleneck.
Tech Stack
RustTokioONNX RuntimeTonic gRPC
Case Study
Multi-Agent RAG System with LangGraph
Production agentic RAG system using LangGraph for multi-step reasoning over enterprise knowledge bases. Handles 10K+ queries/day with sub-2s latency.
Org Acme AI Role Lead AI Engineer Date 10 Jan 2026 Active
Designed multi-agent graph with specialized retriever, reasoner, and validator nodes
Feb 2026RAG Pipeline v1
Hybrid search with dense embeddings + BM25 over Qdrant vector store
Mar 2026Agent Loop Optimization
Added self-reflection and retry nodes — accuracy jumped from 82% to 94%
PresentProduction Deployment
Serving 10K+ queries/day with streaming responses and citation grounding
Built an end-to-end agentic RAG system that goes beyond simple retrieval — the LangGraph agent reasons over multiple sources, self-corrects, and provides grounded citations.
Architecture
Retriever Agent — hybrid search (dense + sparse) across Qdrant vector store
Reasoner Agent — multi-step chain-of-thought with GPT-4 / Claude
Validator Agent — fact-checks responses against source documents
Orchestrator — LangGraph state machine managing agent handoffs and retries
Key Decisions
Chose LangGraph over vanilla LangChain for explicit control over agent state transitions
Qdrant over Pinecone for self-hosted deployment and cost control
Streaming responses via Server-Sent Events for perceived latency improvement
Evaluation
Built a custom evaluation harness using RAGAS metrics — faithfulness, answer relevancy, and context precision tracked per-query in MLflow.
Building an open-source framework to systematically evaluate RAG pipeline quality — retrieval relevance, answer faithfulness, and end-to-end correctness.
Date 10 Jan 2026 Active
#rag#evaluation#open-source#llm
Most RAG pipelines are evaluated vibes-only. This framework brings structured, repeatable evaluation to retrieval-augmented generation.
What It Measures
Retrieval quality — precision, recall, and MRR of retrieved chunks
Answer faithfulness — does the answer actually follow from the retrieved context?
Hallucination detection — claims in the answer that aren’t grounded in any source
End-to-end correctness — compared against golden test sets
Design Principles
Works with any retriever and any LLM
CI-friendly — runs as part of your test suite
Tracks metrics over time to catch regressions
Tech Stack
PythonRAGASLangSmithpytestDuckDB
Article
Production LLM Inference with vLLM
How we optimized LLM serving latency by 3x using vLLM's continuous batching, PagedAttention, and quantized model deployment.
A production postmortem on migrating from naive HuggingFace pipeline() inference to vLLM — and the 3x latency improvement that came with it.
The Problem
Our Llama 3 8B model was serving at 800ms p95 with HuggingFace’s default inference. At 500 concurrent users, GPU utilization was only 40% — most time was spent in memory allocation and batch scheduling.
Continuous batching — no more waiting for the slowest request in a batch
AWQ quantization — 4-bit quantized model with negligible quality loss, 2x throughput
Tensor parallelism — split model across 2x A10G for headroom
Results
Metric
Before
After
p95 Latency
800ms
250ms
Throughput
15 req/s
48 req/s
GPU Utilization
40%
92%
Cost/1K requests
$0.12
$0.04
Takeaway
vLLM is production-ready. The continuous batching alone is worth the migration. If you’re still using transformers.pipeline() for serving, you’re leaving 3x performance on the table.
Tech Stack
vLLMPythonCUDADockerKubernetesNVIDIA A100
Case Study
Domain-Specific LLM Fine-Tuning with Unsloth
Fine-tuned Llama 3 and Mistral models for domain-specific tasks using Unsloth + QLoRA, achieving 40% faster training with 60% less VRAM.
Org Acme AI Role AI Engineer Date 20 Nov 2025 Maintained
Fine-tuned open-weight LLMs for enterprise use cases — legal document summarization, code review, and customer support triage.
Approach
Base Models — Llama 3 8B, Mistral 7B, Phi-3 Mini
Method — QLoRA (4-bit quantization + Low-Rank Adaptation) via Unsloth
Data — curated instruction datasets (5K-20K examples per domain)
Evaluation — custom benchmarks + human preference ranking
Why Unsloth
Unsloth’s fused kernels and memory optimizations let us fine-tune 8B models on a single A100 in under 2 hours — compared to 5+ hours with vanilla PEFT. The 4-bit training path kept VRAM under 24GB.
Results
Model
Task
Accuracy
vs Base
Llama 3 8B
Legal Summarization
91.3%
+18.7%
Mistral 7B
Code Review
87.5%
+22.1%
Phi-3 Mini
Support Triage
94.0%
+15.3%
Deployment
Models exported to GGUF format for llama.cpp inference and served via vLLM behind a FastAPI gateway with streaming support.
Conference talk on systematic RAG evaluation using RAGAS metrics, human preference ranking, and automated regression testing.
Org AI Engineer Summit Date 18 Oct 2025
#rag#evaluation#llm#testing#conference
Most RAG systems are evaluated by “vibes” — someone reads a few outputs and says “looks good.” This talk presents a systematic evaluation framework used in production.
Contributed to HuggingFace Transformers, the most widely-used library for state-of-the-art NLP and LLM inference.
Contributions
Flash Attention integration — added FlashAttention-2 support for Mistral and Phi model families
Quantization improvements — optimized GPTQ and AWQ quantization paths for faster loading
Training utilities — improved gradient checkpointing for multi-GPU fine-tuning workflows
Documentation — rewrote fine-tuning guides for the PEFT + Transformers integration
Impact
These optimizations reduced inference latency by 30-40% for affected model families and are now part of the default pipeline for millions of daily API calls on HuggingFace Hub.
Tech Stack
PythonPyTorchHugging Face TransformersCUDAAccelerate
Article
Reproducible ML Pipelines with DVC
A practical guide to building reproducible, version-controlled ML data pipelines using DVC, from dataset versioning to automated retraining.
A deep-dive into using DVC (Data Version Control) for production ML workflows — going beyond basic file tracking to full pipeline orchestration.
What’s Covered
Dataset versioning — track large datasets in S3/GCS without bloating Git
Pipeline DAGs — define training pipelines as reproducible dvc.yaml stages
Experiment tracking — dvc exp for hyperparameter sweeps without branch pollution
CI integration — automated retraining triggers on data drift detection
Key Insight
The biggest win from DVC isn’t version control — it’s reproducibility. When a model degrades in production, you can dvc checkout the exact data + code + params that produced the last good model and diff against current state.
A zero-config embedding cache that sits between your code and the embedding API. Every embedding is stored in a local SQLite database — identical inputs return cached results instantly.
Why
During RAG development, you re-embed the same documents hundreds of times while iterating on chunking strategies, metadata, and retrieval logic. Each call costs money and adds latency.
Self-hosted MLflow on ECS with PostgreSQL backend and S3 artifact store
Aug 2025Model Registry
Standardized model versioning with automated staging → production promotion
Oct 2025CI Integration
GitHub Actions pipeline for automated model evaluation on PR merge
Dec 2025Team Adoption
All 3 ML teams onboarded, 200+ experiments tracked weekly
Designed and deployed a centralized ML experiment tracking platform that brought reproducibility and visibility to the team’s ML workflows.
Architecture
MLflow Tracking Server — ECS-hosted with PostgreSQL metadata store
Artifact Store — S3 with lifecycle policies for cost management
Model Registry — versioned model artifacts with stage transitions (Staging → Production)
Auth — OIDC integration with company SSO
Key Features
Auto-logging for PyTorch, sklearn, and HuggingFace training runs
Custom MLflow plugins for GPU utilization and cost tracking
Slack notifications on model promotion events
Grafana dashboards for experiment trends and compute usage
Tech Stack
MLflowPythonPostgreSQLMinIODockerNginxGrafana
Talk
Open Source Mentorship — First-Time Contributors Program
Organized and led a 4-week open source mentorship program helping 20+ developers make their first meaningful contributions to ML/AI projects.
Org Local Dev Community Date 15 May 2025
#open-source#mentorship#community#education
Designed a structured program to lower the barrier to open source contribution, specifically targeting ML/AI repositories.
Program Structure
Week 1 — Git workflows, finding good first issues, reading codebases
Week 2 — Paired PR sessions on real HuggingFace/LangChain issues
Week 3 — Writing tests, documentation, and review etiquette
Week 4 — Solo contributions with async mentor support
Results
20+ participants, 15 shipped their first merged PR
6 continued contributing after the program ended
3 became regular contributors to HuggingFace ecosystem projects
Takeaway
The biggest blocker for first-time contributors isn’t skill — it’s confidence. Pairing someone with a mentor for their first PR changes the trajectory.
Tech Stack
GitGitHubPythonHuggingFaceLangChain
Certification
AWS Machine Learning — Specialty
AWS professional certification covering ML workloads — SageMaker, model training, feature engineering, and ML solution architecture.
We propose a parameter-efficient fine-tuning approach that reduces compute requirements by 4x while maintaining 97% of full fine-tuning performance on domain-specific benchmarks. Our method combines adaptive rank selection with gradient-aware layer freezing.
Key Contributions
Adaptive LoRA rank selection — dynamically adjusts rank per layer based on gradient magnitude during training
Layer-wise freezing scheduler — progressively freezes converged layers to redirect compute to under-trained parameters
Domain benchmark suite — released evaluation suite covering legal, medical, and financial domains
Takeaways
The conference provided excellent networking with teams from DeepMind, Meta FAIR, and several university labs working on similar efficiency problems. Led to two follow-up collaborations.
Tech Stack
PyTorchLoRADeepSpeedWeights & Biases
Meetup
MLOps Community Meetup — Speaker & Organizer
Organized and spoke at the monthly MLOps Community meetup in San Francisco on production LLM monitoring patterns.
Org MLOps Community Role Speaker & Organizer Date 20 Nov 2024
Talk: “Beyond Accuracy — Monitoring LLMs in Production”
Most teams ship LLMs with basic latency/error monitoring and call it done. This talk covered the monitoring patterns that actually catch problems before users complain.
Topics Covered
Semantic drift detection — embedding-based monitoring to catch when input distributions shift
Response quality scoring — lightweight LLM-as-judge pipelines that run async on sampled outputs
Cost attribution — tracing token usage back to features and user segments
Alert fatigue prevention — adaptive thresholds that account for natural usage pattern changes
Organizer Role
This was the 24th edition of our monthly meetup. As co-organizer I handle venue logistics, speaker sourcing, and post-event content publishing. We’ve grown from 20 to 85 regular attendees over two years.
Tech Stack
PrometheusGrafanaOpenTelemetryLangSmithPython
Open Source
taskr — Developer Task Runner CLI
A fast, opinionated task runner for monorepos — parallel execution, dependency graphs, and smart caching. Written in Go.
Suggests fixes — generates corrected code with explanations
Writes tests — auto-generates unit tests covering the changed code paths
Architecture
The system uses a LangGraph state machine with four specialized agents:
Analyst — parses diffs and builds dependency graphs
Reviewer — identifies potential issues using RAG over best practices
Fixer — generates code suggestions using few-shot examples
Tester — creates test cases using mutation testing principles
What I Learned
Building under pressure forces you to make ruthless scope decisions. We cut three planned features on Day 1 evening to focus on making the core flow bulletproof — that focus is what won us the judges.