Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate
Jan 2026
Comprehensive benchmark comparing vector databases for production RAG workloads — latency, recall, cost, and operational complexity.
Writing
Learning Rust for High-Performance Inference
Jan 2026 — Present
Learning Rust with a focus on building high-performance ML inference servers — async runtimes, zero-copy deserialization, and ONNX runtime bindings.
Writing
Production LLM Inference with vLLM
Dec 2025
How we optimized LLM serving latency by 3x using vLLM's continuous batching, PagedAttention, and quantized model deployment.
Agentic Pipelines, Code Gen & RAG Evaluation
Building1 Apr 2026
#agents#fine-tuning#rag#llm
Multi-Agent Document Understanding
A supervisor agent routes documents to specialized sub-agents — layout parsing, entity extraction, cross-reference resolution, and fact validation. Each agent has access to different tools and retrieval sources, allowing the system to handle multi-modal documents (text + tables + figures) at scale.
Domain-Specific Code Generation with Llama 3
Base LLMs generate generic code. Fine-tuning Llama 3 on ~50K internal code samples using QLoRA + Unsloth to teach it proprietary SDKs, API patterns, and coding conventions. Early results: 73% pass rate on internal API tests vs 12% for the base model.
Open-Source RAG Evaluation Framework
Moving beyond vibes-based RAG evaluation. A CI-friendly framework measuring retrieval precision/recall, answer faithfulness, hallucination detection, and end-to-end correctness — works with any retriever and any LLM, tracks regressions over time.
RLHF, Sparse MoE & Rust for Inference
Learning1 Apr 2026
#rlhf#moe#rust#research
Reinforcement Learning from Human Feedback
Studying the full alignment pipeline — reward modeling from pairwise preferences, PPO vs DPO tradeoffs, constitutional AI, and open problems around reward hacking and scalable oversight. DPO is simpler and more stable for most cases, but PPO gives finer control. The real bottleneck is always preference data quality.
Mixture of Experts Architectures
How sparse activation scales model capacity without proportional compute cost. Working through Switch Transformers, ST-MoE, Mixtral, and DeepSeek-MoE. Key insight: expert load balancing is the critical implementation challenge, and sparse models need fundamentally different serving infrastructure.
Rust for High-Performance Inference
Python is ML’s lingua franca, but inference serving is a systems problem. Building lightweight servers with Tokio async runtime, ONNX Runtime Rust bindings, and zero-copy tensor handling — targeting sub-millisecond overhead where Python’s GIL is the bottleneck.
ML Systems Design, KV-Cache Research & Staff Engineering
Reading1 Apr 2026
#books#papers#ml-systems
Designing Machine Learning Systems — Chip Huyen
The definitive guide to production ML. Covers data engineering, feature stores, training pipelines, model serving, monitoring, and the organizational patterns that make ML teams effective. Required reading for anyone moving models from notebooks to production.
KV-Cache Optimization Papers
Efficient KV-cache management is the key bottleneck for long-context LLM inference. Studying PagedAttention (vLLM), multi-query attention, grouped-query attention, and sliding window approaches. Understanding these tradeoffs directly informs inference infrastructure decisions.
The Staff Engineer’s Path — Tanya Reilly
Technical leadership beyond writing code. Architecture decisions that compound, mentoring that scales, and organizational influence through technical judgment. Rethinking what “impact” means at senior levels.
Multi-Agent Document Understanding
Project15 Mar 2026
#agents#llm#rag#langraph
Designing an agentic pipeline where specialized agents handle different document understanding tasks — layout parsing, entity extraction, cross-reference resolution, and fact validation.
Architecture
The system uses a supervisor agent that routes documents to specialized sub-agents based on document type and complexity. Each agent has access to different tools and retrieval sources.
A hands-on workshop for engineers moving from ChatGPT-style prompting to production-grade prompt engineering.
Topics Covered
Structured outputs — JSON mode, function calling, and Pydantic validation
Prompt versioning — treating prompts as code with version control and A/B testing
Guardrails — input validation, output filtering, and hallucination detection
Testing prompts — unit tests for LLM outputs using semantic similarity and rubric grading
Cost optimization — prompt caching, token budgeting, and model routing
Key Takeaway
The gap between a demo prompt and a production prompt is the same as the gap between a script and a service — error handling, testing, monitoring, and versioning.
Deep Dive: RLHF & Alignment Techniques
Writing1 Mar 2026
#rlhf#alignment#llm#research
Documenting my learning journey through RLHF and alignment techniques. Covering the full pipeline from preference data collection to reward model training to policy optimization.
Topics Covered
Reward modeling from pairwise human preferences
PPO vs DPO — tradeoffs in practice
Constitutional AI and self-supervised alignment
Open questions: reward hacking, goodharting, scalable oversight
Key Takeaways So Far
DPO is simpler and more stable than PPO for most use cases, but PPO gives more fine-grained control when you need it. The real bottleneck is always preference data quality.
Exploring Mixture of Experts Architectures
Writing15 Feb 2026
#moe#architecture#llm#research
MoE is how the industry is scaling LLMs beyond dense transformer limits. Studying the key architectures and implementation details.
Reading List
Switch Transformers (Fedus et al., 2021)
ST-MoE (Zoph et al., 2022)
Mixtral (Jiang et al., 2024)
DeepSeek-MoE and fine-grained expert design
Key Insights
Expert load balancing is the critical implementation challenge
Token-choice vs expert-choice routing has major throughput implications
Sparse models need different serving infrastructure than dense models
DeepAgents — Multi-Agent Orchestration Research
OSS1 Feb 2026
#multi-agent#llm#agents#research#open-source
Contributing to DeepAgents, a research framework exploring how multiple LLM-powered agents can collaborate on complex tasks through hierarchical planning and shared memory.
Tool registry — built a dynamic tool discovery and registration system
Evaluation harness — added benchmarks for multi-agent task completion on SWE-bench
Research Questions
How do agents decompose ambiguous tasks into sub-plans?
When should agents delegate vs. execute directly?
What memory architectures minimize hallucination in long-horizon tasks?
Learnings
The biggest insight: agent reliability scales better with structured state machines (like LangGraph) than with pure prompt-driven autonomy. Explicit control flow + LLM reasoning at decision nodes beats end-to-end agent prompting.
Base LLMs generate generic code. When your team has internal SDKs, specific API patterns, and coding conventions, the model needs domain knowledge it was never trained on.
Approach
Curated training dataset from internal repos (~50K code samples with docstrings)
QLoRA fine-tuning with Unsloth for memory-efficient training on a single A100
Custom evaluation harness testing function correctness, API usage accuracy, and style compliance
Results So Far
73% pass rate on internal API usage tests (vs 12% for base Llama 3)
3x faster inference with vLLM serving + speculative decoding
Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate
Writing25 Jan 2026
#vector-db#rag#benchmarks#performance#comparison
Ran a production-realistic benchmark of the three most popular vector databases for RAG workloads — not just synthetic benchmarks, but real embedding distributions from enterprise documents.
Methodology
Dataset — 2M embeddings from real enterprise documents (1536-dim, OpenAI ada-002)
Queries — 10K real user queries from production RAG system
Qdrant wins on performance and cost for teams comfortable with self-hosting. Pinecone wins on operational simplicity. Weaviate’s multi-tenancy support is best for SaaS use cases.
Async with Tokio — building concurrent HTTP/gRPC servers
ONNX Runtime Rust bindings — running models without Python overhead
Zero-copy tensor handling — minimizing allocations in the hot path
Goal
Build a lightweight inference server that can serve ONNX models with sub-millisecond overhead, suitable for real-time applications where Python’s GIL is the bottleneck.
Multi-Agent RAG System with LangGraph
Project10 Jan 2026at Acme AI
#rag#llm#agents#agentic-workflows#production
Built an end-to-end agentic RAG system that goes beyond simple retrieval — the LangGraph agent reasons over multiple sources, self-corrects, and provides grounded citations.
Architecture
Retriever Agent — hybrid search (dense + sparse) across Qdrant vector store
Reasoner Agent — multi-step chain-of-thought with GPT-4 / Claude
Validator Agent — fact-checks responses against source documents
Orchestrator — LangGraph state machine managing agent handoffs and retries
Key Decisions
Chose LangGraph over vanilla LangChain for explicit control over agent state transitions
Qdrant over Pinecone for self-hosted deployment and cost control
Streaming responses via Server-Sent Events for perceived latency improvement
Evaluation
Built a custom evaluation harness using RAGAS metrics — faithfulness, answer relevancy, and context precision tracked per-query in MLflow.
A production postmortem on migrating from naive HuggingFace pipeline() inference to vLLM — and the 3x latency improvement that came with it.
The Problem
Our Llama 3 8B model was serving at 800ms p95 with HuggingFace’s default inference. At 500 concurrent users, GPU utilization was only 40% — most time was spent in memory allocation and batch scheduling.
Continuous batching — no more waiting for the slowest request in a batch
AWQ quantization — 4-bit quantized model with negligible quality loss, 2x throughput
Tensor parallelism — split model across 2x A10G for headroom
Results
Metric
Before
After
p95 Latency
800ms
250ms
Throughput
15 req/s
48 req/s
GPU Utilization
40%
92%
Cost/1K requests
$0.12
$0.04
Takeaway
vLLM is production-ready. The continuous batching alone is worth the migration. If you’re still using transformers.pipeline() for serving, you’re leaving 3x performance on the table.
Fine-tuned open-weight LLMs for enterprise use cases — legal document summarization, code review, and customer support triage.
Approach
Base Models — Llama 3 8B, Mistral 7B, Phi-3 Mini
Method — QLoRA (4-bit quantization + Low-Rank Adaptation) via Unsloth
Data — curated instruction datasets (5K-20K examples per domain)
Evaluation — custom benchmarks + human preference ranking
Why Unsloth
Unsloth’s fused kernels and memory optimizations let us fine-tune 8B models on a single A100 in under 2 hours — compared to 5+ hours with vanilla PEFT. The 4-bit training path kept VRAM under 24GB.
Results
Model
Task
Accuracy
vs Base
Llama 3 8B
Legal Summarization
91.3%
+18.7%
Mistral 7B
Code Review
87.5%
+22.1%
Phi-3 Mini
Support Triage
94.0%
+15.3%
Deployment
Models exported to GGUF format for llama.cpp inference and served via vLLM behind a FastAPI gateway with streaming support.
Most RAG systems are evaluated by “vibes” — someone reads a few outputs and says “looks good.” This talk presents a systematic evaluation framework used in production.
Contributed to HuggingFace Transformers, the most widely-used library for state-of-the-art NLP and LLM inference.
Contributions
Flash Attention integration — added FlashAttention-2 support for Mistral and Phi model families
Quantization improvements — optimized GPTQ and AWQ quantization paths for faster loading
Training utilities — improved gradient checkpointing for multi-GPU fine-tuning workflows
Documentation — rewrote fine-tuning guides for the PEFT + Transformers integration
Impact
These optimizations reduced inference latency by 30-40% for affected model families and are now part of the default pipeline for millions of daily API calls on HuggingFace Hub.
A deep-dive into using DVC (Data Version Control) for production ML workflows — going beyond basic file tracking to full pipeline orchestration.
What’s Covered
Dataset versioning — track large datasets in S3/GCS without bloating Git
Pipeline DAGs — define training pipelines as reproducible dvc.yaml stages
Experiment tracking — dvc exp for hyperparameter sweeps without branch pollution
CI integration — automated retraining triggers on data drift detection
Key Insight
The biggest win from DVC isn’t version control — it’s reproducibility. When a model degrades in production, you can dvc checkout the exact data + code + params that produced the last good model and diff against current state.
A zero-config embedding cache that sits between your code and the embedding API. Every embedding is stored in a local SQLite database — identical inputs return cached results instantly.
Why
During RAG development, you re-embed the same documents hundreds of times while iterating on chunking strategies, metadata, and retrieval logic. Each call costs money and adds latency.
Open Source Mentorship — First-Time Contributors Program
Talk15 May 2025at Local Dev Community
#open-source#mentorship#community#education
Designed a structured program to lower the barrier to open source contribution, specifically targeting ML/AI repositories.
Program Structure
Week 1 — Git workflows, finding good first issues, reading codebases
Week 2 — Paired PR sessions on real HuggingFace/LangChain issues
Week 3 — Writing tests, documentation, and review etiquette
Week 4 — Solo contributions with async mentor support
Results
20+ participants, 15 shipped their first merged PR
6 continued contributing after the program ended
3 became regular contributors to HuggingFace ecosystem projects
Takeaway
The biggest blocker for first-time contributors isn’t skill — it’s confidence. Pairing someone with a mentor for their first PR changes the trajectory.
Built the team’s first feature store to solve the train-serve skew problem — ensuring ML models see the same features in training and production.
Architecture
Offline store — Parquet files on S3, computed via Airflow batch jobs
Online store — Redis cluster with sub-10ms reads for real-time serving
Feature registry — centralized catalog with lineage, ownership, and freshness SLAs
SDK — Python client for consistent feature retrieval in notebooks, training, and serving
Key Design Decisions
Chose Redis over DynamoDB for online store — 3x lower p99 latency at our scale
Parquet over Delta Lake for offline — simpler, team already familiar, good enough for batch
Built custom registry instead of adopting Feast — our schema requirements didn’t fit
Impact
Eliminated train-serve skew for all production models. Feature reuse across teams went from 0% to 60%, reducing duplicate computation by ~$2K/month.
NeurIPS 2024 — Spotlight Poster Presentation
Conf10 Dec 2024at NeurIPS
#ml-research#community
Abstract
We propose a parameter-efficient fine-tuning approach that reduces compute requirements by 4x while maintaining 97% of full fine-tuning performance on domain-specific benchmarks. Our method combines adaptive rank selection with gradient-aware layer freezing.
Key Contributions
Adaptive LoRA rank selection — dynamically adjusts rank per layer based on gradient magnitude during training
Layer-wise freezing scheduler — progressively freezes converged layers to redirect compute to under-trained parameters
Domain benchmark suite — released evaluation suite covering legal, medical, and financial domains
Takeaways
The conference provided excellent networking with teams from DeepMind, Meta FAIR, and several university labs working on similar efficiency problems. Led to two follow-up collaborations.
Talk: “Beyond Accuracy — Monitoring LLMs in Production”
Most teams ship LLMs with basic latency/error monitoring and call it done. This talk covered the monitoring patterns that actually catch problems before users complain.
Topics Covered
Semantic drift detection — embedding-based monitoring to catch when input distributions shift
Response quality scoring — lightweight LLM-as-judge pipelines that run async on sampled outputs
Cost attribution — tracing token usage back to features and user segments
Alert fatigue prevention — adaptive thresholds that account for natural usage pattern changes
Organizer Role
This was the 24th edition of our monthly meetup. As co-organizer I handle venue logistics, speaker sourcing, and post-event content publishing. We’ve grown from 20 to 85 regular attendees over two years.
Suggests fixes — generates corrected code with explanations
Writes tests — auto-generates unit tests covering the changed code paths
Architecture
The system uses a LangGraph state machine with four specialized agents:
Analyst — parses diffs and builds dependency graphs
Reviewer — identifies potential issues using RAG over best practices
Fixer — generates code suggestions using few-shot examples
Tester — creates test cases using mutation testing principles
What I Learned
Building under pressure forces you to make ruthless scope decisions. We cut three planned features on Day 1 evening to focus on making the core flow bulletproof — that focus is what won us the judges.