Tag: performance

Multi-Agent Document Understanding

A supervisor agent routes documents to specialized sub-agents — layout parsing, entity extraction, cross-reference resolution, and fact validation. Each agent has access to different tools and retrieval sources, allowing the system to handle multi-modal documents (text + tables + figures) at scale.

Domain-Specific Code Generation with Llama 3

Base LLMs generate generic code. Fine-tuning Llama 3 on ~50K internal code samples using QLoRA + Unsloth to teach it proprietary SDKs, API patterns, and coding conventions. Early results: 73% pass rate on internal API tests vs 12% for the base model.

Open-Source RAG Evaluation Framework

Moving beyond vibes-based RAG evaluation. A CI-friendly framework measuring retrieval precision/recall, answer faithfulness, hallucination detection, and end-to-end correctness — works with any retriever and any LLM, tracks regressions over time.

Ran a production-realistic benchmark of the three most popular vector databases for RAG workloads — not just synthetic benchmarks, but real embedding distributions from enterprise documents.

Methodology

Dataset — 2M embeddings from real enterprise documents (1536-dim, OpenAI ada-002)
Queries — 10K real user queries from production RAG system
Metrics — recall@10, p95 latency, cost/1M queries, operational burden

Results Summary

Database	Recall@10	p95 Latency	Cost/Month (2M vectors)
Qdrant (self-hosted)	98.2%	12ms	$150
Pinecone (managed)	97.8%	18ms	$420
Weaviate (self-hosted)	97.5%	22ms	$180

Recommendation

Qdrant wins on performance and cost for teams comfortable with self-hosting. Pinecone wins on operational simplicity. Weaviate’s multi-tenancy support is best for SaaS use cases.

A production postmortem on migrating from naive HuggingFace pipeline() inference to vLLM — and the 3x latency improvement that came with it.

The Problem

Our Llama 3 8B model was serving at 800ms p95 with HuggingFace’s default inference. At 500 concurrent users, GPU utilization was only 40% — most time was spent in memory allocation and batch scheduling.

The Fix

vLLM’s PagedAttention — eliminated KV cache fragmentation, GPU memory utilization jumped to 90%+
Continuous batching — no more waiting for the slowest request in a batch
AWQ quantization — 4-bit quantized model with negligible quality loss, 2x throughput
Tensor parallelism — split model across 2x A10G for headroom

Results

Metric	Before	After
p95 Latency	800ms	250ms
Throughput	15 req/s	48 req/s
GPU Utilization	40%	92%
Cost/1K requests	$0.12	$0.04

Takeaway

vLLM is production-ready. The continuous batching alone is worth the migration. If you’re still using transformers.pipeline() for serving, you’re leaving 3x performance on the table.

Fine-tuned open-weight LLMs for enterprise use cases — legal document summarization, code review, and customer support triage.

Approach

Base Models — Llama 3 8B, Mistral 7B, Phi-3 Mini
Method — QLoRA (4-bit quantization + Low-Rank Adaptation) via Unsloth
Data — curated instruction datasets (5K-20K examples per domain)
Evaluation — custom benchmarks + human preference ranking

Why Unsloth

Unsloth’s fused kernels and memory optimizations let us fine-tune 8B models on a single A100 in under 2 hours — compared to 5+ hours with vanilla PEFT. The 4-bit training path kept VRAM under 24GB.

Results

Model	Task	Accuracy	vs Base
Llama 3 8B	Legal Summarization	91.3%	+18.7%
Mistral 7B	Code Review	87.5%	+22.1%
Phi-3 Mini	Support Triage	94.0%	+15.3%

Deployment

Models exported to GGUF format for llama.cpp inference and served via vLLM behind a FastAPI gateway with streaming support.

A deep-dive into using DVC (Data Version Control) for production ML workflows — going beyond basic file tracking to full pipeline orchestration.

What’s Covered

Dataset versioning — track large datasets in S3/GCS without bloating Git
Pipeline DAGs — define training pipelines as reproducible dvc.yaml stages
Experiment tracking — dvc exp for hyperparameter sweeps without branch pollution
CI integration — automated retraining triggers on data drift detection

Key Insight

The biggest win from DVC isn’t version control — it’s reproducibility. When a model degrades in production, you can dvc checkout the exact data + code + params that produced the last good model and diff against current state.

Code Examples

# dvc.yaml — define a training pipeline
stages:
  prepare:
    cmd: python src/prepare.py
    deps: [data/raw]
    outs: [data/processed]
  train:
    cmd: python src/train.py --lr ${lr} --epochs ${epochs}
    deps: [data/processed, src/train.py]
    params: [lr, epochs]
    outs: [models/latest]
    metrics: [metrics.json]

Who This Is For

ML engineers tired of model_v2_final_FINAL.pkl and data scientists who want git bisect for their training data.

A zero-config embedding cache that sits between your code and the embedding API. Every embedding is stored in a local SQLite database — identical inputs return cached results instantly.

Why

During RAG development, you re-embed the same documents hundreds of times while iterating on chunking strategies, metadata, and retrieval logic. Each call costs money and adds latency.

Usage

from embed_cache import CachedEmbeddings

embedder = CachedEmbeddings(model="text-embedding-3-small")
vectors = embedder.embed(["document chunk 1", "document chunk 2"])
# Second call: instant, free
vectors = embedder.embed(["document chunk 1", "document chunk 2"])

Features

Drop-in replacement for OpenAI and Cohere embedding clients
SQLite backend — no infrastructure needed
Cache hit rate tracking and cost savings reporting
TTL support for cache invalidation

A CLI tool born from frustration with slow CI builds in large monorepos.

Features

Parallel execution — runs independent tasks concurrently with configurable concurrency limits
Dependency graph — DAG-based task ordering, only runs what’s needed
Smart caching — content-addressable cache skips tasks when inputs haven’t changed
Simple config — YAML task definitions, no DSL to learn

Why Go

Single binary distribution, fast startup, excellent concurrency primitives. Users download one binary — no runtime dependencies.

Usage

# taskr.yaml
tasks:
  lint:
    cmd: eslint src/
    inputs: ["src/**/*.ts"]
  test:
    cmd: pytest tests/
    deps: [lint]
    inputs: ["src/**/*.py", "tests/**/*.py"]
  build:
    cmd: docker build -t app .
    deps: [test]

taskr run build  # runs lint → test → build, skips cached steps

Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate

Learning Rust for High-Performance Inference

Production LLM Inference with vLLM

Tag: performance

Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate

Learning Rust for High-Performance Inference

Production LLM Inference with vLLM

Agentic Pipelines, Code Gen & RAG Evaluation

Multi-Agent Document Understanding

Domain-Specific Code Generation with Llama 3

Open-Source RAG Evaluation Framework

RLHF, Sparse MoE & Rust for Inference

Reinforcement Learning from Human Feedback

Mixture of Experts Architectures

Rust for High-Performance Inference

ML Systems Design, KV-Cache Research & Staff Engineering

Designing Machine Learning Systems — Chip Huyen

KV-Cache Optimization Papers

The Staff Engineer’s Path — Tanya Reilly

Multi-Agent Document Understanding

Architecture

Key Challenges

Prompt Engineering for Production Systems

Topics Covered

Key Takeaway

Deep Dive: RLHF & Alignment Techniques

Topics Covered

Key Takeaways So Far

Exploring Mixture of Experts Architectures

Reading List

Key Insights

DeepAgents — Multi-Agent Orchestration Research

Contributions

Research Questions

Learnings

Domain-Specific Code Generation with Llama 3

Approach

Results So Far

Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate

Methodology

Results Summary

Recommendation

Learning Rust for High-Performance Inference

Learning Path

Goal

Multi-Agent RAG System with LangGraph

Architecture

Key Decisions

Evaluation

Open-Source RAG Evaluation Framework

What It Measures

Design Principles

Production LLM Inference with vLLM

The Problem

The Fix

Results

Takeaway

Domain-Specific LLM Fine-Tuning with Unsloth

Approach

Why Unsloth

Results

Deployment

Evaluating RAG Systems — Beyond Vibes

Key Points

The Framework

Audience

HuggingFace Transformers — Core Contributions

Contributions

Impact

Reproducible ML Pipelines with DVC

What’s Covered

Key Insight

Code Examples

Who This Is For

embed-cache — Persistent Embedding Cache

Why

Usage

Features

ML Experiment Tracking Platform with MLflow

Architecture

Key Features

Open Source Mentorship — First-Time Contributors Program

Program Structure