Hi, I'm @sandeep,

|

building production-grade AI systems

with Evals + LLMOps + HITL

AI Engineer New Delhi, India

4+ years building everything from scratch.

From fine-tuning LLMs to shipping agentic workflows handling real-world traffic — I've owned the full lifecycle.

One day, I stopped chasing titles. I started chasing clarity.

What drives me doesn't fit on a resume.

Building systems that last.

12+ open-source contributions. 3 production ML pipelines. Millions of inference requests served.

Training, experiment tracking, inference at scale, and the tooling that holds it all together.

Bigger problems. Harder systems. End-to-end.

Ready for what's next.

Claude Code Power UserHigh-Agency · AI-Fluency

Building production systems with AI-assisted development — using Claude Code for architecture design, complex refactors, and shipping full-stack features from terminal to deployment.

›Built this entire portfolio site with Claude Code as AI pair programmer
›Custom MCP servers, multi-agent workflows, and agentic tool chains
›Deep expertise in prompt engineering and AI-native development patterns

Claude Code Anthropic API

Active

Multi-Agent RAG System with LangGraph

Acme AI Lead AI Engineer

Production agentic RAG system using LangGraph for multi-step reasoning over enterprise knowledge bases. Handles 10K+ queries/day with sub-2s latency.

10K+
Queries/Day

94.2%
Accuracy

<2s
Latency p95

50+
Knowledge Sources

PythonLangGraphLangChainQdrantGPT-4Claude +2

Architecture Doc
Maintained

Domain-Specific LLM Fine-Tuning with Unsloth

Acme AI AI Engineer

Fine-tuned Llama 3 and Mistral models for domain-specific tasks using Unsloth + QLoRA, achieving 40% faster training with 60% less VRAM.

40%
Training Speedup

60%
VRAM Savings

3
Models Shipped

PyTorchUnslothPEFTQLoRAHugging FaceWeights & Biases +4

Training Notebooks
Active

HuggingFace Transformers — Core Contributions

Hugging Face Core Contributor

Contributed model implementations and training optimizations to HuggingFace's Transformers library, used by 100K+ developers worldwide.

12
PRs Merged

2M+
Downloads Impacted

5
Models Touched

PythonPyTorchHugging Face TransformersCUDAAccelerate

Pull Requests Documentation
Reproducible ML Pipelines with DVC

A practical guide to building reproducible, version-controlled ML data pipelines using DVC, from dataset versioning to automated retraining.

DVCPythonS3GitMake

Read Article
NeurIPS 2024 — Spotlight Poster Presentation

NeurIPS Poster Presenter

Presented poster on efficient fine-tuning methods for domain-specific LLMs at NeurIPS 2024 in Vancouver.

16,000+
Attendees

Spotlight
Poster Session

12
Citations

PyTorchLoRADeepSpeedWeights & Biases

Conference Poster PDF
LangChain AI Agents Hackathon — 2nd Place

LangChain Team Lead

Built an autonomous code review agent in 48 hours that analyzes PRs, suggests fixes, and auto-generates test cases. Won 2nd place out of 200+ teams.

2nd / 200+
Placement

48 hrs
Duration

3
Team Size

$5,000
Prize

LangGraphGPT-4GitHub APIFastAPIRedisDocker

Project Repo Demo Video
Building Active
Apr 2026

Agentic Pipelines, Code Gen & RAG Evaluation

Three active projects at the intersection of LLM applications — multi-agent document processing, domain-adapted code generation, and systematic RAG quality measurement.

#agents#fine-tuning#rag +1
Learning Active
Apr 2026

RLHF, Sparse MoE & Rust for Inference

Deepening expertise in three areas — alignment techniques for LLMs, sparse Mixture of Experts scaling, and systems-level inference serving with Rust.

#rlhf#moe#rust +1
Reading
Apr 2026

ML Systems Design, KV-Cache Research & Staff Engineering

Books and papers shaping how I think about production ML — system design principles, efficient long-context inference, and technical leadership beyond the IC track.

#books#papers#ml-systems
Project Active
Mar 2026

Multi-Agent Document Understanding

Building multi-agent systems that decompose complex documents into structured knowledge using specialized LLM agents for extraction, reasoning, and validation.

#agents#llm#rag +1
Talk
Mar 2026

Prompt Engineering for Production Systems
MLOps Community
Workshop on writing reliable, testable prompts for production LLM applications — covering structured outputs, guardrails, and prompt versioning.

#prompt-engineering#llm#workshop +2
Writing
Mar 2026

Deep Dive: RLHF & Alignment Techniques

Studying reinforcement learning from human feedback — from reward modeling to PPO and DPO, understanding how modern LLMs are aligned to human preferences.

#rlhf#alignment#llm +1

Writing
Feb 2026

Exploring Mixture of Experts Architectures

Research notes on Mixture of Experts — how sparse activation enables scaling model capacity without proportional compute, from Switch Transformer to Mixtral.

#moe#architecture#llm +1
OSS Active
Feb 2026

DeepAgents — Multi-Agent Orchestration Research

Contributing to DeepAgents, a framework for building hierarchical multi-agent systems with planning, tool use, and memory.

#multi-agent#llm#agents +2
Project Active
Feb 2026

Domain-Specific Code Generation with Llama 3

Fine-tuning Llama 3 on proprietary codebases for domain-specific code generation — internal APIs, conventions, and patterns the base model doesn't know.

#fine-tuning#llm#code-gen +1
Writing
Jan 2026

Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate

Comprehensive benchmark comparing vector databases for production RAG workloads — latency, recall, cost, and operational complexity.

#vector-db#rag#benchmarks +2

Writing Learning Rust for High-Performance Inference Jan 2026
OSS Open-Source RAG Evaluation Framework Active Jan 2026
Writing Production LLM Inference with vLLM Dec 2025
Talk Evaluating RAG Systems — Beyond Vibes at AI Engineer Summit Oct 2025
OSS embed-cache — Persistent Embedding Cache Active Jul 2025
Project ML Experiment Tracking Platform with MLflow at DataCorp In Production Jun 2025
Talk Open Source Mentorship — First-Time Contributors Program at Local Dev Community May 2025
Cert AWS Machine Learning — Specialty at Amazon Web Services May 2025
Project LLM Monitoring Dashboard with W&B at DataCorp Apr 2025
Writing Migrating 50 Services to Kubernetes — A Retrospective Mar 2025
Project Real-Time Feature Store Architecture at DataCorp Feb 2025
Meetup MLOps Community Meetup — Speaker & Organizer at MLOps Community Nov 2024
OSS taskr — Developer Task Runner CLI Active Nov 2024
Cert Certified Kubernetes Application Developer (CKAD) at CNCF Sep 2024
Project SaaS Analytics Dashboard — Full-Stack Build In Production Jul 2024
OSS terraform-modules — Reusable Cloud Infrastructure Maintained Apr 2024
Role AI Engineer at Acme AI at Acme AI Jan 2024
Project API Gateway Redesign — From Monolith to Microservices at TechStart Jan 2024

Open Source

huggingface/transformersContributor

Added efficient batch decoding for streaming inference pipelines.

Python120K24K

vllm-project/vllmContributor

Implemented custom sampling strategies for domain-specific generation.

Python35K5.2K

sandeepyadav1478/ml-pipeline-kitAuthor

Opinionated ML pipeline toolkit for rapid experimentation and deployment.

Python1.2K180

Featured Models

sandeepyadav1478/llama3-medical-qa

Llama 3 fine-tuned on medical QA datasets for clinical decision support.

Question Answering5K+

sandeepyadav1478/code-reviewer-7b

7B parameter model fine-tuned for automated code review and suggestions.

Code Generation2K+

Work Experience

LLM Application Development

RAG pipelines, prompt engineering, multi-model orchestration, production inference

Model Fine-Tuning & Training

LoRA/QLoRA, domain adaptation, dataset curation, distributed training

Agentic Workflows

Multi-agent systems, tool use, HITL handoff, orchestration with LangGraph

ML Infrastructure & MLOps

Feature stores, experiment tracking, model registry, CI/CD for ML

Production Inference

vLLM, quantization, batching strategies, latency optimization, autoscaling

Technical Leadership

Architecture design, code review, mentoring, cross-team collaboration

AI Engineer

Acme AIJan 2024 — Present

Current

Building LLM-powered applications and agentic workflows. Deploying inference pipelines on AWS.

Multi-Agent Document Understanding

Extracting structured data from unstructured documents using specialized LLM agents.

LangGraphGPT-4FastAPI

Shipped 3 production LLM apps serving 100K+ daily users

Reduced inference latency by 40% with custom vLLM deployment

PyTorchvLLMLangGraphAWS SageMaker

ML Engineer

DataCorpMar 2022 — Dec 2023

Designed ML pipelines and built real-time feature stores serving 50M+ predictions/day.

Built real-time feature store serving 50M+ predictions/day

Reduced model training time by 60% with distributed training

MLflowDVCRayKubernetes

Software Engineer

TechStartJun 2020 — Feb 2022

Full-stack development with Python and React. Led migration of monolith to microservices on Kubernetes.

Led monolith to microservices migration on Kubernetes

Built CI/CD pipelines reducing deploy time from hours to minutes

PythonReactDockerK8s

Trusted By

Google

HuggingFace

NVIDIA

Skills & Stack

Soft Skills

Technical WritingSystem DesignTeam LeadershipMentoringCross-functional CollaborationAgile / Scrum

Spoken Languages

Tech Stack

ML / AI

PyTorchHuggingFaceLangChainLangGraphUnslothvLLMONNXLoRA / QLoRARAGAgents

MLOps & Data

MLflowDVCWeights & BiasesRayAirflowKubeflowFeature StoresVector DBs

Programming

PythonTypeScriptGoSQLBashC++

Infra & Cloud

DockerKubernetesAWSGCPTerraformGitHub ActionsFastAPIgRPC

Education & Certifications

Education

M.S. Computer Science

2020

Stanford University

Focus on Machine Learning and Natural Language Processing.

B.Tech. Computer Science

2018

IIT Delhi

Graduated with honors. Thesis on deep learning for medical imaging.

AI Product Management Bootcamp

2024

Maven

Led by Dr. Marily Nika (ex-Google PM). Completed capstone project.

Awards & Certifications

Best AI Application — HuggingFace Hackathon

2024

HuggingFace

Built a multi-agent document understanding pipeline in 48 hours.

Top 10 Open Source Contributors

2023

GitHub

Recognized for sustained contributions to ML ecosystem projects.

Outstanding Graduate Thesis Award

2018

IIT Delhi

Publications

Efficient Multi-Agent Architectures for Document Understanding
Sandeep Yadav, Alice Park, Bob Liu
arXiv preprint2024
Scaling Retrieval-Augmented Generation for Enterprise Knowledge Bases
Sandeep Yadav, Carol Zhang
NeurIPS Workshop2023
Low-Rank Adaptation Strategies for Domain-Specific LLMs
Alice Park, Sandeep Yadav, David Kim
EMNLP2023

Speaking & Appearances

Presentation

Building Reliable LLM Applications in Production

AI Engineer Summit 2024Oct 2024

Presentation

Fine-Tuning at Scale: Lessons from the Trenches

MLOps Community MeetupJul 2024

Podcast

The Practical Guide to RAG Systems

The ML PodcastMay 2024

Resources & Roadmaps

Curated paths and guides I maintain for the community.

Getting Started with LLMs

LLM Fundamentals

From transformers to RLHF — the essential building blocks.

Prompt Engineering Guide

Systematic techniques for reliable LLM outputs.

Fine-Tuning Playbook

When, why, and how to fine-tune open-weight models.

ML Engineering in Production

ML System Design

Patterns for building maintainable ML-powered products.

Inference Optimization

Quantization, batching, and serving at scale.

Monitoring & Evaluation

Keeping models honest after deployment.

Curated Lists

People to Follow

Essential Tools

What People Say

One of the most thoughtful engineers I've worked with. Takes complex ML problems and delivers clean, production-ready solutions.
Jane SmithEngineering Manager, Acme AI

Their open-source contributions to our inference pipeline saved us weeks of work. Clear code, excellent documentation.
Alex ChenStaff Engineer, DataCorp

Rare combination of deep ML knowledge and strong engineering fundamentals. Ships reliable systems, not just notebooks.
Sam PatelCTO, TechStart

Frequently Asked Questions

Are you open to freelance or consulting work?

Yes — I take on select projects involving LLM applications, ML infrastructure, and AI strategy. Reach out via email to discuss.

What's your tech stack for most projects?

Python + PyTorch for ML, HuggingFace for models, FastAPI for serving, Docker + K8s for deployment, and AWS for cloud infrastructure.

Do you contribute to open source?

Actively. I contribute to HuggingFace Transformers, vLLM, and maintain a few of my own tools. Check the Open Source section above.

How do I book time with you?

Use the Calendly link on the contact page, or send me an email. I typically respond within 48 hours.

Ran a production-realistic benchmark of the three most popular vector databases for RAG workloads — not just synthetic benchmarks, but real embedding distributions from enterprise documents.

Methodology

Dataset — 2M embeddings from real enterprise documents (1536-dim, OpenAI ada-002)
Queries — 10K real user queries from production RAG system
Metrics — recall@10, p95 latency, cost/1M queries, operational burden

Results Summary

Database	Recall@10	p95 Latency	Cost/Month (2M vectors)
Qdrant (self-hosted)	98.2%	12ms	$150
Pinecone (managed)	97.8%	18ms	$420
Weaviate (self-hosted)	97.5%	22ms	$180

Recommendation

Qdrant wins on performance and cost for teams comfortable with self-hosting. Pinecone wins on operational simplicity. Weaviate’s multi-tenancy support is best for SaaS use cases.

A production postmortem on migrating from naive HuggingFace pipeline() inference to vLLM — and the 3x latency improvement that came with it.

The Problem

Our Llama 3 8B model was serving at 800ms p95 with HuggingFace’s default inference. At 500 concurrent users, GPU utilization was only 40% — most time was spent in memory allocation and batch scheduling.

The Fix

vLLM’s PagedAttention — eliminated KV cache fragmentation, GPU memory utilization jumped to 90%+
Continuous batching — no more waiting for the slowest request in a batch
AWQ quantization — 4-bit quantized model with negligible quality loss, 2x throughput
Tensor parallelism — split model across 2x A10G for headroom

Results

Metric	Before	After
p95 Latency	800ms	250ms
Throughput	15 req/s	48 req/s
GPU Utilization	40%	92%
Cost/1K requests	$0.12	$0.04

Takeaway

vLLM is production-ready. The continuous batching alone is worth the migration. If you’re still using transformers.pipeline() for serving, you’re leaving 3x performance on the table.

Fine-tuned open-weight LLMs for enterprise use cases — legal document summarization, code review, and customer support triage.

Approach

Base Models — Llama 3 8B, Mistral 7B, Phi-3 Mini
Method — QLoRA (4-bit quantization + Low-Rank Adaptation) via Unsloth
Data — curated instruction datasets (5K-20K examples per domain)
Evaluation — custom benchmarks + human preference ranking

Why Unsloth

Unsloth’s fused kernels and memory optimizations let us fine-tune 8B models on a single A100 in under 2 hours — compared to 5+ hours with vanilla PEFT. The 4-bit training path kept VRAM under 24GB.

Results

Model	Task	Accuracy	vs Base
Llama 3 8B	Legal Summarization	91.3%	+18.7%
Mistral 7B	Code Review	87.5%	+22.1%
Phi-3 Mini	Support Triage	94.0%	+15.3%

Deployment

Models exported to GGUF format for llama.cpp inference and served via vLLM behind a FastAPI gateway with streaming support.

A deep-dive into using DVC (Data Version Control) for production ML workflows — going beyond basic file tracking to full pipeline orchestration.

What’s Covered

Dataset versioning — track large datasets in S3/GCS without bloating Git
Pipeline DAGs — define training pipelines as reproducible dvc.yaml stages
Experiment tracking — dvc exp for hyperparameter sweeps without branch pollution
CI integration — automated retraining triggers on data drift detection

Key Insight

The biggest win from DVC isn’t version control — it’s reproducibility. When a model degrades in production, you can dvc checkout the exact data + code + params that produced the last good model and diff against current state.

Code Examples

# dvc.yaml — define a training pipeline
stages:
  prepare:
    cmd: python src/prepare.py
    deps: [data/raw]
    outs: [data/processed]
  train:
    cmd: python src/train.py --lr ${lr} --epochs ${epochs}
    deps: [data/processed, src/train.py]
    params: [lr, epochs]
    outs: [models/latest]
    metrics: [metrics.json]

Who This Is For

ML engineers tired of model_v2_final_FINAL.pkl and data scientists who want git bisect for their training data.

A zero-config embedding cache that sits between your code and the embedding API. Every embedding is stored in a local SQLite database — identical inputs return cached results instantly.

Why

During RAG development, you re-embed the same documents hundreds of times while iterating on chunking strategies, metadata, and retrieval logic. Each call costs money and adds latency.

Usage

from embed_cache import CachedEmbeddings

embedder = CachedEmbeddings(model="text-embedding-3-small")
vectors = embedder.embed(["document chunk 1", "document chunk 2"])
# Second call: instant, free
vectors = embedder.embed(["document chunk 1", "document chunk 2"])

Features

Drop-in replacement for OpenAI and Cohere embedding clients
SQLite backend — no infrastructure needed
Cache hit rate tracking and cost savings reporting
TTL support for cache invalidation

A CLI tool born from frustration with slow CI builds in large monorepos.

Features

Parallel execution — runs independent tasks concurrently with configurable concurrency limits
Dependency graph — DAG-based task ordering, only runs what’s needed
Smart caching — content-addressable cache skips tasks when inputs haven’t changed
Simple config — YAML task definitions, no DSL to learn

Why Go

Single binary distribution, fast startup, excellent concurrency primitives. Users download one binary — no runtime dependencies.

Usage

# taskr.yaml
tasks:
  lint:
    cmd: eslint src/
    inputs: ["src/**/*.ts"]
  test:
    cmd: pytest tests/
    deps: [lint]
    inputs: ["src/**/*.py", "tests/**/*.py"]
  build:
    cmd: docker build -t app .
    deps: [test]

taskr run build  # runs lint → test → build, skips cached steps

|

Claude Code Power User

Multi-Agent RAG System with LangGraph

Domain-Specific LLM Fine-Tuning with Unsloth

HuggingFace Transformers — Core Contributions

Reproducible ML Pipelines with DVC

NeurIPS 2024 — Spotlight Poster Presentation

LangChain AI Agents Hackathon — 2nd Place

Agentic Pipelines, Code Gen & RAG Evaluation

RLHF, Sparse MoE & Rust for Inference

ML Systems Design, KV-Cache Research & Staff Engineering

Multi-Agent Document Understanding

Prompt Engineering for Production Systems

Deep Dive: RLHF & Alignment Techniques

Exploring Mixture of Experts Architectures

DeepAgents — Multi-Agent Orchestration Research

Domain-Specific Code Generation with Llama 3

Vector Database Benchmarks — Qdrant vs Pinecone vs Weaviate

Open Source

Featured Models

Work Experience

LLM Application Development

Model Fine-Tuning & Training

Agentic Workflows

ML Infrastructure & MLOps

Production Inference

Technical Leadership

AI Engineer

ML Engineer

Software Engineer

Skills & Stack

Soft Skills

Spoken Languages

Tech Stack

ML / AI

MLOps & Data

Programming

Infra & Cloud

Education & Certifications

Education

M.S. Computer Science

B.Tech. Computer Science

AI Product Management Bootcamp

Awards & Certifications

Best AI Application — HuggingFace Hackathon

Top 10 Open Source Contributors

Outstanding Graduate Thesis Award

Publications

Speaking & Appearances

Building Reliable LLM Applications in Production

Fine-Tuning at Scale: Lessons from the Trenches

The Practical Guide to RAG Systems

Resources & Roadmaps

Getting Started with LLMs

LLM Fundamentals

Prompt Engineering Guide

Fine-Tuning Playbook

ML Engineering in Production

ML System Design

Inference Optimization

Monitoring & Evaluation

Curated Lists

People to Follow

Essential Tools

What People Say

Frequently Asked Questions

Let's Talk

Agentic Pipelines, Code Gen & RAG Evaluation

Multi-Agent Document Understanding

Domain-Specific Code Generation with Llama 3

Open-Source RAG Evaluation Framework

Tech Stack

RLHF, Sparse MoE & Rust for Inference

Reinforcement Learning from Human Feedback

Mixture of Experts Architectures

Rust for High-Performance Inference

Tech Stack

ML Systems Design, KV-Cache Research & Staff Engineering

Designing Machine Learning Systems — Chip Huyen

KV-Cache Optimization Papers