Version: Next

Welcome to KServe

Deploy and scale AI models effortlessly — from cutting-edge generative AI and large language models to traditional ML models — with enterprise-grade reliability across any cloud or on-premises environment.

CNCF Incubating Project

KServe is a CNCF incubating project and part of the Kubeflow ecosystem.

Why KServe?

KServe eliminates the complexity of productionizing AI models. Whether you're a data scientist, DevOps engineer, or platform architect, KServe provides a unified solution that works across clouds and scales with your needs.

🚀 Minutes to Production

Deploy GenAI services and ML models with simple YAML — no complex infrastructure setup required.

☁️ Cloud-Agnostic

Run anywhere: AWS, Azure, GCP, on-premises, or hybrid environments with consistent behavior.

📈 Enterprise-Scale Ready

Scale to zero when idle, handle traffic spikes automatically, and manage hundreds of models efficiently.

Key Benefits

🤖 Generative AI
📊 Predictive AI
⚡ Universal

Feature	Description
LLM Multi-framework	Deploy LLMs from Hugging Face, vLLM, and custom generative models
OpenAI-Compatible APIs	Chat completion, streaming, and embedding endpoints out of the box
LocalModelCache	Reduce LLM startup time from 15–20 minutes to ~1 minute
KV Cache Offloading	Optimized memory management for long conversations and large contexts
Multi-node Inference	Distributed LLM serving across multiple nodes
Envoy AI Gateway	Enterprise-grade API management and routing for AI workloads
Metric-based Autoscaling	Scale on token throughput, queue depth, and GPU utilization
Canary Deployments	A/B testing and canary rollouts for LLM experiments

→ Full Generative AI docs

Feature	Description
Multi-framework Serving	TensorFlow, PyTorch, Scikit-Learn, XGBoost, ONNX, and more
InferenceGraph	Chain and ensemble multiple models for complex workflows
Batch Prediction	Efficient large-dataset processing with batch inference
Pre/Post Processing	Built-in data transformation pipelines and feature engineering
Real-time Scoring	Low-latency prediction serving for real-time applications
ML Monitoring	Drift detection, outlier detection, and explainability
Standard Protocols	Open Inference Protocol (V1/V2) support across frameworks

→ Full Predictive AI docs

Feature	Description
Serverless Inference	Automatic scaling including scale-to-zero on CPU and GPU
High Scalability	Intelligent routing and density packing using ModelMesh
Enterprise Operations	Production monitoring, logging, and observability out of the box

→ Serverless docs · → ModelMesh docs

Architecture Overview

KServe consists of two main planes:

🎛️ Control Plane

InferenceService CRD — Manages model serving lifecycle
InferenceGraph CRD — Orchestrates model ensembles and chaining
Serving Runtime — Pluggable model runtime implementations
ClusterServingRuntime — Cluster-wide model runtime definitions
LocalModelCache CRD — Caches large models locally for fast startup
Model Storage — S3, GCS, Azure, HuggingFace, PVC, and more

📡 Data Plane

Predictor — Serves model predictions
Transformer — Pre/post processing logic
Explainer — Model explanations and interpretability

KServe extends Kubernetes with custom resources for AI/ML workloads — handling load balancing, autoscaling, canary deployments, and monitoring automatically. Pluggable runtimes let you use the best engine per model type: vLLM for LLMs, TorchServe for PyTorch, or custom containers.

Supported Frameworks

📊 Predictive
🤖 Generative AI
⚡ Multi-Framework

Scikit-Learn

Python-based ML models

XGBoost

Gradient boosting

TensorFlow

Deep learning models

PyTorch

Via Triton server

ONNX

Open Neural Network Exchange

TensorRT

NVIDIA optimized models

Hugging Face

Transformers and NLP

MLflow

MLflow packaged models

Custom Runtimes

Bring your own serving logic

Get Started

🤖 Serve an LLM

Deploy an LLM using InferenceService with Qwen

📊 Serve a Predictive Model

Deploy a scikit-learn model using InferenceService

🏗️ Installation Guide

Set up KServe on your Kubernetes cluster

📚 Core Concepts

Learn about serving patterns, control plane, and data plane

Learning path: Tutorial → Core concepts → Production setup → API reference

Community & Support

Channel	Link
GitHub	github.com/kserve/kserve — issues, PRs, releases
Slack	CNCF Slack #kserve — questions and discussion
Community Meetings	Monthly calendar — open to all
Adopters	See who's using KServe

Why KServe?​

Key Benefits​

Architecture Overview​

🎛️ Control Plane

📡 Data Plane

Supported Frameworks​

Get Started​

Community & Support​