Welcome to KServe
Deploy and scale AI models effortlessly โ from cutting-edge generative AI and large language models to traditional ML models โ with enterprise-grade reliability across any cloud or on-premises environment.
Why KServe?โ
KServe eliminates the complexity of productionizing AI models. Whether you're a data scientist, DevOps engineer, or platform architect, KServe provides a unified solution that works across clouds and scales with your needs.
Deploy GenAI services and ML models with simple YAML โ no complex infrastructure setup required.
Run anywhere: AWS, Azure, GCP, on-premises, or hybrid environments with consistent behavior.
Scale to zero when idle, handle traffic spikes automatically, and manage hundreds of models efficiently.
Key Benefitsโ
- ๐ค Generative AI
- ๐ Predictive AI
- โก Universal
| Feature | Description |
|---|---|
| LLM Multi-framework | Deploy LLMs from Hugging Face, vLLM, and custom generative models |
| OpenAI-Compatible APIs | Chat completion, streaming, and embedding endpoints out of the box |
| LocalModelCache | Reduce LLM startup time from 15โ20 minutes to ~1 minute |
| KV Cache Offloading | Optimized memory management for long conversations and large contexts |
| Multi-node Inference | Distributed LLM serving across multiple nodes |
| Envoy AI Gateway | Enterprise-grade API management and routing for AI workloads |
| Metric-based Autoscaling | Scale on token throughput, queue depth, and GPU utilization |
| Canary Deployments | A/B testing and canary rollouts for LLM experiments |
| Feature | Description |
|---|---|
| Multi-framework Serving | TensorFlow, PyTorch, Scikit-Learn, XGBoost, ONNX, and more |
| InferenceGraph | Chain and ensemble multiple models for complex workflows |
| Batch Prediction | Efficient large-dataset processing with batch inference |
| Pre/Post Processing | Built-in data transformation pipelines and feature engineering |
| Real-time Scoring | Low-latency prediction serving for real-time applications |
| ML Monitoring | Drift detection, outlier detection, and explainability |
| Standard Protocols | Open Inference Protocol (V1/V2) support across frameworks |
| Feature | Description |
|---|---|
| Serverless Inference | Automatic scaling including scale-to-zero on CPU and GPU |
| High Scalability | Intelligent routing and density packing using ModelMesh |
| Enterprise Operations | Production monitoring, logging, and observability out of the box |
Architecture Overviewโ
KServe consists of two main planes:
๐๏ธ Control Plane
- InferenceService CRD โ Manages model serving lifecycle
- InferenceGraph CRD โ Orchestrates model ensembles and chaining
- Serving Runtime โ Pluggable model runtime implementations
- ClusterServingRuntime โ Cluster-wide model runtime definitions
- LocalModelCache CRD โ Caches large models locally for fast startup
- Model Storage โ S3, GCS, Azure, HuggingFace, PVC, and more
๐ก Data Plane
- Predictor โ Serves model predictions
- Transformer โ Pre/post processing logic
- Explainer โ Model explanations and interpretability
KServe extends Kubernetes with custom resources for AI/ML workloads โ handling load balancing, autoscaling, canary deployments, and monitoring automatically. Pluggable runtimes let you use the best engine per model type: vLLM for LLMs, TorchServe for PyTorch, or custom containers.
Supported Frameworksโ
- ๐ Predictive
- ๐ค Generative AI
- โก Multi-Framework
Get Startedโ
Deploy an LLM using InferenceService with Qwen
๐ Serve a Predictive ModelDeploy a scikit-learn model using InferenceService
๐๏ธ Installation GuideSet up KServe on your Kubernetes cluster
๐ Core ConceptsLearn about serving patterns, control plane, and data plane
Learning path: Tutorial โ Core concepts โ Production setup โ API reference
Community & Supportโ
| Channel | Link |
|---|---|
| GitHub | github.com/kserve/kserve โ issues, PRs, releases |
| Slack | CNCF Slack #kserve โ questions and discussion |
| Community Meetings | Monthly calendar โ open to all |
| Adopters | See who's using KServe |