KServe

Why KServe?

The open-source standard for self-hosted AI, providing a unified platform for both Generative and Predictive AI inference on Kubernetes.Simple enough for quick deployments, yet powerful enough for the most demanding enterprise workloads.

🤖 Generative AI

🧮Optimized Backends

vLLM and llm-d for high-performance LLM serving

📌OpenAI Compatible

Drop-in OpenAI protocol for seamless LLM integration

🚅GPU Acceleration

High-performance serving with optimized GPU memory management

💾Model Caching

Reduce load times and improve latency for frequently used models

🗂️KV Cache Offloading

Offload KV cache to CPU/disk for longer sequence handling

📈Autoscaling

Request-based autoscaling tuned for generative workloads

🔧Hugging Face Ready

Native HuggingFace model support with streamlined deployment

📊 Predictive AI

🧮Multi-Framework

TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and more

🔀Intelligent Routing

Seamless routing between predictor, transformer, and explainer

🔄Advanced Deployments

Canary rollouts, pipelines, and ensembles with InferenceGraph

⚡Auto-scaling

Scale-to-zero for predictive workloads on any autoscaler

🔍Model Explainability

Built-in feature attribution to understand prediction reasoning

📊Advanced Monitoring

Payload logging, outlier, adversarial, and drift detection

💰Cost Efficient

Scale-to-zero on expensive resources when not in use

Simple and Powerful API

KServe provides a Kubernetes Custom Resource Definition for serving predictive and generative machine learning models. It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features to your ML deployments.

✓Standard K8s API across ML frameworks

✓Pre/post processing and explainability

✓OpenAI specification support for LLMs

✓Canary rollouts and A/B testing

inferenceservice.yaml

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "llm-service"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
      storageUri: "hf://meta-llama/Llama-3.1-8B-Instruct"

How KServe Works

KServe provides a Kubernetes custom resource definition for serving ML models on arbitrary frameworks, encapsulating complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features to your ML deployments.

Control Plane

Manages lifecycle of ML models, providing model revision tracking, canary rollouts, and A/B testing

Data Plane

Standardized inference protocol for model servers with request/response APIs, supporting both predictive and generative models

InferenceService

Core Kubernetes custom resource that simplifies ML model deployment with automatic scaling, networking, and health checks

Inference Graph

Enables advanced deployments with pipelines for pre/post processing, ensembles, and multi-model workflows

Quick Start

Get started with KServe in minutes. Follow these simple steps to deploy your first model.

Install KServe

Install KServe and its dependencies on your Kubernetes cluster:

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.0/kserve.yaml

Create an InferenceService

Deploy a pre-trained model with a simple YAML configuration:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "qwen-llm"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      storageUri: "hf://Qwen/Qwen2.5-0.5B-Instruct"
      resources:
        requests:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"

Send Inference Requests

Make predictions using the deployed model:

curl -v -H "Host: qwen-llm.default.example.com" \
  http://localhost:8080/openai/v1/chat/completions -d @./prompt.json

Full Quickstart Guide

Trusted by Industry Leaders

KServe is used in production by organizations across various industries, providing reliable model inference at scale.

View all adopters →

Ready to Transform Your ML Deployment?

Simplify your journey from model development to production with KServe's standardized inference platform for both predictive and generative AI models

Quickstart Install