Performance and sizing

This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.

Resource requirements

Baseline resources

Minimal deployment (development/testing):

CPU: 100m (0.1 cores)
Memory: 128Mi

Production deployment (recommended):

CPU: 500m (0.5 cores)
Memory: 512Mi

Scaling factors

Resource needs increase based on:

Number of backends: Each backend adds minimal overhead (~10-20Mi memory)
Request volume: Higher traffic requires more CPU for request processing
Data volume: Large inputs and tool responses increase memory usage and network bandwidth
Composite tool complexity: Workflows with many parallel steps consume more memory
Token caching: Authentication token cache grows with unique client count

Backend scale recommendations

vMCP performs well across different scales:

Backend Count	Use Case	Notes
1-5	Small teams, focused toolsets	Minimal resource overhead
5-15	Medium teams, diverse tools	Recommended range for most use cases
15-30	Large teams, comprehensive	Increase health check interval
30+	Enterprise-scale deployments	Consider multiple vMCP instances

Scaling strategies

Horizontal scaling

Horizontal scaling is possible for stateless use cases where MCP sessions can be resumed on any vMCP instance. However, stateful backends (e.g., Playwright browser sessions, database connections) complicate horizontal scaling because requests must be routed to the same vMCP instance that established the session.

Session considerations:

vMCP uses MCP session IDs to cache routing tables and maintain consistency
Some backends maintain persistent state that requires session affinity
Clients must be able to disconnect and resume sessions for horizontal scaling to work reliably

When horizontal scaling works well:

Stateless backends (fetch, search, read-only operations)
Short-lived sessions with no persistent state
Use cases where session affinity can be reliably maintained

When horizontal scaling is challenging:

Stateful backends (Playwright, database connections, file system operations)
Long-lived sessions requiring persistent state
Complex session interdependencies

Configuration

The VirtualMCPServer CRD does not have a replicas field. The operator creates a Deployment with 1 replica initially and intentionally preserves the replicas count during reconciliation, enabling you to manage scaling separately through HPA/VPA or kubectl.

Operator-created resource names:

Deployment: Same name as VirtualMCPServer (e.g., my-vmcp → my-vmcp)
Service: Prefixed with vmcp- (e.g., my-vmcp → vmcp-my-vmcp)

Option 1: Manual scaling with kubectl

Scale the Deployment (same name as your VirtualMCPServer):

# Example: If VirtualMCPServer is named "my-vmcp"
kubectl scale deployment my-vmcp -n <NAMESPACE> --replicas=3

The operator will not overwrite this change - it preserves the replicas field.

Option 2: Autoscaling with HPA (recommended)

For dynamic scaling based on load, use Horizontal Pod Autoscaler:

# Example: If VirtualMCPServer is named "my-vmcp"
kubectl autoscale deployment my-vmcp -n <NAMESPACE> \
  --min=2 --max=5 --cpu-percent=70

HPA adjusts replicas automatically, and the operator preserves HPA's scaling decisions.

Why no replicas field?

VirtualMCPServer intentionally omits a spec.replicas field to avoid conflicts with HPA/VPA autoscaling. This design allows you to choose between static scaling (kubectl) or dynamic autoscaling (HPA/VPA) without operator interference.

For static replica counts, scale the Deployment after creating the VirtualMCPServer. The operator will preserve your scaling configuration.

Backend scaling

When scaling vMCP horizontally, the backend MCP servers will also see increased load. Ensure your backend deployments (MCPServer resources) are also scaled appropriately to handle the additional traffic.

Session affinity is required when using multiple replicas. Clients must be routed to the same vMCP instance for the duration of their session. Configure based on your deployment:

Kubernetes Service: Use sessionAffinity: ClientIP for basic client-to-pod stickiness
- Note: This is IP-based and may not work well behind proxies or with changing client IPs
Ingress Controller: Configure cookie-based sticky sessions (recommended)
- nginx: Use nginx.ingress.kubernetes.io/affinity: cookie
- Other controllers: Consult your Ingress controller documentation
Gateway API: Use appropriate session affinity configuration based on your Gateway implementation

Session affinity recommendations

For stateless backends: Cookie-based sticky sessions work well and provide reliable routing through proxies
For stateful backends (Playwright, databases): Consider vertical scaling or dedicated vMCP instances instead of horizontal scaling with session affinity, as session resumption may not work reliably

Vertical scaling

Vertical scaling (increasing CPU/memory per instance) provides the simplest scaling story and works for all use cases, including stateful backends. However, it has limits and may not provide high availability since a single instance failure affects all sessions.

Recommended approach:

Start with vertical scaling for simplicity
Add horizontal scaling with session affinity when vertical limits are reached
For stateful backends, consider dedicated vMCP instances per team/use case

When to scale

Scale up (increase resources)

Increase CPU and memory when you observe:

High CPU usage (>70% sustained) during normal operations
Memory pressure or OOM (out-of-memory) kills
Slow response times (>1 second) for simple tool calls
Health check timeouts or frequent backend unavailability

Scale out (increase replicas)

Add more vMCP instances when:

CPU usage remains high despite increasing resources
You need higher availability and fault tolerance
Request volume exceeds capacity of a single instance
You want to distribute load across multiple availability zones

Scale configuration

Large backend counts (15+):

spec:
  config:
    operational:
      failureHandling:
        healthCheckInterval: 60s # Reduce overhead (20 backends = 40 checks/min at default 30s)
        unhealthyThreshold: 5 # Prevent transient issues from removing backends

Tradeoff: Detection time increases to 5 minutes (60s × 5). For latency-sensitive operations, keep 30s interval and raise threshold only.

High request volumes:

spec:
  podTemplateSpec:
    spec:
      containers:
        - name: vmcp
          resources:
            requests:
              cpu: '1'
              memory: 1Gi
            limits:
              cpu: '2'
              memory: 2Gi

Monitor for:

CPU throttling: Check throttled_time metrics
Memory growth: Token caching with many unique clients or large resource payloads

Performance optimization

Backend discovery

Discovered mode (default):

Startup: 1-3 seconds latency for 10-20 backends
Runtime: Continuous monitoring, updates automatically without pod restarts
Watches: MCPServer and MCPRemoteProxy resource changes

Inline mode:

Startup: Near-instantaneous (no Kubernetes API calls)
Runtime: Static backend list, no automatic updates
Changes: Require VirtualMCPServer update (triggers pod restart)

Optimization: Group related tools in fewer backends to reduce discovery overhead.

Authentication

Method	Latency Impact	Use Case
Token caching	Minimal	Enabled by default
Unauthenticated	50-200ms saved	Internal/trusted backends only
Token expiration	Varies	15-30 min recommended

Composite workflows

Parallelism: Up to 10 concurrent steps by default.

Optimizations:

Minimize dependencies: Fetching from 3 backends → 3 parallel steps, then aggregate
Optional enrichment: Use onError.action: continue (e.g., user profile + activity data)
Explicit timeouts: Backend typically 5s → set 10s timeout (prevents 30-60s hangs)

Monitoring

Track these metrics via telemetry integration:

Metric	Healthy State	Action Threshold
Backend request latency	P95 < SLO	Alert on spikes
Backend error rate	< 1%	Investigate > 5%
Health check success rate	> 95%	Early warning
Workflow execution time	Varies	Check for serial execution

Setup: Create dashboards for trend analysis and configure alerts for anomalies. Catches degradation before users notice.

Resource requirements​

Baseline resources​

Scaling factors​

Backend scale recommendations​

Scaling strategies​

Horizontal scaling​

Configuration​

Vertical scaling​

When to scale​

Scale up (increase resources)​

Scale out (increase replicas)​

Scale configuration​

Performance optimization​

Backend discovery​

Authentication​

Composite workflows​

Monitoring​

Related information​