TL;DR
Pool connections to external dependencies (databases, RPC nodes) and size pools based on concurrent tool calls, not total MCP sessions.
Cache tool results with TTLs matched to data freshness needs: minutes for static config, seconds for balances, and never for write operations.
Scale HTTP transport horizontally using sticky sessions or externalized state in Redis, and monitor protocol, dependency, and system metrics to find bottlenecks.
MCP development services for production environments eventually hit the same scaling challenges as any backend system: too many concurrent connections, slow tool handlers blocking other requests, external dependencies becoming bottlenecks, and memory growing without bound. The difference is that MCP servers have unique scaling characteristics because each connection maintains a stateful session and tool calls often depend on external systems with their own latency profiles.
This post covers the practical patterns for scaling MCP servers from handling a single developer’s AI assistant to serving an entire engineering organization.
Understanding MCP Connection Patterns
MCP connections are stateful. Each client maintains a persistent connection to the server, and the server maintains session state for each client. This is fundamentally different from stateless REST APIs where each request is independent.
For stdio transport, each connection is a separate server process. This scales naturally but consumes more memory (one process per client). For HTTP transport, multiple clients can connect to a single server process, but the server must manage session state for each.
The scaling challenge: a team of 50 engineers, each running an AI assistant with 3 MCP servers connected, means 150 concurrent sessions. Each session might generate 10-50 tool calls per hour. That is 1,500-7,500 tool calls per hour, each potentially hitting external databases, APIs, or blockchain nodes.
Connection Pooling for External Dependencies
The most common bottleneck in MCP servers is not the protocol layer but the external systems tools connect to. A database MCP server that opens a new connection for every tool call will exhaust the database’s connection limit under moderate load.
Connection pooling is non-negotiable for production MCP development services. Maintain a pool of pre-established connections to each external system (database, RPC node, API service) and check connections out for each tool call.
For database connections, use a pool library appropriate to your stack: pg-pool for PostgreSQL in Node.js, sqlx with connection pooling in Rust, or connection pool middleware for any ORM. Size the pool based on expected concurrent tool calls, not concurrent MCP sessions (sessions are mostly idle between tool calls).
For blockchain RPC connections, pool is less relevant because most RPC calls are stateless HTTP requests. Instead, implement client-side load balancing across multiple RPC endpoints. If one endpoint is slow or rate-limited, route to another. For Solana specifically, maintain connections to multiple RPC providers and implement fallback logic.
Caching Strategies for MCP Tool Results
Many tool calls return the same data when called with the same arguments within a short window. A “get account balance” tool called five times in a minute for the same address does not need five RPC calls. Caching tool results reduces load on external systems and improves response time.
Implement a simple TTL (time-to-live) cache keyed by tool name plus serialized arguments. The cache duration depends on the data’s freshness requirements:
Static data (configuration, schemas, documentation): cache for minutes to hours. These change rarely and caching eliminates redundant reads.
Semi-dynamic data (account balances, token prices): cache for 5-30 seconds. Fresh enough for conversation context without hammering the data source.
Real-time data (live metrics, active trades): do not cache, or cache for 1-2 seconds at most. Stale data here leads to incorrect decisions.
Write operations (deployments, transactions, record creation): never cache. These must always execute.
For resource endpoints, MCP has built-in support for change notifications. When a cached resource expires and the underlying data has changed, the server sends a notifications/resources/updated message to subscribed clients. This lets clients refresh their context without polling.
Concurrency and Request Isolation
MCP servers must handle concurrent tool calls safely. An AI model might call multiple tools in rapid succession, or multiple clients might call the same tool simultaneously. Your handlers must be concurrent-safe.
In Node.js, concurrency happens at the event loop level. Multiple tool calls run concurrently through async/await, sharing the same thread. This is safe for I/O-bound handlers but problematic for CPU-bound work. Use worker threads for computationally intensive handlers to avoid blocking the event loop.
In Rust with Tokio, concurrency is multi-threaded by default. Tool handlers run on the Tokio thread pool, and the runtime handles scheduling. Use Arc and Mutex/RwLock for shared state (as covered in the Rust MCP post). Avoid holding locks across await points, which can cause deadlocks in async code.
Request isolation is equally important. A slow tool call from one client should not delay tool calls from other clients. This happens naturally with async I/O (each request releases the thread while waiting for I/O), but CPU-bound handlers can starve other requests. Monitor handler execution times and offload heavy computation to background workers.
Horizontal Scaling with Load Balancing
When a single server instance cannot handle the load, horizontal scaling is the next step. MCP’s stateful sessions make this more complex than scaling a REST API.
For HTTP transport, the key challenge is session affinity. Each client’s session must route to the same server instance because the session state lives in server memory. Use sticky sessions (based on the session ID in the Mcp-Session-Id header) at the load balancer level. Nginx, HAProxy, and cloud load balancers all support header-based sticky routing.
Alternatively, externalize session state to a shared store (Redis, for example). This lets any server instance handle any request, but adds latency for session reads/writes. The tradeoff depends on your availability requirements: sticky sessions are simpler but create hot spots; shared state is more resilient but slower.
For stdio transport, horizontal scaling is inherent. Each client spawns its own server process. The host machine’s CPU and memory are the limits. Use container orchestration (Kubernetes, ECS) to distribute server processes across machines if a single host runs out of resources.
Monitoring and Observability
You cannot scale what you cannot measure. MCP servers need observability at three levels:
Protocol metrics: connections active, tool calls per second, error rate, average response time per tool.
Dependency metrics: database query latency, RPC call duration, cache hit rate, connection pool utilization.
System metrics: CPU usage, memory consumption, event loop lag (Node.js), thread pool saturation (Rust).
Export these metrics to your existing monitoring stack (Prometheus, Datadog, CloudWatch). Set alerts on error rate increases, latency spikes, and resource exhaustion. The goal is to identify bottlenecks before they impact AI model performance.
Exo provides MCP development services for teams that need production-grade, scalable AI infrastructure. We have scaled systems supporting over $1B in TVL and apply the same engineering discipline to MCP server architecture. Ready to build? Reach out at founders@exotechnologies.xyz
