TL;DR ⚡
What you’ll learn: How to isolate CPU, manage memory under pressure, avoid runtime stalls, and benchmark RPC callbacks correctly.
Why it matters: Most high-throughput servers don’t fail from a lack of hardware. They fail due to poor runtime configuration and hidden blocking.
Exo Edge: These techniques materially increased throughput and reduced contention in a production blockchain RPC server.
Introduction
Some server designs are not built to scale. Sometimes, even common scaling methods can hurt scaling. This blog highlights a few of these pitfalls. If you build servers and want to improve resource usage or increase throughput without raising infra costs, this is for you.
This blog also shares benchmarking tips to help you get better results. We used these methods while improving an RPC server for a client building a world-class blockchain.
The discussion is divided into three sections:
Resource usage optimization
Bottlenecks and pitfalls in server callbacks
Benchmarking tips
Resource usage optimization:
CPU cores, memory and I/O disk operations are the most common and important resources that should be well contained. If not properly scoped, it can lead to halting other operations in other threads, OOM leading to a crash, disk wear, etc. Let’s talk about them one by one:
CPU cores constraint:
The goal is to isolate and fully utilize the cores. But when async servers are used to handle concurrent requests(which is most of the time) this is not straight forward. RUST runtime provides methods like num_threads() and num_blocking_threads() but they dont guarantee your server will not affect any other service.
For example, if you have multiple runtimes in your system for handling different tasks, you want to prevent them from influencing each other by sharing a common CPU resource. You can limit the num_threads per runtime, but this doesn’t prevent them from scheduling on the same core.
To handle the isolation part, you can “pin” (bind threads to specific cores) all the threads to specific cores. This way, you have contained the workload to specific cores and can also enjoy the concurrent nature of async.
use core_affinity::set_for_current;
let rpc_index = Arc::new(Mutex::new(0));
let fail_to_acquire_lock = Arc::new(AtomicBool::new(false));
// Pins each worker thread to a specific core
let make_on_thread_start = |cores: Vec<_>, index: Arc<Mutex<usize>>, label: &str| {
let label = label.to_string();
let fail_to_acquire_lock = fail_to_acquire_lock.clone();
move || {
let mut idx = match index.lock() {
Ok(guard) => guard,
Err(_) => {
fail_to_acquire_lock.clone().store(true, Relaxed);
return; // just return from closure
}
};
let core = cores[*idx % cores.len()];
set_for_current(core);
tracing::debug!("{} thread pinned to core {:?}", label, core.id);
*idx += 1;
}
};
let rpc_runtime = Builder::new_multi_thread()
.worker_threads(num_cores.len()*2)
.thread_name("rpc-worker")
.on_thread_start(make_on_thread_start(rpc_cores, rpc_index.clone(), "RPC"))
.enable_all()
.build()?;
Disclaimer: This works best when the overall system is well-scoped. If sub-services like ingress, logging, or gossip use default runtimes or global thread pools such as Rayon, they may still use all cores. In that case, pinning alone will not prevent resource sharing.
After isolation, you want to make sure that isolated resources are fully utilized. You can do that by adjusting the num_worker() to an appropriate value which can be identified by benchmarks. If you do not set it, the Tokio runtime defaults it to one worker per core, which is not ideal in most cases. You can set it to a number greater than num_cores.
Memory constraint:
Memory needs to be managed properly; otherwise, the program can crash due to OOM, and sometimes memory saturation can cause the whole system to crash as well.
The goal is to add a safety check. If the system is under memory load, then avoid processing any further requests that can worsen the situation. You can use a per-request middleware for this purpose. Whenever a request arrives, you check the system memory stats and then forward it if it’s safe to process.
Two cautions need to be kept in mind when introducing this extra step in the middle. Make sure the memory stat parsing is not heavy; you can read /proc/meminfo virtual file for that, and secondly, you read the right memory stat. It sounds trivial, but reading the wrong memory stat can be devastating. For example, if you read free memory, but do not account for reclaimable memory, then your check will think the memory has saturated. When, in fact, the memory can be reclaimed, meaning it was a false positive leading to false throttling. can be used to read mem stats.
impl<'a, S> RpcServiceT<'a> for RpcMemManagerMiddleware<S>
where
S: RpcServiceT<'a> + Send + Sync + Clone + 'static,
{
type Future = BoxFuture<'a, MethodResponse>;
fn call(&self, req: Request<'a>) -> Self::Future {
let service = self.service.clone();
let threshold = self.rpc_memory_threshold_pct;
let is_memory_safe = self.check_memory_usage(threshold, None);
Box::pin(async move {
match is_memory_safe {
Ok(()) => service.call(req).await,
Err(err_msg) => MethodResponse::error(
req.id(),
ErrorObject::owned(
jsonrpsee_types::error::ErrorCode::InternalError.code(),
err_msg,
None::<()>,
),
),
}
})
}
}
IO operations constraint:
OSes have different IO metrics, but congestion metrics (Indicators that show delays or backlog in storage) are most relevant when managing resources. One such metric is iowait, which is the time the CPU waits for some IO operations. High iowait often means the system is bottlenecked on disk or network.
File I/O is blocking by default. The calling thread waits until the operation completes. In async Rust, you can use tokio::fs for file I/O. This module runs the blocking I/O on dedicated blocking threads and returns the result as a future. This lets the main async workers keep running other tasks. can be used to read iowait from a Linux-based system.
Bottlenecks and pitfalls in server callback
There are two common problems in this regard. First, choosing the right callback setup for the workload. Second, following the right rules inside the callback.
Callback registration rules
Callback registration defines how and where your callback runs. Most frameworks offer several options. Each option fits a different type of workload. The sections below describe these options and when to use them. The RPC crate used here is for reference.
Method | Description | Suitable for | API in the jsonrpsee crate |
|---|---|---|---|
Blocking | handles the request in the same thread as soon as the request arrives in a blocking fashion | very light-weight methods for which the cost of even scheduling and awaiting on a task in a runtime is greater than the task itself |
|
Non-blocking but synchronous | Creates a new task for each request in a separate thread inside the runtime. But that thread can only perform sync operations. | methods for which the cost of scheduling a task at runtime is comparable or smaller than the computation itself. This is the general goto method. |
|
Non-blocking and asynchronous | Creates a new task for each request in a separate thread inside the runtime, and that thread can perform async operations. | methods which do heavy computation so you can spawn a separate thread for that using spawn_blocking() or want to use async method like tokio::fs::read(). |
|
The exact setting for the specific RPC call can be found using benches. For example, let’s walk through a structured approach to debugging a callback registration issue using benchmarks. The problem can be analyzed at three levels of granularity, each isolating a different class of issues. Taken together, these benchmarks provide useful insight into where time is being spent.
Benchmark the internal callback logic: This isolates the pure computation performed inside the callback.
Benchmark the callback invoked directly: This measures the overhead of the RPC module’s orchestration and execution path, excluding network and client-side costs.
End-to-end benchmark: This measures the full request path by sending an RPC request to the server using an RPC client, as in a real deployment.
When analyzing the benchmark results, there are a few key signals to watch for:
If the first benchmark shows high variance, it likely indicates lock contention or contention over a shared underlying resource.
If the difference between benchmarks (1) and (2) is significant and comparable to the internal logic cost, the issue is likely runtime-related. Further micro-benchmarking often reveals that the additional time is spent either during task scheduling or when awaiting the future’s completion. This typically occurs when the runtime is overwhelmed by task-scheduling pressure. In such cases:
If the callback is very small, consider using a blocking execution model.
If the internal computation is non-trivial, consider increasing the runtime’s
num_worker_threads.
If the difference between benchmarks (2) and (3) is large, verify server-side configuration. Ensure that middleware is not introducing excessive latency, the async RPC client is not being throttled, and the server is using the intended (custom) runtime configuration.
Tokio is not good for everything
Some RPC calls perform blocking operations. These blocking operations generally fall into two categories:
Expensive CPU-bound computation: such as hashing, signature verification, or data compression.
Synchronous I/O: such as reading or writing files, logging, or performing database queries using blocking APIs.
Running these operations directly on Tokio’s runtime worker threads blocks the runtime, potentially delaying or stalling other incoming RPC requests.
To avoid blocking the runtime, blocking operations should be moved to threads outside of Tokio’s main worker pool. There are several approaches:
tokio::task::spawn_blocking: ideal for CPU-bound tasks. It executes blocking work on a dedicated blocking thread pool managed by Tokio. Its downside is that it is not suited for high-throughput loads.
Rayon crate: useful for CPU-intensive batch processing where the same operation is applied independently over many items. Batched RPC methods like send_batch_transactions and get_multiple_accounts currently process items sequentially; since each item is independent, Rayon can parallelize this work across CPU cores to reduce overall processing time.
Dedicated threads: suitable for long-running or isolated tasks, such as sending transactions or where you want full control over thread behaviour.
How to avoid blocking calls in the callbacks:
First, avoid using Mutex on shared resources instead use RwLock so parallel read can occur in parallel as Mutex.lock() does not allow more than one lock at a time.
Second, use a non-blocking variant of the APIs where possible. Instead of calling RWlock::read(), call try_read() instead. This is possible when multiple fallback options are available. For example, suppose there is contention on the lock due to high load. Rather than waiting on the cache lock, you can surely serve the request faster by fetching the data from persistent storage.
Third, use async APIs or write async versions of low-level functions where possible so that the server remains responsive under heavy load. For exmaple use tokio::fs::read instead of std::fs::read where possible.
Benchmarking tips:
To create large sets of randomized mock data, always use a seeded randomizer for deterministic results.
Use rayon for mock data creation when loops are involved, and the process is straightforward. Sometimes it gets messier when shared states are involved. It will significantly reduce the number of benches and stress-test times, and thus CI/CD times.
If you want to avoid discouraged patterns in RUST that can cause panics, such as .unwrap() or .except(), use error handling. To return a custom error, you have to write your own main function like below instead of using the pre-defined macros:
fn main() -> Result<(), RuntimeBenchError> {
let mut c = Criterion::default()
.configure_from_args()
.sample_size(SAMPLE_SIZE);
// call benches manually
rpc_read_account_info_latency_under_load(&mut c)?;
// finalize all groups
c.final_summary();
Ok(())
}
Conclusion
Building a high-throughput server requires careful management of CPU, memory, and I/O, as well as thoughtful callback design. Applying these techniques helps avoid bottlenecks, improves resource utilization, and keeps your server responsive under load.
Next, try applying these ideas to your own server or RPC service. Use benchmarks to validate and tune the config parameters for your heavy workloads.
