Insights from Improving an RPC Server: CPU Isolation, Memory Safety & Rust Benchmarking

Hard-earned lessons in CPU isolation, memory safety, and callback benchmarking in Rust

TL;DR ⚡

What you’ll learn: How to isolate CPU, manage memory under pressure, avoid runtime stalls, and benchmark RPC callbacks correctly.
Why it matters: Most high-throughput servers don’t fail from a lack of hardware. They fail due to poor runtime configuration and hidden blocking.
Exo Edge: These techniques materially increased throughput and reduced contention in a production blockchain RPC server.

Introduction

Some server designs are not built to scale. Sometimes, even common scaling methods can hurt scaling. This blog highlights a few of these pitfalls. If you build servers and want to improve resource usage or increase throughput without raising infra costs, this is for you.

This blog also shares benchmarking tips to help you get better results. We used these methods while improving an RPC server for a client building a world-class blockchain.

The discussion is divided into three sections:

Resource usage optimization
Bottlenecks and pitfalls in server callbacks
Benchmarking tips

Resource usage optimization:

CPU cores, memory and I/O disk operations are the most common and important resources that should be well contained. If not properly scoped, it can lead to halting other operations in other threads, OOM leading to a crash, disk wear, etc. Let’s talk about them one by one:

CPU cores constraint:

The goal is to isolate and fully utilize the cores. But when async servers are used to handle concurrent requests(which is most of the time) this is not straight forward. RUST runtime provides methods like num_threads() and num_blocking_threads() but they dont guarantee your server will not affect any other service.

For example, if you have multiple runtimes in your system for handling different tasks, you want to prevent them from influencing each other by sharing a common CPU resource. You can limit the num_threads per runtime, but this doesn’t prevent them from scheduling on the same core.

To handle the isolation part, you can “pin” (bind threads to specific cores) all the threads to specific cores. This way, you have contained the workload to specific cores and can also enjoy the concurrent nature of async.

use core_affinity::set_for_current;

let rpc_index = Arc::new(Mutex::new(0));
let fail_to_acquire_lock = Arc::new(AtomicBool::new(false));

// Pins each worker thread to a specific core
let make_on_thread_start = |cores: Vec<_>, index: Arc<Mutex<usize>>, label: &str| {
    let label = label.to_string();
    let fail_to_acquire_lock = fail_to_acquire_lock.clone();
    move || {
        let mut idx = match index.lock() {
            Ok(guard) => guard,
            Err(_) => {
                fail_to_acquire_lock.clone().store(true, Relaxed);
                return; // just return from closure
            }
        };
        let core = cores[*idx % cores.len()];
        set_for_current(core);
        tracing::debug!("{} thread pinned to core {:?}", label, core.id);
        *idx += 1;
    }
};

let rpc_runtime = Builder::new_multi_thread()
    .worker_threads(num_cores.len()*2)
    .thread_name("rpc-worker")
    .on_thread_start(make_on_thread_start(rpc_cores, rpc_index.clone(), "RPC"))
    .enable_all()
    .build()?;

Disclaimer: This works best when the overall system is well-scoped. If sub-services like ingress, logging, or gossip use default runtimes or global thread pools such as Rayon, they may still use all cores. In that case, pinning alone will not prevent resource sharing.

After isolation, you want to make sure that isolated resources are fully utilized. You can do that by adjusting the num_worker() to an appropriate value which can be identified by benchmarks. If you do not set it, the Tokio runtime defaults it to one worker per core, which is not ideal in most cases. You can set it to a number greater than num_cores.

Memory constraint:

Memory needs to be managed properly; otherwise, the program can crash due to OOM, and sometimes memory saturation can cause the whole system to crash as well.

The goal is to add a safety check. If the system is under memory load, then avoid processing any further requests that can worsen the situation. You can use a per-request middleware for this purpose. Whenever a request arrives, you check the system memory stats and then forward it if it’s safe to process.

Two cautions need to be kept in mind when introducing this extra step in the middle. Make sure the memory stat parsing is not heavy; you can read /proc/meminfo virtual file for that, and secondly, you read the right memory stat. It sounds trivial, but reading the wrong memory stat can be devastating. For example, if you read free memory, but do not account for reclaimable memory, then your check will think the memory has saturated. When, in fact, the memory can be reclaimed, meaning it was a false positive leading to false throttling. can be used to read mem stats.

impl<'a, S> RpcServiceT<'a> for RpcMemManagerMiddleware<S>
where
    S: RpcServiceT<'a> + Send + Sync + Clone + 'static,
{
    type Future = BoxFuture<'a, MethodResponse>;

    fn call(&self, req: Request<'a>) -> Self::Future {
        let service = self.service.clone();
        let threshold = self.rpc_memory_threshold_pct;
        let is_memory_safe = self.check_memory_usage(threshold, None);

        Box::pin(async move {
            match is_memory_safe {
                Ok(()) => service.call(req).await,
                Err(err_msg) => MethodResponse::error(
                    req.id(),
                    ErrorObject::owned(
                        jsonrpsee_types::error::ErrorCode::InternalError.code(),
                        err_msg,
                        None::<()>,
                    ),
                ),
            }
        })
    }
}

IO operations constraint:

OSes have different IO metrics, but congestions metrics (Indicators that show delays or backlog in storage) are most relevent when managing resources. One such metric is iowait, which is the time CPU waits for some IO operations. High iowait often means the system is bottlenecked on disk or network.

File I/O is blocking by default. The calling thread waits until the operation completes. In async Rust, you can use tokio::fs for file I/O. This module runs the blocking I/O on dedicated blocking threads and returns the result as a future. This lets the main async workers keep running other tasks. procfs can be used to read iowait from linux based system.

Bottlenecks and pitfalls in server callback

There are two common problems in this regard. First, choosing the right callback setup for the workload. Second, following the right rules inside the callback.

Callback registration rules

Callback registration defines how and where your callback runs. Most frameworks offer several options. Each option fits a different type of workload. The sections below describe these options and when to use them. The RPC crate used here is jsonrpsee for reference.

Method	Description	Suitable for	API in jsonrpsee crate
Blocking	handles the request in the same thread as soon as the request arrives in a blocking fashion	very light weight methods for which the cost of even scheduling and awaiting on a task in a runtime is greater than the task itself	`register_blocking_method()`
Non-blocking but synchronous	Creates a new task for each request in a separate thread inside the runtime. But that thread can only perform sync operations.	methods for which the cost of scheduling task in a runtime is comparable or smaller than the computation itself. This the general goto method.	`register_method()`
Non-blocking and asynchronous	Creates a new task for each request in a separate thread inside the runtime and that thread can can perform async operations.	methods which do heavy computation so you can spawn a separate thread for that using spawn_blocking() or want to use async method like tokio::fs::read().	`register_async_method()`

The exact setting for the specific RPC call can be found using benches. For example, let’s walk through a structured approach to debugging a callback registration issue using benchmarks. The problem can be analyzed at three levels of granularity, each isolating a different class of issues. Taken together, these benchmarks provide useful insight into where time is being spent.

Benchmark the internal callback logic: This isolates the pure computation performed inside the callback.
Benchmark the callback invoked directly: This measures the overhead of the RPC module’s orchestration and execution path, excluding network and client-side costs.
End-to-end benchmark: This measures the full request path by sending an RPC request to the server using an RPC client, as in a real deployment.

When analyzing the benchmark results, there are a few key signals to watch for:

If the first benchmark shows high variance, it likely indicates lock contention or contention over a shared underlying resource.
If the difference between benchmarks (1) and (2) is significant and comparable to the internal logic cost, the issue is likely runtime-related. Further micro-benchmarking often reveals that the additional time is spent either during task scheduling or when awaiting the future’s completion. This typically occurs when the runtime is overwhelmed by task scheduling pressure. In such cases:
- If the callback is very small, consider using a blocking execution model.
- If the internal computation is non-trivial, consider increasing the runtime’s num_worker_threads.
If the difference between benchmarks (2) and (3) is large, verify server-side configuration. Ensure that middleware is not introducing excessive latency, the async RPC client is not being throttled, and the server is using the intended (custom) runtime configuration.

Tokio is not good for everything

Some RPC calls perform blocking operations. These blocking operations generally fall into two categories:

Expensive CPU-bound computation: such as hashing, signature verification, or data compression.
Synchronous I/O: such as reading or writing files, logging, or performing database queries using blocking APIs.

Running these operations directly on Tokio’s runtime worker threads blocks the runtime, potentially delaying or stalling other incoming RPC requests.

To avoid blocking the runtime, blocking operations should be moved to threads outside of Tokio’s main worker pool. There are several approaches:

tokio::task::spawn_blocking: ideal for CPU-bound tasks. It executes blocking work on a dedicated blocking thread pool managed by Tokio. Its downside is that it is not suited for high throughput loads.
Rayon crate: useful for CPU-intensive batch processing where the same operation is applied independently over many items. Batched RPC methods like send_batch_transactions and get_multiple_accounts currently process items sequentially; since each item is independent, Rayon can parallelize this work across CPU cores to reduce overall processing time.
Dedicated threads: suitable for long-running or isolated tasks, such as sending transactions or where you want full control over thread behavior.

How to avoid blocking calls in the callbacks:

First, avoid using Mutex on shared resources instead use RwLock so parallel read can occur in parallel as Mutex.lock() does not allow more than one lock at a time.

Second, use non-blocking variant of the APIs where possible. Instead of calling RWlock::read(), call try_read() instead. This is possible in scenarios where there are multiple fallback options available. For example, suppose there is contention on the lock due to high load.Rather then waiting on the cache lock, you can surely serve the request faster by fetching the data from persistent storage.

Third, use async APIs or write async versions of low level functions where possible so that the server remains responsive under heavy load. For exmaple use tokio::fs::read instead of std::fs::read where possible.

Benchmarking tips:

To create large sets of randomized mock data always use seeded randomizor for deterministic results.

Use rayon in mock data creation where loops are involved and the process is straight forward. Sometimes it gets messier when shared states are involved. It will reduce the benches and stress test times significantly and thus the CI/CD times.

If you want to avoid any discouraged patterns in RUST, which can cause panics like .unwrap() or .except(), then use error handling. To return a custom error you have to write your own main function like below instead of using the pre-defined marcos:

fn main() -> Result<(), RuntimeBenchError> {
    let mut c = Criterion::default()
        .configure_from_args()
        .sample_size(SAMPLE_SIZE);

    // call benches manually
    rpc_read_account_info_latency_under_load(&mut c)?;

    // finalize all groups
    c.final_summary();

    Ok(())
}

Conclusion

Building a high-throughput server requires careful management of CPU, memory, and I/O, as well as thoughtful callback design. Applying these techniques helps avoid bottlenecks, improves resource utilization, and keeps your server responsive under load.

Next, try applying these ideas to your own server or RPC service. Use benchmarks to validate and tune the config parameters for your heavy workloads.