IO, IO, the lack of it makes us slow.

April 28, 2025

In a previous post, we discussed how multi-threading may not be the solution that you’re looking for.

This time, it is. (Almost.)

We described a system where multiple endpoints (all connected to copies of the same data) are running to feed a data processor. The processor doesn’t want to be kept waiting, so is consuming from many endpoints simultaneously.

Part of the viability test is to see how different source machines cope in a multi-threaded situation. To this end, we have had each endpoint run a realistic benchmark (using the same access functions as the real system). The benchmarks are incremental, starting at 1 core, then iterating through to all the available cores that the machine has.

In this video, the microservice is started. There are no current benchmarks, so these are generated and written to a JSON file so that the benchmarking doesn’t need to be run again.

Here are the figures generated by an assortment of machines on the network.

A table showing the results from various machines.

The M-series Apple device is leagues ahead.

When we query the machine, it gives up its benchmarked speeds to the caller, so it can make informed decisions.

An example

A general purpose Linux desktop; 16GB RAM, 2TB Samsung SSD and an AMD Ryzen 5 3400G. The graph shows how many records are processed per second per core. 6,780 records per second on the first core.

Add another core, and you would hope that the rate stays flat (i.e. the second core contributes the same as the first), alas no.

The second core helps, but only increase the throughput by 4,386 per second. The third adds 3,344 and by the time we’re at core 8, the core’s running at 1,568, a drop of 77%.

A graph showing the collapse of performance by core 5 of an 8 core system.

Why do we see the degradation?

It’s IO! Whilst the cores are in use, they can’t get sufficient data from the file system to keep them fed.

This graph show how much each machine’s performance was lost when running at full chat.

A table showing the how much processing is lost when using maximum cores.

What can we do?

Assuming that we have to use whatever hardware-hand we’re dealt, we’ll need to write a thread scheduler that gives the best returns based on the hardware profiles that supply it. In our dataset, the choice would be:

M3 Core 1, M3 Core 2, M3 Core 3, M3 Core 4, M3 Core 5, M3 Core 6, i5 Core 1, Ryzen5 Core 1, XPS13 Core 1, M3 Core 7 etc

Of course it could be that by the time i5 Core 1 has been scheduled, M3 Core 1 has finished and should be utilised before moving on to the Ryzen5 Core 1. It’s tricky and needs accurate thread tracking to make the best decisions.

The scheduler will need to take into account the performance of the data processor itself, that’s drawing on these sources. If the scheduler overwhelms it, then it’ll be detrimental to add lesser performing cores.

What have we learned?

Squeezing maximum performance from the available hardware isn’t easy. Some designs (Hi Apple!) have a far greater efficiency, both in terms of performance and power consumption (not previously mentioned). Mixing the different types for greatest performance will be a complex task.