As systems become complex, our ability to understand, and more importantly foresee why things behave in a certain way diminishes. Hence we resort to symptomatic treatment as a tactic to resolve problems. This is commonplace when it comes to tuning the performance of large scale systems. Our willingness to disproportionately value benchmarks over theory is proof that we are losing ability to reason about emergent behaviour of non-trivial systems. This article is hopefully the first of a series wherein we hope to uncover some of the important aspects that helps us to better understand and anticipate how things will pan out in the real wold. We shall now focus on the relationship between throughput and turnaround time. A poor understanding of this relation causes all sorts of wrong expectations on what would happen should more hardware be provided to applications, especially those that deal with online the request-response paradigm such as web servers.

Concurrency:parallelism::throughput:turnaround-time

Almost everyone dealing with scaling systems has heard of the expression “concurrency is not parallelism”. There are a couple of links in the footnotes^[1] that will help you understand it better if you have never heard of it. The intent of these articles and talks are to help novices and expert beginners navigate through the differences of what seems like equivalent concepts. A similar fallacy seems to exist when it comes to the unit of measurement used in web application benchmarks. The two common metrics that I have encountered are requests per second and response time per request. In other words, we are measuring either throughput or turnaround time respectively. There does exist a very specific scenario wherein these two are near equivalent. It is based on a singular assumption which is rarely true in present day server environments

Exactly one logical unit of every resource

If we have just a single compute core and our workload was purely compute bound, then it is obvious that the following equation holds true:

requests per unit time = 1 / (response time per request)

One might wonder what happens if things are not purely compute bound. Let us say for example that we append a fixed amount of data to a log file after processing each request. We have added an I/O operation here but this does not change the equation; all it does is increase the time needed to process a single request. The same holds true if we add a single, proverbial database server that gets accessed as part of serving the request. The equation continue to remain valid as long the database server & connecting network is used exclusively for the purpose of serving this application. In this scenario, we can assume all the associated parts to be a part of the single web application setup and continue with our assumption around the equivalence of throughput and turnaround time.

Adding more servers to make code run fast

One of common observations is as rate of request arrival (requests per unit time) increases, then the time taken to respond to a single request goes up. Adding more servers in those situations sometimes helps in reduction of the response time of a single request. Unfortunately, this leads to a superstition that adding more servers will always help in making the server respond faster.

Interaction effects of concurrency

Processing requests concurrently (i.e. with interleaving) on even a single core machine can slow things down due to a variety of reasons.

For example, if the working memory needed for a single request is say 100 MB a server has 4GB of RAM, then one can expect deterioration in the response time if more than a certain number of requests are getting processed concurrently. We can try and reduce the concurrency levels to increase the response time if this were a pure compute workload. However, if we do need higher concurrency to offset any I/O waits i.e. continue to use the CPU for handling newer requests as the existing ones are in wait state, then throttling concurrency levels down would not be useful. The memory pressure in the example above can be relieved by some mechanism to improve the situation. This can be achieved either by adding more RAM to the existing server or by adding a new server of the same configuration and splitting the workload (which effectively is adding more RAM to handle the same workload).

Similarly, if the request processing involved random access of data on a rotational medium like a hard disk, then one can see how an increased concurrency can stress this part of the system more and how replicating the entire server setup to reduce concurrency would speed things up.

Workload parallelism is actually concurrency

If you were around the time when multi-core CPU processors became the norm in servers that claimed to run on commodity hardware, then you would remember how moving to a dual core server would never give twice the amount of throughput. A lot of the problems back then was caused to do the presence of locks in the code all the way from the operating system to application logic in itself. Over time, these limitations have been removed in almost all places but we still do run into disappointments. The ideal world would have been one where the following equation was true:

requests per unit time = (number of cpu cores) / (response time per request)

Here are some common speculations that people make when trying to answer for the deficit.

There is a non trivial amount of CPU scheduling cost to run “n” execution threads in parallel
“Database servers” cannot make use of multiple cores effectively.
The network is slowing down

While some of these might end up being legitimate concerns, the real reason why they fail is because of an assumption we made earlier on.

Throughput as a linear function of overall parallelism

The original equation was designed for a pure CPU bound workload. We then extended it to workloads involving other resources as well with the assumption that they are available in fixed quantities for handling the same set of requests. What one misses out usually is that not all resources are parallelised by a factor of “n” when we go from a single compute core to “n” compute cores. This requires the network bandwidth must go up by a similar factor if we were using a network in our application; the number of hard drives need to go by a similar factor, the number of cores on the proverbial database server needs to go up the same factor and so does the number of simultaneous communication channels (connections) between the two machines. Failing to consistently scale up everything by the same factor implies that we shall be limited by parts that have been left behind. This is a corollary of Amdahl’s law.

Put it another way, if you were to increase parallelism in only certain parts of the system, then the overall improvement is limited by the parts whose parallelism you did not improve; unless those parts were severely under-utilized to begin with.

Request level parallelism

If we were to view our workload as a large amount of requests arriving at very close intervals of time and we commence processing of a request while at least one another request is being processed; then the system is a concurrent system. Additionally, it also a parallel processing system at the level of entire workload if more than one request processing is active at the same time. This could happen either because more than one instance of the set of hardware resources need to process a request is available within a server or because the load is being distributed across multiple identical set of servers. An extreme case of this would be one wherein application farm is defined as a collection of heterogeneous servers comprised of say 3 web servers, 5 cache servers, 2 database servers and the workload is distributed across multiple such identical farms. Such workload level parallelisms helps in increasing throughput while maintaining a turnaround time. The truth that often gets missed out is that such tactics are merely helping in maintain a baseline turnaround time. Any perceived improvements are just a matter of clawing back of degradation caused due to either concurrent or parallel execution of multiple requests. What it cannot possibly do is improve the baseline.

more servers y u noo make my code run fast?

So, if the fundamental quantity of work that needs to be done to service a single request has gone up, one cannot hope to regain the turnaround time by merely making “n” additional copies of the hardware available. The real choices are as follows:

Make the sequential part of the code run faster
Find ways to parallelise the some of the steps

The latter is an aspect that is often overlooked by most practitioners of web application development. The one narrow field where it is considered a common practice is the problem of searching a large corpus of data to find a match for a given set of criteria. This is problem is solved by dividing the data across multiple servers, issuing the search request to each data partition and then merging the results. The historical motivation for doing this has been limits on data storage & retrieval times of a single server.

What gets missed out here is we could adopt the same technique in cases where the data set is small enough but the computation parallelism exceeds what a single server has to offer. The one place where this is the norm is GPU programming. GPU designs have been entrenched in the symmetric parallel processing paradigm for while. Current mainstream programming techniques are weak in asymmetric parallel processing and also when it comes to symmetric parallel processing. By weak, we mean that programming language level constructs to achieve these are far and few. More often than not, one has to usually resolve to specialized programming kits to exploit parallelism and these tend to feel like orthogonal bolt-ons on top of the underlying language.

The ability to tap into intra-request parallelism becomes essential in a world wherein sequential processing speed of processors (most notably clock speeds) are not going up but the level of parallelism available on a single server is going up. One no longer has to think of a divide-conquer-merge problem as something that necessarily involves multiple servers. That being said, there will continue to remain a subset of problems that warrant the usage multiple servers where the nature of problem places a strain on components that can’t be scaled within a given server.

Parallel processing as a technique to speedup turnaround time in handling a given request should no longer be seen as an esoteric problem confined to HPC. This requires change in the mindset of large base of programmers and also an evolution of programming languages and more importantly execution run-time layers to embrace this as a core concept.

Conclusions

All things being identical, an increase in turnaround time would reduce the throughput of a system (corollary of Little’s law).
An increase in concurrency of the entire workload can slow down the turnaround time and thus potentially reducing overall throughput as well.
Concurrency induced slowdowns can be recouped by adding more hardware. This helps either by increasing workload level parallelism or simply by reducing the stress on some strained sub-systems. This does not hold true when concurrency limiting constructs such as cross-request locks are a part of implementation.
Adding more hardware cannot speed up the processing of a single request once concurrency led slowdowns are mitigated.
Speedup of the turnaround time of a single request can be increased either by improving the sequential speed of execution or by exploiting parallelism within the scope of processing a single request.

Footnotes

Golang/Rob Pike’s take on concurrency is not parallelism
Haskell’s take on concurrency is not parallelism
Symmetric parallel processing in Java using streams.

Statistically incorrect

Understanding web server cluster performance from first principles