A long time ago, people used to worry about the efficiencies of software they used to write. Then came a time when processors just kept getting faster every month the pace wouldn’t slow down even after crossing the 500MHz mark. Somewhere around this time, people started writing exceptionally bloated software and the bloat started to grow at a phenomenal pace. Then came the new catch phase
hardware is cheap, we can throw more hardware at it. And in one magic swoop, all bloatware became perfectly acceptable since the bloat now seemed to be affordable. And this was precisely the point wherein most people forgot their CS fundamentals. If you have done a course on CPU scheduling, you would know these metrics:
- CPU utilisation
- Turnaround time
- Waiting time
- Response time
I will take up web application space as an example in the remainder of the discussions since it has a fairly large development community and also because it is littered with bloatware + hardware is cheap mentality. In web applications, the consumer is usually worried about response times and turnaround times. Let us say there is solution A wherein it takes a full second for the server to process a single web request and solution B that takes 50 milliseconds to process a single web request. A very misplaced number that people chase is requests/second and this is solved using the now infamous throw more hardware approach. Focus on throughput works in businesses when your consumers have nowhere else to go and your notion of increasing business is by increasing volumes. You don’t hear people switching banks because of how fast (or slow) their websites load and the reason is that main product offering is banking service and not a website i.e. you would worry more about interest rates rather than website response times. Businesses whose primary offering is the website itself cannot take such liberties.
Turnaround time is the total time taken to service a request. So, if you have a slow running web page, you can keep adding more hardware to take on more volume (assuming the solution can be scaled out infinitely) but the experience of each individual user is not going to improve. Also, real world experience suggests that left to itself, things start to slow down as you scale out. A knee jerk fix is to do things in parallel and use threads. That also usually doesn’t get you too far thanks to what a certain Amdahl had to say. This is where all those classes on algorithms, architecture and the abstinence from bloatwares begin to make some difference.
Response time is what is usually called as time to first byte in the internet world. In trying to solve the turnaround time problem, one of the speedup areas that people work on is minimizing the context switches from user space to kernel space. Zero copy is an example of one such problem. The most common example however happens to be buffered files (or streams if you are from the Java world). Some people (and their software creations) take this to the extreme and try and send out the entire HTTP response in one shot hoping to minimize the number of system calls needed to get the job done. It turns out that this makes for a worse user experience. Put it another way, it is better off to start sending something to the user after 200 milliseconds (ms) and finish it in the next 4 seconds rather than start sending something 2 seconds after the request was issued and get done in the next 500 ms. In fact this is a harder problem to solve for two reasons:
- Left to itself, most web servers aren’t eager to push back smaller chunks of data (easier problem to solve)
- Dynamic pages, especially the ones generated MVC frameworks do not make the response available to the web server until they have fully constructed the response body. Some of these solutions offer no straight forward way to push out data in parts while others have explicit mechanisms of achieving this effect.
For those of you who are still wondering why something that puts on extra load on the server and takes longer to finish is considered better by the user, there are two reasons:
- Psychological: Giving the user an early indication of some progress creatives some incentive for the user to wait rather than sending no information. Even getting the status bar to say
recieving from …as opposed to
sending request to …makes a difference.
- Pipeline effect: An average web page has references to various resources (images, external css files, etc. etc.) that are needed to completely render a page. It turns out that most browsers can initiate the retrieval of those resources before the page loads up completely. Pushing out a partial response early on gives the browsers a chance to get started with other things early on. So while the additional flushes done on the server side might have slowed down the turnaround time for basic page transmission, the overall turnaround time as seen by the user can still drop with this technique.
Since throughput signifies the total amount of work that gets done in a unit of time, it turns out that throwing more hardware can sort of solve this problem. As I had mentioned earlier, if you solution scales infinitely, then the hardware addition technique works. The reason why things do not scale infinitely are:
- There ends up being some components that are hard to scale infinitely such as the top level load balancer and the pipes that it is connected to
- Amdahl’s law
One of the most common fixes that is a borderline superstition is to run more threads. In a CPU bound world, having any more threads of execution than the number of compute unit slows things down. In a NUMA based world, certain workloads can be detrimental even when the number of threads matches the number of compute unit available. However, for workloads that are I/O bound, threads do help as long as the different threads are not contending for the same underlying I/O resource. The one exception is rotating storage media where the amortized performance increases as concurrent requests increase but only up to a certain point.
In effect, the reasons for throughput not increasing just by increasing either the concurrency levels of task execution or by throwing in more hardware beyond a certain point is very real.
We are now in an age where people not only believe that hardware is cheap but also in cloud computing that promises provisioning of infinite hardware (i.e. more than you can afford). The thing to remember is that you might extend the life of a given solution for quite sometime by throwing in hardware (at diminishing rate of returns) but if you are chasing response times, you will have to constantly improvise on your design as opposed relying on hardware.