Strings, lists & gatherers – Statistically incorrect

Problem statement: You need to send a dynamically constructed stream of bytes over some I/O channel.

An illustrative example

As always, I shall take an example in the internet space and more specifically in HTML. Let us say that you have a classical two dimensional array whose dimensions are not known at compile time. The contents of this array happens to be strings. This 2d array needs to be printed in the form of a valid HTML table. To make things more interesting, each cell in the table may be given a CSS class based on the content of the cell. The output could look as follows:
<table> <tr> <td class="ve">one</td> <td class="ve">two</td> <td class="ve">three</td> </tr> <tr> <td>four</td> <td class="ve">five</td> <td>six</td> </tr> <tr> <td>seven</td> <td>eight</td> <td class="ve">nine</td> </tr> </table>
The rule for having the optional CSS class is left as a puzzle for the bored reader.

A naïve approach

The most straightforward technique is to keep writing one chunk at a time, peeking into the the array as necessary. The reality is that very few people do it since everyone is taught that unbuffered I/O is expensive. The origin of this problem comes the fact that in most mainstream operating systems, only the kernel is allowed to do actual I/O and hence a system call needs to be made. The cost of switching from user space to kernel space and back is considered to be high.

The “fix”

Hence, the strategy is to accumulate a fair amount bytes that needs to be transmitted and then do fewer system calls. I/O APIs in programming languages usually provide transparent APIs where this accumulation happens. It usually comes with a burden of expecting the programmer to indicate the end of the stream so that any left over bytes can be flushed by making one last syscall.

Memory copy

One of sources of overhead (not the largest) of making syscall arises from having to copy data from user space to kernel space. This data copy operation happens ever so often in user space when string are concatenated. If either the user program directly concatenates or the buffering implementation does so in the end, part of the syscall overhead is incurred. The plausible reason is that the most common form of output APIs that programmers are exposed to takes a single stream/string.

Fewer copies

A far less known technique known as scatter/gather I/O exists that sort of addresses this problem. The gather operation is used for writing output in a single shot from multiple input byte buffers. This API (in both POSIX and Windows) accepts an array of buffers over which is sequentially iterates and writes the output. The problem is now reduced to having an array with pointers/references to all the buffers. If you have come down to having to operate at this level, chances are, you might not want to deal with magically expanding arrays. Your option at that point would be to use a linked list to accumulate all the references to the buffers and then turn it into an array just before performing the write operation.

But why all this you ask

… because unless you are coding at what is considered fairly low levels these days (such a POSIX & C), you do not realize how many times you are ripping up and recreating little byte streams in multiple layers of code using both direct and indirect constructs. This problem becomes very evident when trying to generate outputs in formats such as XML or JSON where there is a mix of a lot of what I call gluing bytes sprinkled liberally between the actual payload. Given an arbitrarily nested variable loosely typed languages like say the ones available in perl/php/python/javascript, I am wondering what is the most elegant way to arrive a representation like JSON and perform an output operation without mindless string concatenation. Thoughts are welcome.