It has been close to a decade since when I had to code in C/C++ as part of my job. The only occasions where I had in indulge in C++ was in various algorithmic coding contests that used to take part in a few years ago. The only thing that helps me keep in touch with C is writing modules for the nginx webserver. I happen to be the creator and maintainer of the nginx module that lets you sign the requests when speaking to AWS S3 as the backend. AWS introduced the V4 version of the signature generation mechanism around three years ago and I had to update my plugin to support this mechanism. I managed to code up the bare minimum support for this specification earlier this month. The time taken to make this change was way longer than I expected and this post tries to capture some of the reasons. The difficulties that I faced is attributable either to the C programming language in itself or to how nginx code is organized. This blog post focuses on some of the idiosyncrasies of having to work with C. Old time programmers may scoff at the notion of C being called out as the odd man out but is actually truth in the present day and age wherein less than one percent of all professional software development happens in this language. I shall make another post to divulge the specific reasons because of which I needed multiple years to write was was essentially less than 1000 lines of C code.

Null terminated strings

Anyone who has coded in C has a recollection of null terminated strings and the innumerable buffer overflows that comes with not storing the length of a string as an explicit value and using it for bounds check. Any sane real world program would represent a string as a pair of a pointer and it’s associated length. Nginx is no exception to this and has it’s own struct named as ngx_str_t to represent this pair.The traditional usage of this length was to serve as a defensive measure to guard against the absence of the null byte, as witnessed in the API design of functions such as strnlen(), strncpy(), etc. etc. The introduction of this concept lets us think of string in unexpected ways. For example, we can now represent the prefix of a given string by returning the same pointer and a smaller associated length. In addition, this also frees us from the need to have the null byte be present anywhere in the character array.

In short, a pair of a pointer to a character along with an unsigned length “n” can either mean a null terminated string with a length safety check of n or simply a mean a character array whose first n characters represent a string (without the last character having to be the null byte). These two are not interchangeable and hence, one needs to be aware of which style is being used and ensure that all pieces of code dealing with the string is aware of the style in use. It is quite possible in a large project that the various libraries used in the code base use different styles and one must put in the necessary effort to uphold the invariants.

Recursive descent memory allocation

Languages with automatic memory management receive a lot of attention around memory deallocation schemes; specifically the virtues of having a good garbage collector. It is not until you start coding in C after years of working with such languages does one realize that memory allocation is also a loquacious affair. Consider the struct used to represent strings as mentioned in the earlier section.

struct string {
 char* data;
 unsigned len;
}

Allocating such a string on a heap would entail firstly allocating the struct in itself followed by allocation of the character array. Now imagine what it would take to represent HTTP headers. This would have to be represented as a map of strings to strings. Creating and populating such an entry would require one to allocate the map, followed by a tuple representing they key-value pair instance, followed by allocation of each string. Note that we have not touched upon the code that actually maintains the underlying hash implementation as there would be some library code for the data structure manipulation. What we called out here is the work that would have to be done to interact with such a library.

Dealing with any form of nested structures that need to be managed on the heap requires explicit allocation or memory for every non-scalar field at some point in the code. This starts to feel like drudgery after a point in time.

Functions that ask for memory

One of the peculiar things about working with C is that well written libraries and even functions in applications never allocate memory they need. Instead, they expect to be given a memory location where the results should be placed. For example, a well written function that provides a base 64 encoded version of a given string will not allocate memory needed to represent the result. Instead, it would expect the caller to provide the destination where the result needs to be populated. This would seem awkward if you are used to designing APIs in almost any other language including C++. The unsaid rule in C seems to be that no public API/library is permitted to either allocate or deallocate memory. This is control that always needs to be present in the hands of caller. We shall see a very good rationale in next section. However, that would not be the sole reason for following this rule.

Semi-automatic garbage collection

We shall now touch upon the topic of arena based memory allocators as nginx uses this concept to simply memory management. Apache httpd was the first industrial grade web server that also happened to be written in C. It uses a notion application developers never having to free the memory allocated on heap if they were certain of the fact that memory region in question would never be accessed beyond the lifetime a given HTTP request. This is a very robust and convenient mechanism that ensures that there are no memory leaks as well as inhibits the dereferencing of dangling pointers. The only cognitive overhead imposed by this mechanism is that all allocation on the heap must be requested against a given arena object as opposed to a global allocation. The nginx function for allocating memory on an arena is void *ngx_palloc(ngx_pool_t *pool, size_t size). Contrast this with malloc whose signature is void *malloc(size_t size). Note that the datatype for representing an arena is named as a pool. This misnomer is consistent with the nomenclature used by apache httpd. The term pooling has taken on the connotation of reuse over the years; however no reuse happens in these arenas. The memory is freed by nginx at the end of the request by freeing up all the allocations that was done against this arena. The internal implementations of most arena based allocators is to rely on a single large chunk of memory (or a list of chunks) from which individual allocations happen. The entire chunk is freed as a single operation in the deallocation phase as opposed to freeing up individual allocations that was performed on the arena.

The absence of an arena based memory management would expose us to the full perils of low level memory management that the C language imposes on it’s users. I managed to dodge this hassle thanks to having to code for a nginx module whose operations required no state management beyond the lifecycle of a single request.

Unit testing frameworks

Unit testing as a formalized endeavour and a commonplace practice happened many years after programming in C picked up. Unit testing frameworks specifically targeting C historically speaking, is almost unheard of. One often thinks of unit testing frameworks as no more than a library that provides variants of the equality assertion check and some fancy reporting around it. Things get a lot more complicated in C. A big concession that one needs to make is that the code under test can be buggy and bugs in C can have hazardous side-effects.

Imagine a test suite that contains five test cases. It is quite possible that the third test that gets executed passes the various assertions but corrupts a portion of the memory. The next test case might rely on the correctness of the contents of area of memory. The fourth test case might fail not due to any bugs in the code under test but due to inadvertent memory modification that was performed by an earlier test case. This scenario seems plausible in any other language as the code under test might be manipulating more variables that we expect it do. The odds of this happening in a language like C is higher as we can end up manipulating variables that we not even aware of due to problems such as buffer overflows, stack overflows, dereferencing of dangling pointers, etc. etc. Another complication is that the testing framework would have to abruptly stop the execution of the test suite once a segmentation fault is encountered. Unhandled exceptions of any sort in other languages would simply result in a single test failing and not bring the entire test suite run to a grinding halt.

Unit testing a piece of C may no no walk in the park but it is still worthwhile as all the conceptual goodness of having a unit test suite associated with a piece of code still remains.

Lack of runtime type safety

Type safety is one of my favourite topics. The most liberal type safety that comes to mind is weak dynamic typing. An example of this can be found in perl. You can assign an integer value to a scalar, pass it around the codebase not knowing what type is flowing and finally use it in a context that is expecting a string. The language has some known rules of how to meaningfully cast a numeric value into a string and you can then work off it. In fact, the language also defines some rule by which every string can be cast to an integer. While I might not personally agree with such liberal data transformations, there is at least a defined rule. Javascript is another such language with fairly crazy rules when it comes to how types are handled. C however, is the most evil of them all. The notorious void pointer is essential in cases where we are trying to emulate polymorphism, generics and other such dynamic type aware features. The well known problem is that you can take accidentally deference a pointer pointing to the root of a B-tree as though it were a string and with some luck, you will end up getting some string without triggering a segmentation fault. There would be no language or runtime level indication that this has indeed happened unless and until you examine the contents of the string and find it to contain something totally unexpected. One almost forgets the need to have to deal with such surprises if one has been working in other high level languages.

None of this is new news

It is true that there is no revelation in this post and that C programmers have lived with this all along. However, it feels cumbersome to deal with these issues after having spent years not dealing with it. I have written this piece to serve as a record of the actual things that I had to put in an effort to get it right when moving back to C.