Is indexing a form of caching? – Statistically incorrect

A good amount of software engineering goes into an activity that people describe as “make code run fast“. What people usually mean by that statement actually is “make code run inexpensively“. The reason why I say this is performance improvements are usually attempts at better overall resource utilization. There are a few situations wherein the objective is to actually make the code run faster even at a higher cost but those are few and far between.

The two most common techniques that comes to mind when trying to make code running better are (a) algorithmic efficiency, and (b) caching. What is usually not obvious is that these two feed off each other more often than one realizes. For example, memoization is a form of caching, but the classical case where one encounters it is while studying dynamic programming as an algorithmic approach. The idea of this article is to expand our outlook on caching beyond a term that gets flung around as pseudo-intellectual way of saying “let’s use memcached” (or redis if you are from a younger generation).

When do we consider the usage of caching

An opportunity to speed things up by caching is characterised by the following:

There exists an function or an operation that is deemed to be fairly expensive (either in terms of compute or I/O or both)
The output of this operation is not just deterministic but also immutable for a given set of inputs
The operation is requested multiple times with the same set of inputs

The most obvious improvement that can be brought about in such situation is to avoid the expensive re-computation by retaining results after performing step #1 in the hope that the same operation would be requested again. We shall confine ourselves to solutions that also have an additional constraint of having to be less expensive (after amortisation) than, repeatedly performing step #1. The recognition of this aspect is critical in trying to understand the generalised problem, as most real world caching solution usually impose restrictions on cache sizes as a cost saving tactic. One should not confuse this with cache invalidation; as the need for invalidation arises when we are trying to cache mutable results.

Enter indexing

This article is just not about caching; it is about indexing. We start first try and establish their similarities despite the seeming differences between the two. Both caching & indexing can be thought of as solutions being implemented over a collection of items. Let us denote the abstract expensive operation as f(x) -> y. We commonly refer to the situation where “x” happens to be the identity of the object and “y” is the object in itself; as caching. Indexing on the other hand involves “x” being a value that one of the attributes of the object can take and “y” being the set of objects (or just their identities) that match it. Now, if we were to think of the universe of search results as the collection of objects with its associated search query being it identity, then indexing is identical to traditional caching. Trait #3 that a problem should posses to make it seem like a good candidate for caching (same operation requested multiple times) isn’t directly obvious in the case of indexing as it manifests itself slightly differently. One usually does not build indices unless there is an expectation of frequent access that would utilise the index, as there is non trivial cost in maintaining indices.

The challenges in maintaining an index

The hard part of maintaining indices is identical to that of caching: being able to provide a consistent view if the underlying dataset is mutable. The first question that needs to be answered is in trying to understand the expectations around consistency of the indices when the underlying data mutates. While it is possible to have an internal index such as a tree based scheme for collections, thus trivially solving the consistency problem, this approach may not be viable when more than one index needs is needed on the dataset. All hell breaks loose the moment we concede to have a representation of the data in more than one place. Indices are usually managed by the same entity that acts as the sole gateway for mutating the collection. The update strategy is analogous to write-through caches i.e. both the primary store and the cache (index in this case) are updated as part of the same operation before signalling completion to the update initiator.

It seems obvious to almost any cache designer (or user) that the entire universe of the underlying collection need not be represented in the cache. This is usually not equally obvious to both index designers and users. Put it another way, an index over attributes of items in a collection need not be exhaustive i.e. contain references to every item and every value of the chosen attribute in a given collection. This is well recognized in the case of textual indices over documents as it takes the form of stop words. Such a phenomenon is equally true even in the case of highly normalised collections such as RDBMS tables on even finite value types such as integers. The candidates for exclusion in such situations happens to be high frequency values. We shall not dive into the details of why this strategy is commonplace. What is of interest here is that the decision to include or exclude items in indexing in such cases stems from intrinsic properties of the contents of the collection and not the access pattern. Again, there is nothing that precludes an index implementation strategy that looks at access patterns as well. We can adopt a hierarchical indexing solution much like a hierarchical cache and also have the option to make the layers either inclusive or exclusive of each other.

Wrapping up

I’d like to summarize everything that has been said thus far as follows:

Both caching & indexing are data retrieval speed up techniques.
Both caching & indexing gets tricky if the underlying data mutates. The challenges can be reduced if these solutions are under the same control domain as the underlying data set
Both caching & indexing need not be exhaustive. “Return on investment” acts as the driving force in helping to decide what stays in and what goes out.

Caching usually means identity based full item/entity caching. If we are to throw that narrow definition away, then hopefully we can now see that the problem is very similar to that of indexing to the point wherein indexing can be considered as a specialised style of caching.