Practices of Reliable Software Design

Table of Contents

The Practices

1. Use Off-The-Shelf
2. Cost And Reliability Over Features
3. Idea To Production Quickly
4. Simple Data Structures
5. Reserve Resources Early
6. Set Maximums
7. Make Testing Easy
8. Embed Performance Counters

Other practices

I was nerd-sniped. Out of the blue, a friend asked me,

If you would build an in-memory cache, how would you do it?

It should have good performance and be able to hold many entries. Reads are more
common than writes. I know how I would do it already, but I’m curious about your
approach.

I couldn’t not take the bait.

In the process of answering the question and writing the code, I discovered a
lot of things happened in my thought processes that have come with experience.
These are things that make software engineering easier, but that I know I
wouldn’t have considered when I was less experienced.

I started writing a long and sprawling article on it, but it didn’t quite hit
the right notes, so this is a much abbreviated version. Time allowing, I might
expand on some of these practices in separate articles that can be more focused
and concise.

The Practices

Here are all eight practices I have adopted with experience that I had use for
during the exercise of writing a fast, small, in-memory cache.

1. Use Off-The-Shelf

Our first response to the question should be something like, can we use
Redis?

As long as we’re not dealing with a very expensive or complicated component,
or a part of the software that generates most of its value, we should default
to off-the-shelf solutions. I have written before about
why we want to build expensive and complicated
stuff.

2. Cost And Reliability Over Features

If we find out we can’t use an off-the-shelf solution, we need to build
something cheap and reliable. That usually means not having all the bells and
whistles, but that’s a trade-off worth making. One of my favourite ways of
phrasing this is

It is much easier to add features to reliable software, than it is to add
reliability to featureful software.

Besides, it’s very easy to accidentally think you need features that you
don’t actually need.

I’m not a big proponent of a design phase. Sometimes you need it, but I think
if you need to design your software up front, it will cost you more to
develop and it will take longer to finish. On the other hand, sometimes a
little up front analysis allow you to eliminate large portions of the design
space before even writing a single line of code.

In the case of this cache, we may ask question about item durability
requirements, request rates, sizes, eviction requirements, and so on. When we
do, we would in this case find out that we can get by with a single thread
accessing a single data structure, and no active eviction process. That’s a
huge win! It simplifies the design a lot.

3. Idea To Production Quickly

Part of the reasoning of the previous practice was that it’s easy to think
you need features you don’t. How do you, then, know which features you need?
For the most part, the cheapest and most reliable way to find this out is in
production.

If we deploy the bare minimum feature set, we will very quickly find out what
additional features are most commonly requested, and it’s not going to be the
ones we think. And even if they are, there are usually other requirements on
them that we wouldn’t have caught ahead of time.

This practice ties in strongly with the previous one: use analysis to pare
down the requirements to the bare minimum, and then write the bare minimum
code to support them, and get it into production to find out more about the
problem we’re trying to solve.

4. Simple Data Structures

Complicated data structures are often tempting. Especially when there are
libraries that handle them for us. The problem with complicated data
structures is that it’s easy to misuse them due to a lack of understanding,
which leads to performance problems and bugs.

In the case of this specific cache, the number of entries that needed to be
stored (we can use queueing theory to calculate this from the ttl and the
rate of writes) would comfortably be addressed by 19.9 bits of information –
in other words, we can just use a plain array to store each item, and hash
the key to address it.11 We found out earlier we didn’t need to actually
evict expired items. If we would have needed to, we could keep a separate
index for the item nearest to expiry using a min-heap. This is cheaper and
simpler than something fully sorted, which it might be tempting to reach for
otherwise. Instead of a separate process, we can rely on the high request
rate to trigger eviction of items, thus trading a little latency to avoid
designing for concurrency up front. Another alternative is to provide a
separate API that triggers eviction of the oldest expired entry, letting the
caller decide how much time to spend on eviction. This means the caller can
vary that time to make dynamic latency–resource requirement tradeoffs.

5. Reserve Resources Early

Another potentially controversial decision I made for the cache I designed
was that it allocated the full index array up front. This sounds wasteful,
but we can tell from back-of-the-napkin analysis that it would need most of
this space anyway during continuous operation, so nothing would be saved by
postponing the allocation.

What early allocation does is it allows the software to crash early if the
required resources are not available. This is a desirable trait – it saves
the operator the frustration of finding that out only once they’ve gone to
bed and receive a call from PagerDuty. It also makes it easier to do capacity
planning and reason about performance.

6. Set Maximums

I chose a sloppy open-addressing-with-linear-probing-based method for
handling hash collisions in addressing items that is simple and has very good
performance in the typical case (again, only very rudimentary arithmetic
needed to find this out for any specific use case), but can have very bad
performance in the worst case: it can end up iterating through the entire
array of all items on every cache miss.

Since perfect recall turned out not to be necessary22 This is again why
asking questions about requirements is so important!, it was trivial to set
a maximum iteration limit of e.g. 500 steps. This makes sure the worst case
is not too far from the average case, at the cost of a few more cache misses.
Does 500 steps optimise this tradeoff? No. This is one thing I don’t know how
to compute ahead of time, so production experience will have to inform future
changes to the maximum number of steps.

The important thing is usually not what the maximum is, just that there is
some maximum, so we’re not waiting surprisingly long times for things to
complete.

Given the static allocation requirements, we might also want to set a limit
on how much data can be stored in the cache at any given time, so we don’t
accidentally blow memory limits in the middle of the night. In general, limit
all the things! If someone is unhappy with the limit, revise the limit –
don’t remove it.

7. Make Testing Easy

To avoid regressions and ensure expected behaviour, I made the cache accept
commands on standard input. This means we can start the cache and type write
hello world in the terminal window to store “world” for the key “hello”.

But it also means we can write up a long string of commands in a file, and
pipe that into the cache. If we then also give the cache a command that
asserts the last value read is something specific33 It also had a few more
commands related to testing, such as a sleep command that forces it to
believe time has passed (for testing expiration) and a dump command that just
prints out everything it is currently storing, and whether it’s deleted,
expired, or active., we can write up a whole test protocol in a plain text
file, and pipe it to the program to verify its functionality. The input file
might look like

In[1]:

# Cache empty, should not find anything.
read abc
assert undef

# Write and then immediately read the same thing back.
write abc 123
read abc
assert 123

# Different key should get nothing.
read xyz
assert undef

# Write different value, then read, should match.
write abc 567
read abc
assert 567

# Write one more key, abc should still be around.
write xyz 444
read abc
assert 567

In isolation, each of these things is very simple (accept commands at cli,
allow asserting previous read) but it’s the combination of them, together with
the tools that already exist at hand (shell input redirection) that enables
faster cycle times and thus more reliable software.

8. Embed Performance Counters

Finally, we will want to know how our program spends its time. We can do that
with profiling, or deduce it through logs, but the easiest way to do it is to
embed performance counters. These are variables that accumulate how much of
something is happening. It can be things like

How much time is spent reading keys?
How much time is spent writing key–value pairs?
How much time is spent on I/O?
How many cache misses have there been?
How many keys have we had to linearly search over?
How many times has the iteration limit of 500 been reached?

Note that all of these are accumulative, so we don’t ask “how much storage
space is currently allocated?”, but we ask instead the pair of questions

How much storage space have we allocated in total from startup until now?
How much storage space have we freed in total from startup until now?

By breaking the measurement apart into two time-dependent totals we can learn
significantly more about how our system behaves.

How these numbers evolve over time is also a useful indicator. For example,
the value of “how many keys have we had to linearly search over?” might
increase steadily (indicating relatively even distribution of hash
collisions) or it jumps up a lot sometimes (indicating sudden burst of
collisions). These are useful things to be able to tell apart. In combination
with “How many times has the iteration limit been reached” it tells us a lot
of the pressures put on the system.

Other practices

There are surely many more of these insights gained from years of software
engineering, but these are the ones I thought of during this exercise. I’d be
happy to hear from you about others you are thinking about! If nothing else, I
hope that could teach me to be a better software engineer.

Report