02/03/2021
In Learning Tree Course 468, Introduction to Cybersecurity, we talk about three core elements of cybersecurity: confidentiality, integrity, and availability. The latter probably gets the least attention. It gets limited attention in Course 450, Introduction to Networking, too. Since availability is an essential pillar of cybersecurity, we'll look at three important availability metrics here:
Uptime
Expressed as a percentage, this is the most common of the availability metrics. It is used for systems, networks, and even non-computer measurements. It is fundamentally the time something is available divided by total time. A system you can use 3/4ths of the time has a 75% availability. Likewise, a system you can use 99% of the time is unavailable almost 88 hours a year! Telephone companies strive for "five nines" uptime or 99.999% for nine minutes per year.
Higher percentages are clearly better. We'd all like to have systems and networks available all the time, but stuff happens. Sometimes a lack of availability (downtime) can be due to malice and other times it can be due to equipment failures, cable breaks, configuration, or other issues. We can work to prevent attackers from shutting down services, but we also need to be mindful of other issues that may cause the downtime.
Uptime is a highly-visible measure, but there are other important metrics, too.
Mean-time-to-recovery or repair (MTTR)
Expressed in hours or occasionally minutes, MTTR represents the time it takes an organization to recover from a failure. If there are three failures in a reporting period (e.g. monthly) the MTTR is the average of the three recovery times.
MTTR is generally seen as an indication of the responsiveness or efficiency of the team(s) maintaining systems. While this may be true in the long term, malicious actors can cause failures from which recovery is particularly difficult.
A closely related concept is Mean-down-time. This is generally longer than MTTR and includes administrative times and other non-technical factors.
Mean-time-between-failures (MTBF)
Also expressed in hours or minutes, this is a reflection of how resilient a system or network is. That is a bit inaccurate, though, as downtime caused by malicious actors is generally unpredictable
Each of these metrics provides a valuable measurement of the availability of a system or network. Each must be considered carefully, however: none provides true insight into the cause of downtime, per se, but rather to the operation of the system or network and its response to incidents.
An organization can use these measures to help steer funding appropriately. A long downtime following a particular incident, for example, may be an indicator of a need for increased response training or funding. This is one reason MTTR and MTBF can be valuable: they can show the response over time and thus give a fuller picture.
I have limited these descriptions to non-mathematical fundamental concepts in order to be more readable to a broader audience. Interested readers may want to read e.g. https://en.wikipedia.org/wiki/Mean_time_between_failures