While I have heard DevOps folks talk about observability, I haven’t found a definition for observability in DevOps space. Also, while I have heard folks say observability is better/different than logging and monitoring, I haven’t found concrete differences.
So, this post is my attempt to define and differentiate logging, monitoring, and observability. Also, it is an attempt to answer if and when a solution can automatically make existing systems observable.
Consider the Wilma’s workflow.
At the end of each day, Wilma writes about the activities she did on her job during the day into a journal. Since she spends no more than 30 minutes on writing about her activities, she may not capture all activities or all details about an activity. She chooses to write about only those activities and details that she deems will be useful in the future.
On Saturday mornings, Wilma reads thru the activities described in her journal and reflects on them. If she identifies an interesting and possibly relevant pattern, then she tries to identify activities related to various instances of the pattern, figure out how the activities contributed to the pattern, and figure out how the activities came to be. The point of this exercise is to identify triggers and activities that lead to a pattern and use them to help repeat (or prevent) the pattern.
In the above situation, the act of writing information about activities into a journal is logging. Each sentence/paragraph in the journal is a log statement and the journal is a log — a collection of log statements. [In programming land, the statement that writes a log statement is a logging statement.]
As described, Wilma is doing batched logging as she batches details about activities and writes them into the journal at the end of the day. Also, since she does this when she is not performing the activities, we could deem Wilma is doing offline logging. The alternative would be to do online logging by logging details about each activity at the end of or during the activity.
Logging is the act of recording information about an state/activity/object/action.
The act of identifying instances of patterns by reading the journal on Saturday mornings is monitoring. Monitoring is all about looking for presence or absence of patterns in information source(s), e.g., that captures the behavior of a system. Typically, logs are common information sources and, hence, logging enables monitoring.
As is, the effectiveness of monitoring is seldom concerned with the polarity of the patterns — good or bad. Instead, the effectiveness of monitoring is often concerned with the relevance and novelty of patterns. In this sense, the effectiveness of monitoring (in finding relevant patterns) is heavily dependent on the (template of) patterns that are being searched for. In other words, if Wilma is not looking for a specific important pattern, then she will not discover it even if it occurs in the log; hence, monitoring will be ineffective in this case.
Monitoring is the act of searching (looking out) for presence/absence of patterns in information sources.
A closely related but distinct concept is alerting — the act of informing concerned parties about findings from monitoring. Like logging, monitoring enables alerting.
Unlike logging and monitoring, observability is the ability to observe the behavior/state of a system.
According to wikipedia, in control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. To me, this makes sense, and here’s my interpretation of this in the context of software systems.
A behavior X of a system is (externally) observable if we can observe the system is exhibiting behavior X; typically, without peeking under the hood of the system. If behavior X is internal to the system, then X is observable if certain outputs of the system imply the system is exhibiting behavior X. As observers, our ability to observe the behavior/state of a system is constrained by what the system (the observed) is willing to expose to us. Hence, the observability of a system is the ability to observe the behavior/state of the system and this is dictated by what the system is willing to expose about itself. The extent/measure of observability is the number of behaviors of the system that can be observed.
In this context, precision and accuracy of observation are two closely concepts. Suppose observability of a behavior is defined by a set of associated outputs. The precision of observing behavior X is the fraction of the outputs uniquely associated with behavior X, i.e., how certain are we that we observed behavior X? The accuracy of observing behavior X is the fraction of manifestations of behavior X that are associated with some output, i.e., how certain are we that we will be able to observe all instances of behavior X? [For CS folks, precision and accuracy are similar to soundness and completeness.]
I think observability is strongly related to testability — the ability to test various behaviors of a system. Without logging statements, testability defines the extent of observability, i.e., without logging statements, only a subset of testable behaviors can be observed.
While the above description focuses on the behavior of a system, it applies to states of a system as well if we map internal behavior to internal parts of the system state and external behavior to external parts of the system state.
Observability of a system is the ability to observe the behavior/state of the system, and this is dictated by what the system exposes about itself.
As logging can enable monitoring if logs contain information relevant to the system property being monitored, logging can enable observability if logs contain information relevant to the system behavior/state being observed, i.e., we can infer system behavior/state from the data in logs.
Can a solution automatically make a system observable ?
I have often read statements solution X can make an existing system observable, i.e., solution X enables us to observe Y about an existing system when Y could not be observed previously.
I think such statements are true only if one of the following is true.
- Solution X modifies the system to collect information that was not being collected earlier. This may entail static changes or dynamic changes (e.g., instrumentation, interception) to the system. The reasoning is, if information Z about a system is needed to observe Y but was not exposed by the system, then solutions need to expose/extract information Z from the system to observe Y.
- Solution X analyzes the collected data in ways it was not analyzed earlier, i.e., solution X searches for the presence/absence of new patterns. The reason is, if information Z needed to observe Y is exposed by a system but is not being considered to observe Y, then solutions need to consider information Z to observe Y. In this sense, observability is no different than monitoring.
A solution that merely gathers data that was already being exposed by the system (e.g., redirect logs to a different storage space) cannot alone improve the observability of a system.
A solution can only improve observability of a system by collecting more data from the system or performing new analysis on currently collected data.