Anomaly Detection in Greymatter.io Sense
Unveiling some of the log anomaly detection capabilities built into greymatter.io.
March 15, 2021
Yuval Dror’s July 28 blog post “The Importance of Anomaly Detection for Service Mesh Monitoring” presents an excellent example of the needs, challenges, and potential of service mesh technology. In the article, Yuval calls out the importance of anomaly detection to service mesh operations at scale. We at greymatter.io completely agree. So, with Yuval’s article in mind, we decided to take this opportunity to unveil some of the log anomaly detection capabilities we’re building into greymatter.io.
greymatter.io’s log anomaly detector is a platform-agnostic log monitoring artificial intelligence for the greymatter.io service mesh. Part of what we refer to as greymatter.io Sense, it has no external dependencies beyond whatever delivers the logs. It usually requires no user intervention, operating in the background as an unobtrusive overwatch for the microservice architecture. Once deployed, it receives logs from services in a greymatter.io deployment, automatically training a neural network for each service to recognize its usual look and feel. It then passively monitors the entire firehose of logs for any records that somehow seem “off”.
Sense AI picks out incidents in the logs the same way a human would: it first establishes what “normal logs” look like. It then watches to see if anything “feels weird” given that understanding. Unlike a human, it can consume and monitor haystacks of raw data in search of the odd needle at volumes well beyond human capacity.
Here’s how it works:
- All microservices, all of their sidecars, and any other auxiliary software in a deployment produce logs in the usual way, without knowledge of the Sense subsystem, and the logs are delivered to Sense by a delivery mechanism such as Fluent Bit or Logstash. Each new set of logs is treated separately and automatically, and no new configuration is required to begin accepting logs from a new microservice.
- Each log stream gets its own anomaly detector, backed by a recurrent autoencoder. The neural network is trained to “compress” the incoming logs efficiently using their semantic characteristics so that anomalous logs won’t compress as well. Anomalous logs have different semantic characteristics than normal logs, and the degree of “compression” gives us an accurate measure of this difference.
- The first part is the encoder, which goes over each log line letter by letter, and during training learns to pack all relevant information about a logline into a small, fixed-length set of numbers.
- This set of numbers is called the latent vector because it contains the potential for reconstructing the logline. Crucially, it is too small to “memorize” the logline letter-by-letter — the encoder is forced to learn to abstract high-level features of the logline into this latent vector for it to have any hope of reconstructing it later.
- The last part is the decoder, and it has the task of generating a logline from only the latent vector for that logline. It learns to do a good job reconstructing the original “normal” logs, letter-by-letter.
- Once training is complete, every log line is run through the system, and any that reconstruct too poorly (above a threshold) are flagged as anomalies. The greymatter.io Dashboard presents anomalies to the user in conjunction with other system events, as part of the larger operational picture.
This simple system provides a profound benefit: The otherwise vastly overwhelming firehose of log messages from myriad microservices becomes operationally useful for a change, and a human operator is warned of unusual behavior as it occurs.
The operator can then keep up, passively apprised of the state of the fabric, and able to fix issues before users experience them. This is very difficult without automated anomaly detection. Today, the human operator has much shallower health checks (which often falsely report that everything is fine), user feedback (subject to human failure and limitation), manual checks (how fast can you type?), and mere optimism that silence is good. None of these scale to thousands of services and none of these bring peace of mind.
Sense AI also has quality-of-life features essential for trust and real-world applicability:
- It’s adaptable: Anomaly detection requires no configuration, incorporates new service logs as they’re received, and the per-service sensitivity is set automatically using Sense AI’s own training statistics.
- It’s adjustable: Sensitivity and other post-training parameters are exposed to the user for optional fine-tuning. Sense can also be signaled to restart the training process on a fresh log stream if new versions of a microservice produce different logs.
- It’s correctable: If a handful of normal lines are marked anomalous, it will integrate your feedback incrementally.
- It’s observable: Training status and many other statistics are available for situational awareness and troubleshooting.
- It’s unobtrusive: Anomaly detection within Sense is usually silent. If it surfaces an anomaly, you will be glad it did.
Even during development, this feature has proven useful for rooting out bugs. The following are real-world examples where the Sense AI anomaly detection capability surfaced unforeseen internal problems in our own development work:
- While investigating early possible “mistakes” in the anomaly detection output, the greymatter.io catalog service appeared to produce periodic anomalies, every few hours. We initially believed these were spurious anomalies occurring just above our statistically-inferred detection threshold. The actual cause was periodic service restart triggered by a bug in an old version of the catalog service. Sense surfaced a problem we would otherwise not have noticed amidst the sea of services in the deployment.
- In an instance of surprising self-reference, Sense AI initially ran anomaly detection on its own logs. (This would later become non-default behavior for reasons beyond the scope of this post.) There were only a handful of anomalies in its logs, but they had a similarly strange pattern: an unexpected character (0xe29688 — recognize it?) in the Sense logging output. It turned out that the segments of a progress bar we used to track training progress were interleaved with the logs. The progress bar is invisible in the output itself, so we may never have noticed without Sense.
All indications suggest that this serendipity is typical of Sense’s anomaly detection capability and that developers and operators will derive many useful insights from its hunches.
The anomaly detection service itself is about logs, but the underlying AI and infrastructure applies to any time series, such as greymatter.io observables and metrics streams. Anomaly detection will soon be a fundamental part of greymatter.io, and an unmatched feature in the service mesh market.
Greymatter.io produces a wealth of information about a microservice deployment, and Sense weaponizes it for the war on downtime.