Anatomy of an Outage: How `loglens stats` Reduces MTTR from 45 Minutes to 5

By LogLens Team | August 7, 2025 | Category: Performance

It’s 7:30 AM. An alert fires: the P95 latency for the critical `checkout-api` has breached its Service Level Objective (SLO). High-level dashboards confirm the problem—a spike in response times—but offer no explanation as to the cause. For any on-call engineer, this is the start of a race against time. The key to winning that race is Mean Time To Resolution (MTTR), and the biggest variable is how quickly you can extract actionable signals from your logs.

This is a story of two timelines. One involves the slow, cumbersome process of traditional command-line tools. The other demonstrates the speed and clarity that loglens stats brings to incident response.

The Old Way: A 45-Minute Investigation

Without a specialized tool, the process is painfully manual:

(10-15 min) Download & Decompress: Pull down a multi-gigabyte compressed log archive from cloud storage and wait for gunzip to churn through it.
(20-30 min) Iterative Grepping: Start a frustrating cycle of piping grep, awk, and sed together, slowly building a complex one-liner to isolate error logs, extract latency values, and manually calculate an average. It's slow, error-prone, and provides only surface-level insights.

After nearly 45 minutes of stressful scripting, you might have a rough idea of the problem, but you lack the statistical confidence to declare a root cause.

The LogLens Way: A 5-Minute Diagnosis

Let's restart the clock. The alert fires. The log archive is downloaded. Instead of `gunzip` and `grep`, our engineer, Alex, uses LogLens. Total time elapsed: under 5 minutes.

Step 1: Broad Triage with `stats legacy`

Alex's first hypothesis is that one of the API's downstream dependencies is failing. The logs contain a `downstream_service` field. A single `stats legacy` command can test this hypothesis by grouping all failed checkouts by the service they were calling.

# For all failed checkouts, group by downstream service and find the average latency
loglens stats ./checkout-api.log.gz legacy --count-by "downstream_service" --avg "latency_ms" --where 'status_code >= 500'

In under ten seconds, Alex gets a clear result:

Count by Field: "downstream_service"
===================================================================
Value                   | Count     | Average (latency_ms)
-------------------------------------------------------------------
"inventory-service"       | 12        | 250.41
"shipping-calculator"     | 8         | 450.90
"payment-gateway"         | 157       | 18540.33
"user-database"           | 4         | 150.22
-------------------------------------------------------------------

The signal is unmistakable. While other services have minor failures, the `payment-gateway` is the source of the vast majority of errors, with a staggering average latency of over 18 seconds. The investigation is now focused.

Step 2: Deep Analysis with `stats describe`

An 18-second average is bad, but *why* is it so high? Is it consistently slow, or are there timeouts? A simple average can't answer this. For a complete statistical picture, Alex uses `stats describe`.

# Get a full statistical breakdown for the failing payment gateway calls
loglens stats ./checkout-api.log.gz describe latency_ms --where 'downstream_service == "payment-gateway" AND status_code >= 500'

This command provides the critical "aha!" moment:

Statistics for 'latency_ms'
===========================
Metric       |              Value
---------------------------------
Count        |                157
Sum          |         2910031.71
Min          |             854.60
Average      |           18535.23
p50 (Median) |             890.11
p95          |           30000.00
p99          |           30000.00
Max          |           30000.00

This data tells a rich story. While the median failure (`p50`) is slow at around 890ms, the `p95`, `p99`, and `Max` values are all exactly 30,000ms. This is the smoking gun: the payment gateway integration has a 30-second timeout, and a significant percentage of requests are hitting it perfectly. The problem isn't general slowness; it's a hard timeout.

From Hours to Minutes, From Guesswork to Certainty

In less than five minutes, using just two commands, Alex has moved from a high-level P95 alert to a precise, actionable diagnosis: "The checkout API is failing because calls to the payment gateway are hitting a 30-second network timeout." The team now knows exactly where to look—not at their own code, but at the network configuration or the health of the payment gateway itself.

This is the strategic advantage of a purpose-built tool. By providing instant statistical analysis directly on the command line, LogLens drastically reduces Mean Time To Resolution, turning a stressful, 45-minute ordeal into a calm, 5-minute investigation. It bridges the gap between observability and logs, empowering engineers to stop digging for data and start solving problems.

Get LogLens Pro Explore the Docs