In modern distributed systems, an incident is rarely confined to a single service. A user-facing error is often the final symptom of a complex chain reaction spanning multiple microservices. For on-call engineers, the primary challenge in reducing Mean Time To Resolution (MTTR) is correlating disparate log streams to reconstruct the event's lifecycle. Without the right tools, this process is a time-consuming exercise in manual data wrangling.
Effective incident response requires a workflow that can rapidly pivot from a high-level symptom to a low-level root cause. This post details a professional workflow using LogLens to trace a single transaction through a complex system, demonstrating how its distinct commands work in concert to accelerate root cause analysis.
The Scenario: A Failing Logistics Shipment
A major client of a Ghent-based logistics platform reports that a critical shipment has failed to schedule, providing a shipment_id
. The platform's backend consists of several services, including an api-gateway
, a routing-engine
, partner-integrations
, and a database-service
.
# Some logs are well-structured JSON:
{"level":"info", "service":"api-gateway", "shipment_id":"shipment_id_abc123", "trace_id":"trace-xyz"}
# Others are simple logfmt from a legacy service:
service=routing-engine msg="Route calculated" trace_id=trace-xyz postal_code=9000
# And some are just unstructured text from a third-party library:
ERROR: Partner API call failed for trace-xyz. Reason: Invalid postal code.
Phase 1: Initial Triage with `loglens search`
The incident starts with a single identifier. At this stage, you don't care about structure; you need raw speed to find every occurrence of that shipment_id
. The free loglens search
command is the perfect tool for this initial reconnaissance, as it operates like a massively parallel `grep` across all files.
# Find the initial log entry to get the master trace_id
loglens search /var/log/ "shipment_id_abc123"
This command instantly returns the API Gateway log, revealing the crucial piece of correlating information: "trace_id": "trace-xyz"
. Now you have the thread to pull.
Phase 2: Contextual Analysis with `loglens query`
With the trace_id
, you can perform a more precise, structured search for the error. The query
command understands your log data, allowing for complex filtering. This lets you immediately identify which service first reported an error for this specific trace.
# Find the first service to report an error for this trace
loglens query /var/log/services/ --query 'trace_id == "trace-xyz" and level is "error"'
The query points directly to a log from the partner-integrations
service, but the message is generic: "Partner API returned an error." To understand the root cause, you need to see what happened immediately before this error occurred.
Phase 3: Root Cause Pinpointing with `loglens tui`
For the final step, you need the full context. You'll collate every log line for the trace—including INFO and DEBUG messages—into a temporary file and then open it with the interactive TUI.
# First, create a focused file with the full trace history
loglens search /var/log/ "trace-xyz" --raw > trace_context.log
# Now, open this file for an interactive investigation
loglens tui trace_context.log
Inside the TUI, you see the entire event lifecycle in chronological order. By examining the logs right before the error, you discover the root cause: an INFO log from the routing-engine
shows it processed a request with a malformed postal code. This invalid data was then passed to the partner-integrations
service, causing the API call to fail. The problem wasn't in the service that reported the error, but one step upstream.
The Resolution and Business Impact
The issue is identified: a data sanitization bug in the routing-engine
. Within minutes, the team can deploy a hotfix. MTTR was drastically reduced, and the issue was resolved before it impacted a significant number of shipments. This workflow demonstrates a clear path from a high-level customer report to a specific line of code, turning a potentially hour-long investigation into a focused, five-minute task.