Production runbook

Grafana Tempo

Prometheus metrics

Operations

Operate the live system with health, metrics, logs, and traces.

This page is for the live system. Start with health, move to metrics, correlate logs, and only then drill into Tempo traces. The template already ships with the wiring, so the operator workflow stays simple.

Open /metrics Open /api/health

Local observability stack

Tempo and Grafana endpoints

Grafana

http://127.0.0.1:3001

Tempo API

http://127.0.0.1:3200

OTLP HTTP

http://127.0.0.1:4318

Expected span tree

incoming request
  -> app.route span
  -> subscribers.create span
  -> db.subscribers.select span
  -> db.subscribers.insert span

The request trace should stay readable: route span, service span, repository span, then database child spans with stable names.

Health first

Health JSON gives a fast answer on reachability, DB mode, and tracing state.

Metrics ready

Prometheus metrics expose the request surface and database activity without extra glue code.

Trace continuity

Tempo keeps route, service, repository, and DB spans in one request tree.

Correlated logs

Request IDs and trace IDs stay in logs so incidents can be reconstructed quickly.

Runbook

1. Check health

Call /api/health first. It confirms the app is reachable, shows the database mode, and tells you whether tracing is enabled.

Runbook

2. Check metrics

Open /metrics or your Prometheus target and look for route counters, latency, and DB-related measurements before digging deeper.

Runbook

3. Correlate logs

Use request IDs or trace IDs in logs to identify the exact request window that matters.

Runbook

4. Inspect traces

Open Tempo and follow the request from route span to service span to repository span so slow points are visible in one tree.

HTTP metrics

Route counters and durations are exposed through prom-client at /metrics.

Structured logs

Pino logs include request IDs and active trace IDs for correlation during incidents.

Distributed traces

Route, service, repository, and DB spans are exported through OTLP into Tempo.

Database visibility

The sample subscriber feature proves DB spans sit under the same parent request trace.

Signals

What good coverage looks like

These defaults are already in the template and should remain part of the baseline as you add features.

Structured JSON logging with request IDs and active trace IDs.
Prometheus-compatible metrics for Node.js, HTTP routes, and database operations.
OpenTelemetry spans that flow from route handlers into services and repositories.
Configurable ignore paths so noise like /metrics or static assets do not flood tracing.

Tracing controls

Ignore-path example

Ignore low-value traffic so Tempo keeps the spans that actually help during incident analysis.

OTEL_TRACE_IGNORE_PATHS

/metrics,_next/static,_next/image,favicon.ico

Keep `/metrics`, static assets, and image optimizer requests out of the tracing pipeline unless you explicitly need them for a debugging session.

Commands

Tempo and Grafana workflow

These are the commands most useful when you are validating observability locally.

pnpm e2eMigrate the local database, build the app, and run Playwright.
pnpm observability:upBoot Grafana and Tempo locally with ready-to-use provisioning.
pnpm observability:testStart the stack, run the app, and verify spans reach Tempo.

Sequence

From symptom to span

Use this order during an incident so you move from coarse signal to precise signal without wasting time.

1. Check health

Call /api/health first. It confirms the app is reachable, shows the database mode, and tells you whether tracing is enabled.

2. Check metrics

Open /metrics or your Prometheus target and look for route counters, latency, and DB-related measurements before digging deeper.

3. Correlate logs

Use request IDs or trace IDs in logs to identify the exact request window that matters.

4. Inspect traces

Open Tempo and follow the request from route span to service span to repository span so slow points are visible in one tree.