Skip to content

Observability

Farm includes a fully integrated observability stack for monitoring API performance, tracking errors, inspecting distributed traces, and aggregating logs from all containers. The stack is opt-in and runs alongside the main application via a Docker Compose override.

Architecture

graph LR
    A[Farm API] -->|/api/v1/metrics| P[Prometheus]
    A -->|OTLP HTTP| T[Tempo]
    W[Farm Web] -->|/api/v1/traces/ingest proxy| A
    A --> T
    D[Docker containers] -->|stdout/stderr| PT[Promtail]
    PT -->|push| L[Loki]
    H[Host OS] -->|cpu/mem/disk/net| NE[Node Exporter]
    NE --> P
    P --> G[Grafana]
    T --> G
    L --> G
Component Purpose Port
Prometheus Metrics collection and storage 9090
Tempo Distributed trace storage 3200
Loki Log aggregation and storage 3100
Promtail Log collector (Docker container logs)
Node Exporter Host infrastructure metrics 9100
Grafana Dashboards and visualization 3002

Quick Start

Start the full observability stack with one command:

make up-observability

This launches the base stack (PostgreSQL, Redis, API) plus Prometheus, Tempo, Loki, Promtail, Node Exporter, and Grafana. The API container is automatically configured with:

  • OTEL_ENABLED=true — enables OpenTelemetry trace export
  • OTEL_EXPORTER_ENDPOINT=http://tempo:4318/v1/traces — sends traces to Tempo
  • LOKI_URL=http://loki:3100 — Loki endpoint for the Farm log proxy

Accessing the Dashboards

Service URL
Grafana http://localhost:3002
Prometheus http://localhost:9090
Tempo http://localhost:3200
Loki http://localhost:3100
Node Exporter http://localhost:9100
Farm API http://localhost:3000/api

Grafana starts with anonymous access enabled (no login required) for local development convenience.

Stopping the Stack

make down-observability

Pre-Configured Dashboards

The stack ships with three dashboards that are provisioned automatically.

Farm API Overview

Monitors the NestJS application in real time.

Request Rate

  • Total request rate (requests per second) across all endpoints
  • Error request rate (5xx responses per second)
  • Per-route breakdown by HTTP method and route pattern

Latency

  • p50, p95, p99 percentiles of request duration
  • Average response time as a stat panel
  • Duration heatmap showing the distribution of response times

Error Rate

  • Error rate percentage (5xx / total) with color-coded thresholds:
    • Green: < 1%
    • Yellow: 1% — 5%
    • Red: > 5%

Traces

  • Recent traces table from Tempo, showing the latest 20 traces for farm-api and farm-web services. Click a trace ID to view the full span waterfall.

Business Metrics

  • Pipeline Executions Ratepipeline_executions_total by status (success / failure / cancelled)
  • Component Operations Ratecomponent_operations_total by operation (create / update / delete)
  • Deployment Operations Totaldeployment_operations_total stat panel by operation and status

Farm — Application Logs

Aggregates container logs collected by Promtail from all running Farm services.

Panel Description
Log Rate (lines/min) Time series of log throughput per container
Error Rate (lines/min) Time series of level=error logs per container
Error Count (last 1h) Stat panel — total error log lines in the last hour
Warn Count (last 1h) Stat panel — total warn log lines in the last hour
Total Logs (last 1h) Stat panel — all log lines in the last hour
API Logs — farm-api Live log panel filtered to container="farm-api"
All Container Logs Live log panel showing all Farm containers

Logs are labeled with project=farm, container, service, level, and context (Winston field). Use these labels in LogQL to filter by container or log level.


Farm — Infrastructure

Host-level metrics from Node Exporter.

Panel Description
CPU Usage % Time series of total host CPU utilization
Memory Usage % Time series of host memory utilization
CPU — Current Stat panel — current CPU %
Memory — Current Stat panel — current memory %
Total Memory Stat panel — total RAM installed
Disk Read/Write (bytes/s) Time series of disk I/O throughput
Network Traffic (bytes/s) Time series of network receive/transmit
Disk Usage % Bar gauge per filesystem mount point

Log Collection

Promtail uses Docker service discovery (docker_sd_configs) to automatically collect stdout/stderr from every running Farm container. Logs are pushed to Loki in real time.

Labels applied to all log streams

Label Source
container Docker container name (e.g. farm-api)
service Docker Compose service name
project Always farm
job Always docker
level Extracted from Winston JSON field (when present)
context Extracted from Winston JSON field (when present)

Querying logs in Grafana Explore

# All logs from the API
{container="farm-api"}

# Error logs from all containers
{project="farm", level="error"}

# Logs from a specific NestJS context
{container="farm-api", context="HttpException"}

Metrics Reference

The API exposes these custom Prometheus metrics at GET /api/v1/metrics:

HTTP Metrics (auto-instrumented)

Metric Type Labels Description
http_requests_total Counter method, route, status_code Total HTTP requests received
http_request_duration_seconds Histogram method, route, status_code Request duration in seconds

Business Metrics

Metric Type Labels Description
pipeline_executions_total Counter status, pipeline_id Pipeline runs completed
component_operations_total Counter operation Component create/update/delete
deployment_operations_total Counter operation, status Deployment create/update operations
team_operations_total Counter operation Team create/update/delete operations

In addition, all default Node.js process metrics are exposed (CPU, memory, event loop lag, GC).

Example PromQL Queries

Request rate over the last 5 minutes:

sum(rate(http_requests_total{job="farm-api"}[5m]))

95th percentile latency by route:

histogram_quantile(0.95, sum by (le, route) (rate(http_request_duration_seconds_bucket{job="farm-api"}[5m])))

Error rate as a percentage:

sum(rate(http_requests_total{job="farm-api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="farm-api"}[5m]))

Host CPU usage:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100)

Host memory usage:

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Tracing

When OTEL_ENABLED=true, the API exports distributed traces via OpenTelemetry (OTLP HTTP protocol). The auto-instrumentations cover:

  • HTTP -- incoming and outgoing HTTP requests
  • Express -- route-level spans for each controller handler
  • TypeORM -- database query spans with SQL statements

Log-Trace Correlation

In production mode, Winston log entries automatically include trace_id and span_id fields. You can use these values to jump from a log entry in your aggregator directly to the corresponding trace in Grafana/Tempo.

Configuring a Custom Trace Backend

To send traces to a different backend (Jaeger, Datadog, New Relic), update the OTEL_EXPORTER_ENDPOINT environment variable:

# docker-compose.observability.yml
services:
  api:
    environment:
      OTEL_EXPORTER_ENDPOINT: "http://your-collector:4318/v1/traces"

Configuration Reference

Environment Variable Default Description
OTEL_ENABLED false Enable OpenTelemetry trace export
OTEL_EXPORTER_ENDPOINT http://localhost:4318/v1/traces OTLP HTTP endpoint for traces
OTEL_SERVICE_NAME farm-api Service name in trace metadata

Frontend Tracing

The Next.js frontend (farm-web) ships with a complete browser-side OpenTelemetry setup.

Browser SDK

apps/web/src/lib/tracing.ts bootstraps the @opentelemetry/sdk-trace-web SDK on first page load via the <TracingInit /> component placed in the root layout. It registers auto-instrumentations for:

  • Fetch / XHR — automatically injects traceparent headers into all outgoing API calls, linking browser spans to backend spans in Tempo
  • Document load — captures navigation timing as a root span

Spans are exported via OTLP HTTP to POST /api/v1/traces/ingest, a lightweight NestJS proxy that forwards them to Tempo. This avoids CORS issues since the browser posts to the same origin as the API.

Web Vitals

apps/web/src/lib/web-vitals.ts reports the five Core Web Vitals as OTel spans:

Span name Metric Attributes
web_vitals.lcp LCP web_vital.value, web_vital.rating
web_vitals.cls CLS web_vital.value, web_vital.rating
web_vitals.fid FID web_vital.value, web_vital.rating
web_vitals.ttfb TTFB web_vital.value, web_vital.rating
web_vitals.inp INP web_vital.value, web_vital.rating

Manual Spans

apps/web/src/lib/otel-spans.ts provides two typed helpers used throughout the frontend:

  • startSpan(name, attributes?) — returns a Span for manual lifecycle control
  • recordSpan<T>(name, fn, attributes?) — wraps a sync/async function, auto-sets SpanStatusCode.OK/ERROR, and always ends the span

Key spans currently instrumented:

Span name Triggered by Attributes
auth.login Login form submission auth.method, result, error.message
catalog.search Search term change in catalog search.query, search.results_count
catalog.component.view Component detail page load component.id, component.kind
pipeline.run.trigger Trigger run button click pipeline.id
org.switch Organization switcher selection org.id, org.name

User Context

apps/web/src/lib/otel-context.ts stores the authenticated user identity and sets it as span attributes:

// Called after successful login or org switch
setUserContext(userId, username, orgId?)

// Called on logout
clearUserContext()

This links all browser spans to the authenticated user, enabling filtering by user in Grafana/Tempo.

Adding Custom Dashboards

Place JSON dashboard files in observability/grafana/provisioning/dashboards/. They are automatically loaded by Grafana on startup.

Adding Alert Rules

Create a observability/grafana/provisioning/alerting/ directory and add alert rule YAML files. See the Grafana provisioning docs for the schema.

Using External Prometheus

If you already have a Prometheus instance, point it at http://<farm-host>:3000/api/v1/metrics with a scrape config:

scrape_configs:
  - job_name: "farm-api"
    metrics_path: "/api/v1/metrics"
    static_configs:
      - targets: ["<farm-host>:3000"]

Observability Proxy Module (FARM-E27)

The ObservabilityModule (apps/api/src/common/observability/) exposes thin HTTP proxies that forward requests from the Farm UI to external observability tools (Prometheus, Jaeger/Tempo, Loki). It does not store or process data.

Design principles

  • Thin proxy: Forwards requests and returns responses as-is.
  • Graceful degradation: Every method catches all errors and returns { error: '<tool> not available', data: null } with HTTP 200.
  • Admin-only: All endpoints require @Roles('admin').

Adding a new proxy method

// 1. Add to ObservabilityService
async queryNewTool(params: Record<string, string>): Promise<unknown> {
  const baseUrl = this.configService.get<string>('newtool.url');
  try {
    const { data } = await firstValueFrom(
      this.httpService.get(`${baseUrl}/api/v1/endpoint`, { params }),
    );
    return data;
  } catch {
    return { error: 'NewTool not available', data: null };
  }
}

// 2. Add to ObservabilityController
@Get('newtool/query')
@Roles('admin')
@UseGuards(JwtAuthGuard, RolesGuard)
async queryNewTool(@Query() query: Record<string, string>) {
  return this.observabilityService.queryNewTool(query);
}

AlertingRule module

apps/api/src/modules/alerting/ — CRUD for PromQL-based alerting rules linked to components or environments. Registered as plugin core-alerting. Migration: 1773684432000-add-alerting-rules.ts.

WebSocket event broadcasting

Inject EventsGateway with @Optional() and call this.eventsGateway?.server?.emit(FarmEvent.EVENT_NAME, payload).

The FarmEvent enum must be kept in sync between packages/types/src/index.ts and apps/api/src/common/events/events.interfaces.ts.