Monitoring & Observability - BIO-QMS Platform

Overview

This document defines the monitoring and observability architecture for the BIO-QMS regulated SaaS platform, ensuring compliance with FDA 21 CFR Part 11, HIPAA, and SOC 2 requirements. The platform leverages Google Cloud Platform's native observability stack combined with OpenTelemetry for comprehensive system visibility.

Observability Pillars

Pillar	Technology	Purpose	Compliance Impact
Metrics	Cloud Monitoring	Performance, availability, SLO tracking	SOC 2 availability controls
Logs	Cloud Logging	Audit trail, debugging, compliance	FDA 21 CFR Part 11 §11.10(e)
Traces	Cloud Trace + OpenTelemetry	Request flow, latency analysis	Performance verification
Alerts	Cloud Monitoring Alerting	Proactive incident detection	HIPAA breach notification

Regulatory Requirements

FDA 21 CFR Part 11 §11.10(e): Use of secure, computer-generated, time-stamped audit trails
HIPAA Security Rule: Audit controls (§164.312(b)), integrity controls (§164.312(c)(1))
SOC 2 CC7.2: System monitoring to detect security incidents
SOC 2 CC7.3: System availability monitoring and alerting

E.3.1: Cloud Monitoring Dashboards

Dashboard Architecture

┌─────────────────────────────────────────────────────────────┐
│                  Cloud Monitoring Workspace                  │
├──────────────────┬──────────────────┬──────────────────────┤
│  API Operations  │  Database Layer  │  QMS Business KPIs   │
│   Dashboard      │    Dashboard     │     Dashboard        │
├──────────────────┼──────────────────┼──────────────────────┤
│ - Request Rate   │ - Connection     │ - Documents Signed   │
│ - Latency (p50,  │   Pool Usage     │ - CAPA Resolution    │
│   p95, p99)      │ - Query Latency  │   Time               │
│ - Error Rate     │ - Replication    │ - Audit Events/Hour  │
│ - HTTP Status    │   Lag            │ - Active Sessions    │
│   Distribution   │ - Disk I/O       │ - Compliance Score   │
├──────────────────┼──────────────────┼──────────────────────┤
│  Cache Layer     │ Infrastructure   │  Security Metrics    │
│   Dashboard      │    Dashboard     │     Dashboard        │
├──────────────────┼──────────────────┼──────────────────────┤
│ - Hit Rate       │ - CPU Usage      │ - Failed Logins      │
│ - Miss Rate      │ - Memory Usage   │ - Auth Token Issues  │
│ - Eviction Rate  │ - Network I/O    │ - Access Violations  │
│ - Command/sec    │ - Pod Restarts   │ - Certificate Expiry │
└──────────────────┴──────────────────┴──────────────────────┘

Dashboard 1: API Operations Dashboard

{
  "displayName": "BIO-QMS API Operations",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "API Request Rate (requests/sec)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloud_run_revision\" AND resource.labels.service_name=\"bio-qms-api\" AND metric.type=\"run.googleapis.com/request_count\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_RATE",
                      "crossSeriesReducer": "REDUCE_SUM",
                      "groupByFields": ["resource.service_name"]
                    }
                  }
                },
                "plotType": "LINE",
                "targetAxis": "Y1"
              }
            ],
            "timeshiftDuration": "0s",
            "yAxis": {
              "label": "Requests/sec",
              "scale": "LINEAR"
            }
          }
        }
      },
      {
        "xPos": 6,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "API Latency Percentiles (ms)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_DELTA",
                      "crossSeriesReducer": "REDUCE_PERCENTILE_50"
                    }
                  }
                },
                "plotType": "LINE",
                "legendTemplate": "p50"
              },
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_DELTA",
                      "crossSeriesReducer": "REDUCE_PERCENTILE_95"
                    }
                  }
                },
                "plotType": "LINE",
                "legendTemplate": "p95"
              },
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_DELTA",
                      "crossSeriesReducer": "REDUCE_PERCENTILE_99"
                    }
                  }
                },
                "plotType": "LINE",
                "legendTemplate": "p99"
              }
            ],
            "thresholds": [
              {
                "value": 200.0,
                "color": "YELLOW",
                "direction": "ABOVE",
                "label": "SLO Target (p95 < 200ms)"
              },
              {
                "value": 500.0,
                "color": "RED",
                "direction": "ABOVE",
                "label": "Critical Threshold"
              }
            ]
          }
        }
      },
      {
        "yPos": 4,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Error Rate (%)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilterRatio": {
                    "numerator": {
                      "filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\" AND metric.labels.response_code_class=\"5xx\"",
                      "aggregation": {
                        "alignmentPeriod": "60s",
                        "perSeriesAligner": "ALIGN_RATE"
                      }
                    },
                    "denominator": {
                      "filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
                      "aggregation": {
                        "alignmentPeriod": "60s",
                        "perSeriesAligner": "ALIGN_RATE"
                      }
                    }
                  }
                },
                "plotType": "LINE"
              }
            ],
            "thresholds": [
              {
                "value": 0.01,
                "color": "YELLOW",
                "direction": "ABOVE",
                "label": "Warning (1%)"
              },
              {
                "value": 0.05,
                "color": "RED",
                "direction": "ABOVE",
                "label": "Critical (5%)"
              }
            ]
          }
        }
      },
      {
        "xPos": 6,
        "yPos": 4,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "HTTP Status Distribution",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_RATE",
                      "crossSeriesReducer": "REDUCE_SUM",
                      "groupByFields": ["metric.response_code_class"]
                    }
                  }
                },
                "plotType": "STACKED_BAR"
              }
            ]
          }
        }
      },
      {
        "yPos": 8,
        "width": 12,
        "height": 4,
        "widget": {
          "title": "API Availability (SLO: 99.9%)",
          "scorecard": {
            "timeSeriesQuery": {
              "timeSeriesFilterRatio": {
                "numerator": {
                  "filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\" AND metric.labels.response_code_class!=\"5xx\"",
                  "aggregation": {
                    "alignmentPeriod": "3600s",
                    "perSeriesAligner": "ALIGN_SUM",
                    "crossSeriesReducer": "REDUCE_SUM"
                  }
                },
                "denominator": {
                  "filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
                  "aggregation": {
                    "alignmentPeriod": "3600s",
                    "perSeriesAligner": "ALIGN_SUM",
                    "crossSeriesReducer": "REDUCE_SUM"
                  }
                }
              }
            },
            "sparkChartView": {
              "sparkChartType": "SPARK_LINE"
            },
            "thresholds": [
              {
                "value": 0.999,
                "color": "YELLOW",
                "direction": "BELOW"
              },
              {
                "value": 0.995,
                "color": "RED",
                "direction": "BELOW"
              }
            ]
          }
        }
      }
    ]
  }
}

Dashboard 2: Database Performance Dashboard

{
  "displayName": "BIO-QMS Database Performance",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Database Connection Pool Usage",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/postgresql/num_backends\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_MEAN"
                    }
                  }
                },
                "plotType": "LINE",
                "legendTemplate": "Active Connections"
              }
            ],
            "thresholds": [
              {
                "value": 80.0,
                "color": "YELLOW",
                "direction": "ABOVE",
                "label": "Warning (80 connections)"
              },
              {
                "value": 95.0,
                "color": "RED",
                "direction": "ABOVE",
                "label": "Critical (95 connections)"
              }
            ]
          }
        }
      },
      {
        "xPos": 6,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Query Latency (ms)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "metric.type=\"logging.googleapis.com/user/prisma_query_duration\" AND metric.labels.operation=\"query\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_DELTA",
                      "crossSeriesReducer": "REDUCE_PERCENTILE_95"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ]
          }
        }
      },
      {
        "yPos": 4,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Disk I/O Utilization (%)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/disk/utilization\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_MEAN"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ],
            "thresholds": [
              {
                "value": 0.8,
                "color": "YELLOW",
                "direction": "ABOVE"
              }
            ]
          }
        }
      },
      {
        "xPos": 6,
        "yPos": 4,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Replication Lag (seconds)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/replication/replica_lag\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_MAX"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ],
            "thresholds": [
              {
                "value": 10.0,
                "color": "YELLOW",
                "direction": "ABOVE"
              },
              {
                "value": 60.0,
                "color": "RED",
                "direction": "ABOVE"
              }
            ]
          }
        }
      },
      {
        "yPos": 8,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Transaction Rate (tx/sec)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/postgresql/transaction_count\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_RATE"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ]
          }
        }
      },
      {
        "xPos": 6,
        "yPos": 8,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Database CPU Utilization (%)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/cpu/utilization\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_MEAN"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ],
            "thresholds": [
              {
                "value": 0.7,
                "color": "YELLOW",
                "direction": "ABOVE"
              },
              {
                "value": 0.9,
                "color": "RED",
                "direction": "ABOVE"
              }
            ]
          }
        }
      }
    ]
  }
}

Dashboard 3: Cache Performance Dashboard

{
  "displayName": "BIO-QMS Cache Layer (Memorystore Redis)",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Cache Hit Rate (%)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/cache_hit_ratio\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_MEAN"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ],
            "thresholds": [
              {
                "value": 0.8,
                "color": "YELLOW",
                "direction": "BELOW",
                "label": "Target Hit Rate (80%)"
              },
              {
                "value": 0.6,
                "color": "RED",
                "direction": "BELOW"
              }
            ]
          }
        }
      },
      {
        "xPos": 6,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Commands/sec",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/commands/calls\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_RATE",
                      "crossSeriesReducer": "REDUCE_SUM",
                      "groupByFields": ["metric.cmd"]
                    }
                  }
                },
                "plotType": "STACKED_AREA"
              }
            ]
          }
        }
      },
      {
        "yPos": 4,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Memory Usage (MB)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/memory/usage\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_MEAN"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ]
          }
        }
      },
      {
        "xPos": 6,
        "yPos": 4,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Evicted Keys/sec",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/evicted_keys\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_RATE"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ],
            "thresholds": [
              {
                "value": 100.0,
                "color": "YELLOW",
                "direction": "ABOVE",
                "label": "High Eviction Rate"
              }
            ]
          }
        }
      }
    ]
  }
}

Dashboard 4: QMS Business KPIs

{
  "displayName": "BIO-QMS Business Metrics",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 4,
        "height": 4,
        "widget": {
          "title": "Documents Signed (per hour)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "metric.type=\"logging.googleapis.com/user/qms_document_signed\"",
                    "aggregation": {
                      "alignmentPeriod": "3600s",
                      "perSeriesAligner": "ALIGN_RATE",
                      "crossSeriesReducer": "REDUCE_SUM"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ]
          }
        }
      },
      {
        "xPos": 4,
        "width": 4,
        "height": 4,
        "widget": {
          "title": "CAPA Resolution Time (hours)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "metric.type=\"logging.googleapis.com/user/capa_resolution_duration\"",
                    "aggregation": {
                      "alignmentPeriod": "3600s",
                      "perSeriesAligner": "ALIGN_DELTA",
                      "crossSeriesReducer": "REDUCE_PERCENTILE_95"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ],
            "thresholds": [
              {
                "value": 72.0,
                "color": "YELLOW",
                "direction": "ABOVE",
                "label": "Target (72h)"
              }
            ]
          }
        }
      },
      {
        "xPos": 8,
        "width": 4,
        "height": 4,
        "widget": {
          "title": "Audit Events (per hour)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "metric.type=\"logging.googleapis.com/user/audit_event_count\"",
                    "aggregation": {
                      "alignmentPeriod": "3600s",
                      "perSeriesAligner": "ALIGN_RATE",
                      "crossSeriesReducer": "REDUCE_SUM",
                      "groupByFields": ["metric.event_type"]
                    }
                  }
                },
                "plotType": "STACKED_BAR"
              }
            ]
          }
        }
      },
      {
        "yPos": 4,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Active User Sessions",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "metric.type=\"logging.googleapis.com/user/active_sessions\"",
                    "aggregation": {
                      "alignmentPeriod": "60s",
                      "perSeriesAligner": "ALIGN_MEAN"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ]
          }
        }
      },
      {
        "xPos": 6,
        "yPos": 4,
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Document Approval Workflow Duration (hours)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "metric.type=\"logging.googleapis.com/user/document_approval_duration\"",
                    "aggregation": {
                      "alignmentPeriod": "3600s",
                      "perSeriesAligner": "ALIGN_DELTA",
                      "crossSeriesReducer": "REDUCE_PERCENTILE_95"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ]
          }
        }
      }
    ]
  }
}

Custom Metrics Implementation

// src/monitoring/metrics.service.ts
import { Injectable } from '@nestjs/common';
import { Monitoring } from '@google-cloud/monitoring';

@Injectable()
export class MetricsService {
  private readonly client: Monitoring;
  private readonly projectId: string;

  constructor() {
    this.client = new Monitoring.MetricServiceClient();
    this.projectId = process.env.GCP_PROJECT_ID;
  }

  /**
   * Record document signature event
   * @compliance FDA 21 CFR Part 11 - Electronic signatures tracking
   */
  async recordDocumentSigned(documentId: string, userId: string, orgId: string): Promise<void> {
    const dataPoint = {
      interval: {
        endTime: {
          seconds: Math.floor(Date.now() / 1000),
        },
      },
      value: {
        int64Value: 1,
      },
    };

    const timeSeries = {
      metric: {
        type: 'logging.googleapis.com/user/qms_document_signed',
        labels: {
          document_id: documentId,
          organization_id: orgId,
        },
      },
      resource: {
        type: 'cloud_run_revision',
        labels: {
          project_id: this.projectId,
          service_name: 'bio-qms-api',
        },
      },
      points: [dataPoint],
    };

    await this.client.createTimeSeries({
      name: this.client.projectPath(this.projectId),
      timeSeries: [timeSeries],
    });
  }

  /**
   * Record CAPA resolution duration
   * @compliance SOC 2 - Performance monitoring
   */
  async recordCapaResolution(capaId: string, durationHours: number): Promise<void> {
    const dataPoint = {
      interval: {
        endTime: {
          seconds: Math.floor(Date.now() / 1000),
        },
      },
      value: {
        doubleValue: durationHours,
      },
    };

    const timeSeries = {
      metric: {
        type: 'logging.googleapis.com/user/capa_resolution_duration',
        labels: {
          capa_id: capaId,
        },
      },
      resource: {
        type: 'cloud_run_revision',
        labels: {
          project_id: this.projectId,
          service_name: 'bio-qms-api',
        },
      },
      points: [dataPoint],
    };

    await this.client.createTimeSeries({
      name: this.client.projectPath(this.projectId),
      timeSeries: [timeSeries],
    });
  }

  /**
   * Record audit event count
   * @compliance FDA 21 CFR Part 11 §11.10(e) - Audit trail
   */
  async recordAuditEvent(eventType: string, userId: string, resourceType: string): Promise<void> {
    const dataPoint = {
      interval: {
        endTime: {
          seconds: Math.floor(Date.now() / 1000),
        },
      },
      value: {
        int64Value: 1,
      },
    };

    const timeSeries = {
      metric: {
        type: 'logging.googleapis.com/user/audit_event_count',
        labels: {
          event_type: eventType,
          resource_type: resourceType,
        },
      },
      resource: {
        type: 'cloud_run_revision',
        labels: {
          project_id: this.projectId,
          service_name: 'bio-qms-api',
        },
      },
      points: [dataPoint],
    };

    await this.client.createTimeSeries({
      name: this.client.projectPath(this.projectId),
      timeSeries: [timeSeries],
    });
  }

  /**
   * Update active session count
   */
  async updateActiveSessions(count: number): Promise<void> {
    const dataPoint = {
      interval: {
        endTime: {
          seconds: Math.floor(Date.now() / 1000),
        },
      },
      value: {
        int64Value: count,
      },
    };

    const timeSeries = {
      metric: {
        type: 'logging.googleapis.com/user/active_sessions',
      },
      resource: {
        type: 'cloud_run_revision',
        labels: {
          project_id: this.projectId,
          service_name: 'bio-qms-api',
        },
      },
      points: [dataPoint],
    };

    await this.client.createTimeSeries({
      name: this.client.projectPath(this.projectId),
      timeSeries: [timeSeries],
    });
  }

  /**
   * Record document approval workflow duration
   */
  async recordApprovalDuration(workflowId: string, durationHours: number): Promise<void> {
    const dataPoint = {
      interval: {
        endTime: {
          seconds: Math.floor(Date.now() / 1000),
        },
      },
      value: {
        doubleValue: durationHours,
      },
    };

    const timeSeries = {
      metric: {
        type: 'logging.googleapis.com/user/document_approval_duration',
        labels: {
          workflow_id: workflowId,
        },
      },
      resource: {
        type: 'cloud_run_revision',
        labels: {
          project_id: this.projectId,
          service_name: 'bio-qms-api',
        },
      },
      points: [dataPoint],
    };

    await this.client.createTimeSeries({
      name: this.client.projectPath(this.projectId),
      timeSeries: [timeSeries],
    });
  }
}

SLO Definitions

# slo-definitions.yaml
# Service Level Objectives for BIO-QMS Platform

apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
  name: api-availability-slo
spec:
  displayName: "API Availability SLO"
  serviceLevelIndicator:
    requestBased:
      goodTotalRatio:
        totalServiceFilter: >
          resource.type="cloud_run_revision"
          AND resource.labels.service_name="bio-qms-api"
          AND metric.type="run.googleapis.com/request_count"
        goodServiceFilter: >
          resource.type="cloud_run_revision"
          AND resource.labels.service_name="bio-qms-api"
          AND metric.type="run.googleapis.com/request_count"
          AND metric.labels.response_code_class!="5xx"
  goal: 0.999  # 99.9% availability
  rollingPeriod: "2592000s"  # 30 days
  complianceNote: "SOC 2 CC7.1 - System availability commitment"

---
apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
  name: api-latency-slo
spec:
  displayName: "API Latency SLO (p95 < 200ms)"
  serviceLevelIndicator:
    requestBased:
      distributionCut:
        distributionFilter: >
          resource.type="cloud_run_revision"
          AND metric.type="run.googleapis.com/request_latencies"
        range:
          max: 200.0  # milliseconds
  goal: 0.95  # 95% of requests under 200ms
  rollingPeriod: "2592000s"
  complianceNote: "SOC 2 CC7.2 - Performance monitoring"

---
apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
  name: database-query-latency-slo
spec:
  displayName: "Database Query Latency SLO (p95 < 100ms)"
  serviceLevelIndicator:
    requestBased:
      distributionCut:
        distributionFilter: >
          metric.type="logging.googleapis.com/user/prisma_query_duration"
        range:
          max: 100.0
  goal: 0.95
  rollingPeriod: "2592000s"

E.3.2: Alerting Policies

Alert Severity Levels

Severity	Response Time	Channels	Escalation
Critical	Immediate (5 min)	PagerDuty, Slack, Email	On-call engineer → Manager (15 min)
Warning	30 minutes	Slack, Email	Team channel → On-call (60 min)
Informational	Next business day	Email	None

Critical Alerts

Alert 1: API Service Down

# alerts/critical/api-down.yaml
displayName: "CRITICAL: API Service Down"
documentation:
  content: |
    The BIO-QMS API service is completely down.

    **Compliance Impact:** FDA 21 CFR Part 11, HIPAA - System unavailable

    **Runbook:** https://wiki.bioqms.com/runbooks/api-down

    **Steps:**
    1. Check Cloud Run service status: gcloud run services describe bio-qms-api
    2. Check recent deployments: gcloud run revisions list
    3. Review logs: gcloud logging read "resource.type=cloud_run_revision" --limit 50
    4. Verify database connectivity
    5. If necessary, rollback to previous revision
  mimeType: text/markdown

conditions:
  - displayName: "No successful requests in 5 minutes"
    conditionThreshold:
      filter: |
        resource.type = "cloud_run_revision"
        AND resource.labels.service_name = "bio-qms-api"
        AND metric.type = "run.googleapis.com/request_count"
        AND metric.labels.response_code_class = "2xx"
      comparison: COMPARISON_LT
      thresholdValue: 1
      duration: 300s
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_RATE
          crossSeriesReducer: REDUCE_SUM

notificationChannels:
  - projects/bio-qms-prod/notificationChannels/pagerduty-critical
  - projects/bio-qms-prod/notificationChannels/slack-incidents
  - projects/bio-qms-prod/notificationChannels/email-oncall

alertStrategy:
  autoClose: 1800s  # 30 minutes
  notificationRateLimit:
    period: 300s  # Re-alert every 5 minutes if unacknowledged

Alert 2: Database Unreachable

# alerts/critical/database-unreachable.yaml
displayName: "CRITICAL: Database Unreachable"
documentation:
  content: |
    Cloud SQL database is unreachable or connection pool exhausted.

    **Compliance Impact:** FDA 21 CFR Part 11 - Data integrity risk

    **Runbook:** https://wiki.bioqms.com/runbooks/database-unreachable

    **Steps:**
    1. Check Cloud SQL instance status
    2. Verify network connectivity
    3. Check connection pool metrics
    4. Review database error logs
    5. Consider scaling instance if connection pool exhausted

conditions:
  - displayName: "Database connection errors > 10/min"
    conditionThreshold:
      filter: |
        resource.type = "cloud_run_revision"
        AND jsonPayload.level = "error"
        AND jsonPayload.context.error =~ ".*database.*connection.*"
      comparison: COMPARISON_GT
      thresholdValue: 10
      duration: 60s
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_RATE
          crossSeriesReducer: REDUCE_SUM

notificationChannels:
  - projects/bio-qms-prod/notificationChannels/pagerduty-critical
  - projects/bio-qms-prod/notificationChannels/slack-incidents

Alert 3: Certificate Expiry

# alerts/critical/certificate-expiry.yaml
displayName: "CRITICAL: TLS Certificate Expiring Soon"
documentation:
  content: |
    TLS certificate expires in less than 7 days.

    **Compliance Impact:** HIPAA Security Rule - Encryption in transit

    **Runbook:** https://wiki.bioqms.com/runbooks/certificate-renewal

conditions:
  - displayName: "Certificate expires in < 7 days"
    conditionThreshold:
      filter: |
        resource.type = "gae_app"
        AND metric.type = "appengine.googleapis.com/http/server/certificate_expiry_time"
      comparison: COMPARISON_LT
      thresholdValue: 604800  # 7 days in seconds
      duration: 0s
      aggregations:
        - alignmentPeriod: 3600s
          perSeriesAligner: ALIGN_MIN

notificationChannels:
  - projects/bio-qms-prod/notificationChannels/pagerduty-critical
  - projects/bio-qms-prod/notificationChannels/email-security-team

Alert 4: High Error Rate

# alerts/critical/high-error-rate.yaml
displayName: "CRITICAL: High API Error Rate (>5%)"
documentation:
  content: |
    API error rate exceeds 5%.

    **Compliance Impact:** SOC 2 - Service availability degradation

    **Runbook:** https://wiki.bioqms.com/runbooks/high-error-rate

conditions:
  - displayName: "5xx errors > 5% for 5 minutes"
    conditionThreshold:
      filter: |
        resource.type = "cloud_run_revision"
        AND metric.type = "run.googleapis.com/request_count"
        AND metric.labels.response_code_class = "5xx"
      comparison: COMPARISON_GT
      thresholdValue: 0.05
      duration: 300s
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_RATE
          crossSeriesReducer: REDUCE_SUM
      denominatorFilter: |
        resource.type = "cloud_run_revision"
        AND metric.type = "run.googleapis.com/request_count"
      denominatorAggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_RATE
          crossSeriesReducer: REDUCE_SUM

notificationChannels:
  - projects/bio-qms-prod/notificationChannels/pagerduty-critical
  - projects/bio-qms-prod/notificationChannels/slack-incidents

Warning Alerts

Alert 5: Elevated Error Rate

# alerts/warning/elevated-error-rate.yaml
displayName: "WARNING: Elevated Error Rate (>1%)"
conditions:
  - displayName: "5xx errors > 1% for 10 minutes"
    conditionThreshold:
      filter: |
        resource.type = "cloud_run_revision"
        AND metric.type = "run.googleapis.com/request_count"
        AND metric.labels.response_code_class = "5xx"
      comparison: COMPARISON_GT
      thresholdValue: 0.01
      duration: 600s

notificationChannels:
  - projects/bio-qms-prod/notificationChannels/slack-monitoring
  - projects/bio-qms-prod/notificationChannels/email-team

Alert 6: High Latency

# alerts/warning/high-latency.yaml
displayName: "WARNING: High API Latency (p95 > 500ms)"
conditions:
  - displayName: "p95 latency > 500ms for 10 minutes"
    conditionThreshold:
      filter: |
        resource.type = "cloud_run_revision"
        AND metric.type = "run.googleapis.com/request_latencies"
      comparison: COMPARISON_GT
      thresholdValue: 500.0
      duration: 600s
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_DELTA
          crossSeriesReducer: REDUCE_PERCENTILE_95

notificationChannels:
  - projects/bio-qms-prod/notificationChannels/slack-monitoring

Alert 7: High Disk Usage

# alerts/warning/high-disk-usage.yaml
displayName: "WARNING: Database Disk Usage > 80%"
conditions:
  - displayName: "Disk usage > 80%"
    conditionThreshold:
      filter: |
        resource.type = "cloudsql_database"
        AND metric.type = "cloudsql.googleapis.com/database/disk/utilization"
      comparison: COMPARISON_GT
      thresholdValue: 0.8
      duration: 300s

notificationChannels:
  - projects/bio-qms-prod/notificationChannels/slack-monitoring
  - projects/bio-qms-prod/notificationChannels/email-devops

Alert 8: Low Cache Hit Rate

# alerts/warning/low-cache-hit-rate.yaml
displayName: "WARNING: Cache Hit Rate < 60%"
conditions:
  - displayName: "Cache hit rate < 60% for 30 minutes"
    conditionThreshold:
      filter: |
        resource.type = "redis_instance"
        AND metric.type = "redis.googleapis.com/stats/cache_hit_ratio"
      comparison: COMPARISON_LT
      thresholdValue: 0.6
      duration: 1800s

notificationChannels:
  - projects/bio-qms-prod/notificationChannels/slack-monitoring

Notification Channels Configuration

// src/monitoring/notification-channels.service.ts
import { Injectable } from '@nestjs/common';
import { Monitoring } from '@google-cloud/monitoring';

@Injectable()
export class NotificationChannelsService {
  private readonly client: Monitoring.NotificationChannelServiceClient;
  private readonly projectId: string;

  constructor() {
    this.client = new Monitoring.NotificationChannelServiceClient();
    this.projectId = process.env.GCP_PROJECT_ID;
  }

  async createPagerDutyChannel(): Promise<string> {
    const [channel] = await this.client.createNotificationChannel({
      name: this.client.projectPath(this.projectId),
      notificationChannel: {
        type: 'pagerduty',
        displayName: 'PagerDuty - Critical Incidents',
        labels: {
          service_key: process.env.PAGERDUTY_SERVICE_KEY,
        },
        enabled: true,
      },
    });
    return channel.name;
  }

  async createSlackChannel(webhookUrl: string, channelName: string): Promise<string> {
    const [channel] = await this.client.createNotificationChannel({
      name: this.client.projectPath(this.projectId),
      notificationChannel: {
        type: 'slack',
        displayName: `Slack - ${channelName}`,
        labels: {
          url: webhookUrl,
          channel_name: channelName,
        },
        enabled: true,
      },
    });
    return channel.name;
  }

  async createEmailChannel(emailAddress: string, displayName: string): Promise<string> {
    const [channel] = await this.client.createNotificationChannel({
      name: this.client.projectPath(this.projectId),
      notificationChannel: {
        type: 'email',
        displayName: displayName,
        labels: {
          email_address: emailAddress,
        },
        enabled: true,
      },
    });
    return channel.name;
  }
}

Alert Policy Deployment Script

// scripts/deploy-alert-policies.ts
import { Monitoring } from '@google-cloud/monitoring';
import * as fs from 'fs';
import * as path from 'path';
import * as yaml from 'js-yaml';

async function deployAlertPolicies() {
  const client = new Monitoring.AlertPolicyServiceClient();
  const projectId = process.env.GCP_PROJECT_ID;
  const alertsDir = path.join(__dirname, '../config/alerts');

  const categories = ['critical', 'warning', 'informational'];

  for (const category of categories) {
    const categoryDir = path.join(alertsDir, category);
    const files = fs.readdirSync(categoryDir).filter(f => f.endsWith('.yaml'));

    for (const file of files) {
      const filePath = path.join(categoryDir, file);
      const content = fs.readFileSync(filePath, 'utf8');
      const policy = yaml.load(content) as any;

      console.log(`Deploying ${category} alert: ${policy.displayName}`);

      try {
        const [createdPolicy] = await client.createAlertPolicy({
          name: client.projectPath(projectId),
          alertPolicy: policy,
        });
        console.log(`✓ Created: ${createdPolicy.name}`);
      } catch (error) {
        console.error(`✗ Failed to create ${file}:`, error.message);
      }
    }
  }
}

deployAlertPolicies().catch(console.error);

E.3.3: Structured Logging with Cloud Logging

Logging Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                        │
├─────────────────────────────────────────────────────────────┤
│  NestJS Logger → Winston → Logging Interceptor              │
│         ↓                                                   │
│  JSON Structured Logs + Correlation IDs                    │
└────────────────────┬────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────────────────┐
│              Google Cloud Logging                           │
├──────────────────┬──────────────────┬──────────────────────┤
│  Hot Storage     │  BigQuery Export │  Cloud Storage       │
│  (30 days)       │  (1 year)        │  (Long-term)         │
└──────────────────┴──────────────────┴──────────────────────┘

Log Structure

// src/logging/interfaces/structured-log.interface.ts
export interface StructuredLog {
  // Standard fields
  timestamp: string;           // ISO 8601 UTC
  severity: LogSeverity;       // DEBUG, INFO, NOTICE, WARNING, ERROR, CRITICAL, ALERT, EMERGENCY
  message: string;

  // Request context
  request_id: string;          // Unique per request (UUID v4)
  trace_id?: string;           // OpenTelemetry trace ID
  span_id?: string;            // OpenTelemetry span ID

  // User context
  user_id?: string;            // Authenticated user ID
  org_id?: string;             // Organization/tenant ID
  session_id?: string;         // Session identifier

  // Application context
  service: string;             // 'bio-qms-api'
  environment: string;         // 'production', 'staging', 'development'
  version: string;             // Application version (from package.json)

  // Action context
  action: string;              // API endpoint or operation
  resource_type?: string;      // 'document', 'capa', 'training', etc.
  resource_id?: string;        // ID of affected resource

  // Performance metrics
  duration_ms?: number;        // Operation duration

  // Error context (if severity >= ERROR)
  error?: {
    name: string;
    message: string;
    stack?: string;
    code?: string;
  };

  // Compliance metadata
  compliance?: {
    regulation: string[];      // ['FDA-21-CFR-Part-11', 'HIPAA', 'SOC-2']
    audit_event_type?: string; // 'electronic_signature', 'data_modification', etc.
    pii_logged: boolean;       // Flag if PII is in logs
  };

  // Additional context
  metadata?: Record<string, any>;
}

export enum LogSeverity {
  DEBUG = 'DEBUG',
  INFO = 'INFO',
  NOTICE = 'NOTICE',
  WARNING = 'WARNING',
  ERROR = 'ERROR',
  CRITICAL = 'CRITICAL',
  ALERT = 'ALERT',
  EMERGENCY = 'EMERGENCY',
}

Winston Logger Configuration

// src/logging/winston.config.ts
import * as winston from 'winston';
import { LoggingWinston } from '@google-cloud/logging-winston';

const loggingWinston = new LoggingWinston({
  projectId: process.env.GCP_PROJECT_ID,
  keyFilename: process.env.GCP_KEY_FILE,
  serviceContext: {
    service: 'bio-qms-api',
    version: process.env.APP_VERSION || '1.0.0',
  },
});

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp({ format: 'YYYY-MM-DDTHH:mm:ss.SSSZ' }),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'bio-qms-api',
    environment: process.env.NODE_ENV,
    version: process.env.APP_VERSION,
  },
  transports: [
    // Cloud Logging transport (production)
    loggingWinston,

    // Console transport (development)
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      ),
    }),
  ],
});

NestJS Logging Interceptor

// src/logging/logging.interceptor.ts
import {
  Injectable,
  NestInterceptor,
  ExecutionContext,
  CallHandler,
  Logger,
} from '@nestjs/common';
import { Observable, throwError } from 'rxjs';
import { tap, catchError } from 'rxjs/operators';
import { v4 as uuidv4 } from 'uuid';
import { logger } from './winston.config';
import { StructuredLog, LogSeverity } from './interfaces/structured-log.interface';

@Injectable()
export class LoggingInterceptor implements NestInterceptor {
  private readonly nestLogger = new Logger(LoggingInterceptor.name);

  intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
    const request = context.switchToHttp().getRequest();
    const response = context.switchToHttp().getResponse();

    // Generate or extract correlation IDs
    const requestId = request.headers['x-request-id'] || uuidv4();
    const traceId = request.headers['x-cloud-trace-context']?.split('/')[0];

    // Attach to request for downstream use
    request.requestId = requestId;
    request.traceId = traceId;

    // Set response header
    response.setHeader('X-Request-ID', requestId);

    const startTime = Date.now();
    const { method, url, body, query, params } = request;
    const userId = request.user?.id;
    const orgId = request.user?.organizationId;
    const sessionId = request.session?.id;

    // Log incoming request
    this.logRequest(requestId, traceId, method, url, userId, orgId);

    return next.handle().pipe(
      tap((data) => {
        const duration = Date.now() - startTime;
        this.logResponse(
          requestId,
          traceId,
          method,
          url,
          response.statusCode,
          duration,
          userId,
          orgId,
        );
      }),
      catchError((error) => {
        const duration = Date.now() - startTime;
        this.logError(
          requestId,
          traceId,
          method,
          url,
          error,
          duration,
          userId,
          orgId,
        );
        return throwError(() => error);
      }),
    );
  }

  private logRequest(
    requestId: string,
    traceId: string | undefined,
    method: string,
    url: string,
    userId?: string,
    orgId?: string,
  ): void {
    const log: StructuredLog = {
      timestamp: new Date().toISOString(),
      severity: LogSeverity.INFO,
      message: `Incoming ${method} ${url}`,
      request_id: requestId,
      trace_id: traceId,
      user_id: userId,
      org_id: orgId,
      service: 'bio-qms-api',
      environment: process.env.NODE_ENV,
      version: process.env.APP_VERSION,
      action: `${method} ${url}`,
      compliance: {
        regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
        pii_logged: false,
      },
    };

    logger.info(log);
  }

  private logResponse(
    requestId: string,
    traceId: string | undefined,
    method: string,
    url: string,
    statusCode: number,
    duration: number,
    userId?: string,
    orgId?: string,
  ): void {
    const log: StructuredLog = {
      timestamp: new Date().toISOString(),
      severity: statusCode >= 400 ? LogSeverity.WARNING : LogSeverity.INFO,
      message: `${method} ${url} ${statusCode}`,
      request_id: requestId,
      trace_id: traceId,
      user_id: userId,
      org_id: orgId,
      service: 'bio-qms-api',
      environment: process.env.NODE_ENV,
      version: process.env.APP_VERSION,
      action: `${method} ${url}`,
      duration_ms: duration,
      metadata: {
        status_code: statusCode,
      },
      compliance: {
        regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
        pii_logged: false,
      },
    };

    logger.info(log);
  }

  private logError(
    requestId: string,
    traceId: string | undefined,
    method: string,
    url: string,
    error: any,
    duration: number,
    userId?: string,
    orgId?: string,
  ): void {
    const log: StructuredLog = {
      timestamp: new Date().toISOString(),
      severity: LogSeverity.ERROR,
      message: `${method} ${url} failed: ${error.message}`,
      request_id: requestId,
      trace_id: traceId,
      user_id: userId,
      org_id: orgId,
      service: 'bio-qms-api',
      environment: process.env.NODE_ENV,
      version: process.env.APP_VERSION,
      action: `${method} ${url}`,
      duration_ms: duration,
      error: {
        name: error.name,
        message: error.message,
        stack: error.stack,
        code: error.code,
      },
      compliance: {
        regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
        pii_logged: false,
      },
    };

    logger.error(log);
  }
}

Audit Logging Service

// src/logging/audit-log.service.ts
import { Injectable } from '@nestjs/common';
import { logger } from './winston.config';
import { StructuredLog, LogSeverity } from './interfaces/structured-log.interface';

export enum AuditEventType {
  ELECTRONIC_SIGNATURE = 'electronic_signature',
  DATA_MODIFICATION = 'data_modification',
  DATA_DELETION = 'data_deletion',
  USER_LOGIN = 'user_login',
  USER_LOGOUT = 'user_logout',
  FAILED_LOGIN = 'failed_login',
  PASSWORD_CHANGE = 'password_change',
  PERMISSION_CHANGE = 'permission_change',
  DOCUMENT_APPROVAL = 'document_approval',
  CAPA_STATUS_CHANGE = 'capa_status_change',
  TRAINING_COMPLETION = 'training_completion',
  SYSTEM_CONFIGURATION_CHANGE = 'system_configuration_change',
}

/**
 * Audit logging service for FDA 21 CFR Part 11 compliance
 * @compliance FDA 21 CFR Part 11 §11.10(e)
 */
@Injectable()
export class AuditLogService {
  /**
   * Log electronic signature event
   * @compliance FDA 21 CFR Part 11 §11.50, §11.70
   */
  logElectronicSignature(
    userId: string,
    documentId: string,
    signatureMeaning: string,
    requestId: string,
    traceId?: string,
  ): void {
    const log: StructuredLog = {
      timestamp: new Date().toISOString(),
      severity: LogSeverity.NOTICE,
      message: `Electronic signature applied: ${signatureMeaning}`,
      request_id: requestId,
      trace_id: traceId,
      user_id: userId,
      service: 'bio-qms-api',
      environment: process.env.NODE_ENV,
      version: process.env.APP_VERSION,
      action: 'electronic_signature',
      resource_type: 'document',
      resource_id: documentId,
      compliance: {
        regulation: ['FDA-21-CFR-Part-11'],
        audit_event_type: AuditEventType.ELECTRONIC_SIGNATURE,
        pii_logged: false,
      },
      metadata: {
        signature_meaning: signatureMeaning,
      },
    };

    logger.info(log);
  }

  /**
   * Log data modification event
   * @compliance FDA 21 CFR Part 11 §11.10(e)
   */
  logDataModification(
    userId: string,
    resourceType: string,
    resourceId: string,
    changes: Record<string, any>,
    requestId: string,
    traceId?: string,
  ): void {
    const log: StructuredLog = {
      timestamp: new Date().toISOString(),
      severity: LogSeverity.NOTICE,
      message: `Data modified: ${resourceType}/${resourceId}`,
      request_id: requestId,
      trace_id: traceId,
      user_id: userId,
      service: 'bio-qms-api',
      environment: process.env.NODE_ENV,
      version: process.env.APP_VERSION,
      action: 'data_modification',
      resource_type: resourceType,
      resource_id: resourceId,
      compliance: {
        regulation: ['FDA-21-CFR-Part-11'],
        audit_event_type: AuditEventType.DATA_MODIFICATION,
        pii_logged: false,
      },
      metadata: {
        changes: this.sanitizeChanges(changes),
      },
    };

    logger.info(log);
  }

  /**
   * Log failed login attempt
   * @compliance HIPAA Security Rule §164.312(b)
   */
  logFailedLogin(
    username: string,
    ipAddress: string,
    reason: string,
    requestId: string,
  ): void {
    const log: StructuredLog = {
      timestamp: new Date().toISOString(),
      severity: LogSeverity.WARNING,
      message: `Failed login attempt: ${username}`,
      request_id: requestId,
      service: 'bio-qms-api',
      environment: process.env.NODE_ENV,
      version: process.env.APP_VERSION,
      action: 'failed_login',
      compliance: {
        regulation: ['HIPAA', 'SOC-2'],
        audit_event_type: AuditEventType.FAILED_LOGIN,
        pii_logged: true,  // Username may contain PII
      },
      metadata: {
        username,
        ip_address: ipAddress,
        reason,
      },
    };

    logger.warn(log);
  }

  /**
   * Log successful user login
   * @compliance HIPAA Security Rule §164.312(b)
   */
  logUserLogin(
    userId: string,
    username: string,
    ipAddress: string,
    requestId: string,
  ): void {
    const log: StructuredLog = {
      timestamp: new Date().toISOString(),
      severity: LogSeverity.INFO,
      message: `User login successful: ${username}`,
      request_id: requestId,
      user_id: userId,
      service: 'bio-qms-api',
      environment: process.env.NODE_ENV,
      version: process.env.APP_VERSION,
      action: 'user_login',
      compliance: {
        regulation: ['HIPAA', 'SOC-2'],
        audit_event_type: AuditEventType.USER_LOGIN,
        pii_logged: true,
      },
      metadata: {
        username,
        ip_address: ipAddress,
      },
    };

    logger.info(log);
  }

  /**
   * Sanitize changes to remove sensitive data from logs
   */
  private sanitizeChanges(changes: Record<string, any>): Record<string, any> {
    const sanitized = { ...changes };
    const sensitiveFields = ['password', 'ssn', 'credit_card', 'api_key', 'token'];

    for (const field of sensitiveFields) {
      if (field in sanitized) {
        sanitized[field] = '[REDACTED]';
      }
    }

    return sanitized;
  }
}

Log Retention & Export Configuration

# config/log-retention.yaml
# Log retention and export configuration for compliance

sinks:
  # BigQuery export for long-term analysis (1 year)
  - name: bigquery-export
    destination: bigquery.googleapis.com/projects/bio-qms-prod/datasets/audit_logs
    filter: |
      severity >= NOTICE
      OR jsonPayload.compliance.audit_event_type != null
    bigqueryOptions:
      usePartitionedTables: true
      usesTimestampColumnPartitioning: true

  # Cloud Storage archive (7 years for FDA compliance)
  - name: gcs-archive
    destination: storage.googleapis.com/bio-qms-audit-logs-archive
    filter: |
      jsonPayload.compliance.regulation =~ ".*FDA.*"
      OR jsonPayload.compliance.audit_event_type != null
    includeChildren: true

  # Security events to dedicated dataset
  - name: security-events
    destination: bigquery.googleapis.com/projects/bio-qms-prod/datasets/security_logs
    filter: |
      jsonPayload.compliance.audit_event_type = "failed_login"
      OR jsonPayload.compliance.audit_event_type = "permission_change"
      OR severity >= ERROR

exclusions:
  # Exclude health check logs from long-term storage
  - name: exclude-health-checks
    filter: |
      jsonPayload.action = "GET /health"
      OR jsonPayload.action = "GET /readiness"

retention:
  # Hot storage: 30 days in Cloud Logging
  default: 30d

  # Compliance buckets: extended retention
  audit_logs: 2555d  # 7 years (FDA requirement)
  security_logs: 2555d

E.3.4: Distributed Tracing

OpenTelemetry Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Application Code                          │
│  (NestJS Controllers, Services, Repositories)               │
└────────────────────┬────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────────────────┐
│            OpenTelemetry Instrumentation                    │
├──────────────────┬──────────────────┬──────────────────────┤
│  HTTP Tracing    │  Database        │  External APIs       │
│  (@opentelemetry │  (Prisma)        │  (fetch, axios)      │
│  /instrumentation│                  │                      │
│  -http)          │                  │                      │
└──────────────────┴──────────────────┴──────────────────────┘
                     ↓
┌─────────────────────────────────────────────────────────────┐
│              OpenTelemetry Collector                        │
│  - Sampling (100% errors, 10% normal)                      │
│  - Batching                                                │
│  - Enrichment (resource attributes)                        │
└────────────────────┬────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────────────────┐
│               Google Cloud Trace                            │
│  - Trace storage & visualization                           │
│  - Latency analysis                                        │
│  - Service dependency mapping                              │
└─────────────────────────────────────────────────────────────┘

OpenTelemetry Configuration

// src/tracing/tracing.config.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { TraceExporter } from '@google-cloud/opentelemetry-cloud-trace-exporter';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { ParentBasedSampler, TraceIdRatioBasedSampler, AlwaysOnSampler } from '@opentelemetry/sdk-trace-base';
import { CompositePropagator, W3CTraceContextPropagator, W3CBaggagePropagator } from '@opentelemetry/core';

/**
 * Custom sampler: 100% for errors, 10% for normal traffic
 */
class AdaptiveSampler extends ParentBasedSampler {
  constructor() {
    super({
      root: new TraceIdRatioBasedSampler(0.1), // 10% base sampling
    });
  }

  shouldSample(context, traceId, spanName, spanKind, attributes, links) {
    // Always sample if error
    if (attributes['http.status_code'] >= 400) {
      return { decision: AlwaysOnSampler.prototype.shouldSample.call(this, context, traceId, spanName, spanKind, attributes, links).decision };
    }

    // Always sample audit events
    if (attributes['audit.event_type']) {
      return { decision: AlwaysOnSampler.prototype.shouldSample.call(this, context, traceId, spanName, spanKind, attributes, links).decision };
    }

    // Use parent-based sampling for everything else
    return super.shouldSample(context, traceId, spanName, spanKind, attributes, links);
  }
}

export const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'bio-qms-api',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
    'service.namespace': 'bio-qms',
    'cloud.provider': 'gcp',
    'cloud.platform': 'gcp_cloud_run',
    'cloud.region': process.env.GCP_REGION || 'us-central1',
  }),
  traceExporter: new TraceExporter({
    projectId: process.env.GCP_PROJECT_ID,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        enabled: true,
        ignoreIncomingPaths: ['/health', '/readiness'],
      },
      '@opentelemetry/instrumentation-express': {
        enabled: true,
      },
      '@opentelemetry/instrumentation-pg': {
        enabled: true,
        enhancedDatabaseReporting: true,
      },
      '@opentelemetry/instrumentation-redis': {
        enabled: true,
      },
    }),
  ],
  sampler: new AdaptiveSampler(),
  textMapPropagator: new CompositePropagator({
    propagators: [
      new W3CTraceContextPropagator(),
      new W3CBaggagePropagator(),
    ],
  }),
});

// Start tracing
sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.error('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

Custom Span Creation

// src/tracing/tracing.service.ts
import { Injectable } from '@nestjs/common';
import { trace, context, SpanStatusCode, Span } from '@opentelemetry/api';

@Injectable()
export class TracingService {
  private readonly tracer = trace.getTracer('bio-qms-api');

  /**
   * Create a custom span for business operations
   */
  async withSpan<T>(
    name: string,
    operation: (span: Span) => Promise<T>,
    attributes?: Record<string, any>,
  ): Promise<T> {
    return this.tracer.startActiveSpan(name, async (span) => {
      try {
        if (attributes) {
          span.setAttributes(attributes);
        }

        const result = await operation(span);

        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error.message,
        });
        span.recordException(error);
        throw error;
      } finally {
        span.end();
      }
    });
  }

  /**
   * Add event to current span
   */
  addEvent(name: string, attributes?: Record<string, any>): void {
    const currentSpan = trace.getActiveSpan();
    if (currentSpan) {
      currentSpan.addEvent(name, attributes);
    }
  }

  /**
   * Set attribute on current span
   */
  setAttribute(key: string, value: any): void {
    const currentSpan = trace.getActiveSpan();
    if (currentSpan) {
      currentSpan.setAttribute(key, value);
    }
  }

  /**
   * Get current trace ID (for log correlation)
   */
  getCurrentTraceId(): string | undefined {
    const currentSpan = trace.getActiveSpan();
    return currentSpan?.spanContext().traceId;
  }

  /**
   * Get current span ID (for log correlation)
   */
  getCurrentSpanId(): string | undefined {
    const currentSpan = trace.getActiveSpan();
    return currentSpan?.spanContext().spanId;
  }
}

NestJS Tracing Interceptor

// src/tracing/tracing.interceptor.ts
import {
  Injectable,
  NestInterceptor,
  ExecutionContext,
  CallHandler,
} from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap, catchError } from 'rxjs/operators';
import { TracingService } from './tracing.service';
import { trace, SpanStatusCode } from '@opentelemetry/api';

@Injectable()
export class TracingInterceptor implements NestInterceptor {
  constructor(private readonly tracingService: TracingService) {}

  intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
    const request = context.switchToHttp().getRequest();
    const { method, url } = request;
    const controllerName = context.getClass().name;
    const handlerName = context.getHandler().name;

    const tracer = trace.getTracer('bio-qms-api');
    const spanName = `${controllerName}.${handlerName}`;

    return tracer.startActiveSpan(spanName, (span) => {
      // Set span attributes
      span.setAttributes({
        'http.method': method,
        'http.url': url,
        'http.route': request.route?.path,
        'controller.name': controllerName,
        'handler.name': handlerName,
        'user.id': request.user?.id,
        'org.id': request.user?.organizationId,
      });

      return next.handle().pipe(
        tap((data) => {
          span.setStatus({ code: SpanStatusCode.OK });
          span.setAttribute('http.status_code', 200);
        }),
        catchError((error) => {
          span.setStatus({
            code: SpanStatusCode.ERROR,
            message: error.message,
          });
          span.recordException(error);
          span.setAttribute('http.status_code', error.status || 500);
          throw error;
        }),
        tap(() => {
          span.end();
        }),
      );
    });
  }
}

Prisma Query Tracing

// src/tracing/prisma-tracing.middleware.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { PrismaClient } from '@prisma/client';
import { TracingService } from './tracing.service';
import { trace, SpanKind } from '@opentelemetry/api';

@Injectable()
export class PrismaTracingService implements OnModuleInit {
  constructor(
    private readonly prisma: PrismaClient,
    private readonly tracingService: TracingService,
  ) {}

  async onModuleInit() {
    // Middleware for query tracing
    this.prisma.$use(async (params, next) => {
      const tracer = trace.getTracer('bio-qms-api');

      return tracer.startActiveSpan(
        `prisma.${params.model}.${params.action}`,
        {
          kind: SpanKind.CLIENT,
          attributes: {
            'db.system': 'postgresql',
            'db.name': process.env.DATABASE_NAME,
            'db.operation': params.action,
            'db.model': params.model,
          },
        },
        async (span) => {
          const startTime = Date.now();

          try {
            const result = await next(params);

            const duration = Date.now() - startTime;
            span.setAttribute('db.duration_ms', duration);
            span.setStatus({ code: 0 }); // OK

            return result;
          } catch (error) {
            span.recordException(error);
            span.setStatus({ code: 2, message: error.message }); // ERROR
            throw error;
          } finally {
            span.end();
          }
        },
      );
    });
  }
}

External API Call Tracing

// src/tracing/http-client.service.ts
import { Injectable, HttpService } from '@nestjs/common';
import { TracingService } from './tracing.service';
import { trace, SpanKind, propagation, context } from '@opentelemetry/api';
import { AxiosRequestConfig } from 'axios';

@Injectable()
export class TracedHttpService {
  constructor(
    private readonly httpService: HttpService,
    private readonly tracingService: TracingService,
  ) {}

  /**
   * Make HTTP request with automatic tracing
   */
  async request<T>(config: AxiosRequestConfig): Promise<T> {
    const tracer = trace.getTracer('bio-qms-api');
    const url = `${config.baseURL || ''}${config.url}`;

    return tracer.startActiveSpan(
      `HTTP ${config.method?.toUpperCase()} ${url}`,
      {
        kind: SpanKind.CLIENT,
        attributes: {
          'http.method': config.method?.toUpperCase(),
          'http.url': url,
          'http.target': config.url,
        },
      },
      async (span) => {
        try {
          // Inject trace context into headers
          const carrier = {};
          propagation.inject(context.active(), carrier);

          config.headers = {
            ...config.headers,
            ...carrier,
          };

          const response = await this.httpService.request(config).toPromise();

          span.setAttribute('http.status_code', response.status);
          span.setStatus({ code: 0 }); // OK

          return response.data;
        } catch (error) {
          span.setAttribute('http.status_code', error.response?.status || 0);
          span.recordException(error);
          span.setStatus({ code: 2, message: error.message }); // ERROR
          throw error;
        } finally {
          span.end();
        }
      },
    );
  }
}

Application Bootstrap with Tracing

// src/main.ts
import './tracing/tracing.config'; // MUST be first import
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
import { LoggingInterceptor } from './logging/logging.interceptor';
import { TracingInterceptor } from './tracing/tracing.interceptor';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);

  // Apply global interceptors
  app.useGlobalInterceptors(
    app.get(LoggingInterceptor),
    app.get(TracingInterceptor),
  );

  await app.listen(process.env.PORT || 8080);
  console.log(`Application is running on: ${await app.getUrl()}`);
}

bootstrap();

Trace Analysis Queries

-- BigQuery queries for trace analysis (exported from Cloud Trace)

-- Query 1: p95 latency by endpoint
SELECT
  span_name,
  APPROX_QUANTILES(duration_ms, 100)[OFFSET(95)] AS p95_latency_ms,
  COUNT(*) AS request_count
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
  DATE(start_time) = CURRENT_DATE()
  AND span_kind = 'SERVER'
GROUP BY span_name
ORDER BY p95_latency_ms DESC
LIMIT 20;

-- Query 2: Error rate by endpoint
SELECT
  span_name,
  COUNTIF(status_code = 2) AS error_count,
  COUNT(*) AS total_count,
  ROUND(COUNTIF(status_code = 2) / COUNT(*) * 100, 2) AS error_rate_pct
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
  DATE(start_time) = CURRENT_DATE()
  AND span_kind = 'SERVER'
GROUP BY span_name
HAVING error_count > 0
ORDER BY error_rate_pct DESC;

-- Query 3: Slowest database queries
SELECT
  span_name,
  AVG(duration_ms) AS avg_duration_ms,
  MAX(duration_ms) AS max_duration_ms,
  COUNT(*) AS execution_count
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
  DATE(start_time) = CURRENT_DATE()
  AND span_name LIKE 'prisma.%'
GROUP BY span_name
ORDER BY avg_duration_ms DESC
LIMIT 20;

-- Query 4: Trace dependency graph (service calls)
SELECT
  parent_span_name,
  span_name,
  COUNT(*) AS call_count,
  AVG(duration_ms) AS avg_duration_ms
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
  DATE(start_time) = CURRENT_DATE()
  AND parent_span_id IS NOT NULL
GROUP BY parent_span_name, span_name
ORDER BY call_count DESC;

Compliance Mapping

FDA 21 CFR Part 11

Requirement	Implementation	Evidence
§11.10(e) Audit trails	Structured logging with Cloud Logging, 7-year retention	`AuditLogService`, BigQuery exports
§11.50 Electronic signatures	`logElectronicSignature()` captures all signature events	Audit logs with `electronic_signature` event type
§11.70 Signature linking	Trace ID correlates signature to document modification	`trace_id` field in structured logs

HIPAA Security Rule

Requirement	Implementation	Evidence
§164.312(b) Audit controls	Cloud Logging with tamper-proof timestamps	Structured logs exported to BigQuery
§164.312(c)(1) Integrity controls	Hash verification in audit logs	`metadata.data_hash` in modification logs
§164.308(a)(1)(ii)(D) Information system activity review	Cloud Monitoring dashboards + alerts	Dashboard JSON configs, alert policies

SOC 2

Trust Service Criteria	Implementation	Evidence
CC7.2 System monitoring	Cloud Monitoring dashboards, 24/7 alerting	Dashboard configs, PagerDuty integration
CC7.3 Incident detection	Critical alerts with 5-minute response SLA	Alert policies with escalation
CC7.4 Incident response	Runbooks linked to alerts, incident tracking	Alert documentation fields
CC8.1 Backup and recovery monitoring	Database replication lag alerts	Replication lag dashboard widget

Deployment Instructions

1. Deploy Dashboards

#!/bin/bash
# scripts/deploy-dashboards.sh

PROJECT_ID="bio-qms-prod"

for dashboard in config/dashboards/*.json; do
  echo "Deploying $(basename $dashboard)..."
  gcloud monitoring dashboards create --config-from-file="$dashboard" \
    --project="$PROJECT_ID"
done

2. Deploy Alert Policies

#!/bin/bash
# scripts/deploy-alerts.sh

PROJECT_ID="bio-qms-prod"

# Create notification channels first
gcloud alpha monitoring channels create \
  --display-name="PagerDuty - Critical" \
  --type=pagerduty \
  --channel-labels=service_key=$PAGERDUTY_KEY \
  --project="$PROJECT_ID"

# Deploy alert policies
for policy in config/alerts/*/*.yaml; do
  echo "Deploying $(basename $policy)..."
  gcloud alpha monitoring policies create --policy-from-file="$policy" \
    --project="$PROJECT_ID"
done

3. Configure Log Sinks

#!/bin/bash
# scripts/configure-log-sinks.sh

PROJECT_ID="bio-qms-prod"

# BigQuery sink
gcloud logging sinks create bigquery-audit-logs \
  bigquery.googleapis.com/projects/$PROJECT_ID/datasets/audit_logs \
  --log-filter='severity >= NOTICE OR jsonPayload.compliance.audit_event_type != null' \
  --project="$PROJECT_ID"

# Cloud Storage archive
gcloud logging sinks create gcs-audit-archive \
  storage.googleapis.com/bio-qms-audit-logs-archive \
  --log-filter='jsonPayload.compliance.regulation =~ ".*FDA.*"' \
  --project="$PROJECT_ID"

4. Enable OpenTelemetry

// package.json additions
{
  "dependencies": {
    "@google-cloud/opentelemetry-cloud-trace-exporter": "^2.3.0",
    "@opentelemetry/api": "^1.8.0",
    "@opentelemetry/sdk-node": "^0.49.1",
    "@opentelemetry/auto-instrumentations-node": "^0.42.0",
    "@opentelemetry/instrumentation-http": "^0.49.1",
    "@opentelemetry/instrumentation-express": "^0.37.0",
    "@opentelemetry/instrumentation-pg": "^0.40.0"
  }
}

# Install dependencies
npm install

# Update main.ts to import tracing config first
# (see src/main.ts example above)

# Deploy with tracing enabled
gcloud run deploy bio-qms-api \
  --image=us-central1-docker.pkg.dev/$PROJECT_ID/bio-qms/api:latest \
  --set-env-vars="GOOGLE_CLOUD_PROJECT=$PROJECT_ID,NODE_ENV=production"

Testing & Validation

1. Metrics Validation

# Test custom metric creation
curl -X POST https://api.bioqms.com/documents/123/sign \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"meaning": "Approved by QA Manager"}'

# Verify metric in Cloud Monitoring
gcloud monitoring time-series list \
  --filter='metric.type="logging.googleapis.com/user/qms_document_signed"' \
  --format=json

2. Alert Testing

# Trigger test alert (critical error rate)
for i in {1..100}; do
  curl https://api.bioqms.com/test/error &
done

# Verify alert fired
gcloud alpha monitoring policies list \
  --filter='displayName:"CRITICAL: High API Error Rate"' \
  --format=json

3. Log Verification

# Query structured logs
gcloud logging read "jsonPayload.request_id=\"$REQUEST_ID\"" \
  --format=json \
  --limit=10

# Verify audit log
gcloud logging read \
  "jsonPayload.compliance.audit_event_type=\"electronic_signature\"" \
  --limit=1 \
  --format=json

4. Trace Verification

# Generate traced request
curl -X POST https://api.bioqms.com/documents \
  -H "Authorization: Bearer $TOKEN" \
  -v 2>&1 | grep -i trace

# View trace in Cloud Console
# https://console.cloud.google.com/traces/list

Runbooks

Runbook 1: API Down

Symptoms: No successful API requests for 5+ minutes, PagerDuty alert fired

Diagnosis:

# 1. Check service status
gcloud run services describe bio-qms-api --region=us-central1

# 2. Check recent logs
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
  --limit=50 --format=json

# 3. Check database connectivity
gcloud sql instances describe bio-qms-db

# 4. Check recent deployments
gcloud run revisions list --service=bio-qms-api --limit=5

Resolution:

# If bad deployment: rollback
PREVIOUS_REVISION=$(gcloud run revisions list --service=bio-qms-api --format="value(name)" --limit=2 | tail -n1)
gcloud run services update-traffic bio-qms-api --to-revisions=$PREVIOUS_REVISION=100

# If database issue: restart connection pool
kubectl rollout restart deployment/bio-qms-api -n production

# If Cloud Run issue: scale to zero and back
gcloud run services update bio-qms-api --min-instances=0
sleep 10
gcloud run services update bio-qms-api --min-instances=2

Runbook 2: High Database Latency

Symptoms: p95 query latency > 500ms

Diagnosis:

# Check slow queries
gcloud logging read \
  "jsonPayload.duration_ms>500 AND jsonPayload.action=\"prisma.query\"" \
  --limit=20 --format=json

# Check database CPU
gcloud sql operations list --instance=bio-qms-db --limit=10

# Check connection pool
gcloud monitoring time-series list \
  --filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'

Resolution:

# Add missing index (example)
psql -h $DB_HOST -U $DB_USER -d bio_qms -c \
  "CREATE INDEX CONCURRENTLY idx_documents_org_created ON documents(organization_id, created_at);"

# Scale database instance
gcloud sql instances patch bio-qms-db --tier=db-custom-4-16384

# Analyze and vacuum
psql -h $DB_HOST -U $DB_USER -d bio_qms -c "VACUUM ANALYZE;"

Maintenance

Monthly Tasks

Review SLO compliance and error budget consumption
Analyze top 20 slowest endpoints and optimize
Review alert noise and tune thresholds
Archive old traces (Cloud Trace auto-retention: 30 days)
Validate BigQuery audit log exports

Quarterly Tasks

Review and update dashboards based on new features
Conduct alert fire drill (test PagerDuty escalation)
Audit log retention compliance check (7-year FDA requirement)
Review and optimize trace sampling rates
Update runbooks based on recent incidents

Annual Tasks

Full observability stack audit
Review compliance mapping (FDA, HIPAA, SOC 2)
Evaluate new Cloud Monitoring features
Disaster recovery test (log export restoration)

References

Document Version: 1.0.0 Last Updated: 2026-02-17 Owner: DevOps Team Reviewers: Security Team, Compliance Team, QA Team Next Review: 2026-05-17

Overview​

Observability Pillars​

Regulatory Requirements​

E.3.1: Cloud Monitoring Dashboards​

Dashboard Architecture​

Dashboard 1: API Operations Dashboard​

Dashboard 2: Database Performance Dashboard​

Dashboard 3: Cache Performance Dashboard​

Dashboard 4: QMS Business KPIs​

Custom Metrics Implementation​

SLO Definitions​

E.3.2: Alerting Policies​

Alert Severity Levels​

Critical Alerts​

Alert 1: API Service Down​

Alert 2: Database Unreachable​

Alert 3: Certificate Expiry​

Alert 4: High Error Rate​

Warning Alerts​

Alert 5: Elevated Error Rate​

Alert 6: High Latency​

Alert 7: High Disk Usage​

Alert 8: Low Cache Hit Rate​

Notification Channels Configuration​

Alert Policy Deployment Script​

E.3.3: Structured Logging with Cloud Logging​

Logging Architecture​

Log Structure​

Winston Logger Configuration​

NestJS Logging Interceptor​

Audit Logging Service​

Log Retention & Export Configuration​

E.3.4: Distributed Tracing​

OpenTelemetry Architecture​

OpenTelemetry Configuration​

Custom Span Creation​

NestJS Tracing Interceptor​

Prisma Query Tracing​

External API Call Tracing​

Application Bootstrap with Tracing​

Trace Analysis Queries​

Compliance Mapping​

FDA 21 CFR Part 11​

HIPAA Security Rule​

SOC 2​

Deployment Instructions​

1. Deploy Dashboards​

2. Deploy Alert Policies​

3. Configure Log Sinks​

4. Enable OpenTelemetry​

Testing & Validation​

1. Metrics Validation​

2. Alert Testing​

3. Log Verification​

4. Trace Verification​

Runbooks​

Runbook 1: API Down​

Runbook 2: High Database Latency​

Maintenance​

Monthly Tasks​

Quarterly Tasks​

Annual Tasks​

References​