Skip to main content

Monitoring & Observability - BIO-QMS Platform

Overview

This document defines the monitoring and observability architecture for the BIO-QMS regulated SaaS platform, ensuring compliance with FDA 21 CFR Part 11, HIPAA, and SOC 2 requirements. The platform leverages Google Cloud Platform's native observability stack combined with OpenTelemetry for comprehensive system visibility.

Observability Pillars

PillarTechnologyPurposeCompliance Impact
MetricsCloud MonitoringPerformance, availability, SLO trackingSOC 2 availability controls
LogsCloud LoggingAudit trail, debugging, complianceFDA 21 CFR Part 11 §11.10(e)
TracesCloud Trace + OpenTelemetryRequest flow, latency analysisPerformance verification
AlertsCloud Monitoring AlertingProactive incident detectionHIPAA breach notification

Regulatory Requirements

  • FDA 21 CFR Part 11 §11.10(e): Use of secure, computer-generated, time-stamped audit trails
  • HIPAA Security Rule: Audit controls (§164.312(b)), integrity controls (§164.312(c)(1))
  • SOC 2 CC7.2: System monitoring to detect security incidents
  • SOC 2 CC7.3: System availability monitoring and alerting

E.3.1: Cloud Monitoring Dashboards

Dashboard Architecture

┌─────────────────────────────────────────────────────────────┐
│ Cloud Monitoring Workspace │
├──────────────────┬──────────────────┬──────────────────────┤
│ API Operations │ Database Layer │ QMS Business KPIs │
│ Dashboard │ Dashboard │ Dashboard │
├──────────────────┼──────────────────┼──────────────────────┤
│ - Request Rate │ - Connection │ - Documents Signed │
│ - Latency (p50, │ Pool Usage │ - CAPA Resolution │
│ p95, p99) │ - Query Latency │ Time │
│ - Error Rate │ - Replication │ - Audit Events/Hour │
│ - HTTP Status │ Lag │ - Active Sessions │
│ Distribution │ - Disk I/O │ - Compliance Score │
├──────────────────┼──────────────────┼──────────────────────┤
│ Cache Layer │ Infrastructure │ Security Metrics │
│ Dashboard │ Dashboard │ Dashboard │
├──────────────────┼──────────────────┼──────────────────────┤
│ - Hit Rate │ - CPU Usage │ - Failed Logins │
│ - Miss Rate │ - Memory Usage │ - Auth Token Issues │
│ - Eviction Rate │ - Network I/O │ - Access Violations │
│ - Command/sec │ - Pod Restarts │ - Certificate Expiry │
└──────────────────┴──────────────────┴──────────────────────┘

Dashboard 1: API Operations Dashboard

{
"displayName": "BIO-QMS API Operations",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "API Request Rate (requests/sec)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND resource.labels.service_name=\"bio-qms-api\" AND metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["resource.service_name"]
}
}
},
"plotType": "LINE",
"targetAxis": "Y1"
}
],
"timeshiftDuration": "0s",
"yAxis": {
"label": "Requests/sec",
"scale": "LINEAR"
}
}
}
},
{
"xPos": 6,
"width": 6,
"height": 4,
"widget": {
"title": "API Latency Percentiles (ms)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_50"
}
}
},
"plotType": "LINE",
"legendTemplate": "p50"
},
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
},
"plotType": "LINE",
"legendTemplate": "p95"
},
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_99"
}
}
},
"plotType": "LINE",
"legendTemplate": "p99"
}
],
"thresholds": [
{
"value": 200.0,
"color": "YELLOW",
"direction": "ABOVE",
"label": "SLO Target (p95 < 200ms)"
},
{
"value": 500.0,
"color": "RED",
"direction": "ABOVE",
"label": "Critical Threshold"
}
]
}
}
},
{
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Error Rate (%)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilterRatio": {
"numerator": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\" AND metric.labels.response_code_class=\"5xx\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
},
"denominator": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 0.01,
"color": "YELLOW",
"direction": "ABOVE",
"label": "Warning (1%)"
},
{
"value": 0.05,
"color": "RED",
"direction": "ABOVE",
"label": "Critical (5%)"
}
]
}
}
},
{
"xPos": 6,
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "HTTP Status Distribution",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["metric.response_code_class"]
}
}
},
"plotType": "STACKED_BAR"
}
]
}
}
},
{
"yPos": 8,
"width": 12,
"height": 4,
"widget": {
"title": "API Availability (SLO: 99.9%)",
"scorecard": {
"timeSeriesQuery": {
"timeSeriesFilterRatio": {
"numerator": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\" AND metric.labels.response_code_class!=\"5xx\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_SUM",
"crossSeriesReducer": "REDUCE_SUM"
}
},
"denominator": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_SUM",
"crossSeriesReducer": "REDUCE_SUM"
}
}
}
},
"sparkChartView": {
"sparkChartType": "SPARK_LINE"
},
"thresholds": [
{
"value": 0.999,
"color": "YELLOW",
"direction": "BELOW"
},
{
"value": 0.995,
"color": "RED",
"direction": "BELOW"
}
]
}
}
}
]
}
}

Dashboard 2: Database Performance Dashboard

{
"displayName": "BIO-QMS Database Performance",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Database Connection Pool Usage",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/postgresql/num_backends\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE",
"legendTemplate": "Active Connections"
}
],
"thresholds": [
{
"value": 80.0,
"color": "YELLOW",
"direction": "ABOVE",
"label": "Warning (80 connections)"
},
{
"value": 95.0,
"color": "RED",
"direction": "ABOVE",
"label": "Critical (95 connections)"
}
]
}
}
},
{
"xPos": 6,
"width": 6,
"height": 4,
"widget": {
"title": "Query Latency (ms)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/prisma_query_duration\" AND metric.labels.operation=\"query\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Disk I/O Utilization (%)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/disk/utilization\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 0.8,
"color": "YELLOW",
"direction": "ABOVE"
}
]
}
}
},
{
"xPos": 6,
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Replication Lag (seconds)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/replication/replica_lag\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MAX"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 10.0,
"color": "YELLOW",
"direction": "ABOVE"
},
{
"value": 60.0,
"color": "RED",
"direction": "ABOVE"
}
]
}
}
},
{
"yPos": 8,
"width": 6,
"height": 4,
"widget": {
"title": "Transaction Rate (tx/sec)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/postgresql/transaction_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"xPos": 6,
"yPos": 8,
"width": 6,
"height": 4,
"widget": {
"title": "Database CPU Utilization (%)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/cpu/utilization\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 0.7,
"color": "YELLOW",
"direction": "ABOVE"
},
{
"value": 0.9,
"color": "RED",
"direction": "ABOVE"
}
]
}
}
}
]
}
}

Dashboard 3: Cache Performance Dashboard

{
"displayName": "BIO-QMS Cache Layer (Memorystore Redis)",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Cache Hit Rate (%)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/cache_hit_ratio\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 0.8,
"color": "YELLOW",
"direction": "BELOW",
"label": "Target Hit Rate (80%)"
},
{
"value": 0.6,
"color": "RED",
"direction": "BELOW"
}
]
}
}
},
{
"xPos": 6,
"width": 6,
"height": 4,
"widget": {
"title": "Commands/sec",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/commands/calls\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["metric.cmd"]
}
}
},
"plotType": "STACKED_AREA"
}
]
}
}
},
{
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Memory Usage (MB)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/memory/usage\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"xPos": 6,
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Evicted Keys/sec",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/evicted_keys\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 100.0,
"color": "YELLOW",
"direction": "ABOVE",
"label": "High Eviction Rate"
}
]
}
}
}
]
}
}

Dashboard 4: QMS Business KPIs

{
"displayName": "BIO-QMS Business Metrics",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 4,
"height": 4,
"widget": {
"title": "Documents Signed (per hour)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/qms_document_signed\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"xPos": 4,
"width": 4,
"height": 4,
"widget": {
"title": "CAPA Resolution Time (hours)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/capa_resolution_duration\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 72.0,
"color": "YELLOW",
"direction": "ABOVE",
"label": "Target (72h)"
}
]
}
}
},
{
"xPos": 8,
"width": 4,
"height": 4,
"widget": {
"title": "Audit Events (per hour)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/audit_event_count\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["metric.event_type"]
}
}
},
"plotType": "STACKED_BAR"
}
]
}
}
},
{
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Active User Sessions",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/active_sessions\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"xPos": 6,
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Document Approval Workflow Duration (hours)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/document_approval_duration\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
},
"plotType": "LINE"
}
]
}
}
}
]
}
}

Custom Metrics Implementation

// src/monitoring/metrics.service.ts
import { Injectable } from '@nestjs/common';
import { Monitoring } from '@google-cloud/monitoring';

@Injectable()
export class MetricsService {
private readonly client: Monitoring;
private readonly projectId: string;

constructor() {
this.client = new Monitoring.MetricServiceClient();
this.projectId = process.env.GCP_PROJECT_ID;
}

/**
* Record document signature event
* @compliance FDA 21 CFR Part 11 - Electronic signatures tracking
*/
async recordDocumentSigned(documentId: string, userId: string, orgId: string): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
int64Value: 1,
},
};

const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/qms_document_signed',
labels: {
document_id: documentId,
organization_id: orgId,
},
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};

await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}

/**
* Record CAPA resolution duration
* @compliance SOC 2 - Performance monitoring
*/
async recordCapaResolution(capaId: string, durationHours: number): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
doubleValue: durationHours,
},
};

const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/capa_resolution_duration',
labels: {
capa_id: capaId,
},
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};

await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}

/**
* Record audit event count
* @compliance FDA 21 CFR Part 11 §11.10(e) - Audit trail
*/
async recordAuditEvent(eventType: string, userId: string, resourceType: string): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
int64Value: 1,
},
};

const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/audit_event_count',
labels: {
event_type: eventType,
resource_type: resourceType,
},
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};

await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}

/**
* Update active session count
*/
async updateActiveSessions(count: number): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
int64Value: count,
},
};

const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/active_sessions',
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};

await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}

/**
* Record document approval workflow duration
*/
async recordApprovalDuration(workflowId: string, durationHours: number): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
doubleValue: durationHours,
},
};

const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/document_approval_duration',
labels: {
workflow_id: workflowId,
},
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};

await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}
}

SLO Definitions

# slo-definitions.yaml
# Service Level Objectives for BIO-QMS Platform

apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
name: api-availability-slo
spec:
displayName: "API Availability SLO"
serviceLevelIndicator:
requestBased:
goodTotalRatio:
totalServiceFilter: >
resource.type="cloud_run_revision"
AND resource.labels.service_name="bio-qms-api"
AND metric.type="run.googleapis.com/request_count"
goodServiceFilter: >
resource.type="cloud_run_revision"
AND resource.labels.service_name="bio-qms-api"
AND metric.type="run.googleapis.com/request_count"
AND metric.labels.response_code_class!="5xx"
goal: 0.999 # 99.9% availability
rollingPeriod: "2592000s" # 30 days
complianceNote: "SOC 2 CC7.1 - System availability commitment"

---
apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
name: api-latency-slo
spec:
displayName: "API Latency SLO (p95 < 200ms)"
serviceLevelIndicator:
requestBased:
distributionCut:
distributionFilter: >
resource.type="cloud_run_revision"
AND metric.type="run.googleapis.com/request_latencies"
range:
max: 200.0 # milliseconds
goal: 0.95 # 95% of requests under 200ms
rollingPeriod: "2592000s"
complianceNote: "SOC 2 CC7.2 - Performance monitoring"

---
apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
name: database-query-latency-slo
spec:
displayName: "Database Query Latency SLO (p95 < 100ms)"
serviceLevelIndicator:
requestBased:
distributionCut:
distributionFilter: >
metric.type="logging.googleapis.com/user/prisma_query_duration"
range:
max: 100.0
goal: 0.95
rollingPeriod: "2592000s"

E.3.2: Alerting Policies

Alert Severity Levels

SeverityResponse TimeChannelsEscalation
CriticalImmediate (5 min)PagerDuty, Slack, EmailOn-call engineer → Manager (15 min)
Warning30 minutesSlack, EmailTeam channel → On-call (60 min)
InformationalNext business dayEmailNone

Critical Alerts

Alert 1: API Service Down

# alerts/critical/api-down.yaml
displayName: "CRITICAL: API Service Down"
documentation:
content: |
The BIO-QMS API service is completely down.

**Compliance Impact:** FDA 21 CFR Part 11, HIPAA - System unavailable

**Runbook:** https://wiki.bioqms.com/runbooks/api-down

**Steps:**
1. Check Cloud Run service status: gcloud run services describe bio-qms-api
2. Check recent deployments: gcloud run revisions list
3. Review logs: gcloud logging read "resource.type=cloud_run_revision" --limit 50
4. Verify database connectivity
5. If necessary, rollback to previous revision
mimeType: text/markdown

conditions:
- displayName: "No successful requests in 5 minutes"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND resource.labels.service_name = "bio-qms-api"
AND metric.type = "run.googleapis.com/request_count"
AND metric.labels.response_code_class = "2xx"
comparison: COMPARISON_LT
thresholdValue: 1
duration: 300s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM

notificationChannels:
- projects/bio-qms-prod/notificationChannels/pagerduty-critical
- projects/bio-qms-prod/notificationChannels/slack-incidents
- projects/bio-qms-prod/notificationChannels/email-oncall

alertStrategy:
autoClose: 1800s # 30 minutes
notificationRateLimit:
period: 300s # Re-alert every 5 minutes if unacknowledged

Alert 2: Database Unreachable

# alerts/critical/database-unreachable.yaml
displayName: "CRITICAL: Database Unreachable"
documentation:
content: |
Cloud SQL database is unreachable or connection pool exhausted.

**Compliance Impact:** FDA 21 CFR Part 11 - Data integrity risk

**Runbook:** https://wiki.bioqms.com/runbooks/database-unreachable

**Steps:**
1. Check Cloud SQL instance status
2. Verify network connectivity
3. Check connection pool metrics
4. Review database error logs
5. Consider scaling instance if connection pool exhausted

conditions:
- displayName: "Database connection errors > 10/min"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND jsonPayload.level = "error"
AND jsonPayload.context.error =~ ".*database.*connection.*"
comparison: COMPARISON_GT
thresholdValue: 10
duration: 60s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM

notificationChannels:
- projects/bio-qms-prod/notificationChannels/pagerduty-critical
- projects/bio-qms-prod/notificationChannels/slack-incidents

Alert 3: Certificate Expiry

# alerts/critical/certificate-expiry.yaml
displayName: "CRITICAL: TLS Certificate Expiring Soon"
documentation:
content: |
TLS certificate expires in less than 7 days.

**Compliance Impact:** HIPAA Security Rule - Encryption in transit

**Runbook:** https://wiki.bioqms.com/runbooks/certificate-renewal

conditions:
- displayName: "Certificate expires in < 7 days"
conditionThreshold:
filter: |
resource.type = "gae_app"
AND metric.type = "appengine.googleapis.com/http/server/certificate_expiry_time"
comparison: COMPARISON_LT
thresholdValue: 604800 # 7 days in seconds
duration: 0s
aggregations:
- alignmentPeriod: 3600s
perSeriesAligner: ALIGN_MIN

notificationChannels:
- projects/bio-qms-prod/notificationChannels/pagerduty-critical
- projects/bio-qms-prod/notificationChannels/email-security-team

Alert 4: High Error Rate

# alerts/critical/high-error-rate.yaml
displayName: "CRITICAL: High API Error Rate (>5%)"
documentation:
content: |
API error rate exceeds 5%.

**Compliance Impact:** SOC 2 - Service availability degradation

**Runbook:** https://wiki.bioqms.com/runbooks/high-error-rate

conditions:
- displayName: "5xx errors > 5% for 5 minutes"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND metric.type = "run.googleapis.com/request_count"
AND metric.labels.response_code_class = "5xx"
comparison: COMPARISON_GT
thresholdValue: 0.05
duration: 300s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM
denominatorFilter: |
resource.type = "cloud_run_revision"
AND metric.type = "run.googleapis.com/request_count"
denominatorAggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM

notificationChannels:
- projects/bio-qms-prod/notificationChannels/pagerduty-critical
- projects/bio-qms-prod/notificationChannels/slack-incidents

Warning Alerts

Alert 5: Elevated Error Rate

# alerts/warning/elevated-error-rate.yaml
displayName: "WARNING: Elevated Error Rate (>1%)"
conditions:
- displayName: "5xx errors > 1% for 10 minutes"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND metric.type = "run.googleapis.com/request_count"
AND metric.labels.response_code_class = "5xx"
comparison: COMPARISON_GT
thresholdValue: 0.01
duration: 600s

notificationChannels:
- projects/bio-qms-prod/notificationChannels/slack-monitoring
- projects/bio-qms-prod/notificationChannels/email-team

Alert 6: High Latency

# alerts/warning/high-latency.yaml
displayName: "WARNING: High API Latency (p95 > 500ms)"
conditions:
- displayName: "p95 latency > 500ms for 10 minutes"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND metric.type = "run.googleapis.com/request_latencies"
comparison: COMPARISON_GT
thresholdValue: 500.0
duration: 600s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_DELTA
crossSeriesReducer: REDUCE_PERCENTILE_95

notificationChannels:
- projects/bio-qms-prod/notificationChannels/slack-monitoring

Alert 7: High Disk Usage

# alerts/warning/high-disk-usage.yaml
displayName: "WARNING: Database Disk Usage > 80%"
conditions:
- displayName: "Disk usage > 80%"
conditionThreshold:
filter: |
resource.type = "cloudsql_database"
AND metric.type = "cloudsql.googleapis.com/database/disk/utilization"
comparison: COMPARISON_GT
thresholdValue: 0.8
duration: 300s

notificationChannels:
- projects/bio-qms-prod/notificationChannels/slack-monitoring
- projects/bio-qms-prod/notificationChannels/email-devops

Alert 8: Low Cache Hit Rate

# alerts/warning/low-cache-hit-rate.yaml
displayName: "WARNING: Cache Hit Rate < 60%"
conditions:
- displayName: "Cache hit rate < 60% for 30 minutes"
conditionThreshold:
filter: |
resource.type = "redis_instance"
AND metric.type = "redis.googleapis.com/stats/cache_hit_ratio"
comparison: COMPARISON_LT
thresholdValue: 0.6
duration: 1800s

notificationChannels:
- projects/bio-qms-prod/notificationChannels/slack-monitoring

Notification Channels Configuration

// src/monitoring/notification-channels.service.ts
import { Injectable } from '@nestjs/common';
import { Monitoring } from '@google-cloud/monitoring';

@Injectable()
export class NotificationChannelsService {
private readonly client: Monitoring.NotificationChannelServiceClient;
private readonly projectId: string;

constructor() {
this.client = new Monitoring.NotificationChannelServiceClient();
this.projectId = process.env.GCP_PROJECT_ID;
}

async createPagerDutyChannel(): Promise<string> {
const [channel] = await this.client.createNotificationChannel({
name: this.client.projectPath(this.projectId),
notificationChannel: {
type: 'pagerduty',
displayName: 'PagerDuty - Critical Incidents',
labels: {
service_key: process.env.PAGERDUTY_SERVICE_KEY,
},
enabled: true,
},
});
return channel.name;
}

async createSlackChannel(webhookUrl: string, channelName: string): Promise<string> {
const [channel] = await this.client.createNotificationChannel({
name: this.client.projectPath(this.projectId),
notificationChannel: {
type: 'slack',
displayName: `Slack - ${channelName}`,
labels: {
url: webhookUrl,
channel_name: channelName,
},
enabled: true,
},
});
return channel.name;
}

async createEmailChannel(emailAddress: string, displayName: string): Promise<string> {
const [channel] = await this.client.createNotificationChannel({
name: this.client.projectPath(this.projectId),
notificationChannel: {
type: 'email',
displayName: displayName,
labels: {
email_address: emailAddress,
},
enabled: true,
},
});
return channel.name;
}
}

Alert Policy Deployment Script

// scripts/deploy-alert-policies.ts
import { Monitoring } from '@google-cloud/monitoring';
import * as fs from 'fs';
import * as path from 'path';
import * as yaml from 'js-yaml';

async function deployAlertPolicies() {
const client = new Monitoring.AlertPolicyServiceClient();
const projectId = process.env.GCP_PROJECT_ID;
const alertsDir = path.join(__dirname, '../config/alerts');

const categories = ['critical', 'warning', 'informational'];

for (const category of categories) {
const categoryDir = path.join(alertsDir, category);
const files = fs.readdirSync(categoryDir).filter(f => f.endsWith('.yaml'));

for (const file of files) {
const filePath = path.join(categoryDir, file);
const content = fs.readFileSync(filePath, 'utf8');
const policy = yaml.load(content) as any;

console.log(`Deploying ${category} alert: ${policy.displayName}`);

try {
const [createdPolicy] = await client.createAlertPolicy({
name: client.projectPath(projectId),
alertPolicy: policy,
});
console.log(`✓ Created: ${createdPolicy.name}`);
} catch (error) {
console.error(`✗ Failed to create ${file}:`, error.message);
}
}
}
}

deployAlertPolicies().catch(console.error);

E.3.3: Structured Logging with Cloud Logging

Logging Architecture

┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
├─────────────────────────────────────────────────────────────┤
│ NestJS Logger → Winston → Logging Interceptor │
│ ↓ │
│ JSON Structured Logs + Correlation IDs │
└────────────────────┬────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Google Cloud Logging │
├──────────────────┬──────────────────┬──────────────────────┤
│ Hot Storage │ BigQuery Export │ Cloud Storage │
│ (30 days) │ (1 year) │ (Long-term) │
└──────────────────┴──────────────────┴──────────────────────┘

Log Structure

// src/logging/interfaces/structured-log.interface.ts
export interface StructuredLog {
// Standard fields
timestamp: string; // ISO 8601 UTC
severity: LogSeverity; // DEBUG, INFO, NOTICE, WARNING, ERROR, CRITICAL, ALERT, EMERGENCY
message: string;

// Request context
request_id: string; // Unique per request (UUID v4)
trace_id?: string; // OpenTelemetry trace ID
span_id?: string; // OpenTelemetry span ID

// User context
user_id?: string; // Authenticated user ID
org_id?: string; // Organization/tenant ID
session_id?: string; // Session identifier

// Application context
service: string; // 'bio-qms-api'
environment: string; // 'production', 'staging', 'development'
version: string; // Application version (from package.json)

// Action context
action: string; // API endpoint or operation
resource_type?: string; // 'document', 'capa', 'training', etc.
resource_id?: string; // ID of affected resource

// Performance metrics
duration_ms?: number; // Operation duration

// Error context (if severity >= ERROR)
error?: {
name: string;
message: string;
stack?: string;
code?: string;
};

// Compliance metadata
compliance?: {
regulation: string[]; // ['FDA-21-CFR-Part-11', 'HIPAA', 'SOC-2']
audit_event_type?: string; // 'electronic_signature', 'data_modification', etc.
pii_logged: boolean; // Flag if PII is in logs
};

// Additional context
metadata?: Record<string, any>;
}

export enum LogSeverity {
DEBUG = 'DEBUG',
INFO = 'INFO',
NOTICE = 'NOTICE',
WARNING = 'WARNING',
ERROR = 'ERROR',
CRITICAL = 'CRITICAL',
ALERT = 'ALERT',
EMERGENCY = 'EMERGENCY',
}

Winston Logger Configuration

// src/logging/winston.config.ts
import * as winston from 'winston';
import { LoggingWinston } from '@google-cloud/logging-winston';

const loggingWinston = new LoggingWinston({
projectId: process.env.GCP_PROJECT_ID,
keyFilename: process.env.GCP_KEY_FILE,
serviceContext: {
service: 'bio-qms-api',
version: process.env.APP_VERSION || '1.0.0',
},
});

export const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp({ format: 'YYYY-MM-DDTHH:mm:ss.SSSZ' }),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
},
transports: [
// Cloud Logging transport (production)
loggingWinston,

// Console transport (development)
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple()
),
}),
],
});

NestJS Logging Interceptor

// src/logging/logging.interceptor.ts
import {
Injectable,
NestInterceptor,
ExecutionContext,
CallHandler,
Logger,
} from '@nestjs/common';
import { Observable, throwError } from 'rxjs';
import { tap, catchError } from 'rxjs/operators';
import { v4 as uuidv4 } from 'uuid';
import { logger } from './winston.config';
import { StructuredLog, LogSeverity } from './interfaces/structured-log.interface';

@Injectable()
export class LoggingInterceptor implements NestInterceptor {
private readonly nestLogger = new Logger(LoggingInterceptor.name);

intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const request = context.switchToHttp().getRequest();
const response = context.switchToHttp().getResponse();

// Generate or extract correlation IDs
const requestId = request.headers['x-request-id'] || uuidv4();
const traceId = request.headers['x-cloud-trace-context']?.split('/')[0];

// Attach to request for downstream use
request.requestId = requestId;
request.traceId = traceId;

// Set response header
response.setHeader('X-Request-ID', requestId);

const startTime = Date.now();
const { method, url, body, query, params } = request;
const userId = request.user?.id;
const orgId = request.user?.organizationId;
const sessionId = request.session?.id;

// Log incoming request
this.logRequest(requestId, traceId, method, url, userId, orgId);

return next.handle().pipe(
tap((data) => {
const duration = Date.now() - startTime;
this.logResponse(
requestId,
traceId,
method,
url,
response.statusCode,
duration,
userId,
orgId,
);
}),
catchError((error) => {
const duration = Date.now() - startTime;
this.logError(
requestId,
traceId,
method,
url,
error,
duration,
userId,
orgId,
);
return throwError(() => error);
}),
);
}

private logRequest(
requestId: string,
traceId: string | undefined,
method: string,
url: string,
userId?: string,
orgId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.INFO,
message: `Incoming ${method} ${url}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
org_id: orgId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: `${method} ${url}`,
compliance: {
regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
pii_logged: false,
},
};

logger.info(log);
}

private logResponse(
requestId: string,
traceId: string | undefined,
method: string,
url: string,
statusCode: number,
duration: number,
userId?: string,
orgId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: statusCode >= 400 ? LogSeverity.WARNING : LogSeverity.INFO,
message: `${method} ${url} ${statusCode}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
org_id: orgId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: `${method} ${url}`,
duration_ms: duration,
metadata: {
status_code: statusCode,
},
compliance: {
regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
pii_logged: false,
},
};

logger.info(log);
}

private logError(
requestId: string,
traceId: string | undefined,
method: string,
url: string,
error: any,
duration: number,
userId?: string,
orgId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.ERROR,
message: `${method} ${url} failed: ${error.message}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
org_id: orgId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: `${method} ${url}`,
duration_ms: duration,
error: {
name: error.name,
message: error.message,
stack: error.stack,
code: error.code,
},
compliance: {
regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
pii_logged: false,
},
};

logger.error(log);
}
}

Audit Logging Service

// src/logging/audit-log.service.ts
import { Injectable } from '@nestjs/common';
import { logger } from './winston.config';
import { StructuredLog, LogSeverity } from './interfaces/structured-log.interface';

export enum AuditEventType {
ELECTRONIC_SIGNATURE = 'electronic_signature',
DATA_MODIFICATION = 'data_modification',
DATA_DELETION = 'data_deletion',
USER_LOGIN = 'user_login',
USER_LOGOUT = 'user_logout',
FAILED_LOGIN = 'failed_login',
PASSWORD_CHANGE = 'password_change',
PERMISSION_CHANGE = 'permission_change',
DOCUMENT_APPROVAL = 'document_approval',
CAPA_STATUS_CHANGE = 'capa_status_change',
TRAINING_COMPLETION = 'training_completion',
SYSTEM_CONFIGURATION_CHANGE = 'system_configuration_change',
}

/**
* Audit logging service for FDA 21 CFR Part 11 compliance
* @compliance FDA 21 CFR Part 11 §11.10(e)
*/
@Injectable()
export class AuditLogService {
/**
* Log electronic signature event
* @compliance FDA 21 CFR Part 11 §11.50, §11.70
*/
logElectronicSignature(
userId: string,
documentId: string,
signatureMeaning: string,
requestId: string,
traceId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.NOTICE,
message: `Electronic signature applied: ${signatureMeaning}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: 'electronic_signature',
resource_type: 'document',
resource_id: documentId,
compliance: {
regulation: ['FDA-21-CFR-Part-11'],
audit_event_type: AuditEventType.ELECTRONIC_SIGNATURE,
pii_logged: false,
},
metadata: {
signature_meaning: signatureMeaning,
},
};

logger.info(log);
}

/**
* Log data modification event
* @compliance FDA 21 CFR Part 11 §11.10(e)
*/
logDataModification(
userId: string,
resourceType: string,
resourceId: string,
changes: Record<string, any>,
requestId: string,
traceId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.NOTICE,
message: `Data modified: ${resourceType}/${resourceId}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: 'data_modification',
resource_type: resourceType,
resource_id: resourceId,
compliance: {
regulation: ['FDA-21-CFR-Part-11'],
audit_event_type: AuditEventType.DATA_MODIFICATION,
pii_logged: false,
},
metadata: {
changes: this.sanitizeChanges(changes),
},
};

logger.info(log);
}

/**
* Log failed login attempt
* @compliance HIPAA Security Rule §164.312(b)
*/
logFailedLogin(
username: string,
ipAddress: string,
reason: string,
requestId: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.WARNING,
message: `Failed login attempt: ${username}`,
request_id: requestId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: 'failed_login',
compliance: {
regulation: ['HIPAA', 'SOC-2'],
audit_event_type: AuditEventType.FAILED_LOGIN,
pii_logged: true, // Username may contain PII
},
metadata: {
username,
ip_address: ipAddress,
reason,
},
};

logger.warn(log);
}

/**
* Log successful user login
* @compliance HIPAA Security Rule §164.312(b)
*/
logUserLogin(
userId: string,
username: string,
ipAddress: string,
requestId: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.INFO,
message: `User login successful: ${username}`,
request_id: requestId,
user_id: userId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: 'user_login',
compliance: {
regulation: ['HIPAA', 'SOC-2'],
audit_event_type: AuditEventType.USER_LOGIN,
pii_logged: true,
},
metadata: {
username,
ip_address: ipAddress,
},
};

logger.info(log);
}

/**
* Sanitize changes to remove sensitive data from logs
*/
private sanitizeChanges(changes: Record<string, any>): Record<string, any> {
const sanitized = { ...changes };
const sensitiveFields = ['password', 'ssn', 'credit_card', 'api_key', 'token'];

for (const field of sensitiveFields) {
if (field in sanitized) {
sanitized[field] = '[REDACTED]';
}
}

return sanitized;
}
}

Log Retention & Export Configuration

# config/log-retention.yaml
# Log retention and export configuration for compliance

sinks:
# BigQuery export for long-term analysis (1 year)
- name: bigquery-export
destination: bigquery.googleapis.com/projects/bio-qms-prod/datasets/audit_logs
filter: |
severity >= NOTICE
OR jsonPayload.compliance.audit_event_type != null
bigqueryOptions:
usePartitionedTables: true
usesTimestampColumnPartitioning: true

# Cloud Storage archive (7 years for FDA compliance)
- name: gcs-archive
destination: storage.googleapis.com/bio-qms-audit-logs-archive
filter: |
jsonPayload.compliance.regulation =~ ".*FDA.*"
OR jsonPayload.compliance.audit_event_type != null
includeChildren: true

# Security events to dedicated dataset
- name: security-events
destination: bigquery.googleapis.com/projects/bio-qms-prod/datasets/security_logs
filter: |
jsonPayload.compliance.audit_event_type = "failed_login"
OR jsonPayload.compliance.audit_event_type = "permission_change"
OR severity >= ERROR

exclusions:
# Exclude health check logs from long-term storage
- name: exclude-health-checks
filter: |
jsonPayload.action = "GET /health"
OR jsonPayload.action = "GET /readiness"

retention:
# Hot storage: 30 days in Cloud Logging
default: 30d

# Compliance buckets: extended retention
audit_logs: 2555d # 7 years (FDA requirement)
security_logs: 2555d

E.3.4: Distributed Tracing

OpenTelemetry Architecture

┌─────────────────────────────────────────────────────────────┐
│ Application Code │
│ (NestJS Controllers, Services, Repositories) │
└────────────────────┬────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ OpenTelemetry Instrumentation │
├──────────────────┬──────────────────┬──────────────────────┤
│ HTTP Tracing │ Database │ External APIs │
│ (@opentelemetry │ (Prisma) │ (fetch, axios) │
│ /instrumentation│ │ │
│ -http) │ │ │
└──────────────────┴──────────────────┴──────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ - Sampling (100% errors, 10% normal) │
│ - Batching │
│ - Enrichment (resource attributes) │
└────────────────────┬────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Google Cloud Trace │
│ - Trace storage & visualization │
│ - Latency analysis │
│ - Service dependency mapping │
└─────────────────────────────────────────────────────────────┘

OpenTelemetry Configuration

// src/tracing/tracing.config.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { TraceExporter } from '@google-cloud/opentelemetry-cloud-trace-exporter';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { ParentBasedSampler, TraceIdRatioBasedSampler, AlwaysOnSampler } from '@opentelemetry/sdk-trace-base';
import { CompositePropagator, W3CTraceContextPropagator, W3CBaggagePropagator } from '@opentelemetry/core';

/**
* Custom sampler: 100% for errors, 10% for normal traffic
*/
class AdaptiveSampler extends ParentBasedSampler {
constructor() {
super({
root: new TraceIdRatioBasedSampler(0.1), // 10% base sampling
});
}

shouldSample(context, traceId, spanName, spanKind, attributes, links) {
// Always sample if error
if (attributes['http.status_code'] >= 400) {
return { decision: AlwaysOnSampler.prototype.shouldSample.call(this, context, traceId, spanName, spanKind, attributes, links).decision };
}

// Always sample audit events
if (attributes['audit.event_type']) {
return { decision: AlwaysOnSampler.prototype.shouldSample.call(this, context, traceId, spanName, spanKind, attributes, links).decision };
}

// Use parent-based sampling for everything else
return super.shouldSample(context, traceId, spanName, spanKind, attributes, links);
}
}

export const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'bio-qms-api',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
'service.namespace': 'bio-qms',
'cloud.provider': 'gcp',
'cloud.platform': 'gcp_cloud_run',
'cloud.region': process.env.GCP_REGION || 'us-central1',
}),
traceExporter: new TraceExporter({
projectId: process.env.GCP_PROJECT_ID,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
enabled: true,
ignoreIncomingPaths: ['/health', '/readiness'],
},
'@opentelemetry/instrumentation-express': {
enabled: true,
},
'@opentelemetry/instrumentation-pg': {
enabled: true,
enhancedDatabaseReporting: true,
},
'@opentelemetry/instrumentation-redis': {
enabled: true,
},
}),
],
sampler: new AdaptiveSampler(),
textMapPropagator: new CompositePropagator({
propagators: [
new W3CTraceContextPropagator(),
new W3CBaggagePropagator(),
],
}),
});

// Start tracing
sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.error('Error terminating tracing', error))
.finally(() => process.exit(0));
});

Custom Span Creation

// src/tracing/tracing.service.ts
import { Injectable } from '@nestjs/common';
import { trace, context, SpanStatusCode, Span } from '@opentelemetry/api';

@Injectable()
export class TracingService {
private readonly tracer = trace.getTracer('bio-qms-api');

/**
* Create a custom span for business operations
*/
async withSpan<T>(
name: string,
operation: (span: Span) => Promise<T>,
attributes?: Record<string, any>,
): Promise<T> {
return this.tracer.startActiveSpan(name, async (span) => {
try {
if (attributes) {
span.setAttributes(attributes);
}

const result = await operation(span);

span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}

/**
* Add event to current span
*/
addEvent(name: string, attributes?: Record<string, any>): void {
const currentSpan = trace.getActiveSpan();
if (currentSpan) {
currentSpan.addEvent(name, attributes);
}
}

/**
* Set attribute on current span
*/
setAttribute(key: string, value: any): void {
const currentSpan = trace.getActiveSpan();
if (currentSpan) {
currentSpan.setAttribute(key, value);
}
}

/**
* Get current trace ID (for log correlation)
*/
getCurrentTraceId(): string | undefined {
const currentSpan = trace.getActiveSpan();
return currentSpan?.spanContext().traceId;
}

/**
* Get current span ID (for log correlation)
*/
getCurrentSpanId(): string | undefined {
const currentSpan = trace.getActiveSpan();
return currentSpan?.spanContext().spanId;
}
}

NestJS Tracing Interceptor

// src/tracing/tracing.interceptor.ts
import {
Injectable,
NestInterceptor,
ExecutionContext,
CallHandler,
} from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap, catchError } from 'rxjs/operators';
import { TracingService } from './tracing.service';
import { trace, SpanStatusCode } from '@opentelemetry/api';

@Injectable()
export class TracingInterceptor implements NestInterceptor {
constructor(private readonly tracingService: TracingService) {}

intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const request = context.switchToHttp().getRequest();
const { method, url } = request;
const controllerName = context.getClass().name;
const handlerName = context.getHandler().name;

const tracer = trace.getTracer('bio-qms-api');
const spanName = `${controllerName}.${handlerName}`;

return tracer.startActiveSpan(spanName, (span) => {
// Set span attributes
span.setAttributes({
'http.method': method,
'http.url': url,
'http.route': request.route?.path,
'controller.name': controllerName,
'handler.name': handlerName,
'user.id': request.user?.id,
'org.id': request.user?.organizationId,
});

return next.handle().pipe(
tap((data) => {
span.setStatus({ code: SpanStatusCode.OK });
span.setAttribute('http.status_code', 200);
}),
catchError((error) => {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
span.setAttribute('http.status_code', error.status || 500);
throw error;
}),
tap(() => {
span.end();
}),
);
});
}
}

Prisma Query Tracing

// src/tracing/prisma-tracing.middleware.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { PrismaClient } from '@prisma/client';
import { TracingService } from './tracing.service';
import { trace, SpanKind } from '@opentelemetry/api';

@Injectable()
export class PrismaTracingService implements OnModuleInit {
constructor(
private readonly prisma: PrismaClient,
private readonly tracingService: TracingService,
) {}

async onModuleInit() {
// Middleware for query tracing
this.prisma.$use(async (params, next) => {
const tracer = trace.getTracer('bio-qms-api');

return tracer.startActiveSpan(
`prisma.${params.model}.${params.action}`,
{
kind: SpanKind.CLIENT,
attributes: {
'db.system': 'postgresql',
'db.name': process.env.DATABASE_NAME,
'db.operation': params.action,
'db.model': params.model,
},
},
async (span) => {
const startTime = Date.now();

try {
const result = await next(params);

const duration = Date.now() - startTime;
span.setAttribute('db.duration_ms', duration);
span.setStatus({ code: 0 }); // OK

return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: 2, message: error.message }); // ERROR
throw error;
} finally {
span.end();
}
},
);
});
}
}

External API Call Tracing

// src/tracing/http-client.service.ts
import { Injectable, HttpService } from '@nestjs/common';
import { TracingService } from './tracing.service';
import { trace, SpanKind, propagation, context } from '@opentelemetry/api';
import { AxiosRequestConfig } from 'axios';

@Injectable()
export class TracedHttpService {
constructor(
private readonly httpService: HttpService,
private readonly tracingService: TracingService,
) {}

/**
* Make HTTP request with automatic tracing
*/
async request<T>(config: AxiosRequestConfig): Promise<T> {
const tracer = trace.getTracer('bio-qms-api');
const url = `${config.baseURL || ''}${config.url}`;

return tracer.startActiveSpan(
`HTTP ${config.method?.toUpperCase()} ${url}`,
{
kind: SpanKind.CLIENT,
attributes: {
'http.method': config.method?.toUpperCase(),
'http.url': url,
'http.target': config.url,
},
},
async (span) => {
try {
// Inject trace context into headers
const carrier = {};
propagation.inject(context.active(), carrier);

config.headers = {
...config.headers,
...carrier,
};

const response = await this.httpService.request(config).toPromise();

span.setAttribute('http.status_code', response.status);
span.setStatus({ code: 0 }); // OK

return response.data;
} catch (error) {
span.setAttribute('http.status_code', error.response?.status || 0);
span.recordException(error);
span.setStatus({ code: 2, message: error.message }); // ERROR
throw error;
} finally {
span.end();
}
},
);
}
}

Application Bootstrap with Tracing

// src/main.ts
import './tracing/tracing.config'; // MUST be first import
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
import { LoggingInterceptor } from './logging/logging.interceptor';
import { TracingInterceptor } from './tracing/tracing.interceptor';

async function bootstrap() {
const app = await NestFactory.create(AppModule);

// Apply global interceptors
app.useGlobalInterceptors(
app.get(LoggingInterceptor),
app.get(TracingInterceptor),
);

await app.listen(process.env.PORT || 8080);
console.log(`Application is running on: ${await app.getUrl()}`);
}

bootstrap();

Trace Analysis Queries

-- BigQuery queries for trace analysis (exported from Cloud Trace)

-- Query 1: p95 latency by endpoint
SELECT
span_name,
APPROX_QUANTILES(duration_ms, 100)[OFFSET(95)] AS p95_latency_ms,
COUNT(*) AS request_count
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
DATE(start_time) = CURRENT_DATE()
AND span_kind = 'SERVER'
GROUP BY span_name
ORDER BY p95_latency_ms DESC
LIMIT 20;

-- Query 2: Error rate by endpoint
SELECT
span_name,
COUNTIF(status_code = 2) AS error_count,
COUNT(*) AS total_count,
ROUND(COUNTIF(status_code = 2) / COUNT(*) * 100, 2) AS error_rate_pct
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
DATE(start_time) = CURRENT_DATE()
AND span_kind = 'SERVER'
GROUP BY span_name
HAVING error_count > 0
ORDER BY error_rate_pct DESC;

-- Query 3: Slowest database queries
SELECT
span_name,
AVG(duration_ms) AS avg_duration_ms,
MAX(duration_ms) AS max_duration_ms,
COUNT(*) AS execution_count
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
DATE(start_time) = CURRENT_DATE()
AND span_name LIKE 'prisma.%'
GROUP BY span_name
ORDER BY avg_duration_ms DESC
LIMIT 20;

-- Query 4: Trace dependency graph (service calls)
SELECT
parent_span_name,
span_name,
COUNT(*) AS call_count,
AVG(duration_ms) AS avg_duration_ms
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
DATE(start_time) = CURRENT_DATE()
AND parent_span_id IS NOT NULL
GROUP BY parent_span_name, span_name
ORDER BY call_count DESC;

Compliance Mapping

FDA 21 CFR Part 11

RequirementImplementationEvidence
§11.10(e) Audit trailsStructured logging with Cloud Logging, 7-year retentionAuditLogService, BigQuery exports
§11.50 Electronic signatureslogElectronicSignature() captures all signature eventsAudit logs with electronic_signature event type
§11.70 Signature linkingTrace ID correlates signature to document modificationtrace_id field in structured logs

HIPAA Security Rule

RequirementImplementationEvidence
§164.312(b) Audit controlsCloud Logging with tamper-proof timestampsStructured logs exported to BigQuery
§164.312(c)(1) Integrity controlsHash verification in audit logsmetadata.data_hash in modification logs
§164.308(a)(1)(ii)(D) Information system activity reviewCloud Monitoring dashboards + alertsDashboard JSON configs, alert policies

SOC 2

Trust Service CriteriaImplementationEvidence
CC7.2 System monitoringCloud Monitoring dashboards, 24/7 alertingDashboard configs, PagerDuty integration
CC7.3 Incident detectionCritical alerts with 5-minute response SLAAlert policies with escalation
CC7.4 Incident responseRunbooks linked to alerts, incident trackingAlert documentation fields
CC8.1 Backup and recovery monitoringDatabase replication lag alertsReplication lag dashboard widget

Deployment Instructions

1. Deploy Dashboards

#!/bin/bash
# scripts/deploy-dashboards.sh

PROJECT_ID="bio-qms-prod"

for dashboard in config/dashboards/*.json; do
echo "Deploying $(basename $dashboard)..."
gcloud monitoring dashboards create --config-from-file="$dashboard" \
--project="$PROJECT_ID"
done

2. Deploy Alert Policies

#!/bin/bash
# scripts/deploy-alerts.sh

PROJECT_ID="bio-qms-prod"

# Create notification channels first
gcloud alpha monitoring channels create \
--display-name="PagerDuty - Critical" \
--type=pagerduty \
--channel-labels=service_key=$PAGERDUTY_KEY \
--project="$PROJECT_ID"

# Deploy alert policies
for policy in config/alerts/*/*.yaml; do
echo "Deploying $(basename $policy)..."
gcloud alpha monitoring policies create --policy-from-file="$policy" \
--project="$PROJECT_ID"
done

3. Configure Log Sinks

#!/bin/bash
# scripts/configure-log-sinks.sh

PROJECT_ID="bio-qms-prod"

# BigQuery sink
gcloud logging sinks create bigquery-audit-logs \
bigquery.googleapis.com/projects/$PROJECT_ID/datasets/audit_logs \
--log-filter='severity >= NOTICE OR jsonPayload.compliance.audit_event_type != null' \
--project="$PROJECT_ID"

# Cloud Storage archive
gcloud logging sinks create gcs-audit-archive \
storage.googleapis.com/bio-qms-audit-logs-archive \
--log-filter='jsonPayload.compliance.regulation =~ ".*FDA.*"' \
--project="$PROJECT_ID"

4. Enable OpenTelemetry

// package.json additions
{
"dependencies": {
"@google-cloud/opentelemetry-cloud-trace-exporter": "^2.3.0",
"@opentelemetry/api": "^1.8.0",
"@opentelemetry/sdk-node": "^0.49.1",
"@opentelemetry/auto-instrumentations-node": "^0.42.0",
"@opentelemetry/instrumentation-http": "^0.49.1",
"@opentelemetry/instrumentation-express": "^0.37.0",
"@opentelemetry/instrumentation-pg": "^0.40.0"
}
}
# Install dependencies
npm install

# Update main.ts to import tracing config first
# (see src/main.ts example above)

# Deploy with tracing enabled
gcloud run deploy bio-qms-api \
--image=us-central1-docker.pkg.dev/$PROJECT_ID/bio-qms/api:latest \
--set-env-vars="GOOGLE_CLOUD_PROJECT=$PROJECT_ID,NODE_ENV=production"

Testing & Validation

1. Metrics Validation

# Test custom metric creation
curl -X POST https://api.bioqms.com/documents/123/sign \
-H "Authorization: Bearer $TOKEN" \
-d '{"meaning": "Approved by QA Manager"}'

# Verify metric in Cloud Monitoring
gcloud monitoring time-series list \
--filter='metric.type="logging.googleapis.com/user/qms_document_signed"' \
--format=json

2. Alert Testing

# Trigger test alert (critical error rate)
for i in {1..100}; do
curl https://api.bioqms.com/test/error &
done

# Verify alert fired
gcloud alpha monitoring policies list \
--filter='displayName:"CRITICAL: High API Error Rate"' \
--format=json

3. Log Verification

# Query structured logs
gcloud logging read "jsonPayload.request_id=\"$REQUEST_ID\"" \
--format=json \
--limit=10

# Verify audit log
gcloud logging read \
"jsonPayload.compliance.audit_event_type=\"electronic_signature\"" \
--limit=1 \
--format=json

4. Trace Verification

# Generate traced request
curl -X POST https://api.bioqms.com/documents \
-H "Authorization: Bearer $TOKEN" \
-v 2>&1 | grep -i trace

# View trace in Cloud Console
# https://console.cloud.google.com/traces/list

Runbooks

Runbook 1: API Down

Symptoms: No successful API requests for 5+ minutes, PagerDuty alert fired

Diagnosis:

# 1. Check service status
gcloud run services describe bio-qms-api --region=us-central1

# 2. Check recent logs
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
--limit=50 --format=json

# 3. Check database connectivity
gcloud sql instances describe bio-qms-db

# 4. Check recent deployments
gcloud run revisions list --service=bio-qms-api --limit=5

Resolution:

# If bad deployment: rollback
PREVIOUS_REVISION=$(gcloud run revisions list --service=bio-qms-api --format="value(name)" --limit=2 | tail -n1)
gcloud run services update-traffic bio-qms-api --to-revisions=$PREVIOUS_REVISION=100

# If database issue: restart connection pool
kubectl rollout restart deployment/bio-qms-api -n production

# If Cloud Run issue: scale to zero and back
gcloud run services update bio-qms-api --min-instances=0
sleep 10
gcloud run services update bio-qms-api --min-instances=2

Runbook 2: High Database Latency

Symptoms: p95 query latency > 500ms

Diagnosis:

# Check slow queries
gcloud logging read \
"jsonPayload.duration_ms>500 AND jsonPayload.action=\"prisma.query\"" \
--limit=20 --format=json

# Check database CPU
gcloud sql operations list --instance=bio-qms-db --limit=10

# Check connection pool
gcloud monitoring time-series list \
--filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'

Resolution:

# Add missing index (example)
psql -h $DB_HOST -U $DB_USER -d bio_qms -c \
"CREATE INDEX CONCURRENTLY idx_documents_org_created ON documents(organization_id, created_at);"

# Scale database instance
gcloud sql instances patch bio-qms-db --tier=db-custom-4-16384

# Analyze and vacuum
psql -h $DB_HOST -U $DB_USER -d bio_qms -c "VACUUM ANALYZE;"

Maintenance

Monthly Tasks

  • Review SLO compliance and error budget consumption
  • Analyze top 20 slowest endpoints and optimize
  • Review alert noise and tune thresholds
  • Archive old traces (Cloud Trace auto-retention: 30 days)
  • Validate BigQuery audit log exports

Quarterly Tasks

  • Review and update dashboards based on new features
  • Conduct alert fire drill (test PagerDuty escalation)
  • Audit log retention compliance check (7-year FDA requirement)
  • Review and optimize trace sampling rates
  • Update runbooks based on recent incidents

Annual Tasks

  • Full observability stack audit
  • Review compliance mapping (FDA, HIPAA, SOC 2)
  • Evaluate new Cloud Monitoring features
  • Disaster recovery test (log export restoration)

References


Document Version: 1.0.0 Last Updated: 2026-02-17 Owner: DevOps Team Reviewers: Security Team, Compliance Team, QA Team Next Review: 2026-05-17