Monitoring & Observability - BIO-QMS Platform
Overview
This document defines the monitoring and observability architecture for the BIO-QMS regulated SaaS platform, ensuring compliance with FDA 21 CFR Part 11, HIPAA, and SOC 2 requirements. The platform leverages Google Cloud Platform's native observability stack combined with OpenTelemetry for comprehensive system visibility.
Observability Pillars
| Pillar | Technology | Purpose | Compliance Impact |
|---|---|---|---|
| Metrics | Cloud Monitoring | Performance, availability, SLO tracking | SOC 2 availability controls |
| Logs | Cloud Logging | Audit trail, debugging, compliance | FDA 21 CFR Part 11 §11.10(e) |
| Traces | Cloud Trace + OpenTelemetry | Request flow, latency analysis | Performance verification |
| Alerts | Cloud Monitoring Alerting | Proactive incident detection | HIPAA breach notification |
Regulatory Requirements
- FDA 21 CFR Part 11 §11.10(e): Use of secure, computer-generated, time-stamped audit trails
- HIPAA Security Rule: Audit controls (§164.312(b)), integrity controls (§164.312(c)(1))
- SOC 2 CC7.2: System monitoring to detect security incidents
- SOC 2 CC7.3: System availability monitoring and alerting
E.3.1: Cloud Monitoring Dashboards
Dashboard Architecture
┌─────────────────────────────────────────────────────────────┐
│ Cloud Monitoring Workspace │
├──────────────────┬──────────────────┬──────────────────────┤
│ API Operations │ Database Layer │ QMS Business KPIs │
│ Dashboard │ Dashboard │ Dashboard │
├──────────────────┼──────────────────┼──────────────────────┤
│ - Request Rate │ - Connection │ - Documents Signed │
│ - Latency (p50, │ Pool Usage │ - CAPA Resolution │
│ p95, p99) │ - Query Latency │ Time │
│ - Error Rate │ - Replication │ - Audit Events/Hour │
│ - HTTP Status │ Lag │ - Active Sessions │
│ Distribution │ - Disk I/O │ - Compliance Score │
├──────────────────┼──────────────────┼──────────────────────┤
│ Cache Layer │ Infrastructure │ Security Metrics │
│ Dashboard │ Dashboard │ Dashboard │
├──────────────────┼──────────────────┼──────────────────────┤
│ - Hit Rate │ - CPU Usage │ - Failed Logins │
│ - Miss Rate │ - Memory Usage │ - Auth Token Issues │
│ - Eviction Rate │ - Network I/O │ - Access Violations │
│ - Command/sec │ - Pod Restarts │ - Certificate Expiry │
└──────────────────┴──────────────────┴──────────────────────┘
Dashboard 1: API Operations Dashboard
{
"displayName": "BIO-QMS API Operations",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "API Request Rate (requests/sec)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND resource.labels.service_name=\"bio-qms-api\" AND metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["resource.service_name"]
}
}
},
"plotType": "LINE",
"targetAxis": "Y1"
}
],
"timeshiftDuration": "0s",
"yAxis": {
"label": "Requests/sec",
"scale": "LINEAR"
}
}
}
},
{
"xPos": 6,
"width": 6,
"height": 4,
"widget": {
"title": "API Latency Percentiles (ms)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_50"
}
}
},
"plotType": "LINE",
"legendTemplate": "p50"
},
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
},
"plotType": "LINE",
"legendTemplate": "p95"
},
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_latencies\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_99"
}
}
},
"plotType": "LINE",
"legendTemplate": "p99"
}
],
"thresholds": [
{
"value": 200.0,
"color": "YELLOW",
"direction": "ABOVE",
"label": "SLO Target (p95 < 200ms)"
},
{
"value": 500.0,
"color": "RED",
"direction": "ABOVE",
"label": "Critical Threshold"
}
]
}
}
},
{
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Error Rate (%)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilterRatio": {
"numerator": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\" AND metric.labels.response_code_class=\"5xx\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
},
"denominator": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 0.01,
"color": "YELLOW",
"direction": "ABOVE",
"label": "Warning (1%)"
},
{
"value": 0.05,
"color": "RED",
"direction": "ABOVE",
"label": "Critical (5%)"
}
]
}
}
},
{
"xPos": 6,
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "HTTP Status Distribution",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["metric.response_code_class"]
}
}
},
"plotType": "STACKED_BAR"
}
]
}
}
},
{
"yPos": 8,
"width": 12,
"height": 4,
"widget": {
"title": "API Availability (SLO: 99.9%)",
"scorecard": {
"timeSeriesQuery": {
"timeSeriesFilterRatio": {
"numerator": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\" AND metric.labels.response_code_class!=\"5xx\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_SUM",
"crossSeriesReducer": "REDUCE_SUM"
}
},
"denominator": {
"filter": "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_SUM",
"crossSeriesReducer": "REDUCE_SUM"
}
}
}
},
"sparkChartView": {
"sparkChartType": "SPARK_LINE"
},
"thresholds": [
{
"value": 0.999,
"color": "YELLOW",
"direction": "BELOW"
},
{
"value": 0.995,
"color": "RED",
"direction": "BELOW"
}
]
}
}
}
]
}
}
Dashboard 2: Database Performance Dashboard
{
"displayName": "BIO-QMS Database Performance",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Database Connection Pool Usage",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/postgresql/num_backends\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE",
"legendTemplate": "Active Connections"
}
],
"thresholds": [
{
"value": 80.0,
"color": "YELLOW",
"direction": "ABOVE",
"label": "Warning (80 connections)"
},
{
"value": 95.0,
"color": "RED",
"direction": "ABOVE",
"label": "Critical (95 connections)"
}
]
}
}
},
{
"xPos": 6,
"width": 6,
"height": 4,
"widget": {
"title": "Query Latency (ms)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/prisma_query_duration\" AND metric.labels.operation=\"query\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Disk I/O Utilization (%)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/disk/utilization\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 0.8,
"color": "YELLOW",
"direction": "ABOVE"
}
]
}
}
},
{
"xPos": 6,
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Replication Lag (seconds)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/replication/replica_lag\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MAX"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 10.0,
"color": "YELLOW",
"direction": "ABOVE"
},
{
"value": 60.0,
"color": "RED",
"direction": "ABOVE"
}
]
}
}
},
{
"yPos": 8,
"width": 6,
"height": 4,
"widget": {
"title": "Transaction Rate (tx/sec)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/postgresql/transaction_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"xPos": 6,
"yPos": 8,
"width": 6,
"height": 4,
"widget": {
"title": "Database CPU Utilization (%)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloudsql_database\" AND metric.type=\"cloudsql.googleapis.com/database/cpu/utilization\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 0.7,
"color": "YELLOW",
"direction": "ABOVE"
},
{
"value": 0.9,
"color": "RED",
"direction": "ABOVE"
}
]
}
}
}
]
}
}
Dashboard 3: Cache Performance Dashboard
{
"displayName": "BIO-QMS Cache Layer (Memorystore Redis)",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Cache Hit Rate (%)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/cache_hit_ratio\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 0.8,
"color": "YELLOW",
"direction": "BELOW",
"label": "Target Hit Rate (80%)"
},
{
"value": 0.6,
"color": "RED",
"direction": "BELOW"
}
]
}
}
},
{
"xPos": 6,
"width": 6,
"height": 4,
"widget": {
"title": "Commands/sec",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/commands/calls\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["metric.cmd"]
}
}
},
"plotType": "STACKED_AREA"
}
]
}
}
},
{
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Memory Usage (MB)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/memory/usage\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"xPos": 6,
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Evicted Keys/sec",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/evicted_keys\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 100.0,
"color": "YELLOW",
"direction": "ABOVE",
"label": "High Eviction Rate"
}
]
}
}
}
]
}
}
Dashboard 4: QMS Business KPIs
{
"displayName": "BIO-QMS Business Metrics",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 4,
"height": 4,
"widget": {
"title": "Documents Signed (per hour)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/qms_document_signed\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"xPos": 4,
"width": 4,
"height": 4,
"widget": {
"title": "CAPA Resolution Time (hours)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/capa_resolution_duration\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
},
"plotType": "LINE"
}
],
"thresholds": [
{
"value": 72.0,
"color": "YELLOW",
"direction": "ABOVE",
"label": "Target (72h)"
}
]
}
}
},
{
"xPos": 8,
"width": 4,
"height": 4,
"widget": {
"title": "Audit Events (per hour)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/audit_event_count\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["metric.event_type"]
}
}
},
"plotType": "STACKED_BAR"
}
]
}
}
},
{
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Active User Sessions",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/active_sessions\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
},
"plotType": "LINE"
}
]
}
}
},
{
"xPos": 6,
"yPos": 4,
"width": 6,
"height": 4,
"widget": {
"title": "Document Approval Workflow Duration (hours)",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/document_approval_duration\"",
"aggregation": {
"alignmentPeriod": "3600s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
},
"plotType": "LINE"
}
]
}
}
}
]
}
}
Custom Metrics Implementation
// src/monitoring/metrics.service.ts
import { Injectable } from '@nestjs/common';
import { Monitoring } from '@google-cloud/monitoring';
@Injectable()
export class MetricsService {
private readonly client: Monitoring;
private readonly projectId: string;
constructor() {
this.client = new Monitoring.MetricServiceClient();
this.projectId = process.env.GCP_PROJECT_ID;
}
/**
* Record document signature event
* @compliance FDA 21 CFR Part 11 - Electronic signatures tracking
*/
async recordDocumentSigned(documentId: string, userId: string, orgId: string): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
int64Value: 1,
},
};
const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/qms_document_signed',
labels: {
document_id: documentId,
organization_id: orgId,
},
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};
await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}
/**
* Record CAPA resolution duration
* @compliance SOC 2 - Performance monitoring
*/
async recordCapaResolution(capaId: string, durationHours: number): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
doubleValue: durationHours,
},
};
const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/capa_resolution_duration',
labels: {
capa_id: capaId,
},
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};
await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}
/**
* Record audit event count
* @compliance FDA 21 CFR Part 11 §11.10(e) - Audit trail
*/
async recordAuditEvent(eventType: string, userId: string, resourceType: string): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
int64Value: 1,
},
};
const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/audit_event_count',
labels: {
event_type: eventType,
resource_type: resourceType,
},
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};
await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}
/**
* Update active session count
*/
async updateActiveSessions(count: number): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
int64Value: count,
},
};
const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/active_sessions',
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};
await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}
/**
* Record document approval workflow duration
*/
async recordApprovalDuration(workflowId: string, durationHours: number): Promise<void> {
const dataPoint = {
interval: {
endTime: {
seconds: Math.floor(Date.now() / 1000),
},
},
value: {
doubleValue: durationHours,
},
};
const timeSeries = {
metric: {
type: 'logging.googleapis.com/user/document_approval_duration',
labels: {
workflow_id: workflowId,
},
},
resource: {
type: 'cloud_run_revision',
labels: {
project_id: this.projectId,
service_name: 'bio-qms-api',
},
},
points: [dataPoint],
};
await this.client.createTimeSeries({
name: this.client.projectPath(this.projectId),
timeSeries: [timeSeries],
});
}
}
SLO Definitions
# slo-definitions.yaml
# Service Level Objectives for BIO-QMS Platform
apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
name: api-availability-slo
spec:
displayName: "API Availability SLO"
serviceLevelIndicator:
requestBased:
goodTotalRatio:
totalServiceFilter: >
resource.type="cloud_run_revision"
AND resource.labels.service_name="bio-qms-api"
AND metric.type="run.googleapis.com/request_count"
goodServiceFilter: >
resource.type="cloud_run_revision"
AND resource.labels.service_name="bio-qms-api"
AND metric.type="run.googleapis.com/request_count"
AND metric.labels.response_code_class!="5xx"
goal: 0.999 # 99.9% availability
rollingPeriod: "2592000s" # 30 days
complianceNote: "SOC 2 CC7.1 - System availability commitment"
---
apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
name: api-latency-slo
spec:
displayName: "API Latency SLO (p95 < 200ms)"
serviceLevelIndicator:
requestBased:
distributionCut:
distributionFilter: >
resource.type="cloud_run_revision"
AND metric.type="run.googleapis.com/request_latencies"
range:
max: 200.0 # milliseconds
goal: 0.95 # 95% of requests under 200ms
rollingPeriod: "2592000s"
complianceNote: "SOC 2 CC7.2 - Performance monitoring"
---
apiVersion: monitoring.googleapis.com/v1
kind: ServiceLevelObjective
metadata:
name: database-query-latency-slo
spec:
displayName: "Database Query Latency SLO (p95 < 100ms)"
serviceLevelIndicator:
requestBased:
distributionCut:
distributionFilter: >
metric.type="logging.googleapis.com/user/prisma_query_duration"
range:
max: 100.0
goal: 0.95
rollingPeriod: "2592000s"
E.3.2: Alerting Policies
Alert Severity Levels
| Severity | Response Time | Channels | Escalation |
|---|---|---|---|
| Critical | Immediate (5 min) | PagerDuty, Slack, Email | On-call engineer → Manager (15 min) |
| Warning | 30 minutes | Slack, Email | Team channel → On-call (60 min) |
| Informational | Next business day | None |
Critical Alerts
Alert 1: API Service Down
# alerts/critical/api-down.yaml
displayName: "CRITICAL: API Service Down"
documentation:
content: |
The BIO-QMS API service is completely down.
**Compliance Impact:** FDA 21 CFR Part 11, HIPAA - System unavailable
**Runbook:** https://wiki.bioqms.com/runbooks/api-down
**Steps:**
1. Check Cloud Run service status: gcloud run services describe bio-qms-api
2. Check recent deployments: gcloud run revisions list
3. Review logs: gcloud logging read "resource.type=cloud_run_revision" --limit 50
4. Verify database connectivity
5. If necessary, rollback to previous revision
mimeType: text/markdown
conditions:
- displayName: "No successful requests in 5 minutes"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND resource.labels.service_name = "bio-qms-api"
AND metric.type = "run.googleapis.com/request_count"
AND metric.labels.response_code_class = "2xx"
comparison: COMPARISON_LT
thresholdValue: 1
duration: 300s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM
notificationChannels:
- projects/bio-qms-prod/notificationChannels/pagerduty-critical
- projects/bio-qms-prod/notificationChannels/slack-incidents
- projects/bio-qms-prod/notificationChannels/email-oncall
alertStrategy:
autoClose: 1800s # 30 minutes
notificationRateLimit:
period: 300s # Re-alert every 5 minutes if unacknowledged
Alert 2: Database Unreachable
# alerts/critical/database-unreachable.yaml
displayName: "CRITICAL: Database Unreachable"
documentation:
content: |
Cloud SQL database is unreachable or connection pool exhausted.
**Compliance Impact:** FDA 21 CFR Part 11 - Data integrity risk
**Runbook:** https://wiki.bioqms.com/runbooks/database-unreachable
**Steps:**
1. Check Cloud SQL instance status
2. Verify network connectivity
3. Check connection pool metrics
4. Review database error logs
5. Consider scaling instance if connection pool exhausted
conditions:
- displayName: "Database connection errors > 10/min"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND jsonPayload.level = "error"
AND jsonPayload.context.error =~ ".*database.*connection.*"
comparison: COMPARISON_GT
thresholdValue: 10
duration: 60s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM
notificationChannels:
- projects/bio-qms-prod/notificationChannels/pagerduty-critical
- projects/bio-qms-prod/notificationChannels/slack-incidents
Alert 3: Certificate Expiry
# alerts/critical/certificate-expiry.yaml
displayName: "CRITICAL: TLS Certificate Expiring Soon"
documentation:
content: |
TLS certificate expires in less than 7 days.
**Compliance Impact:** HIPAA Security Rule - Encryption in transit
**Runbook:** https://wiki.bioqms.com/runbooks/certificate-renewal
conditions:
- displayName: "Certificate expires in < 7 days"
conditionThreshold:
filter: |
resource.type = "gae_app"
AND metric.type = "appengine.googleapis.com/http/server/certificate_expiry_time"
comparison: COMPARISON_LT
thresholdValue: 604800 # 7 days in seconds
duration: 0s
aggregations:
- alignmentPeriod: 3600s
perSeriesAligner: ALIGN_MIN
notificationChannels:
- projects/bio-qms-prod/notificationChannels/pagerduty-critical
- projects/bio-qms-prod/notificationChannels/email-security-team
Alert 4: High Error Rate
# alerts/critical/high-error-rate.yaml
displayName: "CRITICAL: High API Error Rate (>5%)"
documentation:
content: |
API error rate exceeds 5%.
**Compliance Impact:** SOC 2 - Service availability degradation
**Runbook:** https://wiki.bioqms.com/runbooks/high-error-rate
conditions:
- displayName: "5xx errors > 5% for 5 minutes"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND metric.type = "run.googleapis.com/request_count"
AND metric.labels.response_code_class = "5xx"
comparison: COMPARISON_GT
thresholdValue: 0.05
duration: 300s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM
denominatorFilter: |
resource.type = "cloud_run_revision"
AND metric.type = "run.googleapis.com/request_count"
denominatorAggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM
notificationChannels:
- projects/bio-qms-prod/notificationChannels/pagerduty-critical
- projects/bio-qms-prod/notificationChannels/slack-incidents
Warning Alerts
Alert 5: Elevated Error Rate
# alerts/warning/elevated-error-rate.yaml
displayName: "WARNING: Elevated Error Rate (>1%)"
conditions:
- displayName: "5xx errors > 1% for 10 minutes"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND metric.type = "run.googleapis.com/request_count"
AND metric.labels.response_code_class = "5xx"
comparison: COMPARISON_GT
thresholdValue: 0.01
duration: 600s
notificationChannels:
- projects/bio-qms-prod/notificationChannels/slack-monitoring
- projects/bio-qms-prod/notificationChannels/email-team
Alert 6: High Latency
# alerts/warning/high-latency.yaml
displayName: "WARNING: High API Latency (p95 > 500ms)"
conditions:
- displayName: "p95 latency > 500ms for 10 minutes"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
AND metric.type = "run.googleapis.com/request_latencies"
comparison: COMPARISON_GT
thresholdValue: 500.0
duration: 600s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_DELTA
crossSeriesReducer: REDUCE_PERCENTILE_95
notificationChannels:
- projects/bio-qms-prod/notificationChannels/slack-monitoring
Alert 7: High Disk Usage
# alerts/warning/high-disk-usage.yaml
displayName: "WARNING: Database Disk Usage > 80%"
conditions:
- displayName: "Disk usage > 80%"
conditionThreshold:
filter: |
resource.type = "cloudsql_database"
AND metric.type = "cloudsql.googleapis.com/database/disk/utilization"
comparison: COMPARISON_GT
thresholdValue: 0.8
duration: 300s
notificationChannels:
- projects/bio-qms-prod/notificationChannels/slack-monitoring
- projects/bio-qms-prod/notificationChannels/email-devops
Alert 8: Low Cache Hit Rate
# alerts/warning/low-cache-hit-rate.yaml
displayName: "WARNING: Cache Hit Rate < 60%"
conditions:
- displayName: "Cache hit rate < 60% for 30 minutes"
conditionThreshold:
filter: |
resource.type = "redis_instance"
AND metric.type = "redis.googleapis.com/stats/cache_hit_ratio"
comparison: COMPARISON_LT
thresholdValue: 0.6
duration: 1800s
notificationChannels:
- projects/bio-qms-prod/notificationChannels/slack-monitoring
Notification Channels Configuration
// src/monitoring/notification-channels.service.ts
import { Injectable } from '@nestjs/common';
import { Monitoring } from '@google-cloud/monitoring';
@Injectable()
export class NotificationChannelsService {
private readonly client: Monitoring.NotificationChannelServiceClient;
private readonly projectId: string;
constructor() {
this.client = new Monitoring.NotificationChannelServiceClient();
this.projectId = process.env.GCP_PROJECT_ID;
}
async createPagerDutyChannel(): Promise<string> {
const [channel] = await this.client.createNotificationChannel({
name: this.client.projectPath(this.projectId),
notificationChannel: {
type: 'pagerduty',
displayName: 'PagerDuty - Critical Incidents',
labels: {
service_key: process.env.PAGERDUTY_SERVICE_KEY,
},
enabled: true,
},
});
return channel.name;
}
async createSlackChannel(webhookUrl: string, channelName: string): Promise<string> {
const [channel] = await this.client.createNotificationChannel({
name: this.client.projectPath(this.projectId),
notificationChannel: {
type: 'slack',
displayName: `Slack - ${channelName}`,
labels: {
url: webhookUrl,
channel_name: channelName,
},
enabled: true,
},
});
return channel.name;
}
async createEmailChannel(emailAddress: string, displayName: string): Promise<string> {
const [channel] = await this.client.createNotificationChannel({
name: this.client.projectPath(this.projectId),
notificationChannel: {
type: 'email',
displayName: displayName,
labels: {
email_address: emailAddress,
},
enabled: true,
},
});
return channel.name;
}
}
Alert Policy Deployment Script
// scripts/deploy-alert-policies.ts
import { Monitoring } from '@google-cloud/monitoring';
import * as fs from 'fs';
import * as path from 'path';
import * as yaml from 'js-yaml';
async function deployAlertPolicies() {
const client = new Monitoring.AlertPolicyServiceClient();
const projectId = process.env.GCP_PROJECT_ID;
const alertsDir = path.join(__dirname, '../config/alerts');
const categories = ['critical', 'warning', 'informational'];
for (const category of categories) {
const categoryDir = path.join(alertsDir, category);
const files = fs.readdirSync(categoryDir).filter(f => f.endsWith('.yaml'));
for (const file of files) {
const filePath = path.join(categoryDir, file);
const content = fs.readFileSync(filePath, 'utf8');
const policy = yaml.load(content) as any;
console.log(`Deploying ${category} alert: ${policy.displayName}`);
try {
const [createdPolicy] = await client.createAlertPolicy({
name: client.projectPath(projectId),
alertPolicy: policy,
});
console.log(`✓ Created: ${createdPolicy.name}`);
} catch (error) {
console.error(`✗ Failed to create ${file}:`, error.message);
}
}
}
}
deployAlertPolicies().catch(console.error);
E.3.3: Structured Logging with Cloud Logging
Logging Architecture
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
├─────────────────────────────────────────────────────────────┤
│ NestJS Logger → Winston → Logging Interceptor │
│ ↓ │
│ JSON Structured Logs + Correlation IDs │
└────────────────────┬────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Google Cloud Logging │
├──────────────────┬──────────────────┬──────────────────────┤
│ Hot Storage │ BigQuery Export │ Cloud Storage │
│ (30 days) │ (1 year) │ (Long-term) │
└──────────────────┴──────────────────┴──────────────────────┘
Log Structure
// src/logging/interfaces/structured-log.interface.ts
export interface StructuredLog {
// Standard fields
timestamp: string; // ISO 8601 UTC
severity: LogSeverity; // DEBUG, INFO, NOTICE, WARNING, ERROR, CRITICAL, ALERT, EMERGENCY
message: string;
// Request context
request_id: string; // Unique per request (UUID v4)
trace_id?: string; // OpenTelemetry trace ID
span_id?: string; // OpenTelemetry span ID
// User context
user_id?: string; // Authenticated user ID
org_id?: string; // Organization/tenant ID
session_id?: string; // Session identifier
// Application context
service: string; // 'bio-qms-api'
environment: string; // 'production', 'staging', 'development'
version: string; // Application version (from package.json)
// Action context
action: string; // API endpoint or operation
resource_type?: string; // 'document', 'capa', 'training', etc.
resource_id?: string; // ID of affected resource
// Performance metrics
duration_ms?: number; // Operation duration
// Error context (if severity >= ERROR)
error?: {
name: string;
message: string;
stack?: string;
code?: string;
};
// Compliance metadata
compliance?: {
regulation: string[]; // ['FDA-21-CFR-Part-11', 'HIPAA', 'SOC-2']
audit_event_type?: string; // 'electronic_signature', 'data_modification', etc.
pii_logged: boolean; // Flag if PII is in logs
};
// Additional context
metadata?: Record<string, any>;
}
export enum LogSeverity {
DEBUG = 'DEBUG',
INFO = 'INFO',
NOTICE = 'NOTICE',
WARNING = 'WARNING',
ERROR = 'ERROR',
CRITICAL = 'CRITICAL',
ALERT = 'ALERT',
EMERGENCY = 'EMERGENCY',
}
Winston Logger Configuration
// src/logging/winston.config.ts
import * as winston from 'winston';
import { LoggingWinston } from '@google-cloud/logging-winston';
const loggingWinston = new LoggingWinston({
projectId: process.env.GCP_PROJECT_ID,
keyFilename: process.env.GCP_KEY_FILE,
serviceContext: {
service: 'bio-qms-api',
version: process.env.APP_VERSION || '1.0.0',
},
});
export const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp({ format: 'YYYY-MM-DDTHH:mm:ss.SSSZ' }),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
},
transports: [
// Cloud Logging transport (production)
loggingWinston,
// Console transport (development)
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple()
),
}),
],
});
NestJS Logging Interceptor
// src/logging/logging.interceptor.ts
import {
Injectable,
NestInterceptor,
ExecutionContext,
CallHandler,
Logger,
} from '@nestjs/common';
import { Observable, throwError } from 'rxjs';
import { tap, catchError } from 'rxjs/operators';
import { v4 as uuidv4 } from 'uuid';
import { logger } from './winston.config';
import { StructuredLog, LogSeverity } from './interfaces/structured-log.interface';
@Injectable()
export class LoggingInterceptor implements NestInterceptor {
private readonly nestLogger = new Logger(LoggingInterceptor.name);
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const request = context.switchToHttp().getRequest();
const response = context.switchToHttp().getResponse();
// Generate or extract correlation IDs
const requestId = request.headers['x-request-id'] || uuidv4();
const traceId = request.headers['x-cloud-trace-context']?.split('/')[0];
// Attach to request for downstream use
request.requestId = requestId;
request.traceId = traceId;
// Set response header
response.setHeader('X-Request-ID', requestId);
const startTime = Date.now();
const { method, url, body, query, params } = request;
const userId = request.user?.id;
const orgId = request.user?.organizationId;
const sessionId = request.session?.id;
// Log incoming request
this.logRequest(requestId, traceId, method, url, userId, orgId);
return next.handle().pipe(
tap((data) => {
const duration = Date.now() - startTime;
this.logResponse(
requestId,
traceId,
method,
url,
response.statusCode,
duration,
userId,
orgId,
);
}),
catchError((error) => {
const duration = Date.now() - startTime;
this.logError(
requestId,
traceId,
method,
url,
error,
duration,
userId,
orgId,
);
return throwError(() => error);
}),
);
}
private logRequest(
requestId: string,
traceId: string | undefined,
method: string,
url: string,
userId?: string,
orgId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.INFO,
message: `Incoming ${method} ${url}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
org_id: orgId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: `${method} ${url}`,
compliance: {
regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
pii_logged: false,
},
};
logger.info(log);
}
private logResponse(
requestId: string,
traceId: string | undefined,
method: string,
url: string,
statusCode: number,
duration: number,
userId?: string,
orgId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: statusCode >= 400 ? LogSeverity.WARNING : LogSeverity.INFO,
message: `${method} ${url} ${statusCode}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
org_id: orgId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: `${method} ${url}`,
duration_ms: duration,
metadata: {
status_code: statusCode,
},
compliance: {
regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
pii_logged: false,
},
};
logger.info(log);
}
private logError(
requestId: string,
traceId: string | undefined,
method: string,
url: string,
error: any,
duration: number,
userId?: string,
orgId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.ERROR,
message: `${method} ${url} failed: ${error.message}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
org_id: orgId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: `${method} ${url}`,
duration_ms: duration,
error: {
name: error.name,
message: error.message,
stack: error.stack,
code: error.code,
},
compliance: {
regulation: ['FDA-21-CFR-Part-11', 'HIPAA'],
pii_logged: false,
},
};
logger.error(log);
}
}
Audit Logging Service
// src/logging/audit-log.service.ts
import { Injectable } from '@nestjs/common';
import { logger } from './winston.config';
import { StructuredLog, LogSeverity } from './interfaces/structured-log.interface';
export enum AuditEventType {
ELECTRONIC_SIGNATURE = 'electronic_signature',
DATA_MODIFICATION = 'data_modification',
DATA_DELETION = 'data_deletion',
USER_LOGIN = 'user_login',
USER_LOGOUT = 'user_logout',
FAILED_LOGIN = 'failed_login',
PASSWORD_CHANGE = 'password_change',
PERMISSION_CHANGE = 'permission_change',
DOCUMENT_APPROVAL = 'document_approval',
CAPA_STATUS_CHANGE = 'capa_status_change',
TRAINING_COMPLETION = 'training_completion',
SYSTEM_CONFIGURATION_CHANGE = 'system_configuration_change',
}
/**
* Audit logging service for FDA 21 CFR Part 11 compliance
* @compliance FDA 21 CFR Part 11 §11.10(e)
*/
@Injectable()
export class AuditLogService {
/**
* Log electronic signature event
* @compliance FDA 21 CFR Part 11 §11.50, §11.70
*/
logElectronicSignature(
userId: string,
documentId: string,
signatureMeaning: string,
requestId: string,
traceId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.NOTICE,
message: `Electronic signature applied: ${signatureMeaning}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: 'electronic_signature',
resource_type: 'document',
resource_id: documentId,
compliance: {
regulation: ['FDA-21-CFR-Part-11'],
audit_event_type: AuditEventType.ELECTRONIC_SIGNATURE,
pii_logged: false,
},
metadata: {
signature_meaning: signatureMeaning,
},
};
logger.info(log);
}
/**
* Log data modification event
* @compliance FDA 21 CFR Part 11 §11.10(e)
*/
logDataModification(
userId: string,
resourceType: string,
resourceId: string,
changes: Record<string, any>,
requestId: string,
traceId?: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.NOTICE,
message: `Data modified: ${resourceType}/${resourceId}`,
request_id: requestId,
trace_id: traceId,
user_id: userId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: 'data_modification',
resource_type: resourceType,
resource_id: resourceId,
compliance: {
regulation: ['FDA-21-CFR-Part-11'],
audit_event_type: AuditEventType.DATA_MODIFICATION,
pii_logged: false,
},
metadata: {
changes: this.sanitizeChanges(changes),
},
};
logger.info(log);
}
/**
* Log failed login attempt
* @compliance HIPAA Security Rule §164.312(b)
*/
logFailedLogin(
username: string,
ipAddress: string,
reason: string,
requestId: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.WARNING,
message: `Failed login attempt: ${username}`,
request_id: requestId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: 'failed_login',
compliance: {
regulation: ['HIPAA', 'SOC-2'],
audit_event_type: AuditEventType.FAILED_LOGIN,
pii_logged: true, // Username may contain PII
},
metadata: {
username,
ip_address: ipAddress,
reason,
},
};
logger.warn(log);
}
/**
* Log successful user login
* @compliance HIPAA Security Rule §164.312(b)
*/
logUserLogin(
userId: string,
username: string,
ipAddress: string,
requestId: string,
): void {
const log: StructuredLog = {
timestamp: new Date().toISOString(),
severity: LogSeverity.INFO,
message: `User login successful: ${username}`,
request_id: requestId,
user_id: userId,
service: 'bio-qms-api',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
action: 'user_login',
compliance: {
regulation: ['HIPAA', 'SOC-2'],
audit_event_type: AuditEventType.USER_LOGIN,
pii_logged: true,
},
metadata: {
username,
ip_address: ipAddress,
},
};
logger.info(log);
}
/**
* Sanitize changes to remove sensitive data from logs
*/
private sanitizeChanges(changes: Record<string, any>): Record<string, any> {
const sanitized = { ...changes };
const sensitiveFields = ['password', 'ssn', 'credit_card', 'api_key', 'token'];
for (const field of sensitiveFields) {
if (field in sanitized) {
sanitized[field] = '[REDACTED]';
}
}
return sanitized;
}
}
Log Retention & Export Configuration
# config/log-retention.yaml
# Log retention and export configuration for compliance
sinks:
# BigQuery export for long-term analysis (1 year)
- name: bigquery-export
destination: bigquery.googleapis.com/projects/bio-qms-prod/datasets/audit_logs
filter: |
severity >= NOTICE
OR jsonPayload.compliance.audit_event_type != null
bigqueryOptions:
usePartitionedTables: true
usesTimestampColumnPartitioning: true
# Cloud Storage archive (7 years for FDA compliance)
- name: gcs-archive
destination: storage.googleapis.com/bio-qms-audit-logs-archive
filter: |
jsonPayload.compliance.regulation =~ ".*FDA.*"
OR jsonPayload.compliance.audit_event_type != null
includeChildren: true
# Security events to dedicated dataset
- name: security-events
destination: bigquery.googleapis.com/projects/bio-qms-prod/datasets/security_logs
filter: |
jsonPayload.compliance.audit_event_type = "failed_login"
OR jsonPayload.compliance.audit_event_type = "permission_change"
OR severity >= ERROR
exclusions:
# Exclude health check logs from long-term storage
- name: exclude-health-checks
filter: |
jsonPayload.action = "GET /health"
OR jsonPayload.action = "GET /readiness"
retention:
# Hot storage: 30 days in Cloud Logging
default: 30d
# Compliance buckets: extended retention
audit_logs: 2555d # 7 years (FDA requirement)
security_logs: 2555d
E.3.4: Distributed Tracing
OpenTelemetry Architecture
┌─────────────────────────────────────────────────────────────┐
│ Application Code │
│ (NestJS Controllers, Services, Repositories) │
└────────────────────┬────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ OpenTelemetry Instrumentation │
├──────────────────┬──────────────────┬──────────────────────┤
│ HTTP Tracing │ Database │ External APIs │
│ (@opentelemetry │ (Prisma) │ (fetch, axios) │
│ /instrumentation│ │ │
│ -http) │ │ │
└──────────────────┴──────────────────┴──────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ - Sampling (100% errors, 10% normal) │
│ - Batching │
│ - Enrichment (resource attributes) │
└────────────────────┬────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Google Cloud Trace │
│ - Trace storage & visualization │
│ - Latency analysis │
│ - Service dependency mapping │
└─────────────────────────────────────────────────────────────┘
OpenTelemetry Configuration
// src/tracing/tracing.config.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { TraceExporter } from '@google-cloud/opentelemetry-cloud-trace-exporter';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { ParentBasedSampler, TraceIdRatioBasedSampler, AlwaysOnSampler } from '@opentelemetry/sdk-trace-base';
import { CompositePropagator, W3CTraceContextPropagator, W3CBaggagePropagator } from '@opentelemetry/core';
/**
* Custom sampler: 100% for errors, 10% for normal traffic
*/
class AdaptiveSampler extends ParentBasedSampler {
constructor() {
super({
root: new TraceIdRatioBasedSampler(0.1), // 10% base sampling
});
}
shouldSample(context, traceId, spanName, spanKind, attributes, links) {
// Always sample if error
if (attributes['http.status_code'] >= 400) {
return { decision: AlwaysOnSampler.prototype.shouldSample.call(this, context, traceId, spanName, spanKind, attributes, links).decision };
}
// Always sample audit events
if (attributes['audit.event_type']) {
return { decision: AlwaysOnSampler.prototype.shouldSample.call(this, context, traceId, spanName, spanKind, attributes, links).decision };
}
// Use parent-based sampling for everything else
return super.shouldSample(context, traceId, spanName, spanKind, attributes, links);
}
}
export const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'bio-qms-api',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
'service.namespace': 'bio-qms',
'cloud.provider': 'gcp',
'cloud.platform': 'gcp_cloud_run',
'cloud.region': process.env.GCP_REGION || 'us-central1',
}),
traceExporter: new TraceExporter({
projectId: process.env.GCP_PROJECT_ID,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
enabled: true,
ignoreIncomingPaths: ['/health', '/readiness'],
},
'@opentelemetry/instrumentation-express': {
enabled: true,
},
'@opentelemetry/instrumentation-pg': {
enabled: true,
enhancedDatabaseReporting: true,
},
'@opentelemetry/instrumentation-redis': {
enabled: true,
},
}),
],
sampler: new AdaptiveSampler(),
textMapPropagator: new CompositePropagator({
propagators: [
new W3CTraceContextPropagator(),
new W3CBaggagePropagator(),
],
}),
});
// Start tracing
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.error('Error terminating tracing', error))
.finally(() => process.exit(0));
});
Custom Span Creation
// src/tracing/tracing.service.ts
import { Injectable } from '@nestjs/common';
import { trace, context, SpanStatusCode, Span } from '@opentelemetry/api';
@Injectable()
export class TracingService {
private readonly tracer = trace.getTracer('bio-qms-api');
/**
* Create a custom span for business operations
*/
async withSpan<T>(
name: string,
operation: (span: Span) => Promise<T>,
attributes?: Record<string, any>,
): Promise<T> {
return this.tracer.startActiveSpan(name, async (span) => {
try {
if (attributes) {
span.setAttributes(attributes);
}
const result = await operation(span);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
/**
* Add event to current span
*/
addEvent(name: string, attributes?: Record<string, any>): void {
const currentSpan = trace.getActiveSpan();
if (currentSpan) {
currentSpan.addEvent(name, attributes);
}
}
/**
* Set attribute on current span
*/
setAttribute(key: string, value: any): void {
const currentSpan = trace.getActiveSpan();
if (currentSpan) {
currentSpan.setAttribute(key, value);
}
}
/**
* Get current trace ID (for log correlation)
*/
getCurrentTraceId(): string | undefined {
const currentSpan = trace.getActiveSpan();
return currentSpan?.spanContext().traceId;
}
/**
* Get current span ID (for log correlation)
*/
getCurrentSpanId(): string | undefined {
const currentSpan = trace.getActiveSpan();
return currentSpan?.spanContext().spanId;
}
}
NestJS Tracing Interceptor
// src/tracing/tracing.interceptor.ts
import {
Injectable,
NestInterceptor,
ExecutionContext,
CallHandler,
} from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap, catchError } from 'rxjs/operators';
import { TracingService } from './tracing.service';
import { trace, SpanStatusCode } from '@opentelemetry/api';
@Injectable()
export class TracingInterceptor implements NestInterceptor {
constructor(private readonly tracingService: TracingService) {}
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const request = context.switchToHttp().getRequest();
const { method, url } = request;
const controllerName = context.getClass().name;
const handlerName = context.getHandler().name;
const tracer = trace.getTracer('bio-qms-api');
const spanName = `${controllerName}.${handlerName}`;
return tracer.startActiveSpan(spanName, (span) => {
// Set span attributes
span.setAttributes({
'http.method': method,
'http.url': url,
'http.route': request.route?.path,
'controller.name': controllerName,
'handler.name': handlerName,
'user.id': request.user?.id,
'org.id': request.user?.organizationId,
});
return next.handle().pipe(
tap((data) => {
span.setStatus({ code: SpanStatusCode.OK });
span.setAttribute('http.status_code', 200);
}),
catchError((error) => {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
span.setAttribute('http.status_code', error.status || 500);
throw error;
}),
tap(() => {
span.end();
}),
);
});
}
}
Prisma Query Tracing
// src/tracing/prisma-tracing.middleware.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { PrismaClient } from '@prisma/client';
import { TracingService } from './tracing.service';
import { trace, SpanKind } from '@opentelemetry/api';
@Injectable()
export class PrismaTracingService implements OnModuleInit {
constructor(
private readonly prisma: PrismaClient,
private readonly tracingService: TracingService,
) {}
async onModuleInit() {
// Middleware for query tracing
this.prisma.$use(async (params, next) => {
const tracer = trace.getTracer('bio-qms-api');
return tracer.startActiveSpan(
`prisma.${params.model}.${params.action}`,
{
kind: SpanKind.CLIENT,
attributes: {
'db.system': 'postgresql',
'db.name': process.env.DATABASE_NAME,
'db.operation': params.action,
'db.model': params.model,
},
},
async (span) => {
const startTime = Date.now();
try {
const result = await next(params);
const duration = Date.now() - startTime;
span.setAttribute('db.duration_ms', duration);
span.setStatus({ code: 0 }); // OK
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: 2, message: error.message }); // ERROR
throw error;
} finally {
span.end();
}
},
);
});
}
}
External API Call Tracing
// src/tracing/http-client.service.ts
import { Injectable, HttpService } from '@nestjs/common';
import { TracingService } from './tracing.service';
import { trace, SpanKind, propagation, context } from '@opentelemetry/api';
import { AxiosRequestConfig } from 'axios';
@Injectable()
export class TracedHttpService {
constructor(
private readonly httpService: HttpService,
private readonly tracingService: TracingService,
) {}
/**
* Make HTTP request with automatic tracing
*/
async request<T>(config: AxiosRequestConfig): Promise<T> {
const tracer = trace.getTracer('bio-qms-api');
const url = `${config.baseURL || ''}${config.url}`;
return tracer.startActiveSpan(
`HTTP ${config.method?.toUpperCase()} ${url}`,
{
kind: SpanKind.CLIENT,
attributes: {
'http.method': config.method?.toUpperCase(),
'http.url': url,
'http.target': config.url,
},
},
async (span) => {
try {
// Inject trace context into headers
const carrier = {};
propagation.inject(context.active(), carrier);
config.headers = {
...config.headers,
...carrier,
};
const response = await this.httpService.request(config).toPromise();
span.setAttribute('http.status_code', response.status);
span.setStatus({ code: 0 }); // OK
return response.data;
} catch (error) {
span.setAttribute('http.status_code', error.response?.status || 0);
span.recordException(error);
span.setStatus({ code: 2, message: error.message }); // ERROR
throw error;
} finally {
span.end();
}
},
);
}
}
Application Bootstrap with Tracing
// src/main.ts
import './tracing/tracing.config'; // MUST be first import
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
import { LoggingInterceptor } from './logging/logging.interceptor';
import { TracingInterceptor } from './tracing/tracing.interceptor';
async function bootstrap() {
const app = await NestFactory.create(AppModule);
// Apply global interceptors
app.useGlobalInterceptors(
app.get(LoggingInterceptor),
app.get(TracingInterceptor),
);
await app.listen(process.env.PORT || 8080);
console.log(`Application is running on: ${await app.getUrl()}`);
}
bootstrap();
Trace Analysis Queries
-- BigQuery queries for trace analysis (exported from Cloud Trace)
-- Query 1: p95 latency by endpoint
SELECT
span_name,
APPROX_QUANTILES(duration_ms, 100)[OFFSET(95)] AS p95_latency_ms,
COUNT(*) AS request_count
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
DATE(start_time) = CURRENT_DATE()
AND span_kind = 'SERVER'
GROUP BY span_name
ORDER BY p95_latency_ms DESC
LIMIT 20;
-- Query 2: Error rate by endpoint
SELECT
span_name,
COUNTIF(status_code = 2) AS error_count,
COUNT(*) AS total_count,
ROUND(COUNTIF(status_code = 2) / COUNT(*) * 100, 2) AS error_rate_pct
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
DATE(start_time) = CURRENT_DATE()
AND span_kind = 'SERVER'
GROUP BY span_name
HAVING error_count > 0
ORDER BY error_rate_pct DESC;
-- Query 3: Slowest database queries
SELECT
span_name,
AVG(duration_ms) AS avg_duration_ms,
MAX(duration_ms) AS max_duration_ms,
COUNT(*) AS execution_count
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
DATE(start_time) = CURRENT_DATE()
AND span_name LIKE 'prisma.%'
GROUP BY span_name
ORDER BY avg_duration_ms DESC
LIMIT 20;
-- Query 4: Trace dependency graph (service calls)
SELECT
parent_span_name,
span_name,
COUNT(*) AS call_count,
AVG(duration_ms) AS avg_duration_ms
FROM `bio-qms-prod.cloud_trace.traces`
WHERE
DATE(start_time) = CURRENT_DATE()
AND parent_span_id IS NOT NULL
GROUP BY parent_span_name, span_name
ORDER BY call_count DESC;
Compliance Mapping
FDA 21 CFR Part 11
| Requirement | Implementation | Evidence |
|---|---|---|
| §11.10(e) Audit trails | Structured logging with Cloud Logging, 7-year retention | AuditLogService, BigQuery exports |
| §11.50 Electronic signatures | logElectronicSignature() captures all signature events | Audit logs with electronic_signature event type |
| §11.70 Signature linking | Trace ID correlates signature to document modification | trace_id field in structured logs |
HIPAA Security Rule
| Requirement | Implementation | Evidence |
|---|---|---|
| §164.312(b) Audit controls | Cloud Logging with tamper-proof timestamps | Structured logs exported to BigQuery |
| §164.312(c)(1) Integrity controls | Hash verification in audit logs | metadata.data_hash in modification logs |
| §164.308(a)(1)(ii)(D) Information system activity review | Cloud Monitoring dashboards + alerts | Dashboard JSON configs, alert policies |
SOC 2
| Trust Service Criteria | Implementation | Evidence |
|---|---|---|
| CC7.2 System monitoring | Cloud Monitoring dashboards, 24/7 alerting | Dashboard configs, PagerDuty integration |
| CC7.3 Incident detection | Critical alerts with 5-minute response SLA | Alert policies with escalation |
| CC7.4 Incident response | Runbooks linked to alerts, incident tracking | Alert documentation fields |
| CC8.1 Backup and recovery monitoring | Database replication lag alerts | Replication lag dashboard widget |
Deployment Instructions
1. Deploy Dashboards
#!/bin/bash
# scripts/deploy-dashboards.sh
PROJECT_ID="bio-qms-prod"
for dashboard in config/dashboards/*.json; do
echo "Deploying $(basename $dashboard)..."
gcloud monitoring dashboards create --config-from-file="$dashboard" \
--project="$PROJECT_ID"
done
2. Deploy Alert Policies
#!/bin/bash
# scripts/deploy-alerts.sh
PROJECT_ID="bio-qms-prod"
# Create notification channels first
gcloud alpha monitoring channels create \
--display-name="PagerDuty - Critical" \
--type=pagerduty \
--channel-labels=service_key=$PAGERDUTY_KEY \
--project="$PROJECT_ID"
# Deploy alert policies
for policy in config/alerts/*/*.yaml; do
echo "Deploying $(basename $policy)..."
gcloud alpha monitoring policies create --policy-from-file="$policy" \
--project="$PROJECT_ID"
done
3. Configure Log Sinks
#!/bin/bash
# scripts/configure-log-sinks.sh
PROJECT_ID="bio-qms-prod"
# BigQuery sink
gcloud logging sinks create bigquery-audit-logs \
bigquery.googleapis.com/projects/$PROJECT_ID/datasets/audit_logs \
--log-filter='severity >= NOTICE OR jsonPayload.compliance.audit_event_type != null' \
--project="$PROJECT_ID"
# Cloud Storage archive
gcloud logging sinks create gcs-audit-archive \
storage.googleapis.com/bio-qms-audit-logs-archive \
--log-filter='jsonPayload.compliance.regulation =~ ".*FDA.*"' \
--project="$PROJECT_ID"
4. Enable OpenTelemetry
// package.json additions
{
"dependencies": {
"@google-cloud/opentelemetry-cloud-trace-exporter": "^2.3.0",
"@opentelemetry/api": "^1.8.0",
"@opentelemetry/sdk-node": "^0.49.1",
"@opentelemetry/auto-instrumentations-node": "^0.42.0",
"@opentelemetry/instrumentation-http": "^0.49.1",
"@opentelemetry/instrumentation-express": "^0.37.0",
"@opentelemetry/instrumentation-pg": "^0.40.0"
}
}
# Install dependencies
npm install
# Update main.ts to import tracing config first
# (see src/main.ts example above)
# Deploy with tracing enabled
gcloud run deploy bio-qms-api \
--image=us-central1-docker.pkg.dev/$PROJECT_ID/bio-qms/api:latest \
--set-env-vars="GOOGLE_CLOUD_PROJECT=$PROJECT_ID,NODE_ENV=production"
Testing & Validation
1. Metrics Validation
# Test custom metric creation
curl -X POST https://api.bioqms.com/documents/123/sign \
-H "Authorization: Bearer $TOKEN" \
-d '{"meaning": "Approved by QA Manager"}'
# Verify metric in Cloud Monitoring
gcloud monitoring time-series list \
--filter='metric.type="logging.googleapis.com/user/qms_document_signed"' \
--format=json
2. Alert Testing
# Trigger test alert (critical error rate)
for i in {1..100}; do
curl https://api.bioqms.com/test/error &
done
# Verify alert fired
gcloud alpha monitoring policies list \
--filter='displayName:"CRITICAL: High API Error Rate"' \
--format=json
3. Log Verification
# Query structured logs
gcloud logging read "jsonPayload.request_id=\"$REQUEST_ID\"" \
--format=json \
--limit=10
# Verify audit log
gcloud logging read \
"jsonPayload.compliance.audit_event_type=\"electronic_signature\"" \
--limit=1 \
--format=json
4. Trace Verification
# Generate traced request
curl -X POST https://api.bioqms.com/documents \
-H "Authorization: Bearer $TOKEN" \
-v 2>&1 | grep -i trace
# View trace in Cloud Console
# https://console.cloud.google.com/traces/list
Runbooks
Runbook 1: API Down
Symptoms: No successful API requests for 5+ minutes, PagerDuty alert fired
Diagnosis:
# 1. Check service status
gcloud run services describe bio-qms-api --region=us-central1
# 2. Check recent logs
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
--limit=50 --format=json
# 3. Check database connectivity
gcloud sql instances describe bio-qms-db
# 4. Check recent deployments
gcloud run revisions list --service=bio-qms-api --limit=5
Resolution:
# If bad deployment: rollback
PREVIOUS_REVISION=$(gcloud run revisions list --service=bio-qms-api --format="value(name)" --limit=2 | tail -n1)
gcloud run services update-traffic bio-qms-api --to-revisions=$PREVIOUS_REVISION=100
# If database issue: restart connection pool
kubectl rollout restart deployment/bio-qms-api -n production
# If Cloud Run issue: scale to zero and back
gcloud run services update bio-qms-api --min-instances=0
sleep 10
gcloud run services update bio-qms-api --min-instances=2
Runbook 2: High Database Latency
Symptoms: p95 query latency > 500ms
Diagnosis:
# Check slow queries
gcloud logging read \
"jsonPayload.duration_ms>500 AND jsonPayload.action=\"prisma.query\"" \
--limit=20 --format=json
# Check database CPU
gcloud sql operations list --instance=bio-qms-db --limit=10
# Check connection pool
gcloud monitoring time-series list \
--filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'
Resolution:
# Add missing index (example)
psql -h $DB_HOST -U $DB_USER -d bio_qms -c \
"CREATE INDEX CONCURRENTLY idx_documents_org_created ON documents(organization_id, created_at);"
# Scale database instance
gcloud sql instances patch bio-qms-db --tier=db-custom-4-16384
# Analyze and vacuum
psql -h $DB_HOST -U $DB_USER -d bio_qms -c "VACUUM ANALYZE;"
Maintenance
Monthly Tasks
- Review SLO compliance and error budget consumption
- Analyze top 20 slowest endpoints and optimize
- Review alert noise and tune thresholds
- Archive old traces (Cloud Trace auto-retention: 30 days)
- Validate BigQuery audit log exports
Quarterly Tasks
- Review and update dashboards based on new features
- Conduct alert fire drill (test PagerDuty escalation)
- Audit log retention compliance check (7-year FDA requirement)
- Review and optimize trace sampling rates
- Update runbooks based on recent incidents
Annual Tasks
- Full observability stack audit
- Review compliance mapping (FDA, HIPAA, SOC 2)
- Evaluate new Cloud Monitoring features
- Disaster recovery test (log export restoration)
References
- Google Cloud Monitoring Documentation
- OpenTelemetry NodeJS SDK
- FDA 21 CFR Part 11 Guidance
- HIPAA Security Rule
- SOC 2 Trust Service Criteria
- Cloud Logging Best Practices
- Site Reliability Engineering Book
Document Version: 1.0.0 Last Updated: 2026-02-17 Owner: DevOps Team Reviewers: Security Team, Compliance Team, QA Team Next Review: 2026-05-17