Cloud Build & Logging Improvements
Created: 2025-10-27 Problem: 11 failed builds, 4-6 hours wasted, insufficient error visibility Goal: Reduce build-fix-rebuild cycle from 45 min → 10 min
Current Problems
1. Incomplete Error Visibility
gcloud builds submitonly shows Cloud Storage logs (basic)- Missing detailed Cloud Logging (stderr, build context)
- Error often buried in 10,000+ lines of output
- Have to manually run
gcloud builds log BUILD_IDafter failure
2. No Early Error Detection
- Wait 10-45 minutes for full build before seeing errors
- Can't detect dependency issues until Cargo resolves them
- No pre-flight checks for common mistakes
3. Poor Log Aggregation
- Mix of 6 stages (frontend, theia, v5-backend, codi2, monitor, runtime)
- Hard to find which stage failed
- No structured error extraction
4. No Dependency Pre-Validation
- Could check cargo.toml for edition2024 issues before building
- Could verify all system packages available in base images
- Could detect missing bindgen dependencies
5. No Incremental Builds
- Every failure requires FULL rebuild (30-45 min)
- No Docker layer caching between builds
- Dockerfile changes invalidate all downstream layers
Proposed Solutions
🚀 Phase 1: Immediate Wins (10 min implementation)
A. Enhanced Logging Script
File: scripts/build-with-enhanced-logging.sh
#!/bin/bash
set -euo pipefail
# Enhanced Cloud Build submission with real-time error detection
BUILD_NUM=${1:-"auto"}
LOG_FILE="/tmp/build-log-v${BUILD_NUM}.txt"
ERROR_FILE="/tmp/build-errors-v${BUILD_NUM}.txt"
echo "🚀 Starting Build #${BUILD_NUM} with enhanced logging..."
echo "📋 Log: ${LOG_FILE}"
echo "❌ Errors: ${ERROR_FILE}"
# Submit build and capture Build ID immediately
BUILD_OUTPUT=$(gcloud builds submit \
--config cloudbuild-combined.yaml \
--project serene-voltage-464305-n2 \
2>&1 | tee "${LOG_FILE}")
BUILD_ID=$(echo "$BUILD_OUTPUT" | grep -oP 'builds/\K[a-f0-9-]+' | head -1)
if [ -z "$BUILD_ID" ]; then
echo "❌ Failed to extract Build ID"
exit 1
fi
echo "🆔 Build ID: $BUILD_ID"
echo "🔗 Console: https://console.cloud.google.com/cloud-build/builds/${BUILD_ID}?project=serene-voltage-464305-n2"
# Monitor build with real-time error extraction
echo ""
echo "👀 Monitoring for errors (Ctrl+C to stop monitoring, build continues)..."
gcloud builds log "$BUILD_ID" --stream --project serene-voltage-464305-n2 2>&1 | while IFS= read -r line; do
echo "$line" >> "${LOG_FILE}"
# Extract errors in real-time
if echo "$line" | grep -qiE "(error|ERROR|failed|FAILED|panicked)"; then
echo "$line" | tee -a "${ERROR_FILE}"
fi
done
# After build completes, extract structured errors
echo ""
echo "📊 Build Complete - Extracting Errors..."
# Rust compilation errors
echo "=== RUST COMPILATION ERRORS ===" >> "${ERROR_FILE}"
grep -A 10 "error: " "${LOG_FILE}" >> "${ERROR_FILE}" 2>/dev/null || echo "No Rust errors" >> "${ERROR_FILE}"
# Missing dependencies
echo "" >> "${ERROR_FILE}"
echo "=== MISSING DEPENDENCIES ===" >> "${ERROR_FILE}"
grep -A 5 "Unable to find\|couldn't find\|No such file" "${LOG_FILE}" >> "${ERROR_FILE}" 2>/dev/null || echo "No missing deps" >> "${ERROR_FILE}"
# Bindgen errors
echo "" >> "${ERROR_FILE}"
echo "=== BINDGEN ERRORS ===" >> "${ERROR_FILE}"
grep -A 10 "bindgen\|libclang" "${LOG_FILE}" >> "${ERROR_FILE}" 2>/dev/null || echo "No bindgen errors" >> "${ERROR_FILE}"
# Edition2024 errors
echo "" >> "${ERROR_FILE}"
echo "=== EDITION2024 ERRORS ===" >> "${ERROR_FILE}"
grep -A 5 "edition2024" "${LOG_FILE}" >> "${ERROR_FILE}" 2>/dev/null || echo "No edition2024 errors" >> "${ERROR_FILE}"
echo ""
echo "✅ Error extraction complete"
echo "📄 Full log: ${LOG_FILE}"
echo "❌ Errors only: ${ERROR_FILE}"
echo ""
echo "🔍 Quick error view:"
tail -50 "${ERROR_FILE}"
Usage:
chmod +x scripts/build-with-enhanced-logging.sh
./scripts/build-with-enhanced-logging.sh 12
Benefits:
- Real-time error detection while build runs
- Structured error extraction by category
- Console link for visual monitoring
- Separate error file for quick debugging
B. Pre-Flight Validation Script
File: scripts/preflight-build-check.sh
#!/bin/bash
set -euo pipefail
echo "🛫 Pre-flight Build Validation"
echo "=============================="
echo ""
FAIL_COUNT=0
# Check 1: cargo.toml edition2024 dependencies
echo "✓ Checking cargo.toml for edition2024 dependencies..."
if grep -r "edition.*2024" backend/cargo.toml archive/coditect-v4/codi2/cargo.toml 2>/dev/null; then
echo " ❌ FAIL: edition2024 dependency found"
((FAIL_COUNT++))
else
echo " ✅ PASS: No edition2024 dependencies"
fi
# Check 2: Verify base64ct pin
echo "✓ Checking base64ct pin..."
if grep -q 'base64ct = "=1.6.0"' backend/cargo.toml; then
echo " ✅ PASS: base64ct pinned to 1.6.0"
else
echo " ❌ FAIL: base64ct not pinned"
((FAIL_COUNT++))
fi
# Check 3: Verify codi2 dependency pins
echo "✓ Checking codi2 dependency pins..."
REQUIRED_PINS=("notify = \"=6.0.0\"" "ignore = \"=0.4.23\"" "globset = \"=0.4.15\"")
for pin in "${REQUIRED_PINS[@]}"; do
if grep -q "$pin" archive/coditect-v4/codi2/cargo.toml; then
echo " ✅ PASS: $pin"
else
echo " ❌ FAIL: Missing $pin"
((FAIL_COUNT++))
fi
done
# Check 4: Verify Dockerfile has clang for codi2-builder
echo "✓ Checking Dockerfile for clang in codi2-builder..."
if grep -A 10 "FROM rust.*AS codi2-builder" dockerfile.combined-fixed | grep -q "clang"; then
echo " ✅ PASS: clang present in codi2-builder"
else
echo " ❌ FAIL: clang missing from codi2-builder"
((FAIL_COUNT++))
fi
# Check 5: Verify FoundationDB clients in codi2-builder
echo "✓ Checking Dockerfile for FoundationDB in codi2-builder..."
if grep -A 20 "FROM rust.*AS codi2-builder" dockerfile.combined-fixed | grep -q "foundationdb-clients"; then
echo " ✅ PASS: FoundationDB clients in codi2-builder"
else
echo " ❌ FAIL: FoundationDB missing from codi2-builder"
((FAIL_COUNT++))
fi
# Check 6: Verify frontend build exists (required for combined image)
echo "✓ Checking for frontend build..."
if [ -d "dist" ]; then
echo " ✅ PASS: dist/ directory exists"
else
echo " ❌ FAIL: dist/ missing - run 'npm run build' first"
((FAIL_COUNT++))
fi
# Check 7: Verify .gcloudignore to reduce upload size
echo "✓ Checking .gcloudignore..."
if [ -f ".gcloudignore" ]; then
echo " ✅ PASS: .gcloudignore exists"
else
echo " ⚠️ WARN: .gcloudignore missing (builds will be slower)"
fi
# Check 8: Estimate upload size
echo "✓ Estimating upload size..."
UPLOAD_SIZE=$(du -sh . 2>/dev/null | awk '{print $1}')
echo " 📦 Upload size: ${UPLOAD_SIZE}"
if [ -f ".gcloudignore" ]; then
ACTUAL_SIZE=$(tar --exclude-from=.gcloudignore -czf - . 2>/dev/null | wc -c | awk '{printf "%.1f MB", $1/1024/1024}')
echo " 📦 After .gcloudignore: ${ACTUAL_SIZE}"
fi
echo ""
echo "=============================="
if [ $FAIL_COUNT -eq 0 ]; then
echo "✅ Pre-flight PASSED - Safe to build"
exit 0
else
echo "❌ Pre-flight FAILED - Fix ${FAIL_COUNT} issues before building"
exit 1
fi
Usage:
chmod +x scripts/preflight-build-check.sh
./scripts/preflight-build-check.sh && ./scripts/build-with-enhanced-logging.sh 12
Benefits:
- Catches 80% of common errors before 45-minute build
- Fast validation (< 1 second)
- Clear checklist of required fixes
🔄 Phase 2: Incremental Builds (30 min implementation)
C. Docker Layer Caching via Artifact Registry
Problem: Every build rebuilds ALL 6 stages from scratch
Solution: Enable remote Docker cache in Cloud Build
File: cloudbuild-combined-cached.yaml
steps:
# Build with cache from Artifact Registry
- name: 'gcr.io/cloud-builders/docker'
args:
- 'build'
- '--file=dockerfile.combined-fixed'
- '--cache-from=us-central1-docker.pkg.dev/serene-voltage-464305-n2/coditect/coditect-combined-cache:latest'
- '--build-arg=BUILDKIT_INLINE_CACHE=1'
- '--tag=us-central1-docker.pkg.dev/serene-voltage-464305-n2/coditect/coditect-combined:$BUILD_ID'
- '--tag=us-central1-docker.pkg.dev/serene-voltage-464305-n2/coditect/coditect-combined-cache:latest'
- '.'
timeout: 7200s
images:
- 'us-central1-docker.pkg.dev/serene-voltage-464305-n2/coditect/coditect-combined:$BUILD_ID'
- 'us-central1-docker.pkg.dev/serene-voltage-464305-n2/coditect/coditect-combined-cache:latest'
options:
machineType: 'E2_HIGHCPU_32'
diskSizeGb: 200
logging: CLOUD_LOGGING_ONLY
Benefits:
- Reuses unchanged layers from previous builds
- Only rebuilds changed stages
- Can reduce 45 min build → 10 min if only 1 stage changed
One-time setup:
# Create initial cache image (run once)
gcloud builds submit --config cloudbuild-combined.yaml --substitutions=_CACHE_FROM=""
D. Stage-Specific Build Scripts
Problem: Can't test individual stages without full build
Solution: Separate build scripts per stage
File: scripts/build-stage-v5-backend.sh
#!/bin/bash
# Build ONLY v5-backend stage for testing
docker build \
--file dockerfile.combined-fixed \
--target v5-backend-builder \
--tag coditect-v5-backend:test \
.
echo "✅ V5 backend stage built successfully"
echo "🔍 Checking binary..."
docker run --rm coditect-v5-backend:test ls -lh /build/backend/target/release/api-server
Similar scripts: build-stage-codi2.sh, build-stage-theia.sh, etc.
Benefits:
- Test single stage in 5-10 minutes vs 45 minutes
- Iterate faster on fixes
- Identify exactly which stage has issues
📊 Phase 3: Monitoring & Alerting (1 hour implementation)
E. Build Metrics Dashboard
File: scripts/analyze-build-history.sh
#!/bin/bash
# Analyze build history and generate metrics
echo "📊 Build History Analysis"
echo "========================"
gcloud builds list \
--project serene-voltage-464305-n2 \
--limit=20 \
--format="table(id,status,createTime,duration,logUrl)" \
> /tmp/build-history.txt
cat /tmp/build-history.txt
echo ""
echo "📈 Statistics:"
TOTAL=$(gcloud builds list --project serene-voltage-464305-n2 --limit=20 --format="value(status)" | wc -l)
SUCCESS=$(gcloud builds list --project serene-voltage-464305-n2 --limit=20 --format="value(status)" | grep SUCCESS | wc -l)
FAILURE=$(gcloud builds list --project serene-voltage-464305-n2 --limit=20 --format="value(status)" | grep FAILURE | wc -l)
SUCCESS_RATE=$(echo "scale=1; $SUCCESS * 100 / $TOTAL" | bc)
echo " Total Builds: $TOTAL"
echo " Success: $SUCCESS ($SUCCESS_RATE%)"
echo " Failure: $FAILURE"
echo ""
# Identify common failure patterns
echo "🔍 Common Failure Patterns:"
for BUILD_ID in $(gcloud builds list --project serene-voltage-464305-n2 --filter="status=FAILURE" --limit=10 --format="value(id)"); do
ERROR=$(gcloud builds log "$BUILD_ID" 2>&1 | grep -oiE "(edition2024|libclang|base64ct|bindgen)" | head -1)
if [ -n "$ERROR" ]; then
echo " $BUILD_ID: $ERROR"
fi
done
Benefits:
- Track success rate over time
- Identify recurring error patterns
- Measure improvement from these changes
Implementation Timeline
| Phase | Tasks | Time | Priority |
|---|---|---|---|
| Phase 1 | Enhanced logging + pre-flight checks | 10 min | 🔴 P0 |
| Phase 2 | Docker caching + stage-specific builds | 30 min | 🟡 P1 |
| Phase 3 | Monitoring dashboard | 1 hour | 🟢 P2 |
Immediate action: Implement Phase 1 (10 minutes) before Build #12
Expected Impact
Before Improvements:
- ❌ Build → Fail → Debug → Fix → Rebuild: 45-60 minutes per cycle
- ❌ 11 failed builds × 45 min = 8.25 hours wasted
- ❌ Error visibility: Poor (hunt through 10K lines)
After Phase 1:
- ✅ Pre-flight catches 80% of errors: < 1 minute
- ✅ Enhanced logging shows errors immediately: ~10 seconds
- ✅ Build → Fail → Debug cycle: 45 minutes (no improvement yet)
After Phase 2:
- ✅ Incremental builds: 10-15 minutes (3-4x faster)
- ✅ Stage-specific testing: 5-10 minutes per stage
- ✅ Build → Fail → Fix → Rebuild: 10-20 minutes (60-70% faster)
After Phase 3:
- ✅ Trend analysis prevents repeated failures
- ✅ Success rate tracking shows improvement
- ✅ Common error detection for proactive fixes
Recommended: Implement Phase 1 NOW
Before submitting Build #12, let's implement the enhanced logging and pre-flight checks:
# 1. Create scripts
mkdir -p scripts
# (create scripts from Phase 1 above)
# 2. Run pre-flight check
./scripts/preflight-build-check.sh
# 3. If pass, build with enhanced logging
./scripts/build-with-enhanced-logging.sh 12
# 4. Monitor errors in real-time
tail -f /tmp/build-errors-v12.txt
Time investment: 10 minutes Time saved per build: 20-30 minutes (catching errors faster) ROI: Pays for itself after 1 build
Long-Term: Multi-Stage Cloud Build
Future optimization: Split single cloudbuild.yaml into parallel stages
steps:
# Build all stages in parallel
- id: 'build-frontend'
name: 'gcr.io/cloud-builders/docker'
args: ['build', '--target=frontend-builder', ...]
waitFor: ['-'] # Start immediately
- id: 'build-theia'
name: 'gcr.io/cloud-builders/docker'
args: ['build', '--target=theia-builder', ...]
waitFor: ['-'] # Start immediately
- id: 'build-v5-backend'
name: 'gcr.io/cloud-builders/docker'
args: ['build', '--target=v5-backend-builder', ...]
waitFor: ['-'] # Start immediately
# ... (all builders in parallel)
# Final runtime stage waits for all builders
- id: 'build-runtime'
name: 'gcr.io/cloud-builders/docker'
args: ['build', '--target=runtime', ...]
waitFor: ['build-frontend', 'build-theia', 'build-v5-backend', ...]
Benefits:
- Parallel compilation: 45 min → 15-20 min
- Faster error detection (failing stage fails immediately)
- Better resource utilization
Complexity: Medium (requires Dockerfile restructuring) Time to implement: 2-3 hours
Last Updated: 2025-10-27 Status: Phase 1 ready to implement Next: Create scripts before Build #12