Skip to main content

Query Optimization Skill

Query Optimization Skill

When to Use This Skill

Use this skill when implementing query optimization patterns in your codebase.

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Comprehensive query optimization patterns for identifying and resolving database performance bottlenecks.

EXPLAIN ANALYZE

Reading Execution Plans

-- Basic EXPLAIN
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

-- With execution statistics
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 123;

-- JSON format for tools
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders WHERE customer_id = 123;

Key Metrics to Watch

Seq Scan       - Full table scan (often bad for large tables)
Index Scan - Using index (good)
Index Only Scan - Reading from index only (best)
Bitmap Scan - Combining multiple indexes
Nested Loop - O(n*m) - watch for large row counts
Hash Join - Good for large datasets
Sort - Check if index could avoid
Rows Removed - Filter efficiency indicator

Example Analysis

EXPLAIN (ANALYZE, BUFFERS)
SELECT o.id, c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2024-01-01'
AND o.status = 'pending';

-- Output interpretation:
/*
Hash Join (cost=1.23..45.67 rows=100 width=50) (actual time=0.5..2.3 rows=95 loops=1)
Hash Cond: (o.customer_id = c.id)
Buffers: shared hit=50 read=10
-> Index Scan using idx_orders_date on orders o (cost=0.29..30.00 rows=100 width=20)
Index Cond: (order_date > '2024-01-01')
Filter: (status = 'pending')
Rows Removed by Filter: 50
-> Hash (cost=1.10..1.10 rows=10 width=30)
-> Seq Scan on customers c (cost=0.00..1.10 rows=10 width=30)
Planning Time: 0.2 ms
Execution Time: 2.5 ms
*/

Common Issues

Sequential Scan on Large Table

-- Problem
EXPLAIN SELECT * FROM logs WHERE created_at > '2024-01-01';
-- Seq Scan on logs (cost=0.00..100000.00 rows=5000000)

-- Solution: Add index
CREATE INDEX CONCURRENTLY idx_logs_created ON logs (created_at);

Rows Removed by Filter Too High

-- Problem: Index returning too many rows, then filtering
-- Rows Removed by Filter: 99000

-- Solution: More selective index or composite index
CREATE INDEX idx_orders_status_date ON orders (status, order_date)
WHERE status IN ('pending', 'processing');

Nested Loop with Large Tables

-- Problem: O(n*m) complexity
-- Nested Loop (actual rows=1000000)

-- Solution: Ensure join columns are indexed
CREATE INDEX idx_order_items_order_id ON order_items (order_id);

Index Optimization

Index Selection Strategy

-- Analyze query patterns
SELECT
query,
calls,
mean_exec_time,
rows
FROM pg_stat_statements
WHERE query LIKE '%orders%'
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Find missing indexes
SELECT
schemaname,
relname,
seq_scan,
seq_tup_read,
idx_scan,
seq_tup_read / NULLIF(seq_scan, 0) as avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 100
AND seq_tup_read / NULLIF(seq_scan, 0) > 1000
ORDER BY seq_tup_read DESC;

Composite Index Design

-- Query pattern
SELECT * FROM orders
WHERE customer_id = ?
AND status = ?
ORDER BY order_date DESC
LIMIT 10;

-- Optimal index (equality columns first, then range/sort)
CREATE INDEX idx_orders_cust_status_date
ON orders (customer_id, status, order_date DESC);

-- Covering index to avoid table access
CREATE INDEX idx_orders_covering
ON orders (customer_id, status, order_date DESC)
INCLUDE (total, shipping_address_id);

Index-Only Scans

-- Enable index-only scan by including all selected columns
-- Query
SELECT customer_id, SUM(total) FROM orders
WHERE order_date > '2024-01-01'
GROUP BY customer_id;

-- Index that enables index-only scan
CREATE INDEX idx_orders_date_cust_total
ON orders (order_date, customer_id, total);

Partial Indexes

-- Index only active/relevant rows
CREATE INDEX idx_orders_pending
ON orders (customer_id, order_date)
WHERE status = 'pending';

-- Expression index for computed values
CREATE INDEX idx_users_email_lower
ON users (LOWER(email));

Query Rewriting

Subquery to JOIN

-- Slow subquery
SELECT * FROM orders
WHERE customer_id IN (
SELECT id FROM customers WHERE country = 'US'
);

-- Faster JOIN
SELECT o.* FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'US';

EXISTS vs IN

-- IN with large subquery (slower)
SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE total > 1000);

-- EXISTS (often faster, stops at first match)
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.id AND o.total > 1000
);

Avoiding OR in WHERE

-- Slow (can't use single index)
SELECT * FROM products
WHERE category = 'electronics' OR category = 'computers';

-- Faster (uses index)
SELECT * FROM products
WHERE category IN ('electronics', 'computers');

-- Or use UNION for complex OR
SELECT * FROM products WHERE category = 'electronics'
UNION ALL
SELECT * FROM products WHERE category = 'computers';

LIMIT Optimization

-- Slow: Sorts all rows then limits
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;

-- Fast: Index provides ordering
CREATE INDEX idx_orders_created ON orders (created_at DESC);

-- For pagination, use keyset instead of OFFSET
-- Slow
SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 10000;

-- Fast (keyset pagination)
SELECT * FROM orders
WHERE id > 10000
ORDER BY id
LIMIT 20;

Batch Operations

-- Slow: Many individual queries
FOR each_id IN ids LOOP
SELECT * FROM products WHERE id = each_id;
END LOOP;

-- Fast: Single batch query
SELECT * FROM products WHERE id = ANY(ARRAY[1,2,3,4,5]);

ORM Optimization

N+1 Query Problem

# Problem: N+1 queries
users = User.query.all()
for user in users:
print(user.orders) # Each triggers a query

# Solution: Eager loading
users = User.query.options(
joinedload(User.orders)
).all()

# Or select specific columns
users = User.query.options(
load_only(User.id, User.name),
joinedload(User.orders).load_only(Order.id, Order.total)
).all()

SQLAlchemy Optimization

# Use exists() for boolean checks
has_orders = db.session.query(
Order.query.filter_by(customer_id=customer_id).exists()
).scalar()

# Use func.count() instead of len()
order_count = db.session.query(
func.count(Order.id)
).filter_by(customer_id=customer_id).scalar()

# Bulk operations
db.session.bulk_insert_mappings(User, [
{'name': 'Alice', 'email': 'alice@example.com'},
{'name': 'Bob', 'email': 'bob@example.com'},
])

# Bulk update
db.session.query(User).filter(
User.status == 'inactive'
).update({'status': 'archived'}, synchronize_session=False)

Django ORM Optimization

# Select related (foreign keys)
orders = Order.objects.select_related('customer').all()

# Prefetch related (reverse foreign keys, many-to-many)
customers = Customer.objects.prefetch_related('orders').all()

# Only load needed fields
orders = Order.objects.only('id', 'total', 'customer_id')

# Defer heavy fields
orders = Order.objects.defer('description', 'notes')

# Use values/values_list for simple queries
totals = Order.objects.values_list('customer_id', 'total')

Performance Monitoring

pg_stat_statements

-- Enable extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top queries by total time
SELECT
round(total_exec_time::numeric, 2) as total_ms,
calls,
round(mean_exec_time::numeric, 2) as mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER())::numeric, 2) as pct,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

-- Most frequent queries
SELECT
calls,
round(mean_exec_time::numeric, 2) as mean_ms,
query
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;

-- Reset statistics
SELECT pg_stat_statements_reset();

auto_explain

-- Enable in postgresql.conf
-- shared_preload_libraries = 'auto_explain'

-- Or per-session
LOAD 'auto_explain';
SET auto_explain.log_min_duration = '100ms';
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;

Slow Query Log

# postgresql.conf
log_min_duration_statement = 1000 # Log queries > 1 second
log_statement = 'none' # Don't log all statements
log_duration = off

Connection and Memory

Work Memory Tuning

-- Check current settings
SHOW work_mem;
SHOW maintenance_work_mem;

-- Increase for complex queries (per-session)
SET work_mem = '256MB';

-- For maintenance operations
SET maintenance_work_mem = '1GB';
REINDEX TABLE orders;

Connection Pooling Impact

-- Monitor connection count
SELECT count(*) FROM pg_stat_activity;

-- Per-application connections
SELECT
application_name,
count(*),
array_agg(DISTINCT state)
FROM pg_stat_activity
GROUP BY application_name;

Optimization Checklist

## Query Optimization Checklist

### Before Optimization
- [ ] Capture baseline metrics (time, rows, buffers)
- [ ] Identify specific queries to optimize
- [ ] Check current indexes on involved tables

### Analysis
- [ ] Run EXPLAIN (ANALYZE, BUFFERS)
- [ ] Check for sequential scans on large tables
- [ ] Identify high "Rows Removed by Filter"
- [ ] Look for nested loops with large row counts
- [ ] Check join conditions have indexes

### Optimization
- [ ] Add missing indexes (CONCURRENTLY)
- [ ] Consider composite indexes for multi-column filters
- [ ] Evaluate partial indexes for filtered queries
- [ ] Rewrite subqueries as JOINs where beneficial
- [ ] Use EXISTS instead of IN for existence checks
- [ ] Implement keyset pagination over OFFSET

### Verification
- [ ] Compare EXPLAIN plans before/after
- [ ] Measure actual query time improvement
- [ ] Check index is actually being used
- [ ] Monitor pg_stat_statements for changes

Success Output

When this skill is successfully applied, you MUST output:

✅ SKILL COMPLETE: query-optimization

Completed:
- [x] Query performance analysis complete (EXPLAIN ANALYZE executed)
- [x] Performance bottlenecks identified (seq scans, high filter removal, etc.)
- [x] Index optimization implemented (indexes created)
- [x] Query rewriting applied (subqueries → JOINs, etc.)
- [x] Performance improvement verified (before/after metrics)

Outputs:
- Performance analysis: [baseline metrics, bottleneck report]
- Index recommendations: [file path or SQL statements]
- Optimized queries: [file paths with rewritten queries]
- Benchmark results: [before/after execution times, improvement %]

Performance improvement: [X]% faster (e.g., 95% faster: 2000ms → 100ms)

Completion Checklist

Before marking this skill as complete, verify:

  • EXPLAIN ANALYZE executed for target queries (baseline captured)
  • Sequential scans on large tables identified
  • High "Rows Removed by Filter" values identified
  • Missing indexes identified and created (use CONCURRENTLY)
  • Query rewrites applied (subqueries → JOINs, IN → EXISTS, etc.)
  • N+1 query problems fixed with eager loading (ORM)
  • Performance re-measured (EXPLAIN ANALYZE after optimization)
  • Actual query time improvement verified (not just plan estimates)
  • pg_stat_statements monitoring configured (if PostgreSQL)
  • Index usage confirmed (indexes actually being used)

Failure Indicators

This skill has FAILED if:

  • ❌ No baseline metrics captured before optimization
  • ❌ Indexes created but not used by queries (check EXPLAIN output)
  • ❌ Query performance did not improve or got worse
  • ❌ N+1 queries still present after "fixing"
  • ❌ Sequential scans still occurring on large tables
  • ❌ High "Rows Removed by Filter" persists (poor index selectivity)
  • ❌ New indexes cause write performance degradation (too many indexes)
  • ❌ No monitoring configured to track ongoing performance

When NOT to Use

Do NOT use this skill when:

  • Queries already fast (<100ms) and no performance complaints
  • Database very small (<10k rows total) where optimization overhead > benefit
  • Read-heavy queries on data warehouse (use materialized views/aggregates instead)
  • Real-time analytics required (use specialized OLAP database instead)
  • Schema cannot be modified (no ability to add indexes)

Use alternatives:

  • For data warehouses: Materialized views, columnar storage, OLAP databases
  • For real-time analytics: ClickHouse, BigQuery, Snowflake
  • For small databases: No optimization needed, focus on application logic
  • For read-only reporting: Pre-aggregated reports, caching layer

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Optimizing without measuringDon't know if it helpedAlways capture baseline with EXPLAIN ANALYZE
Creating indexes blindlyMay not be used or slow writesVerify index usage in EXPLAIN output
Over-indexingWrite performance degradedOnly index high-selectivity columns used in queries
Ignoring N+1 queriesHundreds of queries instead of 1Use eager loading (joinedload, select_related)
Using OFFSET for paginationSlow on deep pagesUse keyset pagination (WHERE id > ?)
Fetching unnecessary columnsMore I/O than neededUse SELECT specific columns, not SELECT *
Not using connection poolingNew connection overhead per requestImplement connection pooling
Forgetting to ANALYZE after indexQuery planner doesn't know about new indexRun ANALYZE table after creating indexes

Principles

This skill embodies:

  • #2 First Principles - Understand query execution before optimizing
  • #8 No Assumptions - Measure before and after, verify improvement
  • #9 Research When in Doubt - Study EXPLAIN plans, consult database docs
  • #3 Keep It Simple - Start with simplest fix (add index) before complex rewrites
  • Performance Mindset - Benchmark-driven optimization, not premature optimization

Full Standard: CODITECT-STANDARD-AUTOMATION.md

Usage Examples

Analyze Slow Query

Apply query-optimization skill to analyze slow EXPLAIN output and recommend index improvements for orders query

ORM Performance Audit

Apply query-optimization skill to identify N+1 queries in Django application and implement eager loading

Database Performance Baseline

Apply query-optimization skill to setup pg_stat_statements monitoring and identify top optimization targets