Query Optimization Skill
Query Optimization Skill
When to Use This Skill
Use this skill when implementing query optimization patterns in your codebase.
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Comprehensive query optimization patterns for identifying and resolving database performance bottlenecks.
EXPLAIN ANALYZE
Reading Execution Plans
-- Basic EXPLAIN
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- With execution statistics
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 123;
-- JSON format for tools
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders WHERE customer_id = 123;
Key Metrics to Watch
Seq Scan - Full table scan (often bad for large tables)
Index Scan - Using index (good)
Index Only Scan - Reading from index only (best)
Bitmap Scan - Combining multiple indexes
Nested Loop - O(n*m) - watch for large row counts
Hash Join - Good for large datasets
Sort - Check if index could avoid
Rows Removed - Filter efficiency indicator
Example Analysis
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.id, c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2024-01-01'
AND o.status = 'pending';
-- Output interpretation:
/*
Hash Join (cost=1.23..45.67 rows=100 width=50) (actual time=0.5..2.3 rows=95 loops=1)
Hash Cond: (o.customer_id = c.id)
Buffers: shared hit=50 read=10
-> Index Scan using idx_orders_date on orders o (cost=0.29..30.00 rows=100 width=20)
Index Cond: (order_date > '2024-01-01')
Filter: (status = 'pending')
Rows Removed by Filter: 50
-> Hash (cost=1.10..1.10 rows=10 width=30)
-> Seq Scan on customers c (cost=0.00..1.10 rows=10 width=30)
Planning Time: 0.2 ms
Execution Time: 2.5 ms
*/
Common Issues
Sequential Scan on Large Table
-- Problem
EXPLAIN SELECT * FROM logs WHERE created_at > '2024-01-01';
-- Seq Scan on logs (cost=0.00..100000.00 rows=5000000)
-- Solution: Add index
CREATE INDEX CONCURRENTLY idx_logs_created ON logs (created_at);
Rows Removed by Filter Too High
-- Problem: Index returning too many rows, then filtering
-- Rows Removed by Filter: 99000
-- Solution: More selective index or composite index
CREATE INDEX idx_orders_status_date ON orders (status, order_date)
WHERE status IN ('pending', 'processing');
Nested Loop with Large Tables
-- Problem: O(n*m) complexity
-- Nested Loop (actual rows=1000000)
-- Solution: Ensure join columns are indexed
CREATE INDEX idx_order_items_order_id ON order_items (order_id);
Index Optimization
Index Selection Strategy
-- Analyze query patterns
SELECT
query,
calls,
mean_exec_time,
rows
FROM pg_stat_statements
WHERE query LIKE '%orders%'
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Find missing indexes
SELECT
schemaname,
relname,
seq_scan,
seq_tup_read,
idx_scan,
seq_tup_read / NULLIF(seq_scan, 0) as avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 100
AND seq_tup_read / NULLIF(seq_scan, 0) > 1000
ORDER BY seq_tup_read DESC;
Composite Index Design
-- Query pattern
SELECT * FROM orders
WHERE customer_id = ?
AND status = ?
ORDER BY order_date DESC
LIMIT 10;
-- Optimal index (equality columns first, then range/sort)
CREATE INDEX idx_orders_cust_status_date
ON orders (customer_id, status, order_date DESC);
-- Covering index to avoid table access
CREATE INDEX idx_orders_covering
ON orders (customer_id, status, order_date DESC)
INCLUDE (total, shipping_address_id);
Index-Only Scans
-- Enable index-only scan by including all selected columns
-- Query
SELECT customer_id, SUM(total) FROM orders
WHERE order_date > '2024-01-01'
GROUP BY customer_id;
-- Index that enables index-only scan
CREATE INDEX idx_orders_date_cust_total
ON orders (order_date, customer_id, total);
Partial Indexes
-- Index only active/relevant rows
CREATE INDEX idx_orders_pending
ON orders (customer_id, order_date)
WHERE status = 'pending';
-- Expression index for computed values
CREATE INDEX idx_users_email_lower
ON users (LOWER(email));
Query Rewriting
Subquery to JOIN
-- Slow subquery
SELECT * FROM orders
WHERE customer_id IN (
SELECT id FROM customers WHERE country = 'US'
);
-- Faster JOIN
SELECT o.* FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'US';
EXISTS vs IN
-- IN with large subquery (slower)
SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE total > 1000);
-- EXISTS (often faster, stops at first match)
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.id AND o.total > 1000
);
Avoiding OR in WHERE
-- Slow (can't use single index)
SELECT * FROM products
WHERE category = 'electronics' OR category = 'computers';
-- Faster (uses index)
SELECT * FROM products
WHERE category IN ('electronics', 'computers');
-- Or use UNION for complex OR
SELECT * FROM products WHERE category = 'electronics'
UNION ALL
SELECT * FROM products WHERE category = 'computers';
LIMIT Optimization
-- Slow: Sorts all rows then limits
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;
-- Fast: Index provides ordering
CREATE INDEX idx_orders_created ON orders (created_at DESC);
-- For pagination, use keyset instead of OFFSET
-- Slow
SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 10000;
-- Fast (keyset pagination)
SELECT * FROM orders
WHERE id > 10000
ORDER BY id
LIMIT 20;
Batch Operations
-- Slow: Many individual queries
FOR each_id IN ids LOOP
SELECT * FROM products WHERE id = each_id;
END LOOP;
-- Fast: Single batch query
SELECT * FROM products WHERE id = ANY(ARRAY[1,2,3,4,5]);
ORM Optimization
N+1 Query Problem
# Problem: N+1 queries
users = User.query.all()
for user in users:
print(user.orders) # Each triggers a query
# Solution: Eager loading
users = User.query.options(
joinedload(User.orders)
).all()
# Or select specific columns
users = User.query.options(
load_only(User.id, User.name),
joinedload(User.orders).load_only(Order.id, Order.total)
).all()
SQLAlchemy Optimization
# Use exists() for boolean checks
has_orders = db.session.query(
Order.query.filter_by(customer_id=customer_id).exists()
).scalar()
# Use func.count() instead of len()
order_count = db.session.query(
func.count(Order.id)
).filter_by(customer_id=customer_id).scalar()
# Bulk operations
db.session.bulk_insert_mappings(User, [
{'name': 'Alice', 'email': 'alice@example.com'},
{'name': 'Bob', 'email': 'bob@example.com'},
])
# Bulk update
db.session.query(User).filter(
User.status == 'inactive'
).update({'status': 'archived'}, synchronize_session=False)
Django ORM Optimization
# Select related (foreign keys)
orders = Order.objects.select_related('customer').all()
# Prefetch related (reverse foreign keys, many-to-many)
customers = Customer.objects.prefetch_related('orders').all()
# Only load needed fields
orders = Order.objects.only('id', 'total', 'customer_id')
# Defer heavy fields
orders = Order.objects.defer('description', 'notes')
# Use values/values_list for simple queries
totals = Order.objects.values_list('customer_id', 'total')
Performance Monitoring
pg_stat_statements
-- Enable extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top queries by total time
SELECT
round(total_exec_time::numeric, 2) as total_ms,
calls,
round(mean_exec_time::numeric, 2) as mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER())::numeric, 2) as pct,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
-- Most frequent queries
SELECT
calls,
round(mean_exec_time::numeric, 2) as mean_ms,
query
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
-- Reset statistics
SELECT pg_stat_statements_reset();
auto_explain
-- Enable in postgresql.conf
-- shared_preload_libraries = 'auto_explain'
-- Or per-session
LOAD 'auto_explain';
SET auto_explain.log_min_duration = '100ms';
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;
Slow Query Log
# postgresql.conf
log_min_duration_statement = 1000 # Log queries > 1 second
log_statement = 'none' # Don't log all statements
log_duration = off
Connection and Memory
Work Memory Tuning
-- Check current settings
SHOW work_mem;
SHOW maintenance_work_mem;
-- Increase for complex queries (per-session)
SET work_mem = '256MB';
-- For maintenance operations
SET maintenance_work_mem = '1GB';
REINDEX TABLE orders;
Connection Pooling Impact
-- Monitor connection count
SELECT count(*) FROM pg_stat_activity;
-- Per-application connections
SELECT
application_name,
count(*),
array_agg(DISTINCT state)
FROM pg_stat_activity
GROUP BY application_name;
Optimization Checklist
## Query Optimization Checklist
### Before Optimization
- [ ] Capture baseline metrics (time, rows, buffers)
- [ ] Identify specific queries to optimize
- [ ] Check current indexes on involved tables
### Analysis
- [ ] Run EXPLAIN (ANALYZE, BUFFERS)
- [ ] Check for sequential scans on large tables
- [ ] Identify high "Rows Removed by Filter"
- [ ] Look for nested loops with large row counts
- [ ] Check join conditions have indexes
### Optimization
- [ ] Add missing indexes (CONCURRENTLY)
- [ ] Consider composite indexes for multi-column filters
- [ ] Evaluate partial indexes for filtered queries
- [ ] Rewrite subqueries as JOINs where beneficial
- [ ] Use EXISTS instead of IN for existence checks
- [ ] Implement keyset pagination over OFFSET
### Verification
- [ ] Compare EXPLAIN plans before/after
- [ ] Measure actual query time improvement
- [ ] Check index is actually being used
- [ ] Monitor pg_stat_statements for changes
Success Output
When this skill is successfully applied, you MUST output:
✅ SKILL COMPLETE: query-optimization
Completed:
- [x] Query performance analysis complete (EXPLAIN ANALYZE executed)
- [x] Performance bottlenecks identified (seq scans, high filter removal, etc.)
- [x] Index optimization implemented (indexes created)
- [x] Query rewriting applied (subqueries → JOINs, etc.)
- [x] Performance improvement verified (before/after metrics)
Outputs:
- Performance analysis: [baseline metrics, bottleneck report]
- Index recommendations: [file path or SQL statements]
- Optimized queries: [file paths with rewritten queries]
- Benchmark results: [before/after execution times, improvement %]
Performance improvement: [X]% faster (e.g., 95% faster: 2000ms → 100ms)
Completion Checklist
Before marking this skill as complete, verify:
- EXPLAIN ANALYZE executed for target queries (baseline captured)
- Sequential scans on large tables identified
- High "Rows Removed by Filter" values identified
- Missing indexes identified and created (use CONCURRENTLY)
- Query rewrites applied (subqueries → JOINs, IN → EXISTS, etc.)
- N+1 query problems fixed with eager loading (ORM)
- Performance re-measured (EXPLAIN ANALYZE after optimization)
- Actual query time improvement verified (not just plan estimates)
- pg_stat_statements monitoring configured (if PostgreSQL)
- Index usage confirmed (indexes actually being used)
Failure Indicators
This skill has FAILED if:
- ❌ No baseline metrics captured before optimization
- ❌ Indexes created but not used by queries (check EXPLAIN output)
- ❌ Query performance did not improve or got worse
- ❌ N+1 queries still present after "fixing"
- ❌ Sequential scans still occurring on large tables
- ❌ High "Rows Removed by Filter" persists (poor index selectivity)
- ❌ New indexes cause write performance degradation (too many indexes)
- ❌ No monitoring configured to track ongoing performance
When NOT to Use
Do NOT use this skill when:
- Queries already fast (<100ms) and no performance complaints
- Database very small (<10k rows total) where optimization overhead > benefit
- Read-heavy queries on data warehouse (use materialized views/aggregates instead)
- Real-time analytics required (use specialized OLAP database instead)
- Schema cannot be modified (no ability to add indexes)
Use alternatives:
- For data warehouses: Materialized views, columnar storage, OLAP databases
- For real-time analytics: ClickHouse, BigQuery, Snowflake
- For small databases: No optimization needed, focus on application logic
- For read-only reporting: Pre-aggregated reports, caching layer
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Optimizing without measuring | Don't know if it helped | Always capture baseline with EXPLAIN ANALYZE |
| Creating indexes blindly | May not be used or slow writes | Verify index usage in EXPLAIN output |
| Over-indexing | Write performance degraded | Only index high-selectivity columns used in queries |
| Ignoring N+1 queries | Hundreds of queries instead of 1 | Use eager loading (joinedload, select_related) |
| Using OFFSET for pagination | Slow on deep pages | Use keyset pagination (WHERE id > ?) |
| Fetching unnecessary columns | More I/O than needed | Use SELECT specific columns, not SELECT * |
| Not using connection pooling | New connection overhead per request | Implement connection pooling |
| Forgetting to ANALYZE after index | Query planner doesn't know about new index | Run ANALYZE table after creating indexes |
Principles
This skill embodies:
- #2 First Principles - Understand query execution before optimizing
- #8 No Assumptions - Measure before and after, verify improvement
- #9 Research When in Doubt - Study EXPLAIN plans, consult database docs
- #3 Keep It Simple - Start with simplest fix (add index) before complex rewrites
- Performance Mindset - Benchmark-driven optimization, not premature optimization
Full Standard: CODITECT-STANDARD-AUTOMATION.md
Usage Examples
Analyze Slow Query
Apply query-optimization skill to analyze slow EXPLAIN output and recommend index improvements for orders query
ORM Performance Audit
Apply query-optimization skill to identify N+1 queries in Django application and implement eager loading
Database Performance Baseline
Apply query-optimization skill to setup pg_stat_statements monitoring and identify top optimization targets