Search Indexing for Thousands of Markdown Files
Design indexing so that full-text and compliance metadata are equally important.
Ingestion Pipeline
Step 1: Parse
- Read frontmatter → structured metadata
- Render Markdown to text
- Extract:
headings(H1-H3)sections[]with anchorscode_blockstables
Step 2: Enrich
- Normalize regulation codes
- Normalize jurisdictions
- Normalize retention buckets
- Generate additional fields:
h1_h2_textkeywordsembeddings(for semantic search)
Step 3: Index
Create one document per Markdown file with:
Text Fields:
titlebodyheadingstagsregulation_text(codes + human labels)
Facet/Filterable Fields:
domaindocument_typejurisdictionregulationssecurity_classificationstatusretention_categorybusiness_unitowner
Date Fields (Sortable):
effective_datereview_due_dateretain_untillast_modified_at
Index Schema Example (OpenSearch/Elasticsearch)
{
"mappings": {
"properties": {
"doc_id": { "type": "keyword" },
"title": { "type": "text", "boost": 2.0 },
"headings": { "type": "text", "boost": 1.5 },
"body": { "type": "text" },
"tags": { "type": "keyword" },
"domain": { "type": "keyword" },
"document_type": { "type": "keyword" },
"jurisdiction": { "type": "keyword" },
"regulations": { "type": "keyword" },
"security_classification": { "type": "keyword" },
"status": { "type": "keyword" },
"retention_category": { "type": "keyword" },
"business_unit": { "type": "keyword" },
"owner": { "type": "keyword" },
"effective_date": { "type": "date" },
"review_due_date": { "type": "date" },
"retain_until": { "type": "date" },
"last_modified_at": { "type": "date" },
"contains_phi": { "type": "boolean" },
"contains_pii": { "type": "boolean" },
"contains_financial": { "type": "boolean" }
}
}
}
Search Behavior
Default Search
- Full-text over
title + headings + body - Boosted matches on
title,headings,regulations
Filter Patterns
| Filter Type | Example |
|---|---|
| Classification | security_classification:confidential |
| Regulation | regulations:HIPAA-164.316 |
| Jurisdiction | jurisdiction:US |
| Lifecycle | status:effective |
| Business Unit | business_unit:Compliance |
| PHI/PII | contains_phi:true |
Semantic Search Layer
- RAG over Markdown content
- Answers always cite doc/section IDs for auditability
- Hybrid: combine vector similarity with keyword matching
Maintenance
Incremental Updates
- Trigger on Git commits or file changes
- Use webhooks or filesystem events
- Delta indexing (only changed documents)
Integrity Checks
Periodic scans for:
- Documents with missing mandatory metadata
- Un-indexable content
- Stale index entries
- Report exceptions to compliance team
Performance Considerations
| Scale | Recommendation |
|---|---|
| <10K docs | Single-node Meilisearch or Postgres FTS |
| 10K-100K | Meilisearch or single OpenSearch node |
| 100K-1M | OpenSearch/Elasticsearch cluster |
| 1M+ | Distributed cluster with sharding |