Search Indexing for Thousands of Markdown Files

Design indexing so that full-text and compliance metadata are equally important.

Ingestion Pipeline

Step 1: Parse

Read frontmatter → structured metadata
Render Markdown to text
Extract:
- headings (H1-H3)
- sections[] with anchors
- code_blocks
- tables

Step 2: Enrich

Normalize regulation codes
Normalize jurisdictions
Normalize retention buckets
Generate additional fields:
- h1_h2_text
- keywords
- embeddings (for semantic search)

Step 3: Index

Create one document per Markdown file with:

Text Fields:

title
body
headings
tags
regulation_text (codes + human labels)

Facet/Filterable Fields:

domain
document_type
jurisdiction
regulations
security_classification
status
retention_category
business_unit
owner

Date Fields (Sortable):

effective_date
review_due_date
retain_until
last_modified_at

Index Schema Example (OpenSearch/Elasticsearch)

{
  "mappings": {
    "properties": {
      "doc_id": { "type": "keyword" },
      "title": { "type": "text", "boost": 2.0 },
      "headings": { "type": "text", "boost": 1.5 },
      "body": { "type": "text" },
      "tags": { "type": "keyword" },
      "domain": { "type": "keyword" },
      "document_type": { "type": "keyword" },
      "jurisdiction": { "type": "keyword" },
      "regulations": { "type": "keyword" },
      "security_classification": { "type": "keyword" },
      "status": { "type": "keyword" },
      "retention_category": { "type": "keyword" },
      "business_unit": { "type": "keyword" },
      "owner": { "type": "keyword" },
      "effective_date": { "type": "date" },
      "review_due_date": { "type": "date" },
      "retain_until": { "type": "date" },
      "last_modified_at": { "type": "date" },
      "contains_phi": { "type": "boolean" },
      "contains_pii": { "type": "boolean" },
      "contains_financial": { "type": "boolean" }
    }
  }
}

Search Behavior

Default Search

Full-text over title + headings + body
Boosted matches on title, headings, regulations

Filter Patterns

Filter Type	Example
Classification	`security_classification:confidential`
Regulation	`regulations:HIPAA-164.316`
Jurisdiction	`jurisdiction:US`
Lifecycle	`status:effective`
Business Unit	`business_unit:Compliance`
PHI/PII	`contains_phi:true`

Semantic Search Layer

RAG over Markdown content
Answers always cite doc/section IDs for auditability
Hybrid: combine vector similarity with keyword matching

Maintenance

Incremental Updates

Trigger on Git commits or file changes
Use webhooks or filesystem events
Delta indexing (only changed documents)

Integrity Checks

Periodic scans for:

Documents with missing mandatory metadata
Un-indexable content
Stale index entries
Report exceptions to compliance team

Performance Considerations

Scale	Recommendation
<10K docs	Single-node Meilisearch or Postgres FTS
10K-100K	Meilisearch or single OpenSearch node
100K-1M	OpenSearch/Elasticsearch cluster
1M+	Distributed cluster with sharding

Ingestion Pipeline​

Step 1: Parse​

Step 2: Enrich​

Step 3: Index​

Index Schema Example (OpenSearch/Elasticsearch)​

Search Behavior​

Default Search​

Filter Patterns​

Semantic Search Layer​

Maintenance​

Incremental Updates​

Integrity Checks​

Performance Considerations​

References​