Skip to main content

Search Indexing for Thousands of Markdown Files

Design indexing so that full-text and compliance metadata are equally important.

Ingestion Pipeline

Step 1: Parse

  • Read frontmatter → structured metadata
  • Render Markdown to text
  • Extract:
    • headings (H1-H3)
    • sections[] with anchors
    • code_blocks
    • tables

Step 2: Enrich

  • Normalize regulation codes
  • Normalize jurisdictions
  • Normalize retention buckets
  • Generate additional fields:
    • h1_h2_text
    • keywords
    • embeddings (for semantic search)

Step 3: Index

Create one document per Markdown file with:

Text Fields:

  • title
  • body
  • headings
  • tags
  • regulation_text (codes + human labels)

Facet/Filterable Fields:

  • domain
  • document_type
  • jurisdiction
  • regulations
  • security_classification
  • status
  • retention_category
  • business_unit
  • owner

Date Fields (Sortable):

  • effective_date
  • review_due_date
  • retain_until
  • last_modified_at

Index Schema Example (OpenSearch/Elasticsearch)

{
"mappings": {
"properties": {
"doc_id": { "type": "keyword" },
"title": { "type": "text", "boost": 2.0 },
"headings": { "type": "text", "boost": 1.5 },
"body": { "type": "text" },
"tags": { "type": "keyword" },
"domain": { "type": "keyword" },
"document_type": { "type": "keyword" },
"jurisdiction": { "type": "keyword" },
"regulations": { "type": "keyword" },
"security_classification": { "type": "keyword" },
"status": { "type": "keyword" },
"retention_category": { "type": "keyword" },
"business_unit": { "type": "keyword" },
"owner": { "type": "keyword" },
"effective_date": { "type": "date" },
"review_due_date": { "type": "date" },
"retain_until": { "type": "date" },
"last_modified_at": { "type": "date" },
"contains_phi": { "type": "boolean" },
"contains_pii": { "type": "boolean" },
"contains_financial": { "type": "boolean" }
}
}
}

Search Behavior

  • Full-text over title + headings + body
  • Boosted matches on title, headings, regulations

Filter Patterns

Filter TypeExample
Classificationsecurity_classification:confidential
Regulationregulations:HIPAA-164.316
Jurisdictionjurisdiction:US
Lifecyclestatus:effective
Business Unitbusiness_unit:Compliance
PHI/PIIcontains_phi:true

Semantic Search Layer

  • RAG over Markdown content
  • Answers always cite doc/section IDs for auditability
  • Hybrid: combine vector similarity with keyword matching

Maintenance

Incremental Updates

  • Trigger on Git commits or file changes
  • Use webhooks or filesystem events
  • Delta indexing (only changed documents)

Integrity Checks

Periodic scans for:

  • Documents with missing mandatory metadata
  • Un-indexable content
  • Stale index entries
  • Report exceptions to compliance team

Performance Considerations

ScaleRecommendation
<10K docsSingle-node Meilisearch or Postgres FTS
10K-100KMeilisearch or single OpenSearch node
100K-1MOpenSearch/Elasticsearch cluster
1M+Distributed cluster with sharding

References