Skip to main content

Scalable Index Schema for 100K Markdown Files

100K Markdown docs are moderate scale. A good design is "one indexed document per Markdown, plus structured fields for compliance filters."

Logical Index Document

For Meilisearch/OpenSearch/Elasticsearch:

Document Structure

{
"id": "doc_id",

"title": "HIPAA Privacy Officer Policy",
"headings": ["Purpose", "Scope", "Responsibilities"],
"body": "Plain text of Markdown content...",
"sections": [
{ "anchor": "purpose", "heading": "Purpose", "text_snippet": "..." },
{ "anchor": "scope", "heading": "Scope", "text_snippet": "..." }
],

"domain": "security-privacy",
"document_type": "policy",
"jurisdiction": ["US"],
"regulations": ["HIPAA-164.316"],
"security_classification": "confidential",
"contains_phi": true,
"contains_financial_data": false,
"status": "effective",
"retention_category": "HIPAA-6Y",
"business_unit": "Compliance",
"desk": null,
"facility": "Hospital-A",
"owner_role": "Privacy Officer",
"owner_user_id": "u123",

"created_at": "2025-01-01T10:00:00Z",
"effective_date": "2025-01-01",
"review_due_date": "2027-01-01",
"retain_until": "2031-01-01",
"last_modified_at": "2025-01-10T11:00:00Z",

"content_hash": "sha256:a1b2c3...",
"worm_location_id": "worm-2025-001"
}

Meilisearch Index Settings

{
"uid": "documents",
"primaryKey": "id",

"searchableAttributes": [
"title",
"headings",
"body"
],

"filterableAttributes": [
"domain",
"document_type",
"jurisdiction",
"regulations",
"security_classification",
"contains_phi",
"contains_pii",
"contains_financial_data",
"status",
"retention_category",
"business_unit",
"desk",
"facility",
"owner_role",
"owner_user_id",
"effective_date",
"review_due_date",
"retain_until",
"created_at",
"last_modified_at"
],

"sortableAttributes": [
"effective_date",
"review_due_date",
"retain_until",
"created_at",
"last_modified_at"
]
}

OpenSearch/Elasticsearch Mapping

{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"title": {
"type": "text",
"boost": 2.0,
"analyzer": "standard"
},
"headings": {
"type": "text",
"boost": 1.5
},
"body": { "type": "text" },
"sections": {
"type": "nested",
"properties": {
"anchor": { "type": "keyword" },
"heading": { "type": "text" },
"text_snippet": { "type": "text" }
}
},
"domain": { "type": "keyword" },
"document_type": { "type": "keyword" },
"jurisdiction": { "type": "keyword" },
"regulations": { "type": "keyword" },
"security_classification": { "type": "keyword" },
"contains_phi": { "type": "boolean" },
"contains_pii": { "type": "boolean" },
"contains_financial_data": { "type": "boolean" },
"status": { "type": "keyword" },
"retention_category": { "type": "keyword" },
"business_unit": { "type": "keyword" },
"desk": { "type": "keyword" },
"facility": { "type": "keyword" },
"owner_role": { "type": "keyword" },
"owner_user_id": { "type": "keyword" },
"effective_date": { "type": "date" },
"review_due_date": { "type": "date" },
"retain_until": { "type": "date" },
"created_at": { "type": "date" },
"last_modified_at": { "type": "date" },
"content_hash": { "type": "keyword" },
"worm_location_id": { "type": "keyword" }
}
}
}

Indexing Choices

Meilisearch

  • Mark all compliance metadata as filterableAttributes
  • searchableAttributes: title, headings, body only
  • sortableAttributes: date fields for temporal queries

Batch Processing

  • Batch insert/update in chunks of 1000-5000 documents
  • 100K documents safe to index in a few large batches
  • Use incremental indexing for ongoing updates

Triggers

  • Git commit webhooks
  • Filesystem events
  • Scheduled full reindex (weekly)

Query Patterns

GET /indexes/documents/search
{
"q": "privacy policy HIPAA",
"limit": 20,
"attributesToHighlight": ["title", "body"]
}

Compliance Filter

GET /indexes/documents/search
{
"q": "breach notification",
"filter": "regulations = 'HIPAA-164.316' AND status = 'effective'",
"limit": 20
}

PHI Access Query

GET /indexes/documents/search
{
"q": "*",
"filter": "contains_phi = true AND facility = 'Hospital-A'",
"sort": ["last_modified_at:desc"],
"limit": 50
}

Scale Recommendations

Document CountEngineConfiguration
< 10KMeilisearch or Postgres FTSSingle node
10K - 100KMeilisearchSingle node, 8GB RAM
100K - 500KOpenSearchSingle node, 16GB RAM
500K - 1MOpenSearch3-node cluster
1M+OpenSearch/ElasticsearchDistributed cluster

References