React DMS GUI Specification
An ideal React GUI for a sophisticated DMS looks like a task-focused “workspace” rather than just a file browser: opinionated navigation, powerful search + filters, rich preview, and inline workflows (versioning, approvals, annotations).
Below is a concrete breakdown you can turn into components and routes.
Core layout and navigation
Aim for a 3‑pane, app-like layout using something like MUI, Tailwind, or a dashboard kit as a base.123
- Left sidebar (global navigation):
- Sections: “My Work”, “Documents”, “Projects/Spaces”, “Tasks”, “Admin”.
- Saved searches / smart folders (e.g., “Needs my review”, “Recently viewed”, “Expiring soon”).
- Tenant / workspace switcher for multi-tenant SaaS.
- Top bar:
- Global search bar with quick actions (search across title, content, tags, people).
- User menu: profile, notifications, theme toggle, organization context.
- Quick-create button (“New document”, “New folder”, “New request”).
- Main content:
- Swappable views (table, grid, split view) for document lists.
- Right-side contextual panel for metadata, activity, and comments.
For implementation speed, starting with a React admin/dashboard kit (e.g., Material UI dashboard templates, TailAdmin, etc.) gives a solid layout and theming foundation.231
Workflow Checklist
- Prerequisites verified
- Configuration applied
- Process executed
- Results validated
- Documentation updated
Workflow Steps
- Initialize - Set up the environment
- Configure - Apply settings
- Execute - Run the process
- Validate - Check results
- Complete - Finalize workflow
Workflow Phases
Phase 1: Initialization
Set up prerequisites and validate inputs.
Phase 2: Processing
Execute the main workflow steps.
Phase 3: Verification
Validate outputs and confirm completion.
Phase 4: Finalization
Clean up and generate reports.
Document browsing and search
The “Documents” area should feel like a smarter file manager, not just a tree. Open-source React file manager components and explorers can act as references.4567
Key elements:
- Primary list view:
- Virtualized table with columns: Name, Type, Owner, Modified, Status, Tags, Version, Retention.
- Multi-select with bulk operations (move, tag, change state, share).
- Toggleable views: table, card grid, hierarchy/tree.
- Faceted search sidebar:
- Filters by: type, owner, date, lifecycle state, classification, tags, workspace, retention policy.
- Saved filter sets as user-defined “smart folders”.
- Search UX:
- One global search bar (with typeahead and quick filters) + “advanced search” modal.
- Support for query building (e.g., owner:me AND status:pending AND tag:contract).
- Recent searches and pinned searches.
You can borrow patterns from existing React file managers (e.g., @cubone/react-file-manager, react-file-manager repos) for interactions like drag-and-drop, breadcrumb navigation, and split panes.674
Document detail, preview, and lifecycle
The document detail view is a core screen: think of it as a “control panel” for one document.
- Layout:
- Center: preview pane (PDF/Doc viewer, images, text, code, etc.).
- Left or top: document title, key status badges (state, classification, retention).
- Right sidebar: metadata, activity, and workflow.
- Preview & interaction:
- In-place viewing for common formats; open in new tab when needed.
- Zoom, page navigation, thumbnails, search within document (if OCR/text available).
- Section-based comments or anchored annotations for documents that support it.
- Metadata panel:
- Versioning and history:
- Version list with diff metadata (who, when, what changed), plus restore and compare.
- Activity log: views, edits, approvals, permission changes, external shares.
- Lifecycle controls:
Collaboration, permissions, and workflows
A sophisticated DMS GUI should make collaboration and security discoverable without being noisy.101189
- Sharing & permissions drawer:
- People/groups with roles (view, comment, edit, manage, restricted).
- Link sharing options (organization-only, specific groups, external with expiry).
- Quick presets that align with internal policies (e.g., “Internal read-only”, “Legal-only”).
- Comments and tasks:
- Threaded comments, mentions, and notifications.
- Inline tasks tied to documents (“Review by X date”, “Sign-off by Y”).
- “My Tasks” view aggregating approvals and review actions across documents.
- Workflow dashboards:
- Kanban or table view of documents by lifecycle stage (Draft, In review, Approved, Published, Archived).
- Filter by workflow type (contract approval, policy review, etc.).
- Admin & governance:
Concrete component/route structure and next-step prompt
A pragmatic React app structure for this GUI:
- Routes:
/login,/onboarding/dashboard– personalized overview (recent docs, tasks, approvals)./documents– global library with filters and views./spaces/:spaceId– project/department spaces with scoped docs./documents/:documentId– document detail with preview, metadata, activity./tasks– approvals, review requests./admin/*– policies, roles, audit, tenants.
- Key components:
AppShell(sidebar, topbar, routing outlet)GlobalSearchBar,SavedSearchList,FacetFilterPanelDocumentTable,DocumentGrid,Breadcrumb,FileTreeDocumentPreview,MetadataPanel,VersionTimeline,ActivityFeedCommentsPanel,TaskList,SharePermissionsDrawerWorkflowBoard,AdminAuditLog,PolicyList
A focused follow-up prompt you can reuse:
“Design a React component architecture (with TypeScript interfaces) for a multi-tenant document management GUI that includes: global search, faceted document browsing, document detail with preview/versioning, collaboration (comments/tasks), and lifecycle workflows (review/approval/retention). Assume a separate backend API; focus on React components, state management, and routing patterns.”
If you describe your backend shape (APIs, events, AI capabilities), a next iteration can map this UI to specific data contracts and interaction flows.
THE DOCUMENT MANAGEMENT SYSTEM IS DESIGNED FOR A REGULATED INDUSTRY LIKE HEALTH CARE AND FINANCE AND IS TO BE USED TO SEARCH FIND MANAGE AND INTERACT WITH THE ENTERPRISE KNOWLEDGE BASE which is in the form of thousands of MARKDOWN FILES
An ideal React GUI here is an enterprise knowledge console for regulated content: everything is Markdown under the hood, but users experience compliant search, review, and governance workflows tailored to healthcare/finance.
Top-level mental model
For HIPAA/SEC-type environments the UI must surface: access control, classification, retention, and auditability as first-class citizens, not hidden settings.2122232425
- Primary objects:
- Knowledge items (Markdown docs) with type, classification, lifecycle state, owner, and retention.
- Collections/spaces (e.g., “Clinical Protocols”, “Policies”, “Product Knowledge”) mapping to business domains.
- Workflows (review, approval, periodic re-certification, legal hold).
- Primary views:
- My Work (tasks, reviews, assigned items).
- Knowledge Explorer (search + browse).
- Compliance & Governance (retention, holds, audits).
- Admin (policies, roles, mappings to regulations).
Knowledge explorer (Markdown-centric)
Treat thousands of Markdown files like a GitBook/Docmost/Document360-style structured knowledge base, not a raw file system.26272829
- Left:
- Hierarchical navigation (spaces → sections → pages) derived from folder paths / frontmatter.
- Pinned collections (e.g., “Clinical Policies”, “Risk Procedures”, “KYC Playbooks”).
- Center:
- Right:
Search in this view should be hybrid: full-text over Markdown, plus filters over metadata and regulatory properties.272826
Search and discovery UX
Regulated KB search must let a compliance officer answer “who can see what, and why?” and a practitioner quickly find the right guidance.222425312123
- Global search bar:
- Query across title, headings, body, tags, and “regulation tags” (e.g., hipaa:breach-notification, sec:17a-4).
- Typeahead sections: “Documents”, “Spaces”, “People/Owners”.
- Advanced search panel:
- Facets: document type (policy, SOP, clinical guideline, risk procedure), classification level, retention category, jurisdiction, business unit.
- Status: draft / in review / approved / deprecated / on hold.
- Time-based filters keyed to retention (creation, last review, next re-cert date).
- Saved searches:
- “HIPAA policies in review”, “KYC procedures expiring this quarter”, “High-risk procedures without current attestation”.
Integrate RAG-style semantic search in the results panel but always anchored back to specific Markdown docs and sections for auditability.2627
Compliance-first document view
Each Markdown document’s detail view should foreground compliance context.242531212223
- Header strip (always visible):
- Metadata tab:
- Workflow tab:
- Review chain (author → SME → Compliance → Approver), current assignee, due dates.
- Attestation history (who signed off when).
- Audit tab:
Tasks, workflows, and lifecycle
Regulated industries require structured, repeatable document lifecycle handling. UI should make this explicit.312122232425
- My Work dashboard:
- Tiles: “Reviews to complete”, “Attestations due”, “Content to update”, “Legal hold notices”.
- List of assigned items with priority and SLA indicators.
- Workflow board:
- Columns by lifecycle state (Draft, SME Review, Compliance, Approved, Effective, Obsolete).
- Filters by regulation, space, owner, risk rating.
- Lifecycle actions (buttons on doc view):
Governance & admin views
Provide specialized views for compliance / records management teams.21222425323123
- Retention & classification console:
- Records monitor:
- Items approaching review/expiry.
- Items on legal hold.
- Exceptions (content missing classification, retention, or regulatory mapping).
- Audit & reporting:
- Markdown source management:
Suggested focused follow-up prompt
“Design a React + TypeScript component architecture for a regulated-industry knowledge console that manages thousands of Markdown documents, including: (1) semantic and faceted search over Markdown + metadata, (2) a compliance-focused document view (classification, retention, audit), (3) workflow views for review/approval/attestation, and (4) governance consoles for retention schedules and audit logs. Assume APIs already expose Markdown content, metadata, and audit trails; focus on component boundaries, state management, and routing.”
How should I structure markdown folders for regulatory compliance
What metadata fields are essential for healthcare and finance documents How to design search indexing for thousands of markdown files Which access control model fits regulated document systems best How to implement audit trails and retention policies automatically
For a regulated, Markdown-based knowledge base, treat folders, metadata, indexing, access control, and audit/retention as one coherent “compliance fabric.”
1. Folder structure for regulatory compliance
Use folders for human orientation and broad access boundaries, but push most compliance logic into metadata and policy engines.41424344
- High-level pattern (top 2–3 levels):
/domain/– e.g.,clinical,operations,risk,trading,kyc./jurisdiction/– e.g.,us,eu,uk,br, ormulti-jurisdiction./document-type/– e.g.,policy,procedure,sop,form,runbook,guideline.
- Examples:
clinical/us/policy/hipaa-privacy-officer.mdrisk/eu/procedure/aml-kyc-review.mdtrading/us/policy/sec-17a4-electronic-records.md
- Separate “working” vs “controlled” content:
- Enforce:
2. Essential metadata fields (healthcare + finance)
Most compliance capabilities should be driven by metadata, not paths.4647484945
Minimum metadata (frontmatter + index DB):
- Identity:
doc_id(stable),title,description,language.
- Classification & sensitivity:
- Regulatory mapping:
- Lifecycle & retention:
- Ownership & context:
- Versioning:
version,supersedes,superseded_by,change_reason.
- Access policy hints:
This can map cleanly to YAML frontmatter in each Markdown, plus a normalized relational or graph store for querying.474948
3. Search indexing for thousands of Markdown files
Design indexing so that full-text and compliance metadata are equally important.59434449
- Parsing & enrichment pipeline:
- Index schema (e.g., OpenSearch/Solr/Typesense/pg_trgm):
- Search behavior:
- Maintenance:
4. Access control model for regulated documents
Use RBAC as the backbone, with classification- and attribute-based constraints; many regulated shops approach MAC-like behavior for high-sensitivity content.586257565045
- Base model:
- Classification-aware rules:
- Least privilege and SoD:
- Implementation detail:
For highly sensitive subsets (e.g., some financial records, special PHI), you can approximate MAC with system-enforced clearances and non-bypassable rules layered on top of RBAC.625750
5. Audit trails and automatic retention
Automate logging and lifecycle transitions so compliance is enforced “by default,” not by convention.515455454749
- Audit trails:
- Log all security- and compliance-relevant events:
- Required fields per event:
- Storage requirements:
- Retention policies:
- Automation patterns:
If you want, the next step can be a concrete schema + pipeline sketch (frontmatter YAML spec, Postgres schema for metadata and events, and an indexing + policy-eval flow) tailored to your current stack.
Create a compliant folder hierarchy template for healthcare documents
Recommended metadata schema for HIPAA and FINRA documents Indexing strategy for fast search across 100k markdown files How to store and query metadata separately from markdown files Best open source search engines for markdown repositories
For a HIPAA/FINRA-style Markdown corpus, use folders only for high‑signal orientation and boundaries, with metadata + index as the real control plane.
1. Folder hierarchy template (healthcare-focused, extendable to finance)
Use a consistent 3–4 level hierarchy; drive compliance from metadata, but keep folders human-readable and audit-friendly.66676869
Top-level:
clinical/– care delivery policies, protocols, order sets.administrative/– HR, operations, facilities.security-privacy/– HIPAA, GDPR, security policies.billing-revenue/– coding, billing, collections.research/– IRB, study procedures.finance/– trading, products, risk, disclosures (for FINRA/SEC overlap).
Within each domain:
us/,eu/,uk/,br/,global/(jurisdiction).70policy/,procedure/,sop/,work-instruction/,form/,guideline/.7166controlled/,draft/,archive/to distinguish official vs working vs obsolete content.7266
Example paths:
clinical/us/policy/controlled/CLN-001-hipaa-privacy-officer-v3.2.mdsecurity-privacy/us/procedure/draft/SEC-17a4-electronic-records-v0.9.mdfinance/us/policy/controlled/FINRA-4511-recordkeeping-v2.1.md
Use IDs + short slugs in filenames to help eDiscovery and cross-systems referencing.7368
2. Recommended metadata schema (HIPAA + FINRA)
Metadata should cover descriptive, structural, administrative, technical, and provenance aspects, with explicit regulatory and retention signals.74757673
Core fields (YAML frontmatter + DB):
- Identity:
doc_id: stable identifier.title,summary,language.
- Domain & type:
- Regulatory mapping:
- Sensitivity & classification:
- Lifecycle & retention:
- Ownership & access:
- Provenance & versioning:
For HIPAA/FINRA, treat metadata (timestamps, authorship, classification, retention, lineage) as part of the “record” and preserve it immutably with content for WORM-style compliance.74738976
3. Indexing strategy for 100k Markdown files
100k Markdown docs are well within range for a serious full-text engine; focus on a content pipeline and rich fields.9091929394
Ingestion pipeline:
- Step 1 – Parse:
- Read frontmatter → structured metadata.
- Render Markdown to plain text; extract:
headings(H1–H3),sectionswith anchors.code_blocks,tablesif relevant.
- Step 2 – Enrich:
- Step 3 – Index document (per file) with fields:
- Text:
title,headings,body,tags,regulation_text(e.g., codes + human labels).
- Facets/filterable fields:
- Sortable/date:
- Text:
- Indexing performance considerations:
- Batch insert/update in chunks of thousands (depending on engine) to speed up indexing and reduce overhead.9593
- Prefer bigger payloads over many small ones; 100k documents is generally safe to index in a few batches.9395
- Use incremental indexing triggered by VCS hooks or filesystem events for continuous updates.9091
Query model:
- Default query = full-text over
title + headings + bodywith boosts ontitle,headings,regulations.919093 - Filters:
- For compliance and explainability, always return:
4. Storing and querying metadata separately from Markdown
Keep Markdown as the source of truth for content; use a database for metadata, joins, and analytics.94897374
Recommended split:
- Markdown:
- Metadata store:
- Relational DB (PostgreSQL is ideal) with:
documentstable (doc_id, path, hash, timestamps).document_metadata(doc_id FK, all normalized fields like domain, type, regulation codes, classifications, retention, BU).9474document_regulations(doc_id, regulation_code) for many-to-many if needed.7076document_tags,document_facilities, etc., as junction tables.
- Relational DB (PostgreSQL is ideal) with:
- Query patterns:
- Synchronization:
5. Open source search engines for Markdown repositories
Several engines work well with Markdown once you provide a parsing pipeline.929796939190
Good fits:
- Elasticsearch / OpenSearch:
- Mature full-text, faceting, highlighting; widely used for document search at 100k+ scale.92
- Works well for mixed text + structured metadata; supports analyzers for multiple languages.
- Meilisearch:
- Lightweight, fast, simple HTTP API; great relevance for document-style search; good for 100k docs.93
- Supports searchable attributes, filterable attributes, synonyms, and typo tolerance.
- Typesense:
- Similar to Meilisearch; strong for instant search and faceted UIs.
- Solr:
- Traditional, battle-tested; strong document indexing, used in many enterprise DMS setups.92
- PostgreSQL full-text (tsvector/tsquery):
- If you want to avoid extra infra, can work well with 100k documents using FTS + GIN indexes.94
- Markdown-specific examples and inspiration:
- GitHub project “Search engine for markdown files with tagging” (Whoosh-based), which boosts filenames and Markdown structure (headlines, emphasis).90
- Guides on building Markdown search APIs (e.g., ZeroEntropy’s approach) show indexing flows specific to
.mdrepositories.91 - Static-site ecosystems (MkDocs, Docusaurus) bundle client-side search (Lunr/Algolia-like) that can be reused or adapted for server-side search pipelines.9796
For a regulated KB with 100k Markdown docs, a typical architecture is:
- PostgreSQL (metadata + audit).
- OpenSearch/Elasticsearch or Meilisearch (text + facets).
- Markdown in Git/object store.
- Event-driven sync to maintain metadata DB and search index integrity.73899394
If useful, next step can be: concrete Postgres schema + Meilisearch index JSON definition + a small ingestion pipeline sketch tailored to your RBAC/ABAC model.
Related
Show a HIPAA focused folder template with retention rules List mandatory metadata fields for FINRA audits only Design a scalable index schema for 100k markdown files Compare storing metadata in sidecar files versus a DB How to implement immutable WORM storage for records and metadata. next step can be: concrete Postgres schema + Meilisearch index JSON definition + a small ingestion pipeline sketch tailored to your RBAC/ABAC model.
A compliant HIPAA/FINRA Markdown repo needs: (1) a retention-aware folder pattern, (2) audit-ready metadata, (3) an index schema that separates content from metadata, and (4) an immutable storage layer for records + metadata.
1. HIPAA-focused folder template with retention hints
HIPAA generally requires retaining HIPAA-related documentation (policies, procedures, notices, complaints) at least six years from creation or last effective date.102103104105
Example top-level layout:
clinical/administrative/security-privacy/billing-revenue/research/
Within each:
us/,state-<xx>/(where state law drives longer retention),global/.106107108109policy/,procedure/,sop/,form/,notice/,log/.110103102controlled/,draft/,archive/.104110
Concrete template with retention category encoded in folder name (for ops clarity, while actual enforcement is via metadata & jobs):
security-privacy/us/policy/ret-6y/clinical/us/record/ret-6y-plus-state/billing-revenue/us/record/ret-7y/- If internal policy aligns with common 7‑year practices for financial records.110
Filename pattern:
HSP-POL-001-privacy-notice-v3.0.mdCLN-SOP-010-medication-reconciliation-v1.4.md
The folder name ret-6y is advisory; the authoritative retention is in metadata and a central retention table.107103110
2. Mandatory metadata fields for FINRA audits (documents only)
FINRA Rule 4511 points to SEC Rule 17a‑4 for how records must be made and preserved: accurate, complete, immutable (WORM or equivalent), and retained for specified periods.111112113114115
For Markdown-based “books and records,” minimum per-record metadata should include:
- Identity:
- Business context:
- Regulatory mapping:
- Authorship and timestamps:
- Retention:
- Integrity and storage:
- Status & lineage:
Auditors will focus heavily on: accurate timestamps, clear mapping to books/records rules, retention duration, and demonstrable immutability of the record and its metadata.115112113114111
3. Scalable index schema for 100k Markdown files
100k Markdown docs are moderate scale; a good design is “one indexed document per Markdown, plus structured fields for compliance filters.”119120121122
Logical index document (for Meilisearch/OpenSearch/etc.):
id:doc_id.- Content:
titleheadings: array of strings (H1–H3).body: plain text of Markdown.sections: array of{ anchor, heading, text_snippet }for section-level highlighting.
- Compliance & metadata fields (filterable/faceted):
domain(clinical, security-privacy, finance, etc.).document_type(policy, sop, record, communication).jurisdiction(us, state-ca, eu, etc.).regulations(array of codes).security_classificationcontains_phi,contains_financial_data.statusretention_categorybusiness_unit,desk,facility.owner_role,owner_user_id.123124111
- Dates (sortable & filterable):
- Integrity & storage (for linking to WORM store):
Indexing choices:
- Mark all metadata fields (
domain,regulations,status,retention_category, etc.) as filterable/faceted.122 - For Meilisearch:
- Use batched indexing and incremental updates based on Git commits or FS events; 100k docs is well within Meilisearch/OpenSearch abilities with correct batching.121126122
4. Sidecar metadata files vs DB
Both patterns are used; regulators care that metadata is preserved immutably and consistently, not where it lives.127124123111
| Aspect | Sidecar/YAML-only (per .md) | Central DB (e.g., Postgres) |
|---|---|---|
| Source of truth | Metadata travels with content in Git or object store. 123120128 | Clear, queryable system of record for metadata. 123129 |
| Human review | Easy to view/edit in editors; good for dev workflows. 120128 | Requires UI/tooling to see/edit; less diff-friendly. |
| Query & analytics | Needs indexing step; no ad‑hoc SQL. 119120 | Natural for dashboards, compliance queries, joins. 125124129 |
| Immutability (WORM) | Commit history can act as immutable log; but Git alone may not satisfy WORM. 111130 | DB must be backed by WORM/append-only logs or replicated into immutable archive. 111127116 |
| Complexity | Simpler infra; more complex analytics. 119120 | More infra; simpler analytics and API queries. 129 |
| Regulatory expectations | Must still prove metadata wasn’t altered independently; strong story if entire repo is archived in WORM snapshots. 111116 | Stronger central controls for mandatory fields, validations, consistency. 123124129 |
For HIPAA/FINRA‑grade systems, a hybrid is typically best:
- Authoring: metadata in frontmatter/sidecar tightly coupled with Markdown.120128131
- Operational: parsed metadata normalized into Postgres (or similar) and archived along with content to WORM storage so both can be shown as preserved together.124127116111
5. Implementing immutable WORM storage for records + metadata
SEC 17a‑4 and FINRA 4511 require “non‑rewriteable, non‑erasable” records storage (classic WORM or equivalent) for specified periods; HIPAA requires retrievable, safeguarded records and auditable activity logs (often 6+ years).105103104115114111
Core principles:
- Immutability:
- Integrity & verification:
- Separation of duties:
Practical pattern:
- Use cloud or on-prem WORM-capable storage:
- Object storage with immutable buckets / object lock (S3 Object Lock, similar features on other providers) configured in compliance mode for required retention periods.117116
- Store both Markdown content and a serialized metadata blob (e.g., JSON with all fields) as a single logical object or tightly coupled objects.118116117111
- Write path:
- Application validates metadata, calculates
retain_untilfrom retention category.125107110 - Serializes
{markdown, metadata}with hash and writes to WORM store with a retention setting at least as long as policy requires.116117111 - Records a pointer (WORM object ID, hash) in Postgres for fast lookup and search indexing.124111116
- Application validates metadata, calculates
- Update / versioning:
- Deletion:
- A scheduled process identifies records whose
retain_untilhas passed and are not on legal hold, then:
- A scheduled process identifies records whose
- Audit logs:
6. Next step: Postgres + Meilisearch + ingestion pipeline (sketch)
Below is a compact, implementation-ready sketch you can expand.
6.1 Postgres schema (core tables)
-- documents: logical docs; markdown lives in content store
CREATE TABLE documents (
doc_id UUID PRIMARY KEY,
path TEXT NOT NULL, -- repo path
content_hash TEXT NOT NULL, -- SHA-256
current_version INT NOT NULL,
worm_object_id TEXT NOT NULL, -- pointer into WORM
created_at TIMESTAMPTZ NOT NULL,
created_by TEXT NOT NULL,
last_modified_at TIMESTAMPTZ NOT NULL,
last_modified_by TEXT NOT NULL
);
CREATE TABLE document_versions (
doc_id UUID NOT NULL REFERENCES documents(doc_id),
version INT NOT NULL,
worm_object_id TEXT NOT NULL,
content_hash TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
created_by TEXT NOT NULL,
supersedes_version INT,
PRIMARY KEY (doc_id, version)
);
-- core metadata (flattened for easy indexing)
CREATE TABLE document_metadata (
doc_id UUID PRIMARY KEY REFERENCES documents(doc_id),
title TEXT NOT NULL,
summary TEXT,
domain TEXT NOT NULL,
document_type TEXT NOT NULL,
jurisdiction TEXT[] NOT NULL,
regulations TEXT[] NOT NULL,
security_class TEXT NOT NULL,
contains_phi BOOLEAN NOT NULL DEFAULT FALSE,
contains_financial BOOLEAN NOT NULL DEFAULT FALSE,
status TEXT NOT NULL,
effective_date DATE,
review_due_date DATE,
expiry_date DATE,
retention_category TEXT NOT NULL,
retain_until DATE NOT NULL,
legal_hold BOOLEAN NOT NULL DEFAULT FALSE,
legal_hold_reason TEXT,
business_unit TEXT,
desk TEXT,
facility TEXT,
owner_user_id TEXT NOT NULL,
owner_role TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
last_modified_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE document_tags (
doc_id UUID NOT NULL REFERENCES documents(doc_id),
tag TEXT NOT NULL,
PRIMARY KEY (doc_id, tag)
);
This schema keeps metadata queryable and ties each document to immutable WORM objects via worm_object_id.129124116111
6.2 Meilisearch index settings (JSON)
{
"uid": "documents",
"primaryKey": "doc_id",
"searchableAttributes": [
"title",
"headings",
"body"
],
"filterableAttributes": [
"domain",
"document_type",
"jurisdiction",
"regulations",
"security_class",
"contains_phi",
"contains_financial",
"status",
"retention_category",
"business_unit",
"desk",
"facility",
"owner_role",
"owner_user_id",
"effective_date",
"review_due_date",
"retain_until",
"created_at",
"last_modified_at"
],
"sortableAttributes": [
"effective_date",
"review_due_date",
"retain_until",
"created_at",
"last_modified_at"
]
}
Each indexed document you send would look like:
{
"doc_id": "uuid",
"title": "HIPAA Privacy Officer Policy",
"headings": ["Purpose", "Scope", "Responsibilities"],
"body": "Plain text…",
"domain": "security-privacy",
"document_type": "policy",
"jurisdiction": ["US"],
"regulations": ["HIPAA-164.316"],
"security_class": "confidential",
"contains_phi": true,
"contains_financial": false,
"status": "effective",
"retention_category": "HIPAA-6Y",
"business_unit": "Compliance",
"desk": null,
"facility": "Hospital-A",
"owner_role": "Privacy Officer",
"owner_user_id": "u123",
"effective_date": "2025-01-01",
"review_due_date": "2027-01-01",
"retain_until": "2031-01-01",
"created_at": "2025-01-01T10:00:00Z",
"last_modified_at": "2025-01-10T11:00:00Z"
}
6.3 Ingestion pipeline (RBAC/ABAC-aware)
High-level steps:
- Discovery & parsing:
- Retention and policy enrichment:
- RBAC/ABAC projection:
- WORM write:
- Database upsert:
- Search indexing:
- Audit logging:
If you want, next iteration can be: concrete RBAC/ABAC policy examples (e.g., OPA/Cedar snippets) wired to the Postgres schema and Meilisearch filters for HIPAA + FINRA scenarios.
next iteration can be: concrete RBAC/ABAC policy examples (e.g., OPA/Cedar snippets) wired to the Postgres schema and Meilisearch filters for HIPAA + FINRA scenarios.
Related Provide a concrete Postgres schema for records and metadata Show Meilisearch index JSON definition for HIPAA and FINRA fields Sketch a small ingestion pipeline for markdown files to Postgres and Meilisearch How to map RBAC and ABAC attributes into the Postgres schema Retention automation rules and SQL procedures for record expiry and destruction
Below is a concrete, implementation-ready next step that ties together Postgres, Meilisearch, RBAC/ABAC, and retention automation for a HIPAA + FINRA Markdown repo.
1. Postgres schema for records, metadata, RBAC/ABAC
This focuses on: documents, versions, metadata, RBAC roles/permissions, ABAC attributes, and retention.
-- Core documents (logical record)
CREATE TABLE documents (
doc_id UUID PRIMARY KEY,
path TEXT NOT NULL, -- repo path to .md
current_version INT NOT NULL,
content_hash TEXT NOT NULL, -- hash of current content
worm_object_id TEXT NOT NULL, -- immutable storage pointer
created_at TIMESTAMPTZ NOT NULL,
created_by TEXT NOT NULL,
last_modified_at TIMESTAMPTZ NOT NULL,
last_modified_by TEXT NOT NULL
);
-- Immutable versions (link to WORM objects)
CREATE TABLE document_versions (
doc_id UUID NOT NULL REFERENCES documents(doc_id),
version INT NOT NULL,
worm_object_id TEXT NOT NULL,
content_hash TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
created_by TEXT NOT NULL,
supersedes_version INT,
PRIMARY KEY (doc_id, version)
);
-- Flattened metadata (HIPAA + FINRA)
CREATE TABLE document_metadata (
doc_id UUID PRIMARY KEY REFERENCES documents(doc_id),
title TEXT NOT NULL,
summary TEXT,
domain TEXT NOT NULL, -- clinical, security-privacy, finance, etc.
document_type TEXT NOT NULL, -- policy, sop, record, communication, etc.
jurisdiction TEXT[] NOT NULL, -- ["US"], ["US","EU"], etc.
regulations TEXT[] NOT NULL, -- ["HIPAA-164.316","FINRA-4511",...]
security_class TEXT NOT NULL, -- public/internal/confidential/restricted
contains_phi BOOLEAN NOT NULL DEFAULT FALSE,
contains_pii BOOLEAN NOT NULL DEFAULT FALSE,
contains_financial BOOLEAN NOT NULL DEFAULT FALSE,
status TEXT NOT NULL, -- draft/in_review/effective/obsolete
effective_date DATE,
review_due_date DATE,
expiry_date DATE,
retention_category TEXT NOT NULL, -- HIPAA-6Y, FINRA-6Y, etc.
retention_period_y INT NOT NULL, -- in years
retain_until DATE NOT NULL,
legal_hold BOOLEAN NOT NULL DEFAULT FALSE,
legal_hold_reason TEXT,
business_unit TEXT,
desk TEXT, -- trading desk, etc.
facility TEXT, -- hospital/facility
owner_user_id TEXT NOT NULL,
owner_role TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
last_modified_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE document_tags (
doc_id UUID NOT NULL REFERENCES documents(doc_id),
tag TEXT NOT NULL,
PRIMARY KEY (doc_id, tag)
);
-- Retention catalog (policy table)
CREATE TABLE retention_policies (
retention_category TEXT PRIMARY KEY,
description TEXT NOT NULL,
period_years_default INT NOT NULL,
legal_basis TEXT NOT NULL, -- "HIPAA-164.316", "FINRA-4511/SEC-17a-4", etc.
min_years INT NOT NULL -- safeguard against too-short periods
);
-- RBAC: users, roles, permissions
CREATE TABLE users (
user_id TEXT PRIMARY KEY,
display_name TEXT NOT NULL,
email TEXT NOT NULL,
active BOOLEAN NOT NULL DEFAULT TRUE
);
CREATE TABLE roles (
role_id TEXT PRIMARY KEY, -- "clinician", "privacy_officer", "trader", "finra_compliance"
description TEXT NOT NULL
);
CREATE TABLE user_roles (
user_id TEXT NOT NULL REFERENCES users(user_id),
role_id TEXT NOT NULL REFERENCES roles(role_id),
PRIMARY KEY (user_id, role_id)
);
CREATE TABLE permissions (
permission_id TEXT PRIMARY KEY, -- "doc.read", "doc.write", "doc.approve", "doc.view_phi", ...
description TEXT NOT NULL
);
CREATE TABLE role_permissions (
role_id TEXT NOT NULL REFERENCES roles(role_id),
permission_id TEXT NOT NULL REFERENCES permissions(permission_id),
PRIMARY KEY (role_id, permission_id)
);
-- ABAC-like policy hints stored per doc (consumed by PDP)
CREATE TABLE document_access_attributes (
doc_id UUID PRIMARY KEY REFERENCES documents(doc_id),
required_roles TEXT[] DEFAULT '{}', -- e.g., ["privacy_officer"]
allowed_business_units TEXT[] DEFAULT '{}', -- e.g., ["Compliance","Oncology"]
allowed_desks TEXT[] DEFAULT '{}', -- for finance
allowed_facilities TEXT[] DEFAULT '{}', -- site-level scoping
max_class_for_role TEXT -- optional (e.g., define a ceiling on classification for a role)
);
This schema mirrors the RBAC/ABAC patterns described for OPA/Cedar-style evaluation, where user roles and resource attributes are fed into a policy engine.136137138139140141142143
2. Meilisearch index JSON (HIPAA + FINRA filter fields)
Index settings for a documents index:
{
"uid": "documents",
"primaryKey": "doc_id",
"searchableAttributes": [
"title",
"headings",
"body"
],
"filterableAttributes": [
"domain",
"document_type",
"jurisdiction",
"regulations",
"security_class",
"contains_phi",
"contains_pii",
"contains_financial",
"status",
"retention_category",
"business_unit",
"desk",
"facility",
"owner_role",
"owner_user_id",
"effective_date",
"review_due_date",
"retain_until",
"created_at",
"last_modified_at"
],
"sortableAttributes": [
"effective_date",
"review_due_date",
"retain_until",
"created_at",
"last_modified_at"
]
}
searchableAttributeslimit text search to the Markdown content fields.144145146147filterableAttributesdrive compliance filters and fine-grained access filtering (e.g.,contains_phi = true AND facility = "Hospital-A").145148149144- You can also configure attribute patterns if you want families of fields filterable without enumerating all names.144145
Each indexed doc should map 1:1 from documents + document_metadata rows plus parsed content, with doc_id as PK.
3. Ingestion pipeline sketch (Markdown → Postgres + Meilisearch)
High-level, idempotent pipeline:
- Discover changed files
- From Git (commits) or filesystem events, collect changed
.mdpaths and their content hashes.
- From Git (commits) or filesystem events, collect changed
- Parse Markdown
- Extract YAML frontmatter → metadata object (title, domain, document_type, regulations, retention_category, etc.).
- Render Markdown → plain text for
body. - Extract headings and section anchors for
headings/sections.
- Enrich metadata
- Write to WORM storage
- Upsert Postgres
- If
doc_idnew:INSERT INTO documents(doc_id, path, current_version=1, content_hash, worm_object_id, timestamps).INSERT INTO document_versionswith version 1.
- Else:
- Increment
current_version, add row todocument_versions. - Update
documents.content_hash,worm_object_id,last_modified_at,last_modified_by.
- Increment
- Upsert into
document_metadatawith enriched values. - Upsert tags /
document_access_attributesif present.
- If
- Index in Meilisearch
- Build index doc:
{
"doc_id": "...",
"title": "...",
"headings": ["..."],
"body": "plain text…",
"domain": "clinical",
"document_type": "policy",
"jurisdiction": ["US"],
"regulations": ["HIPAA-164.316"],
"security_class": "confidential",
"contains_phi": true,
"contains_pii": false,
"contains_financial": false,
"status": "effective",
"retention_category": "HIPAA-6Y",
"business_unit": "Compliance",
"desk": null,
"facility": "Hospital-A",
"owner_role": "Privacy Officer",
"owner_user_id": "u123",
"effective_date": "2025-01-01",
"review_due_date": "2027-01-01",
"retain_until": "2031-01-01",
"created_at": "2025-01-01T10:00:00Z",
"last_modified_at": "2025-01-10T11:00:00Z"
}
- `POST /indexes/documents/documents` in batches.[^6_13][^6_14][^6_12][^6_24]
7. Audit the pipeline
- For each create/update, insert into an audit_events table (append-only) and periodically archive that table into WORM as well.155156158159
4. Mapping RBAC + ABAC into the schema
RBAC mapping
- Users ↔
userstable. - Roles ↔
rolestable (e.g.,clinician,privacy_officer,trader,finra_compliance). - Permissions ↔
permissions(doc.read,doc.write,doc.approve,doc.view_phi,doc.view_financial). - Role-permission assignments ↔
role_permissions.137138139140142136
At decision time, your PDP (OPA/Cedar) receives:
{
"subject": {
"user_id": "u123",
"roles": ["privacy_officer", "clinician"],
"business_unit": "Compliance",
"facility": "Hospital-A"
},
"action": "doc.read",
"resource": {
"doc_id": "d0001",
"domain": "security-privacy",
"document_type": "policy",
"jurisdiction": ["US"],
"regulations": ["HIPAA-164.316"],
"security_class": "confidential",
"contains_phi": true,
"contains_financial": false,
"business_unit": "Compliance",
"facility": "Hospital-A"
}
}
User and resource attributes are loaded from Postgres into OPA or into Cedar policies.138139140141136137
ABAC mapping
Typical ABAC conditions:
contains_phi = true⇒ user must havedoc.view_phipermission andfacilitymatch.contains_financial = trueANDregulationscontainsFINRA-4511⇒ user must havefinra_complianceor approved front-office role.domain = "security-privacy"⇒ onlyprivacy_officerorsecurity_officerroles can update.140141142136137138
These attributes are directly read from document_metadata and user profile tables, then turned into input.resource and input.subject attributes in Rego/Cedar policy.
At query time, you can also pre-filter Meilisearch results by ABAC attributes (e.g., filter: 'facility = "Hospital-A" AND security_class != "restricted"') in combination with PDP decisions.149146145148144
5. Retention automation rules and SQL procedures
Retention enforcement belongs in a scheduled job or background worker that uses retain_until and legal_hold.151152153150
Retention update helper
Whenever metadata changes effective_date or retention_category, recompute retain_until:
CREATE OR REPLACE FUNCTION recompute_retain_until()
RETURNS TRIGGER AS $$
DECLARE
pol retention_policies;
years INT;
BEGIN
SELECT * INTO pol
FROM retention_policies
WHERE retention_category = NEW.retention_category;
IF NOT FOUND THEN
RAISE EXCEPTION 'Unknown retention_category: %', NEW.retention_category;
END IF;
years := GREATEST(pol.period_years_default, pol.min_years);
IF NEW.effective_date IS NULL THEN
RAISE EXCEPTION 'effective_date required for retention computation';
END IF;
NEW.retention_period_y := years;
NEW.retain_until := (NEW.effective_date + (years || ' years')::INTERVAL)::DATE;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_recompute_retain_until
BEFORE INSERT OR UPDATE OF effective_date, retention_category
ON document_metadata
FOR EACH ROW
EXECUTE FUNCTION recompute_retain_until();
This mirrors guidance on automating retention windows at the DB level.152153150151
Scheduled expiry job (logical destruction queue)
A daily job (cron or PG background worker) can mark records for destruction:
-- Table capturing destruction events (logical delete)
CREATE TABLE destruction_queue (
id BIGSERIAL PRIMARY KEY,
doc_id UUID NOT NULL REFERENCES documents(doc_id),
queued_at TIMESTAMPTZ NOT NULL DEFAULT now(),
processed_at TIMESTAMPTZ,
status TEXT NOT NULL DEFAULT 'queued', -- queued|processing|completed|failed
reason TEXT NOT NULL -- "retention_expired"
);
-- Query to enqueue eligible docs (no legal hold, retention passed)
INSERT INTO destruction_queue (doc_id, reason)
SELECT dm.doc_id, 'retention_expired'
FROM document_metadata dm
LEFT JOIN destruction_queue dq ON dq.doc_id = dm.doc_id AND dq.status IN ('queued','processing','completed')
WHERE dq.doc_id IS NULL
AND dm.legal_hold = FALSE
AND dm.retain_until <= CURRENT_DATE;
The worker that processes destruction_queue would:
- Read queued items.
- Issue delete/expiry operations against WORM storage if allowed, or mark them logically deleted.157158154155
- Update
destruction_queue.statusandprocessed_at. - Optionally set a
status = 'archived'or similar indocument_metadata/documents. - Log events into
audit_eventsand (optionally) Meilisearch (e.g.,status = "expired").156158159155
For performance, this can be batched and run on a schedule (e.g., daily, off-peak), consistent with common practices for DB- and storage-level retention jobs.153150151152
If you want, next iteration can be specific Rego or Cedar policy snippets that consume these tables (e.g., allow_read for HIPAA PHI vs FINRA records, plus example Meilisearch filter expressions bound to PDP results).
CODITECT Document Management API
1.0.0
OAS 3.1
/openapi.json
CODITECT Document Management API
Enterprise-grade document management with AI-powered semantic search.
Features
Semantic Search: Vector similarity search with pgvector
Hybrid Search: Combined vector + keyword using RRF fusion
GraphRAG: Graph-based relationship traversal
Document Management: Full CRUD with chunking and embeddings
Analytics: Real-time metrics and usage tracking
Authentication
All endpoints (except health checks) require JWT authentication. Include the token in the Authorization header:
Authorization: Bearer
Rate Limiting Standard tier: 100 requests/minute Premium tier: 1000 requests/minute Enterprise tier: 10000 requests/minute Authorize Health GET /health Health Check GET /health/ready Readiness Check GET /health/live Liveness Check GET /health/version Version Check Search POST /api/v1/search Search Documents POST /api/v1/search/hybrid Hybrid Search POST /api/v1/search/graphrag Graphrag Search GET /api/v1/search/suggest Get Suggestions GET /api/v1/search/modes Get Search Modes Documents GET /api/v1/documents List Documents POST /api/v1/documents Create Document POST /api/v1/documents/upload Upload Document Upload a document file. Accepts various file formats: Markdown (.md) Text (.txt) PDF (.pdf) Word (.docx) Parameters CancelReset No parameters Request body multipart/form-data file * string($binary) Execute Responses Code Description Links 201 Successful Response Media type application/json Controls Accept header. Example Value Schema { "created_at": "2025-12-30T08:07:07.245Z", "updated_at": "2025-12-30T08:07:07.245Z", "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6", "filename": "string", "filepath": "string", "mime_type": "string", "file_size": 0, "file_hash": "string", "status": "pending", "document_type": "reference", "title": "string", "version": "string", "summary": "string", "keywords": [ "string" ], "tags": [ "string" ], "chunk_count": 0, "processing_error": "string" } No links 422 Validation Error Media type application/json Example Value Schema { "detail": [ { "loc": [ "string", 0 ], "msg": "string", "type": "string" } ] } No links GET /api/v1/documents/{doc_id} Get Document PUT /api/v1/documents/{doc_id} Update Document DELETE /api/v1/documents/{doc_id} Delete Document GET /api/v1/documents/{doc_id}/chunks Get Document Chunks POST /api/v1/documents/{doc_id}/reprocess Reprocess Document GET /api/v1/documents/{doc_id}/status Get Processing Status Analytics GET /api/v1/analytics/dashboard Get Dashboard POST /api/v1/analytics/metrics Query Metrics GET /api/v1/analytics/documents Get Document Metrics GET /api/v1/analytics/search Get Search Analytics GET /api/v1/analytics/usage Get Usage Metrics GET /api/v1/analytics/costs Get Cost Metrics Schemas Body_upload_document_api_v1_documents_upload_post Expand allobject ChunkResponse Expand allobject CostMetrics Expand allobject DashboardResponse Expand allobject DocumentChunksResponse Expand allobject DocumentCreate Expand allobject DocumentListResponse Expand allobject DocumentMetrics Expand allobject DocumentProcessingStatus Expand allobject DocumentResponse Expand allobject DocumentStatus Expand allstring DocumentType Expand allstring DocumentUpdate Expand allobject GraphRAGNode Expand allobject GraphRAGRelationship Expand allobject GraphRAGRequest Expand allobject GraphRAGResponse Expand allobject HTTPValidationError Expand allobject HealthResponse Expand allobject MetricSeries Expand allobject MetricType Expand allstring MetricsRequest Expand allobject MetricsResponse Expand allobject ProcessingMetrics Expand allobject SearchAnalytics Expand allobject SearchMetrics Expand allobject SearchMode Expand allstring SearchRequest Expand allobject SearchResponse Expand allobject SearchResultItem Expand allobject SuggestResponse Expand allobject TimeGranularity Expand allstring TimeSeriesPoint Expand allobject TopSearchQuery Expand allobject UsageMetrics Expand allobject ValidationError Expand allobject
The existing CODITECT DMS API already gives you most of the primitives you need; the missing pieces are (a) richer compliance metadata, (b) retention state, and (c) a policy layer that ties RBAC/ABAC to search and CRUD.
Below is a concise, implementation-oriented next step tailored to what you posted.
1. Concrete Postgres schema aligned to CODITECT API
Map the existing /api/v1/documents schema into relational tables and extend with compliance/ABAC fields.
-- Raw file + high-level doc info (maps to DocumentResponse)
CREATE TABLE documents (
id UUID PRIMARY KEY,
filename TEXT NOT NULL,
filepath TEXT NOT NULL,
mime_type TEXT NOT NULL,
file_size BIGINT NOT NULL,
file_hash TEXT NOT NULL,
status TEXT NOT NULL, -- pending, processed, error, deleted
document_type TEXT NOT NULL, -- reference, policy, record, etc.
title TEXT NOT NULL,
version TEXT NOT NULL,
summary TEXT,
chunk_count INT NOT NULL DEFAULT 0,
processing_error TEXT,
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE document_keywords (
document_id UUID NOT NULL REFERENCES documents(id),
keyword TEXT NOT NULL,
PRIMARY KEY (document_id, keyword)
);
CREATE TABLE document_tags (
document_id UUID NOT NULL REFERENCES documents(id),
tag TEXT NOT NULL,
PRIMARY KEY (document_id, tag)
);
-- Compliance-focused metadata for HIPAA + FINRA
CREATE TABLE document_metadata (
document_id UUID PRIMARY KEY REFERENCES documents(id),
domain TEXT NOT NULL, -- clinical, security-privacy, finance, etc.
jurisdiction TEXT[] NOT NULL, -- ["US"], ["US","EU"], etc.
regulations TEXT[] NOT NULL, -- ["HIPAA-164.316","FINRA-4511"]
security_class TEXT NOT NULL, -- public/internal/confidential/restricted
contains_phi BOOLEAN NOT NULL DEFAULT FALSE,
contains_pii BOOLEAN NOT NULL DEFAULT FALSE,
contains_financial BOOLEAN NOT NULL DEFAULT FALSE,
status TEXT NOT NULL, -- draft/in_review/effective/obsolete
effective_date DATE,
review_due_date DATE,
expiry_date DATE,
retention_category TEXT NOT NULL, -- HIPAA-6Y, FINRA-6Y, etc.
retention_period_y INT NOT NULL,
retain_until DATE NOT NULL,
legal_hold BOOLEAN NOT NULL DEFAULT FALSE,
legal_hold_reason TEXT,
business_unit TEXT,
desk TEXT,
facility TEXT,
owner_user_id TEXT NOT NULL,
owner_role TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
last_modified_at TIMESTAMPTZ NOT NULL
);
-- Optional: chunk metadata (aligned with /documents/{id}/chunks)
CREATE TABLE document_chunks (
id UUID PRIMARY KEY,
document_id UUID NOT NULL REFERENCES documents(id),
index INT NOT NULL,
content TEXT NOT NULL,
embedding VECTOR, -- pgvector
created_at TIMESTAMPTZ NOT NULL
);
This lets you hydrate both /api/v1/documents and compliance-aware UIs from one schema, while still using pgvector for semantic search chunks.163164165
2. Meilisearch index definition for HIPAA + FINRA
You already have pgvector for semantic; Meilisearch (or similar) can own the text + metadata faceting for UI search filters.
{
"uid": "documents",
"primaryKey": "id",
"searchableAttributes": [
"title",
"summary",
"body",
"keywords",
"tags"
],
"filterableAttributes": [
"document_type",
"domain",
"jurisdiction",
"regulations",
"security_class",
"contains_phi",
"contains_pii",
"contains_financial",
"status",
"retention_category",
"business_unit",
"desk",
"facility",
"owner_role",
"owner_user_id",
"effective_date",
"review_due_date",
"retain_until",
"created_at",
"updated_at"
],
"sortableAttributes": [
"effective_date",
"review_due_date",
"retain_until",
"created_at",
"updated_at"
]
}
- Frontend can pass filters like
contains_phi = true AND security_class = "confidential" AND jurisdiction = "US".165166167168
3. Ingestion pipeline (Markdown → Postgres + Meilisearch + pgvector)
Core steps, assuming you already have the chunking/embedding logic for /api/v1/documents/upload:
- File upload (via
POST /api/v1/documents/upload):- Store file, compute hash.
- Insert into
documentswithstatus = 'pending', basic fields from request.
- Processor worker (kicks on
pendingdocs):- Parse Markdown:
- YAML frontmatter → compliance metadata (domain, regulations, etc.).
- Body → plain text.
- Headings, keywords, tags.
- Enrich metadata:
- Chunk + embed:
- Split body; store chunks in
document_chunkswith pgvector embeddings.
- Split body; store chunks in
- Upsert metadata:
- Fill
document_metadatafor HIPAA/FINRA semantics.
- Fill
- Index into Meilisearch:
- Build a doc like:
- Parse Markdown:
{
"id": "uuid",
"title": "HIPAA Privacy Officer Policy",
"summary": "...",
"body": "plain text...",
"keywords": ["hipaa","privacy","officer"],
"tags": ["policy","hipaa"],
"document_type": "policy",
"domain": "security-privacy",
"jurisdiction": ["US"],
"regulations": ["HIPAA-164.316"],
"security_class": "confidential",
"contains_phi": true,
"contains_pii": false,
"contains_financial": false,
"status": "effective",
"retention_category": "HIPAA-6Y",
"business_unit": "Compliance",
"facility": "Hospital-A",
"owner_role": "Privacy Officer",
"owner_user_id": "u123",
"effective_date": "2025-01-01",
"review_due_date": "2027-01-01",
"retain_until": "2031-01-01",
"created_at": "2025-01-01T10:00:00Z",
"updated_at": "2025-01-10T11:00:00Z"
}
- Update `documents.status = 'processed'`, `chunk_count`, `updated_at`.
3. Search orchestration (for /api/v1/search/hybrid):
- Use pgvector for k‑NN over document_chunks.embedding to get candidate doc IDs.
- Query Meilisearch with filter incorporating RBAC/ABAC constraints (see below) and candidate IDs.
- Fuse scores (you already use RRF).
4. Mapping RBAC/ABAC attributes into Postgres and filters
Extend with user + policy tables; your policy engine (OPA/Cedar/permit.io/etc.) reads from there.172173174175176177178
CREATE TABLE users (
id TEXT PRIMARY KEY,
display_name TEXT NOT NULL,
email TEXT NOT NULL,
business_unit TEXT,
desk TEXT,
facility TEXT
);
CREATE TABLE roles (
id TEXT PRIMARY KEY,
description TEXT NOT NULL
);
CREATE TABLE user_roles (
user_id TEXT NOT NULL REFERENCES users(id),
role_id TEXT NOT NULL REFERENCES roles(id),
PRIMARY KEY (user_id, role_id)
);
CREATE TABLE permissions (
id TEXT PRIMARY KEY, -- "doc.read", "doc.view_phi", ...
description TEXT NOT NULL
);
CREATE TABLE role_permissions (
role_id TEXT NOT NULL REFERENCES roles(id),
permission_id TEXT NOT NULL REFERENCES permissions(id),
PRIMARY KEY (role_id, permission_id)
);
ABAC lives in document_metadata and the user record:
- Resource attributes:
domain,document_type,jurisdiction,regulations,security_class,contains_phi,contains_financial,business_unit,desk,facility,owner_role. - Subject attributes: user roles +
business_unit,desk,facility.
Authorization flow per request:
- Decode JWT → user_id, tenant, maybe roles.
- Load user + roles + perms from Postgres; build
subjectclaims. - For each candidate doc:
- For list/search APIs:
This gives you a single, consistent attribute model used both in Meilisearch filters and in fine-grained policy decisions.
5. Retention automation: rules + SQL
You already have retention_category, retention_period_y, retain_until, and legal_hold in document_metadata.
Retention policy table
CREATE TABLE retention_policies (
category TEXT PRIMARY KEY, -- "HIPAA-6Y", "FINRA-6Y"
description TEXT NOT NULL,
period_years INT NOT NULL,
legal_basis TEXT NOT NULL, -- "HIPAA-164.316", "FINRA-4511/SEC-17a-4"
min_years INT NOT NULL -- defensive lower bound
);
Trigger to compute retain_until
CREATE OR REPLACE FUNCTION set_retention_fields()
RETURNS TRIGGER AS $$
DECLARE
pol retention_policies;
years INT;
BEGIN
SELECT * INTO pol
FROM retention_policies
WHERE category = NEW.retention_category;
IF NOT FOUND THEN
RAISE EXCEPTION 'Unknown retention_category: %', NEW.retention_category;
END IF;
years := GREATEST(pol.period_years, pol.min_years);
IF NEW.effective_date IS NULL THEN
RAISE EXCEPTION 'effective_date required for retention computation';
END IF;
NEW.retention_period_y := years;
NEW.retain_until := (NEW.effective_date + (years || ' years')::INTERVAL)::DATE;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_set_retention
BEFORE INSERT OR UPDATE OF effective_date, retention_category
ON document_metadata
FOR EACH ROW
EXECUTE FUNCTION set_retention_fields();
Scheduled expiry queue
CREATE TABLE destruction_queue (
id BIGSERIAL PRIMARY KEY,
document_id UUID NOT NULL REFERENCES documents(id),
queued_at TIMESTAMPTZ NOT NULL DEFAULT now(),
processed_at TIMESTAMPTZ,
status TEXT NOT NULL DEFAULT 'queued', -- queued|processing|completed|failed
reason TEXT NOT NULL
);
-- daily job:
INSERT INTO destruction_queue (document_id, reason)
SELECT dm.document_id, 'retention_expired'
FROM document_metadata dm
LEFT JOIN destruction_queue dq
ON dq.document_id = dm.document_id
AND dq.status IN ('queued','processing','completed')
WHERE dq.document_id IS NULL
AND dm.legal_hold = FALSE
AND dm.retain_until <= CURRENT_DATE;
A worker then:
- Talks to your WORM/object store to delete/expire,
- Marks
status→completed, - Updates
documents.status = 'deleted', - Logs an audit event for regulators.180181182183184169170
If you want a follow-up, the next iteration can be: concrete Rego (OPA) policies that consume document_metadata + users + roles and example Meilisearch filter strings derived from those policies for HIPAA PHI vs FINRA record access.