Purpose
Two use cases, same framework:
- Onboarding: A power user with an existing vault (Obsidian, Logseq, any markdown-based system) migrates their knowledge into the Ora architecture without starting over.
- Landfill Rehabilitation: A user with a vault that has become unnavigable through accumulated noise extracts the signal into a clean Ora structure and archives the original.
Dependencies
This framework depends on components that must be built first:
- Document Processing Framework (Appendix C-25) — the three-pass extraction pipeline, quality gates, grammar rules, deduplication
- Ora YAML schema finalized — Reference — YAML Property Specification (core: nexus, type, tags, dates; standard RAG properties applied by AI when appropriate)
- ChromaDB ingestion pipeline — the indexing system that makes extracted notes retrievable
- 13-type relationship taxonomy implemented — the typed relationship system that relationship seeding populates
This framework wraps the Document Processing Framework. It does not duplicate extraction logic. Stages 4-7 call C-25 directly.
The Ten Stages
Stage 1 — Inventory & Classification
Input: Path to source vault root directory.
Process:
- Recursive directory scan
- Classify every file by type:
- Markdown note —
.mdfiles with prose content - MOC/index —
.mdfiles where link density exceeds prose density (threshold configurable, default: >60% of non-blank lines contain wikilinks) - Template — files in a templates directory or containing template syntax (Templater, core templates)
- Attachment/binary — images, PDFs, audio, video, other non-text files
- Configuration —
.obsidian/directory contents,.gitignore, dotfiles - Other — anything not classified above
- Markdown note —
Output: Inventory report:
- Total file count by type
- Estimated markdown files requiring processing
- Estimated processing scope (token count for API-based extraction)
- List of skipped directories/file types
- Ready/not-ready assessment
The user reviews this report before Stage 2 runs. No processing begins without explicit confirmation.
Stage 2 — Metadata Harvesting
Input: All markdown files from Stage 1 inventory.
Process:
- Extract YAML frontmatter from every markdown file (preserve all properties as metadata)
- Extract all wikilinks per file (outgoing
[[link]]and[[link|alias]]) - Extract all tags per file (
#taginline andtags:in YAML) - Map folder structure as organizational metadata (folder path → implicit category)
- Identify MOCs by structural signature (refine Stage 1 classification)
- Build source link graph: directed graph where nodes are files and edges are wikilinks
Output: Three artifacts:
- Metadata index — per-file record of all harvested YAML properties, tags, and link targets
- Source link graph — complete directed graph of all inter-file connections
- MOC hierarchy map — MOCs identified with their linked children, representing the source vault’s navigational structure
Stage 3 — Content Triage
Input: All markdown notes (excluding templates, config, attachments) plus metadata index from Stage 2.
Process: Classify each file into a processing track:
| Track | Criteria | Processing Path |
|---|---|---|
| (a) Already atomic | Single idea, well-formed, <500 words, one clear claim or concept | Light reformat → schema application → quality gate (skip full extraction) |
| (b) Compound | Multiple ideas, long-form, >500 words or multiple distinct sections | Full three-pass extraction (Stages 4-7) |
| (c) Fragment | Below minimum content threshold (<50 words of prose, no clear claim) | Skip with flag for human decision |
| (d) MOC/index | Primarily link lists with navigational prose | Harvest link structure for Stage 8 relationship seeding; skip content extraction |
| (e) Template | Template syntax, placeholder content | Skip entirely |
| (f) Daily note/journal | Date-formatted filename, journal-style content | Content density assessment; extract only if substantive content detected (>200 words of non-routine prose) |
Output: Processing manifest — every file assigned to a track, with estimated token cost for extraction-track files.
Configurable thresholds: Word count boundaries, content density ratios, and MOC link-density threshold are all configurable. Defaults are provided but should be calibrated during the Stage 0 calibration run.
Stages 4-7 — Document Processing Framework
These stages call Appendix C-25 directly. No duplication of extraction logic.
Applied only to files on tracks (a) and (b):
- Track (a) files get a simplified pass: schema validation, grammar rule application, quality gate — but skip full three-pass extraction since the content is already atomic
- Track (b) files get the full pipeline: Pass A (signal identification), Pass B (note generation in subtype schemas with grammar rules), Pass C (quality pre-screening), quality gate with three-queue routing, deduplication with merge-with-provenance
Stage 8 — Relationship Seeding
Input: Extracted notes from Stages 4-7 + source link graph from Stage 2 + MOC hierarchy map from Stage 2 + tag metadata from Stage 2.
Process: Merge three relationship signal sources into typed relationship candidates:
Source 1 — Link-derived relationships:
- Each wikilink A→B becomes a candidate relationship between the extracted notes from file A and file B
- MOC hierarchy: MOC→linked_note becomes a
parent/childrelationship candidate - Mutual links (A→B and B→A): flagged for bidirectional relationship assessment (likely
supports,extends, orrelated_to) - Link context: where the wikilink appears in a sentence, the surrounding text provides evidence for relationship type classification
Source 2 — Extraction-derived relationships:
- Pass 1 relationship discovery from the Document Processing Framework (standard)
- Entity co-occurrence from NLP entity extraction (standard)
- Post-extraction question matching (standard)
Source 3 — Tag co-occurrence:
- Notes sharing 2+ tags are weak relationship candidates
- Shared tags suggest thematic connection but not specific relationship type
- Lowest confidence source — used to supplement, not override, the other two sources
Output: Typed relationship candidate list:
- Each candidate: source note, target note, proposed relationship type (from 13-type taxonomy), confidence score, source attribution (which of the three sources generated it)
- Candidates above confidence threshold auto-applied
- Candidates below threshold queued for human review
Stage 9 — Schema Application & Ingestion
Input: Quality-gate-approved notes from Stages 4-7 + relationship candidates from Stage 8.
Process:
- Apply Ora YAML schema to all extracted notes:
nexus: assigned from source folder structure or source YAML metadata where determinable;unassignedif not determinabletype:working(all notes enter as incubator — no automatic engram promotion)writing:no(default; adjusted if source metadata indicates creative/fiction content)tags: merged from source tags + extraction-derived tagsdate created: original file creation date from source metadatadate modified: current processing date
- Write approved notes to Ora vault directory
- Ingest into ChromaDB knowledge collection at incubator provenance weighting (0.2 — working type)
- Apply approved relationship candidates to the relationship graph
Stage 10 — Report
Output: Comprehensive processing summary:
- Inventory: total files scanned, files per classification type
- Triage: files per processing track, files skipped (with reasons per skip category)
- Extraction: total notes extracted, notes per atomic subtype (fact, process principle, definition, causal claim, analogy, evaluative)
- Quality gate: notes auto-approved, auto-rejected, queued for human review
- Relationships: relationship candidates generated (count per type from 13-type taxonomy), candidates auto-applied vs. queued for review
- Nexus assignment: notes with determined nexus vs. notes assigned
unassigned - Human review workload: estimated hours based on calibrated review rate (default: 30 notes/hour for quality gate review, 20 candidates/hour for relationship review)
- Source vault status: confirmation that source vault is unmodified
Calibration Protocol
Before full vault processing, run a 20-file calibration sample:
Select files across all triage tracks:
- 3-4 already-atomic notes
- 5-6 compound/long-form notes
- 2-3 fragments
- 3-4 MOCs
- 2-3 daily notes
Run the full pipeline on this sample. Evaluate:
- Triage accuracy: Did files land on the correct track? Adjust thresholds if not.
- Extraction quality: Are the extracted atomic notes well-formed? Do they pass the “would I retrieve this and find it useful?” test?
- Relationship candidate relevance: Are link-derived relationships producing sensible typed candidates? Is the confidence threshold calibrated correctly?
- Schema application: Are nexus assignments reasonable? Are tags merging correctly?
Adjust thresholds and parameters before committing to full vault processing.
Implementation Notes
- Batch processing with progress tracking: Large vaults (1,000+ files) require queue management, checkpoint/resume capability, and progress reporting. Integrate with sleep-wake cycle for overnight processing.
- Token cost estimation: The processing manifest from Stage 3 provides estimated token costs before processing begins. The user approves the budget before extraction runs.
- Incremental processing: Support for processing a vault in batches — run 100 files, review results, adjust, run the next 100. Not all-or-nothing.
- Source vault format support: Primary target is Obsidian vaults (markdown + wikilinks). Logseq vaults (markdown + block references) require a format adapter in Stage 2. Notion exports (markdown + database properties) require a different metadata harvesting path. Plain markdown directories work with Stages 1-2 producing minimal metadata.
Build Sequence
- Stage 1 (Inventory) — standalone Python script; no AI calls required; pure file system operations
- Stage 2 (Metadata Harvesting) — standalone Python script; regex-based extraction of YAML, wikilinks, tags; graph construction with NetworkX or equivalent
- Stage 3 (Content Triage) — requires lightweight model call for content density assessment on ambiguous files; most classification is rule-based
- Stages 4-7 — calls Document Processing Framework (build C-25 first)
- Stage 8 (Relationship Seeding) — requires 13-type taxonomy implementation; link-to-relationship conversion is rule-based; confidence scoring requires lightweight model call
- Stages 9-10 — schema application is rule-based; ChromaDB ingestion calls existing ingestion pipeline; report generation is template-based
Stages 1-3 can be built and tested independently before the Document Processing Framework exists. They produce useful output (inventory, metadata map, processing manifest) even without extraction.