Purpose

Two use cases, same framework:

  1. Onboarding: A power user with an existing vault (Obsidian, Logseq, any markdown-based system) migrates their knowledge into the Ora architecture without starting over.
  2. Landfill Rehabilitation: A user with a vault that has become unnavigable through accumulated noise extracts the signal into a clean Ora structure and archives the original.

Dependencies

This framework depends on components that must be built first:

  • Document Processing Framework (Appendix C-25) — the three-pass extraction pipeline, quality gates, grammar rules, deduplication
  • Ora YAML schema finalized — Reference — YAML Property Specification (core: nexus, type, tags, dates; standard RAG properties applied by AI when appropriate)
  • ChromaDB ingestion pipeline — the indexing system that makes extracted notes retrievable
  • 13-type relationship taxonomy implemented — the typed relationship system that relationship seeding populates

This framework wraps the Document Processing Framework. It does not duplicate extraction logic. Stages 4-7 call C-25 directly.


The Ten Stages

Stage 1 — Inventory & Classification

Input: Path to source vault root directory.

Process:

  • Recursive directory scan
  • Classify every file by type:
    • Markdown note.md files with prose content
    • MOC/index.md files where link density exceeds prose density (threshold configurable, default: >60% of non-blank lines contain wikilinks)
    • Template — files in a templates directory or containing template syntax (Templater, core templates)
    • Attachment/binary — images, PDFs, audio, video, other non-text files
    • Configuration.obsidian/ directory contents, .gitignore, dotfiles
    • Other — anything not classified above

Output: Inventory report:

  • Total file count by type
  • Estimated markdown files requiring processing
  • Estimated processing scope (token count for API-based extraction)
  • List of skipped directories/file types
  • Ready/not-ready assessment

The user reviews this report before Stage 2 runs. No processing begins without explicit confirmation.


Stage 2 — Metadata Harvesting

Input: All markdown files from Stage 1 inventory.

Process:

  • Extract YAML frontmatter from every markdown file (preserve all properties as metadata)
  • Extract all wikilinks per file (outgoing [[link]] and [[link|alias]])
  • Extract all tags per file (#tag inline and tags: in YAML)
  • Map folder structure as organizational metadata (folder path → implicit category)
  • Identify MOCs by structural signature (refine Stage 1 classification)
  • Build source link graph: directed graph where nodes are files and edges are wikilinks

Output: Three artifacts:

  1. Metadata index — per-file record of all harvested YAML properties, tags, and link targets
  2. Source link graph — complete directed graph of all inter-file connections
  3. MOC hierarchy map — MOCs identified with their linked children, representing the source vault’s navigational structure

Stage 3 — Content Triage

Input: All markdown notes (excluding templates, config, attachments) plus metadata index from Stage 2.

Process: Classify each file into a processing track:

TrackCriteriaProcessing Path
(a) Already atomicSingle idea, well-formed, <500 words, one clear claim or conceptLight reformat → schema application → quality gate (skip full extraction)
(b) CompoundMultiple ideas, long-form, >500 words or multiple distinct sectionsFull three-pass extraction (Stages 4-7)
(c) FragmentBelow minimum content threshold (<50 words of prose, no clear claim)Skip with flag for human decision
(d) MOC/indexPrimarily link lists with navigational proseHarvest link structure for Stage 8 relationship seeding; skip content extraction
(e) TemplateTemplate syntax, placeholder contentSkip entirely
(f) Daily note/journalDate-formatted filename, journal-style contentContent density assessment; extract only if substantive content detected (>200 words of non-routine prose)

Output: Processing manifest — every file assigned to a track, with estimated token cost for extraction-track files.

Configurable thresholds: Word count boundaries, content density ratios, and MOC link-density threshold are all configurable. Defaults are provided but should be calibrated during the Stage 0 calibration run.


Stages 4-7 — Document Processing Framework

These stages call Appendix C-25 directly. No duplication of extraction logic.

Applied only to files on tracks (a) and (b):

  • Track (a) files get a simplified pass: schema validation, grammar rule application, quality gate — but skip full three-pass extraction since the content is already atomic
  • Track (b) files get the full pipeline: Pass A (signal identification), Pass B (note generation in subtype schemas with grammar rules), Pass C (quality pre-screening), quality gate with three-queue routing, deduplication with merge-with-provenance

Stage 8 — Relationship Seeding

Input: Extracted notes from Stages 4-7 + source link graph from Stage 2 + MOC hierarchy map from Stage 2 + tag metadata from Stage 2.

Process: Merge three relationship signal sources into typed relationship candidates:

Source 1 — Link-derived relationships:

  • Each wikilink A→B becomes a candidate relationship between the extracted notes from file A and file B
  • MOC hierarchy: MOC→linked_note becomes a parent/child relationship candidate
  • Mutual links (A→B and B→A): flagged for bidirectional relationship assessment (likely supports, extends, or related_to)
  • Link context: where the wikilink appears in a sentence, the surrounding text provides evidence for relationship type classification

Source 2 — Extraction-derived relationships:

  • Pass 1 relationship discovery from the Document Processing Framework (standard)
  • Entity co-occurrence from NLP entity extraction (standard)
  • Post-extraction question matching (standard)

Source 3 — Tag co-occurrence:

  • Notes sharing 2+ tags are weak relationship candidates
  • Shared tags suggest thematic connection but not specific relationship type
  • Lowest confidence source — used to supplement, not override, the other two sources

Output: Typed relationship candidate list:

  • Each candidate: source note, target note, proposed relationship type (from 13-type taxonomy), confidence score, source attribution (which of the three sources generated it)
  • Candidates above confidence threshold auto-applied
  • Candidates below threshold queued for human review

Stage 9 — Schema Application & Ingestion

Input: Quality-gate-approved notes from Stages 4-7 + relationship candidates from Stage 8.

Process:

  • Apply Ora YAML schema to all extracted notes:
    • nexus: assigned from source folder structure or source YAML metadata where determinable; unassigned if not determinable
    • type: working (all notes enter as incubator — no automatic engram promotion)
    • writing: no (default; adjusted if source metadata indicates creative/fiction content)
    • tags: merged from source tags + extraction-derived tags
    • date created: original file creation date from source metadata
    • date modified: current processing date
  • Write approved notes to Ora vault directory
  • Ingest into ChromaDB knowledge collection at incubator provenance weighting (0.2 — working type)
  • Apply approved relationship candidates to the relationship graph

Stage 10 — Report

Output: Comprehensive processing summary:

  • Inventory: total files scanned, files per classification type
  • Triage: files per processing track, files skipped (with reasons per skip category)
  • Extraction: total notes extracted, notes per atomic subtype (fact, process principle, definition, causal claim, analogy, evaluative)
  • Quality gate: notes auto-approved, auto-rejected, queued for human review
  • Relationships: relationship candidates generated (count per type from 13-type taxonomy), candidates auto-applied vs. queued for review
  • Nexus assignment: notes with determined nexus vs. notes assigned unassigned
  • Human review workload: estimated hours based on calibrated review rate (default: 30 notes/hour for quality gate review, 20 candidates/hour for relationship review)
  • Source vault status: confirmation that source vault is unmodified

Calibration Protocol

Before full vault processing, run a 20-file calibration sample:

Select files across all triage tracks:

  • 3-4 already-atomic notes
  • 5-6 compound/long-form notes
  • 2-3 fragments
  • 3-4 MOCs
  • 2-3 daily notes

Run the full pipeline on this sample. Evaluate:

  1. Triage accuracy: Did files land on the correct track? Adjust thresholds if not.
  2. Extraction quality: Are the extracted atomic notes well-formed? Do they pass the “would I retrieve this and find it useful?” test?
  3. Relationship candidate relevance: Are link-derived relationships producing sensible typed candidates? Is the confidence threshold calibrated correctly?
  4. Schema application: Are nexus assignments reasonable? Are tags merging correctly?

Adjust thresholds and parameters before committing to full vault processing.


Implementation Notes

  • Batch processing with progress tracking: Large vaults (1,000+ files) require queue management, checkpoint/resume capability, and progress reporting. Integrate with sleep-wake cycle for overnight processing.
  • Token cost estimation: The processing manifest from Stage 3 provides estimated token costs before processing begins. The user approves the budget before extraction runs.
  • Incremental processing: Support for processing a vault in batches — run 100 files, review results, adjust, run the next 100. Not all-or-nothing.
  • Source vault format support: Primary target is Obsidian vaults (markdown + wikilinks). Logseq vaults (markdown + block references) require a format adapter in Stage 2. Notion exports (markdown + database properties) require a different metadata harvesting path. Plain markdown directories work with Stages 1-2 producing minimal metadata.

Build Sequence

  1. Stage 1 (Inventory) — standalone Python script; no AI calls required; pure file system operations
  2. Stage 2 (Metadata Harvesting) — standalone Python script; regex-based extraction of YAML, wikilinks, tags; graph construction with NetworkX or equivalent
  3. Stage 3 (Content Triage) — requires lightweight model call for content density assessment on ambiguous files; most classification is rule-based
  4. Stages 4-7 — calls Document Processing Framework (build C-25 first)
  5. Stage 8 (Relationship Seeding) — requires 13-type taxonomy implementation; link-to-relationship conversion is rule-based; confidence scoring requires lightweight model call
  6. Stages 9-10 — schema application is rule-based; ChromaDB ingestion calls existing ingestion pipeline; report generation is template-based

Stages 1-3 can be built and tested independently before the Document Processing Framework exists. They produce useful output (inventory, metadata map, processing manifest) even without extraction.