Vault Conversion Pipeline

Purpose

Two use cases, same framework:

Onboarding: A power user with an existing vault (Obsidian, Logseq, any markdown-based system) migrates their knowledge into the Ora architecture without starting over.
Landfill Rehabilitation: A user with a vault that has become unnavigable through accumulated noise extracts the signal into a clean Ora structure and archives the original.

Dependencies

This framework depends on components that must be built first:

Document Processing Framework (Appendix C-25) — the three-pass extraction pipeline, quality gates, grammar rules, deduplication
Ora YAML schema finalized — Reference — YAML Property Specification (core: nexus, type, tags, dates; standard RAG properties applied by AI when appropriate)
ChromaDB ingestion pipeline — the indexing system that makes extracted notes retrievable
13-type relationship taxonomy implemented — the typed relationship system that relationship seeding populates

This framework wraps the Document Processing Framework. It does not duplicate extraction logic. Stages 4-7 call C-25 directly.

The Ten Stages

Stage 1 — Inventory & Classification

Input: Path to source vault root directory.

Process:

Recursive directory scan
Classify every file by type:
- Markdown note — .md files with prose content
- MOC/index — .md files where link density exceeds prose density (threshold configurable, default: >60% of non-blank lines contain wikilinks)
- Template — files in a templates directory or containing template syntax (Templater, core templates)
- Attachment/binary — images, PDFs, audio, video, other non-text files
- Configuration — .obsidian/ directory contents, .gitignore, dotfiles
- Other — anything not classified above

Output: Inventory report:

Total file count by type
Estimated markdown files requiring processing
Estimated processing scope (token count for API-based extraction)
List of skipped directories/file types
Ready/not-ready assessment

The user reviews this report before Stage 2 runs. No processing begins without explicit confirmation.

Stage 2 — Metadata Harvesting

Input: All markdown files from Stage 1 inventory.

Process:

Extract YAML frontmatter from every markdown file (preserve all properties as metadata)
Extract all wikilinks per file (outgoing [[link]] and [[link|alias]])
Extract all tags per file (#tag inline and tags: in YAML)
Map folder structure as organizational metadata (folder path → implicit category)
Identify MOCs by structural signature (refine Stage 1 classification)
Build source link graph: directed graph where nodes are files and edges are wikilinks

Output: Three artifacts:

Metadata index — per-file record of all harvested YAML properties, tags, and link targets
Source link graph — complete directed graph of all inter-file connections
MOC hierarchy map — MOCs identified with their linked children, representing the source vault’s navigational structure

Stage 3 — Content Triage

Input: All markdown notes (excluding templates, config, attachments) plus metadata index from Stage 2.

Process: Classify each file into a processing track:

Track	Criteria	Processing Path
(a) Already atomic	Single idea, well-formed, <500 words, one clear claim or concept	Light reformat → schema application → quality gate (skip full extraction)
(b) Compound	Multiple ideas, long-form, >500 words or multiple distinct sections	Full three-pass extraction (Stages 4-7)
(c) Fragment	Below minimum content threshold (<50 words of prose, no clear claim)	Skip with flag for human decision
(d) MOC/index	Primarily link lists with navigational prose	Harvest link structure for Stage 8 relationship seeding; skip content extraction
(e) Template	Template syntax, placeholder content	Skip entirely
(f) Daily note/journal	Date-formatted filename, journal-style content	Content density assessment; extract only if substantive content detected (>200 words of non-routine prose)

Output: Processing manifest — every file assigned to a track, with estimated token cost for extraction-track files.

Configurable thresholds: Word count boundaries, content density ratios, and MOC link-density threshold are all configurable. Defaults are provided but should be calibrated during the Stage 0 calibration run.

Stages 4-7 — Document Processing Framework

These stages call Appendix C-25 directly. No duplication of extraction logic.

Applied only to files on tracks (a) and (b):

Track (a) files get a simplified pass: schema validation, grammar rule application, quality gate — but skip full three-pass extraction since the content is already atomic
Track (b) files get the full pipeline: Pass A (signal identification), Pass B (note generation in subtype schemas with grammar rules), Pass C (quality pre-screening), quality gate with three-queue routing, deduplication with merge-with-provenance

Stage 8 — Relationship Seeding

Input: Extracted notes from Stages 4-7 + source link graph from Stage 2 + MOC hierarchy map from Stage 2 + tag metadata from Stage 2.

Process: Merge three relationship signal sources into typed relationship candidates:

Source 1 — Link-derived relationships:

Each wikilink A→B becomes a candidate relationship between the extracted notes from file A and file B
MOC hierarchy: MOC→linked_note becomes a parent/child relationship candidate
Mutual links (A→B and B→A): flagged for bidirectional relationship assessment (likely supports, extends, or related_to)
Link context: where the wikilink appears in a sentence, the surrounding text provides evidence for relationship type classification

Source 2 — Extraction-derived relationships:

Pass 1 relationship discovery from the Document Processing Framework (standard)
Entity co-occurrence from NLP entity extraction (standard)
Post-extraction question matching (standard)

Source 3 — Tag co-occurrence:

Notes sharing 2+ tags are weak relationship candidates
Shared tags suggest thematic connection but not specific relationship type
Lowest confidence source — used to supplement, not override, the other two sources

Output: Typed relationship candidate list:

Each candidate: source note, target note, proposed relationship type (from 13-type taxonomy), confidence score, source attribution (which of the three sources generated it)
Candidates above confidence threshold auto-applied
Candidates below threshold queued for human review

Stage 9 — Schema Application & Ingestion

Input: Quality-gate-approved notes from Stages 4-7 + relationship candidates from Stage 8.

Process:

Apply Ora YAML schema to all extracted notes:
- nexus: assigned from source folder structure or source YAML metadata where determinable; unassigned if not determinable
- type: working (all notes enter as incubator — no automatic engram promotion)
- writing: no (default; adjusted if source metadata indicates creative/fiction content)
- tags: merged from source tags + extraction-derived tags
- date created: original file creation date from source metadata
- date modified: current processing date
Write approved notes to Ora vault directory
Ingest into ChromaDB knowledge collection at incubator provenance weighting (0.2 — working type)
Apply approved relationship candidates to the relationship graph

Stage 10 — Report

Output: Comprehensive processing summary:

Inventory: total files scanned, files per classification type
Triage: files per processing track, files skipped (with reasons per skip category)
Extraction: total notes extracted, notes per atomic subtype (fact, process principle, definition, causal claim, analogy, evaluative)
Quality gate: notes auto-approved, auto-rejected, queued for human review
Relationships: relationship candidates generated (count per type from 13-type taxonomy), candidates auto-applied vs. queued for review
Nexus assignment: notes with determined nexus vs. notes assigned unassigned
Human review workload: estimated hours based on calibrated review rate (default: 30 notes/hour for quality gate review, 20 candidates/hour for relationship review)
Source vault status: confirmation that source vault is unmodified

Calibration Protocol

Before full vault processing, run a 20-file calibration sample:

Select files across all triage tracks:

3-4 already-atomic notes
5-6 compound/long-form notes
2-3 fragments
3-4 MOCs
2-3 daily notes

Run the full pipeline on this sample. Evaluate:

Triage accuracy: Did files land on the correct track? Adjust thresholds if not.
Extraction quality: Are the extracted atomic notes well-formed? Do they pass the “would I retrieve this and find it useful?” test?
Relationship candidate relevance: Are link-derived relationships producing sensible typed candidates? Is the confidence threshold calibrated correctly?
Schema application: Are nexus assignments reasonable? Are tags merging correctly?

Adjust thresholds and parameters before committing to full vault processing.

Implementation Notes

Batch processing with progress tracking: Large vaults (1,000+ files) require queue management, checkpoint/resume capability, and progress reporting. Integrate with sleep-wake cycle for overnight processing.
Token cost estimation: The processing manifest from Stage 3 provides estimated token costs before processing begins. The user approves the budget before extraction runs.
Incremental processing: Support for processing a vault in batches — run 100 files, review results, adjust, run the next 100. Not all-or-nothing.
Source vault format support: Primary target is Obsidian vaults (markdown + wikilinks). Logseq vaults (markdown + block references) require a format adapter in Stage 2. Notion exports (markdown + database properties) require a different metadata harvesting path. Plain markdown directories work with Stages 1-2 producing minimal metadata.

Build Sequence

Stage 1 (Inventory) — standalone Python script; no AI calls required; pure file system operations
Stage 2 (Metadata Harvesting) — standalone Python script; regex-based extraction of YAML, wikilinks, tags; graph construction with NetworkX or equivalent
Stage 3 (Content Triage) — requires lightweight model call for content density assessment on ambiguous files; most classification is rule-based
Stages 4-7 — calls Document Processing Framework (build C-25 first)
Stage 8 (Relationship Seeding) — requires 13-type taxonomy implementation; link-to-relationship conversion is rule-based; confidence scoring requires lightweight model call
Stages 9-10 — schema application is rule-based; ChromaDB ingestion calls existing ingestion pipeline; report generation is template-based

Stages 1-3 can be built and tested independently before the Document Processing Framework exists. They produce useful output (inventory, metadata map, processing manifest) even without extraction.