Normalization Pipeline

The normalize step is where Cortex’s interesting work happens. Capture is trivial; recall is just an indexed query. Normalize is where unstructured events become structured records.

Why normalize is its own step

Capture is a single file write. The hook that triggered it has already returned. The session it came from may already be over.

Normalize, by contrast, may need to:

Decide whether an event yields one record or many.
Slugify content for stable IDs.
Look up dedup keys against the existing store.
Update multiple indexes in the SQLite sidecar.
Walk related-to references between records.
Optionally commit the new records to git.

Putting all of that in the capture path would couple session lifecycle to the slowest, most failure-prone part of the system. Splitting them out lets capture stay trivial and lets normalize take whatever time it needs.

Single-writer guarantee

The normalizer holds an exclusive lockfile (~/.cortex/.normalize.lock) for the duration of a pass. Concurrent invocations either wait or bail out depending on configuration.

This matters because the inbox is a producer/consumer queue with multiple producers (capture scripts) and a single consumer (normalize). Producers can race each other safely — they write distinct files into a shared directory. The consumer cannot race itself; the lockfile ensures it.

What the normalizer does, per event

For each event in inbox/pending/, in arrival order:

Parse. Read the file, split on the YAML frontmatter delimiters, parse the frontmatter, hold the body as a string.
Decide cardinality. Most event types yield one record. Meeting events yield many: one meeting record, plus one record per action item, plus one record per decision, plus context records as needed.
Resolve scope. Use the project hint from the frontmatter if present, otherwise fall back to the event type’s default scope.
Slugify. Generate a slug from a summary line in the body. Bound the length, strip punctuation, collapse whitespace. The slug is deterministic per content but not unique — uniqueness comes from the sequence number.
Generate IDs. Build rec-<scope>-<slug>-<seq>, retrying the seq number if the file already exists.
Write the record. Atomic temp + rename into the appropriate PARA bucket.
Index it. Insert rows into the records table, the FTS index, the related-to table if applicable, and the provenance table.
Move the event. Atomic move from pending/ to processed/. On any error, move to failed/ instead and log the cause.

If --auto-commit is enabled, the normalizer stages and commits the new records to the Cortex git repository as a final step.

Dedup

Two layers, depending on event type:

Manifest-based dedup

For meeting ingest, the normalizer maintains a manifest of source filenames already processed. A re-import of the same source file is a no-op.

Canonical-key dedup

For records derived from meetings, a canonical key is computed from title|date. Two events whose canonical keys collide produce a single record with combined provenance, not two.

These layers are conservative on purpose. The cost of duplicates in the store is real (noisy recalls), and re-imports happen frequently in practice.

Records can declare relationships to other records via a related_to frontmatter field. The normalizer:

Resolves the listed IDs against the existing record set.
Stores the relationships in a dedicated table, encoded as a CSV-bracketed string (,id1,id2,) so SQLite LIKE matching works in both directions.

The CSV encoding is a small concession to keep the relationship table flat and queryable without a recursive join. It keeps recall-by-relation cheap.

Provenance

Every record’s frontmatter carries a provenance block:

provenance:
  event_ids: [evt-...]
  source_session_id: <id>

The normalizer fills these in from the event’s own frontmatter. This means every record can be traced back to the exact event that produced it, and every event can be traced back to the session that captured it.

Provenance is what makes supersede safe: a record can be replaced with confidence because its lineage is explicit.

Failure modes

Normalize is allowed to fail per event without taking down the pass. A failed event:

Stays in failed/ with the original content untouched.
Has a sibling .error.log written next to it explaining what went wrong.
Is excluded from re-processing on subsequent passes (the file is no longer in pending/).

Re-processing failed events is a manual operation: move the file back to pending/ and re-run normalize. This is intentional friction — if a class of events keeps failing, the right response is usually to fix the cause, not to silently retry.