Normalization Pipeline
The normalize step is where Cortex’s interesting work happens. Capture is trivial; recall is just an indexed query. Normalize is where unstructured events become structured records.
Why normalize is its own step
Section titled “Why normalize is its own step”Capture is a single file write. The hook that triggered it has already returned. The session it came from may already be over.
Normalize, by contrast, may need to:
- Decide whether an event yields one record or many.
- Slugify content for stable IDs.
- Look up dedup keys against the existing store.
- Update multiple indexes in the SQLite sidecar.
- Walk related-to references between records.
- Optionally commit the new records to git.
Putting all of that in the capture path would couple session lifecycle to the slowest, most failure-prone part of the system. Splitting them out lets capture stay trivial and lets normalize take whatever time it needs.
Single-writer guarantee
Section titled “Single-writer guarantee”The normalizer holds an exclusive lockfile (~/.cortex/.normalize.lock) for
the duration of a pass. Concurrent invocations either wait or bail out
depending on configuration.
This matters because the inbox is a producer/consumer queue with multiple producers (capture scripts) and a single consumer (normalize). Producers can race each other safely — they write distinct files into a shared directory. The consumer cannot race itself; the lockfile ensures it.
What the normalizer does, per event
Section titled “What the normalizer does, per event”For each event in inbox/pending/, in arrival order:
- Parse. Read the file, split on the YAML frontmatter delimiters, parse the frontmatter, hold the body as a string.
- Decide cardinality. Most event types yield one record. Meeting events yield many: one meeting record, plus one record per action item, plus one record per decision, plus context records as needed.
- Resolve scope. Use the project hint from the frontmatter if present, otherwise fall back to the event type’s default scope.
- Slugify. Generate a slug from a summary line in the body. Bound the length, strip punctuation, collapse whitespace. The slug is deterministic per content but not unique — uniqueness comes from the sequence number.
- Generate IDs. Build
rec-<scope>-<slug>-<seq>, retrying the seq number if the file already exists. - Write the record. Atomic temp + rename into the appropriate PARA bucket.
- Index it. Insert rows into the records table, the FTS index, the related-to table if applicable, and the provenance table.
- Move the event. Atomic move from
pending/toprocessed/. On any error, move tofailed/instead and log the cause.
If --auto-commit is enabled, the normalizer stages and commits the new
records to the Cortex git repository as a final step.
Two layers, depending on event type:
Manifest-based dedup
Section titled “Manifest-based dedup”For meeting ingest, the normalizer maintains a manifest of source filenames already processed. A re-import of the same source file is a no-op.
Canonical-key dedup
Section titled “Canonical-key dedup”For records derived from meetings, a canonical key is computed from
title|date. Two events whose canonical keys collide produce a single
record with combined provenance, not two.
These layers are conservative on purpose. The cost of duplicates in the store is real (noisy recalls), and re-imports happen frequently in practice.
Related-to handling
Section titled “Related-to handling”Records can declare relationships to other records via a related_to
frontmatter field. The normalizer:
- Resolves the listed IDs against the existing record set.
- Stores the relationships in a dedicated table, encoded as a CSV-bracketed
string (
,id1,id2,) so SQLiteLIKEmatching works in both directions.
The CSV encoding is a small concession to keep the relationship table flat and queryable without a recursive join. It keeps recall-by-relation cheap.
Provenance
Section titled “Provenance”Every record’s frontmatter carries a provenance block:
provenance: event_ids: [evt-...] source_session_id: <id>The normalizer fills these in from the event’s own frontmatter. This means every record can be traced back to the exact event that produced it, and every event can be traced back to the session that captured it.
Provenance is what makes supersede safe: a record can be replaced with confidence because its lineage is explicit.
Failure modes
Section titled “Failure modes”Normalize is allowed to fail per event without taking down the pass. A failed event:
- Stays in
failed/with the original content untouched. - Has a sibling
.error.logwritten next to it explaining what went wrong. - Is excluded from re-processing on subsequent passes (the file is no longer
in
pending/).
Re-processing failed events is a manual operation: move the file back to
pending/ and re-run normalize. This is intentional friction — if a
class of events keeps failing, the right response is usually to fix the
cause, not to silently retry.