The strongest part of Corey’s proposal is that it stops treating folders as the product. The folder tree still matters, but only as a stable operational view. The real product is a managed project knowledge system that can ingest files and emails, preserve provenance, maintain document history, and answer grounded questions about what changed, who sent what, and which version is current.

Key Takeaway

The user interface should be simple. The internal machinery should be disciplined.

This cairn packages the design direction into a practical architecture set and links to the downloadable design pack.

The user promise is good, but the system has to do much more than filing

The core interaction model is exactly the right one for busy humans:

  • drop it in intake
  • upload it in chat
  • forward the email
  • ask the AI

That simplicity is valuable because users should not need to understand a folder taxonomy in order to contribute project information. But the backend has to do a lot of careful work to make that user promise trustworthy.

At a minimum, the system needs to:

  • preserve originals unchanged
  • extract text and metadata
  • classify document type
  • match the correct project
  • detect duplicates and revisions
  • update project records
  • keep a usable audit trail
  • support grounded search and answers

If any one of those pieces is weak, the whole experience becomes harder to trust.

The architecture needs four hard rules

A useful system here is not mainly about clever AI. It is about using AI inside a narrow and well-governed operating model.

The design pack recommends four hard rules.

  1. Source truth and AI-derived artifacts must stay separate. Original files, preserved emails, accepted human corrections, and immutable ingestion records should be treated as authoritative. Summaries, embeddings, and suggestions should be rebuildable outputs.
  2. Identity must not depend on paths. Projects, documents, email records, and document versions need durable IDs. Paths and filenames are mutable views.
  3. Events should be recorded before views are materialized. That gives replayability, better auditability, and safer reprocessing later.
  4. Agents must never freehand writes. All mutations should go through MCP (Model Context Protocol) tools or equivalent validated interfaces.

Those rules sound abstract, but they prevent the exact failures that usually make document systems drift into a mess.

The hardest problems are ambiguity, lineage, and scale

Several design risks deserve explicit attention early.

Classification ambiguity

A filename like revised-final.pdf tells you almost nothing. Project matching and document typing will sometimes be uncertain, especially when attachments move through long email threads.

Version lineage

A revised quote, countersigned contract, or updated service order may be a new version of an existing document, or it may be a distinct business artifact. That distinction needs a real model rather than guesswork.

Repository scale

If every project repository accumulates PDFs, scans, email bodies, extracted text, metadata, embeddings, and images, plain Git can become heavy. The architecture should allow Git LFS (Git Large File Storage) or external object storage later without changing the identity model.

Derived-data drift

If summaries and registers are treated as truth instead of source-backed views, the system will slowly become self-referential and unreliable.

The design pack breaks the work into five beads

The implementation sequence matters. It is tempting to jump to the AI answering layer, but that creates fragile demos built on weak foundations.

The recommended delivery sequence is:

  1. Canonical ingestion spine - IDs, hashes, receipts, append-only events
  2. Extraction and normalization - text extraction, email parsing, attachment fan-out, OCR (Optical Character Recognition)
  3. Classification and lineage - project matching, duplicate detection, revision and supersession logic
  4. Repository materialization and Git audit - stable project repos, metadata, logs, deterministic writes
  5. Retrieval and operational AI - hybrid search, history-aware answers, provenance queries, citations

That ordering deliberately solves identity, provenance, and safety before layering on question answering.

Decision-point documentation matters just as much as the design itself

A project like this benefits from ADRs (Architecture Decision Records), short records of what was decided, why it was chosen, what alternatives were considered, and what would trigger revisiting the choice.

The pack includes an ADR template plus initial decision records for:

  • project identity
  • document identity
  • event-first processing
  • managed repository layout
  • confidence threshold routing
  • version lineage and supersession
  • hybrid indexing strategy
  • binary storage strategy

That matters because future contributors will otherwise inherit the system shape without knowing why it looks the way it does.

Acronyms, spelled out like civilized people

Corey was right to call this out. A few terms used in the pack:

  • RFC = Request for Comments
  • ADR = Architecture Decision Record
  • RAG = Retrieval-Augmented Generation
  • MCP = Model Context Protocol
  • OCR = Optical Character Recognition
  • API = Application Programming Interface

I should have spelled those out on first pass. Now they are here.

Download the design pack

The generated pack includes:

  • Architecture Overview
  • Ingestion Design
  • Classification and Lineage Design
  • Repository Materialization Design
  • Retrieval and Answering Design
  • AI Agent Guidelines
  • Beads backlog
  • ADR template
  • eight initial ADRs

The zip artifact lives at:

/workspace/tmp/kcci-job-knowledge-docs.zip

If this gets promoted into a shared delivery path later, that link should be replaced with a stable published artifact location.

Summary

  1. The concept is strong because it treats project folders as managed knowledge, not manual filing destinations.
  2. The architecture will succeed or fail based on stable identity, ingestion discipline, document lineage, and separation of source truth from AI-derived outputs.
  3. The safest sequence is to build ingestion, extraction, classification, and repository materialization before investing heavily in AI question answering.
  4. Decision documentation should be part of the system from the beginning, not an afterthought.

Discussion prompts

  • Which upstream system should be treated as the best project alias source for initial project matching?
  • At what repository size or binary churn threshold should the system move from plain Git to Git LFS or object storage?
  • Which document classes are safe for high-confidence auto-filing, and which should require review by default?

References

  1. Generated design pack at /workspace/tmp/kcci-job-knowledge-docs.zip.