Files
tricu/docs/content-store-and-module-format.md
James Eversole fdebb6c13d Tricu 2.0.0
Sorry for squashing all of this but 🤷
2026-05-25 12:44:24 -05:00

597 lines
15 KiB
Markdown

# Content Store and Module Format Design
Status: concrete design draft.
This document narrows the higher-level module-system direction into concrete
format and storage decisions. It intentionally avoids source/provenance details:
modules export usable portable artifacts, not edit history.
Related design overview: `docs/module-system-design.md`.
## 1. Scope
This document specifies the first target shape for:
- a neutral filesystem-backed content-addressed store;
- Arboricx Merkle node persistence;
- indexed Arboricx bundle import/export as transport;
- module manifests as immutable export maps;
- workspace aliases as mutable human-facing references;
- View Contract artifact attachment to module exports.
It does not specify:
- package manager semantics;
- dependency solving;
- source-level rebuild/provenance metadata;
- final import syntax;
- garbage collection;
- registry/sync protocol.
## 2. Non-Negotiable Boundaries
The content store is not `tricu`-specific and is not Haskell-specific.
The store may contain objects produced by `tricu`, Haskell, Tree Calculus tools,
Arboricx tooling, or future frontends. The store core only knows object bytes,
object kinds, hashes, aliases, and optionally structural references for known
portable formats.
View Contracts may be first-class artifact references because they are portable
Tree Calculus data checked by pure Tree Calculus code. They are not
Haskell-private semantics.
Source and build provenance are intentionally excluded from the first module
manifest format. A module manifest answers:
```text
What portable artifacts does this module export, and what portable contracts are
paired with them?
```
It does not answer:
```text
Which source file, parser, frontend, or build command produced these artifacts?
```
## 3. Hashing Convention
Objects are content-addressed by SHA-256 over domain-separated canonical bytes.
General rule:
```text
hash = SHA256(domainUtf8 || 0x00 || canonicalPayloadBytes)
```
This matches the existing Merkle node convention in `Research.nodeHash`:
```text
SHA256("arboricx.merkle.node.v1" || 0x00 || nodePayload)
```
The domain string is part of the object format. It prevents identical payload
bytes in different formats from accidentally sharing identity.
Hashes are represented externally as 64 lowercase hexadecimal characters.
## 4. Filesystem Store Layout
The canonical filesystem store layout is:
```text
store/
objects/
abc/
abc123... -- object bytes, sharded by first 3 hex chars
aliases/
names/
modules/
packages/
manifests/
tmp/
```
The three-character shard follows the existing `lib/arboricx/server.tri`
convention.
### 4.1 Object paths
For object hash:
```text
abc123...
```
object bytes live at:
```text
store/objects/abc/abc123...
```
The object filename is the full hash. The shard directory is the first three hex
characters.
### 4.2 Atomic writes
Writers should use:
```text
store/tmp/<hash>.<nonce>.tmp
```
then atomically rename into:
```text
store/objects/<shard>/<hash>
```
Writing an existing object is idempotent if the existing bytes match the hash.
### 4.3 Store core metadata
The minimal filesystem store does not require sidecar metadata for every object.
Object kind can be known by context or by manifest references.
A later index may cache:
```text
hash -> kind
hash -> size
hash -> references
hash -> createdAt
```
but this index is not semantic identity.
## 5. Arboricx Merkle Node Object Format
The persistent Tree Calculus representation is a Merkle DAG of node objects.
Domain:
```text
arboricx.merkle.node.v1
```
Canonical payloads:
```text
Leaf = 0x00
Stem child = 0x01 || childHashRaw32
Fork left right
= 0x02 || leftHashRaw32 || rightHashRaw32
```
Where `childHashRaw32`, `leftHashRaw32`, and `rightHashRaw32` are the raw 32-byte
SHA-256 digests corresponding to child node hashes.
This is already implemented conceptually by:
```text
Research.Node
Research.serializeNode
Research.deserializeNode
Research.nodeHash
```
The filesystem CAS should use this payload/hash convention directly.
## 6. Tree Roots
A Tree Calculus value stored in the CAS is identified by the hash of its root
Merkle node.
```text
treeRootHash = hash(rootNodePayload)
```
The complete tree is reconstructed by recursively loading node objects reachable
from the root.
Hydration is an interpretation step, not part of object identity. A client may
hydrate a root as a plain tree, a graph with explicit sharing, or another runtime
representation as long as the observable Tree Calculus value is the same. The
filesystem CAS provides structural dedupe and portable identity; it does not by
itself guarantee that a hydrated runtime value is the cheapest representation for
all workloads.
Merkle nodes are useful for explicit DAG-oriented tooling, audit, and bundle
packing. They are not the default representation for module executable exports:
storing every subtree as a separate filesystem object is pathologically slow for
large normal forms.
For module-backed evaluation and imports, a complete normalized named term is
stored as one canonical object:
```text
kind: arboricx.tree-term.v1
hash: <whole-term object hash>
abi: arboricx.abi.tree.v1
```
The `arboricx.tree-term.v1` payload is a prefix encoding:
```text
Leaf = 0x00
Stem t = 0x01 Tree
Fork l r = 0x02 Tree Tree
```
## 7. Arboricx Indexed Bundles
Indexed `.arboricx` bundles remain the transport/execution format.
They are:
- compact;
- self-contained;
- deterministic;
- suitable for restricted runtimes;
- suitable for HTTP serving and deployment.
They are not the canonical long-lived deduplicated store representation.
### 7.1 Pack
Packing converts one or more CAS tree roots into an indexed bundle:
```text
CAS tree roots -> indexed Arboricx bundle
```
The packer traverses reachable Merkle nodes, emits a compact indexed node table,
and writes a bundle manifest with export names and root indices.
### 7.3 Unpack
Unpacking converts a bundle into CAS nodes:
```text
indexed Arboricx bundle -> CAS tree roots
```
The unpacker verifies the bundle structure, reconstructs each exported tree, and
stores the corresponding Merkle nodes. It returns the tree root hash for each
bundle export.
## 8. Module Manifest v1
A module is an immutable manifest object. The module identity is the hash of its
canonical manifest bytes.
A module name is not identity. It is a workspace alias to a module manifest hash.
### 8.1 Domain
Proposed domain:
```text
arboricx.module-manifest.v1
```
### 8.2 Purpose
A module manifest pairs human-facing export names with portable content objects
and optional portable contracts.
It exists to support:
- reproducible import resolution;
- executable export discovery;
- View Contract lookup for imported symbols;
- module-to-module reference tracking;
- transport/store interop.
It does not describe source provenance.
### 8.3 Conceptual shape
```text
moduleManifestV1:
imports:
- alias: <text>
kind: <object kind>
hash: <object hash>
exports:
- name: <text>
object:
kind: <object kind>
hash: <object hash>
abi: <abi identifier>
view: optional
kind: <view artifact kind>
hash: <view artifact hash>
catalog: optional
kind: <view catalog kind>
hash: <view catalog hash>
metadata: optional human-facing fields
```
### 8.4 Imports/references
The `imports` section is a manifest reference graph, not a store-level language
dependency graph.
Each entry records direct content-addressed references used by the module:
```text
alias: Prelude
kind: arboricx.module-manifest.v1
hash: <module hash>
```
This supports reproducibility, partial fetch, and audit. The content store core
stores this object but does not need to understand `Prelude` or import
semantics.
### 8.5 Exports
Each export is a record, not a single hash. This is required so executable
objects and advertised contracts cannot drift apart.
Minimal executable export:
```text
name: "id"
object:
kind: arboricx.tree-term.v1
hash: <whole-term hash>
abi: arboricx.abi.tree.v1
```
Export with View Contract:
```text
name: "map"
object:
kind: arboricx.tree-term.v1
hash: <whole-term hash>
abi: arboricx.abi.tree.v1
view:
kind: arboricx.view-contract.type.v1
hash: <view type hash>
```
The manifest preserves the pairing between exported executable and exported
contract. For workspace modules built from local source, annotated exports are
checked before the manifest is published; only exports that pass producer-side
View Contract checking receive direct `arboricx.view-contract.type.v1` refs.
### 8.6 Metadata
Metadata is optional and human-facing. Initial fields may include:
```text
package
version
description
license
createdBy
```
Metadata is not source provenance and is not required for execution or checking.
## 9. View Contract Artifacts
View Contract artifacts are portable Arboricx-layer data. They may be stored
as content objects and referenced by module exports. `tricu` may emit these
objects, but the object kind is not tricu-specific.
Current artifact kind:
```text
arboricx.view-contract.type.v1
```
`arboricx.view-contract.type.v1` is the direct export-view artifact. Its
payload is a canonical prefix binary encoding of the syntactic ViewType:
```text
Name = 0x00 u32be(byte-length) utf8-name
Ref = 0x01 u32be(byte-length) utf8-ref
List = 0x02 ViewType
Maybe = 0x03 ViewType
Pair = 0x04 ViewType ViewType
Result = 0x05 ViewType ViewType
Fn = 0x06 u32be(argument-count) ViewType* ViewType
```
`utf8-ref` is tagged text:
```text
i:<decimal-integer> numeric/legacy ref
s:<text> symbolic user ref
```
Symbolic refs are the preferred user-authored form; numeric refs remain useful
for generated code, fixtures, and old low-level examples.
The object hash domain is the object kind:
```text
arboricx.view-contract.type.v1 \0 <payload>
```
### 9.1 Export-level pairing
The module manifest is the canonical pairing of an executable export and its
advertised contract:
```text
export name -> tree-term hash + optional view artifact hash
```
This avoids drift such as:
```text
map -> tree A
map.view -> contract B
```
where aliases might be retargeted independently.
### 9.2 Import checking
When a source file imports a module, a frontend can resolve an imported export,
decode its direct `arboricx.view-contract.type.v1` ref, and emit typed program
evidence locally:
```text
imported List.map has view Fn [...]
```
For locally built workspace modules this is backed by producer-side checking
before the module manifest alias is published, including imported view facts from
dependencies used by the producer source. External or prebuilt manifests are
trusted boundary declarations for now; they are not accompanied by proof objects.
The checker still consumes only local numeric symbols and typed-program evidence.
Global content hashes do not become checker symbols.
Correct split:
```text
local checker symbol: 3
presentation label: "List.map"
resolved object: sha256:...
exported view: Fn [...]
```
### 9.3 Execution hydration versus contract evidence
Execution imports should use a narrow, demand-driven path:
```text
module import -> selected executable exports -> hydrate selected tree-term objects
```
This path should not compute a dependency closure over other module exports.
Each selected executable export is already a complete Tree Calculus value.
Contract-aware checking may use a broader path:
```text
module import -> selected exports -> exported view type refs -> typed-program evidence
```
That path emits portable evidence and leaves compatibility policy decisions to
the Tree Calculus checker. typed programs and reusable catalogs do not need their
own binary object kinds today: they are ordinary Tree Calculus data and can be
stored as `arboricx.tree-term.v1` when persistence is useful.
## 10. Workspace Aliases
A workspace is mutable human-facing state over immutable content.
Examples:
```text
List -> module manifest hash
Prelude -> module manifest hash
map -> tree-term hash
httpServer -> bundle hash
```
Aliases should live under:
```text
store/aliases/
```
Initial categories:
```text
store/aliases/modules/<name>
store/aliases/names/<name>
store/aliases/packages/<name>
```
Alias file contents should be simple and explicit, for example:
```text
kind: arboricx.module-manifest.v1
hash: abc123...
```
Exact encoding can be decided with the first implementation. The important rule
is that aliases are mutable pointers, not content identity.
## 11. Existing Convention Alignment
This design intentionally preserves existing conventions where they already fit:
- SHA-256 domain-separated Merkle node hashing;
- `Leaf` / `Stem` / `Fork` node payload tags `0x00`, `0x01`, `0x02`;
- three-character object sharding from `lib/arboricx/server.tri`;
- indexed Arboricx bundles as compact transport objects;
- optional human-facing export names in manifests;
- View Contract checker evidence as portable Tree Calculus data.
It replaces or demotes conventions that do not fit:
- SQLite `terms.names` comma-separated aliases become workspace aliases/indexes;
- SQLite `terms.tags` comma-separated tags become optional metadata/indexes;
- file imports as AST flattening become transitional behavior;
- names cease to be semantic identity.
## 12. Implementation Sketch
A staged implementation can proceed as follows:
1. Add filesystem CAS helpers alongside the existing SQLite store.
2. Store/load Arboricx Merkle nodes using the filesystem layout.
3. Implement tree-term storage and reconstruction from filesystem CAS.
4. Implement pack from CAS tree terms/Merkle roots to indexed Arboricx bundle.
5. Implement unpack from indexed Arboricx bundle to CAS tree terms/Merkle roots.
6. Define a concrete module manifest encoding.
7. Store/load module manifests as content-addressed objects.
8. Add workspace alias read/write helpers.
9. Teach import resolution to target module manifests/exports.
10. Attach exported View Contract artifacts to module exports.
11. Gradually migrate existing `!import` users.
## 13. Deferred Decisions
These are intentionally left out of the first concrete format:
- package version solving;
- registry/remotes protocol;
- garbage collection/reachability;
- source/provenance/build-record objects;
- editor/update workflows;
- rich visibility/export rules;
- final import syntax;
- whether module manifests also need a tree-native encoding.
## 14. Summary
The concrete v1 direction is:
```text
Store:
filesystem-backed content-addressed objects
Hashing:
SHA256(domain || 0x00 || canonical payload)
Tree persistence:
Arboricx Merkle nodes
Transport:
indexed .arboricx bundles, packable from and unpackable to CAS roots
Modules:
immutable manifests pairing export names with object refs and optional View
Contract refs
Workspace:
mutable aliases from human names to immutable content hashes
```
This keeps the store portable, preserves Arboricx's compact transport role,
restores Merkle DAGs as the persistence model, and gives View Contracts a stable
module/export attachment point without making the store `tricu`-specific.