Files
tricu/MERKLE.md
2026-05-05 14:07:30 -05:00

359 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TRICU MERKLE CONTENT STORE — HANDOFF DOC
## Objective
Replace the current **whole-term content store** with a **Merkle DAGbased content store** for Tree Calculus terms.
Goal:
* Canonical, cross-language, content-addressed representation
* Maximal structural deduplication
* Clean separation of:
* identity (hash)
* storage (nodes)
* transport (packages)
* execution (runtime graph)
---
## Current State (contentstore branch)
You currently have:
```text
Term (T)
-> serializeTerm (Cereal)
-> sha256(bytes)
-> store full term blob
```
This is:
* canonical at whole-term level
* NOT deduplicated internally
* NOT Merkle
---
## Target Architecture
### Core Concept
Each Tree Calculus node becomes a content-addressed object:
```text
Leaf:
hash = H( tag_leaf )
Stem:
hash = H( tag_stem || child_hash )
Fork:
hash = H( tag_fork || left_hash || right_hash )
```
Content store:
```text
Hash -> Node(tag, child_hashes)
```
A program is:
```text
root_hash
```
---
## Data Model (Introduce)
Define a new canonical node type:
```haskell
data Node
= NLeaf
| NStem Hash
| NFork Hash Hash
```
Define:
```haskell
type Hash = ByteString -- SHA-256
```
---
## Canonical Serialization (CRITICAL)
Define a **strict, minimal, cross-language spec**:
```text
Node payload:
Leaf: 0x00
Stem: 0x01 || child_hash
Fork: 0x02 || left_hash || right_hash
Node hash:
SHA256( UTF8("tricu.merkle.node.v1") || 0x00 || node_payload )
Store:
node_hash -> node_payload
```
The only thing I would avoid is storing the version inside every node payload unless you need every node to be self-describing. Put it in the hash preimage and in the store/package metadata. That gives versioning without bloating every node.
---
## Required Invariants
These MUST hold:
1. **Determinism**
```text
same tree → same hashes everywhere
```
2. **Structural identity**
```text
identical subtrees → identical hashes
```
3. **No dependence on DAG shape**
Tree identity must not depend on construction order.
4. **Hash correctness**
```text
lookup(hash) -> node
hash(node) == hash
```
---
## Core Functions to Implement
### 1. Convert Tree → Merkle DAG
```haskell
buildMerkle :: T -> State Store Hash
```
Behavior:
* recursively compute child hashes
* create Node
* store if not exists
* return hash
This is the entry point replacing current storage.
---
### 2. Store Interface
```haskell
putNode :: Node -> StoreM Hash
getNode :: Hash -> StoreM Node
```
Store layout can be:
```text
/data/<hash>
```
---
### 3. Reconstruct Tree (for execution)
```haskell
loadTree :: Hash -> StoreM T
```
Recursive:
* fetch node
* rebuild T
* optionally cache
---
### 4. Execution
Reuse existing evaluator:
```haskell
eval :: T -> T
```
No change required.
---
## Phase Plan
### Phase 1 — Minimal Merkle Store
* Implement Node type
* Implement canonical serialization
* Implement `buildMerkle`
* Replace current `put` logic
* Add `loadTree`
Goal: roundtrip correctness
---
### Phase 2 — Dedup Verification
Add diagnostics:
```haskell
countNodes :: Hash -> Int
```
Test:
* repeated structures only stored once
* identical subtrees share hash
---
### Phase 3 — Wire Format
Define transport:
```text
bundle = compress(
list of (hash, serialized_node)
)
```
Implement:
```haskell
exportClosure :: Hash -> Bundle
importBundle :: Bundle -> StoreM ()
```
---
### Phase 4 — Runtime Optimization
Optional:
* memoized load
* DAG-preserving runtime
* step counter in evaluator
---
## What NOT To Do
Do NOT:
* hash full trees anymore
* store serialized `T` directly
* allow multiple encodings
* include runtime state in nodes
* depend on evaluation for hashing
---
## Testing Requirements
Add tests for:
### Identity
```text
same term -> same hash
```
### Deduplication
```text
Fork A A stores A once
```
### Roundtrip
```text
T -> hash -> loadTree -> T (equal)
```
### Cross-run stability
Hash must not change between runs
---
## Optional Enhancements
Not required for initial implementation:
* lazy loading
* partial fetch (networked store)
* compression at storage layer
* typed wrappers
* DAG-aware evaluator
---
## Key Insight
You are not storing programs anymore.
You are storing:
```text
a canonical graph of computation
```
Everything else (execution, wire, language) sits on top.
---
## Success Criteria
You know this is working when:
* identical subtrees collapse globally
* hashes are stable across runs
* small programs reuse large portions of structure
* runtime can reconstruct and execute correctly
* wire bundles can reconstruct store elsewhere
---
## Final Mental Model
```text
Authoring: tricu source
Lowering: Tree Calculus (T)
Identity: Merkle hash(root)
Storage: Merkle DAG (node store)
Wire: compressed node bundles
Execution: reconstructed graph → reduce
```
---
If anything is unclear during implementation, prioritize:
```text
determinism > simplicity > performance
```
In order to run tests, simply `nix build .#`. All tests must pass without modification.