tricu/MERKLE.md

# TRICU MERKLE CONTENT STORE — HANDOFF DOC

## Objective

Replace the current **whole-term content store** with a **Merkle DAG–based content store** for Tree Calculus terms.

Goal:

* Canonical, cross-language, content-addressed representation
* Maximal structural deduplication
* Clean separation of:

  * identity (hash)
  * storage (nodes)
  * transport (packages)
  * execution (runtime graph)

---

## Current State (contentstore branch)

You currently have:

```text
Term (T)
  -> serializeTerm (Cereal)
  -> sha256(bytes)
  -> store full term blob
```

This is:

* canonical at whole-term level
* NOT deduplicated internally
* NOT Merkle

---

## Target Architecture

### Core Concept

Each Tree Calculus node becomes a content-addressed object:

```text
Leaf:
  hash = H( tag_leaf )

Stem:
  hash = H( tag_stem || child_hash )

Fork:
  hash = H( tag_fork || left_hash || right_hash )
```

Content store:

```text
Hash -> Node(tag, child_hashes)
```

A program is:

```text
root_hash
```

---

## Data Model (Introduce)

Define a new canonical node type:

```haskell
data Node
  = NLeaf
  | NStem Hash
  | NFork Hash Hash
```

Define:

```haskell
type Hash = ByteString  -- SHA-256
```

---

## Canonical Serialization (CRITICAL)

Define a **strict, minimal, cross-language spec**:

```text
Node payload:
  Leaf: 0x00
  Stem: 0x01 || child_hash
  Fork: 0x02 || left_hash || right_hash

Node hash:
  SHA256( UTF8("tricu.merkle.node.v1") || 0x00 || node_payload )

Store:
  node_hash -> node_payload
```

The only thing I would avoid is storing the version inside every node payload unless you need every node to be self-describing. Put it in the hash preimage and in the store/package metadata. That gives versioning without bloating every node.

---

## Required Invariants

These MUST hold:

1. **Determinism**

```text
same tree → same hashes everywhere
```

2. **Structural identity**

```text
identical subtrees → identical hashes
```

3. **No dependence on DAG shape**

Tree identity must not depend on construction order.

4. **Hash correctness**

```text
lookup(hash) -> node
hash(node) == hash
```

---

## Core Functions to Implement

### 1. Convert Tree → Merkle DAG

```haskell
buildMerkle :: T -> State Store Hash
```

Behavior:

* recursively compute child hashes
* create Node
* store if not exists
* return hash

This is the entry point replacing current storage.

---

### 2. Store Interface

```haskell
putNode :: Node -> StoreM Hash
getNode :: Hash -> StoreM Node
```

Store layout can be:

```text
/data/<hash>
```

---

### 3. Reconstruct Tree (for execution)

```haskell
loadTree :: Hash -> StoreM T
```

Recursive:

* fetch node
* rebuild T
* optionally cache

---

### 4. Execution

Reuse existing evaluator:

```haskell
eval :: T -> T
```

No change required.

---

## Phase Plan

### Phase 1 — Minimal Merkle Store

* Implement Node type
* Implement canonical serialization
* Implement `buildMerkle`
* Replace current `put` logic
* Add `loadTree`

Goal: roundtrip correctness

---

### Phase 2 — Dedup Verification

Add diagnostics:

```haskell
countNodes :: Hash -> Int
```

Test:

* repeated structures only stored once
* identical subtrees share hash

---

### Phase 3 — Wire Format

Define transport:

```text
bundle = compress(
  list of (hash, serialized_node)
)
```

Implement:

```haskell
exportClosure :: Hash -> Bundle
importBundle :: Bundle -> StoreM ()
```

---

### Phase 4 — Runtime Optimization

Optional:

* memoized load
* DAG-preserving runtime
* step counter in evaluator

---

## What NOT To Do

Do NOT:

* hash full trees anymore
* store serialized `T` directly
* allow multiple encodings
* include runtime state in nodes
* depend on evaluation for hashing

---

## Testing Requirements

Add tests for:

### Identity

```text
same term -> same hash
```

### Deduplication

```text
Fork A A stores A once
```

### Roundtrip

```text
T -> hash -> loadTree -> T (equal)
```

### Cross-run stability

Hash must not change between runs

---

## Optional Enhancements

Not required for initial implementation:

* lazy loading
* partial fetch (networked store)
* compression at storage layer
* typed wrappers
* DAG-aware evaluator

---

## Key Insight

You are not storing programs anymore.

You are storing:

```text
a canonical graph of computation
```

Everything else (execution, wire, language) sits on top.

---

## Success Criteria

You know this is working when:

* identical subtrees collapse globally
* hashes are stable across runs
* small programs reuse large portions of structure
* runtime can reconstruct and execute correctly
* wire bundles can reconstruct store elsewhere

---

## Final Mental Model

```text
Authoring:        tricu source
Lowering:         Tree Calculus (T)

Identity:         Merkle hash(root)

Storage:          Merkle DAG (node store)

Wire:             compressed node bundles

Execution:        reconstructed graph → reduce
```

---

If anything is unclear during implementation, prioritize:

```text
determinism > simplicity > performance
```

In order to run tests, simply `nix build .#`. All tests must pass without modification.