Files
tricu/MERKLE.md

5.0 KiB
Raw Blame History

TRICU MERKLE CONTENT STORE — HANDOFF DOC

Objective

Replace the current whole-term content store with a Merkle DAGbased content store for Tree Calculus terms.

Goal:

  • Canonical, cross-language, content-addressed representation

  • Maximal structural deduplication

  • Clean separation of:

    • identity (hash)
    • storage (nodes)
    • transport (packages)
    • execution (runtime graph)

Current State (contentstore branch)

You currently have:

Term (T)
  -> serializeTerm (Cereal)
  -> sha256(bytes)
  -> store full term blob

This is:

  • canonical at whole-term level
  • NOT deduplicated internally
  • NOT Merkle

Target Architecture

Core Concept

Each Tree Calculus node becomes a content-addressed object:

Leaf:
  hash = H( tag_leaf )

Stem:
  hash = H( tag_stem || child_hash )

Fork:
  hash = H( tag_fork || left_hash || right_hash )

Content store:

Hash -> Node(tag, child_hashes)

A program is:

root_hash

Data Model (Introduce)

Define a new canonical node type:

data Node
  = NLeaf
  | NStem Hash
  | NFork Hash Hash

Define:

type Hash = ByteString  -- SHA-256

Canonical Serialization (CRITICAL)

Define a strict, minimal, cross-language spec:

Node payload:
  Leaf: 0x00
  Stem: 0x01 || child_hash
  Fork: 0x02 || left_hash || right_hash

Node hash:
  SHA256( UTF8("tricu.merkle.node.v1") || 0x00 || node_payload )

Store:
  node_hash -> node_payload

The only thing I would avoid is storing the version inside every node payload unless you need every node to be self-describing. Put it in the hash preimage and in the store/package metadata. That gives versioning without bloating every node.


Required Invariants

These MUST hold:

  1. Determinism
same tree → same hashes everywhere
  1. Structural identity
identical subtrees → identical hashes
  1. No dependence on DAG shape

Tree identity must not depend on construction order.

  1. Hash correctness
lookup(hash) -> node
hash(node) == hash

Core Functions to Implement

1. Convert Tree → Merkle DAG

buildMerkle :: T -> State Store Hash

Behavior:

  • recursively compute child hashes
  • create Node
  • store if not exists
  • return hash

This is the entry point replacing current storage.


2. Store Interface

putNode :: Node -> StoreM Hash
getNode :: Hash -> StoreM Node

Store layout can be:

/data/<hash>

3. Reconstruct Tree (for execution)

loadTree :: Hash -> StoreM T

Recursive:

  • fetch node
  • rebuild T
  • optionally cache

4. Execution

Reuse existing evaluator:

eval :: T -> T

No change required.


Phase Plan

Phase 1 — Minimal Merkle Store

  • Implement Node type
  • Implement canonical serialization
  • Implement buildMerkle
  • Replace current put logic
  • Add loadTree

Goal: roundtrip correctness


Phase 2 — Dedup Verification

Add diagnostics:

countNodes :: Hash -> Int

Test:

  • repeated structures only stored once
  • identical subtrees share hash

Phase 3 — Wire Format

Define transport:

bundle = compress(
  list of (hash, serialized_node)
)

Implement:

exportClosure :: Hash -> Bundle
importBundle :: Bundle -> StoreM ()

Phase 4 — Runtime Optimization

Optional:

  • memoized load
  • DAG-preserving runtime
  • step counter in evaluator

What NOT To Do

Do NOT:

  • hash full trees anymore
  • store serialized T directly
  • allow multiple encodings
  • include runtime state in nodes
  • depend on evaluation for hashing

Testing Requirements

Add tests for:

Identity

same term -> same hash

Deduplication

Fork A A stores A once

Roundtrip

T -> hash -> loadTree -> T (equal)

Cross-run stability

Hash must not change between runs


Optional Enhancements

Not required for initial implementation:

  • lazy loading
  • partial fetch (networked store)
  • compression at storage layer
  • typed wrappers
  • DAG-aware evaluator

Key Insight

You are not storing programs anymore.

You are storing:

a canonical graph of computation

Everything else (execution, wire, language) sits on top.


Success Criteria

You know this is working when:

  • identical subtrees collapse globally
  • hashes are stable across runs
  • small programs reuse large portions of structure
  • runtime can reconstruct and execute correctly
  • wire bundles can reconstruct store elsewhere

Final Mental Model

Authoring:        tricu source
Lowering:         Tree Calculus (T)

Identity:         Merkle hash(root)

Storage:          Merkle DAG (node store)

Wire:             compressed node bundles

Execution:        reconstructed graph → reduce

If anything is unclear during implementation, prioritize:

determinism > simplicity > performance

In order to run tests, simply nix build .#. All tests must pass without modification.