diff --git a/MERKLE.md b/MERKLE.md new file mode 100644 index 0000000..9bc47f3 --- /dev/null +++ b/MERKLE.md @@ -0,0 +1,358 @@ +# TRICU MERKLE CONTENT STORE — HANDOFF DOC + +## Objective + +Replace the current **whole-term content store** with a **Merkle DAG–based content store** for Tree Calculus terms. + +Goal: + +* Canonical, cross-language, content-addressed representation +* Maximal structural deduplication +* Clean separation of: + + * identity (hash) + * storage (nodes) + * transport (packages) + * execution (runtime graph) + +--- + +## Current State (contentstore branch) + +You currently have: + +```text +Term (T) + -> serializeTerm (Cereal) + -> sha256(bytes) + -> store full term blob +``` + +This is: + +* canonical at whole-term level +* NOT deduplicated internally +* NOT Merkle + +--- + +## Target Architecture + +### Core Concept + +Each Tree Calculus node becomes a content-addressed object: + +```text +Leaf: + hash = H( tag_leaf ) + +Stem: + hash = H( tag_stem || child_hash ) + +Fork: + hash = H( tag_fork || left_hash || right_hash ) +``` + +Content store: + +```text +Hash -> Node(tag, child_hashes) +``` + +A program is: + +```text +root_hash +``` + +--- + +## Data Model (Introduce) + +Define a new canonical node type: + +```haskell +data Node + = NLeaf + | NStem Hash + | NFork Hash Hash +``` + +Define: + +```haskell +type Hash = ByteString -- SHA-256 +``` + +--- + +## Canonical Serialization (CRITICAL) + +Define a **strict, minimal, cross-language spec**: + +```text +Node payload: + Leaf: 0x00 + Stem: 0x01 || child_hash + Fork: 0x02 || left_hash || right_hash + +Node hash: + SHA256( UTF8("tricu.merkle.node.v1") || 0x00 || node_payload ) + +Store: + node_hash -> node_payload +``` + +The only thing I would avoid is storing the version inside every node payload unless you need every node to be self-describing. Put it in the hash preimage and in the store/package metadata. That gives versioning without bloating every node. + +--- + +## Required Invariants + +These MUST hold: + +1. **Determinism** + +```text +same tree → same hashes everywhere +``` + +2. **Structural identity** + +```text +identical subtrees → identical hashes +``` + +3. **No dependence on DAG shape** + +Tree identity must not depend on construction order. + +4. **Hash correctness** + +```text +lookup(hash) -> node +hash(node) == hash +``` + +--- + +## Core Functions to Implement + +### 1. Convert Tree → Merkle DAG + +```haskell +buildMerkle :: T -> State Store Hash +``` + +Behavior: + +* recursively compute child hashes +* create Node +* store if not exists +* return hash + +This is the entry point replacing current storage. + +--- + +### 2. Store Interface + +```haskell +putNode :: Node -> StoreM Hash +getNode :: Hash -> StoreM Node +``` + +Store layout can be: + +```text +/data/ +``` + +--- + +### 3. Reconstruct Tree (for execution) + +```haskell +loadTree :: Hash -> StoreM T +``` + +Recursive: + +* fetch node +* rebuild T +* optionally cache + +--- + +### 4. Execution + +Reuse existing evaluator: + +```haskell +eval :: T -> T +``` + +No change required. + +--- + +## Phase Plan + +### Phase 1 — Minimal Merkle Store + +* Implement Node type +* Implement canonical serialization +* Implement `buildMerkle` +* Replace current `put` logic +* Add `loadTree` + +Goal: roundtrip correctness + +--- + +### Phase 2 — Dedup Verification + +Add diagnostics: + +```haskell +countNodes :: Hash -> Int +``` + +Test: + +* repeated structures only stored once +* identical subtrees share hash + +--- + +### Phase 3 — Wire Format + +Define transport: + +```text +bundle = compress( + list of (hash, serialized_node) +) +``` + +Implement: + +```haskell +exportClosure :: Hash -> Bundle +importBundle :: Bundle -> StoreM () +``` + +--- + +### Phase 4 — Runtime Optimization + +Optional: + +* memoized load +* DAG-preserving runtime +* step counter in evaluator + +--- + +## What NOT To Do + +Do NOT: + +* hash full trees anymore +* store serialized `T` directly +* allow multiple encodings +* include runtime state in nodes +* depend on evaluation for hashing + +--- + +## Testing Requirements + +Add tests for: + +### Identity + +```text +same term -> same hash +``` + +### Deduplication + +```text +Fork A A stores A once +``` + +### Roundtrip + +```text +T -> hash -> loadTree -> T (equal) +``` + +### Cross-run stability + +Hash must not change between runs + +--- + +## Optional Enhancements + +Not required for initial implementation: + +* lazy loading +* partial fetch (networked store) +* compression at storage layer +* typed wrappers +* DAG-aware evaluator + +--- + +## Key Insight + +You are not storing programs anymore. + +You are storing: + +```text +a canonical graph of computation +``` + +Everything else (execution, wire, language) sits on top. + +--- + +## Success Criteria + +You know this is working when: + +* identical subtrees collapse globally +* hashes are stable across runs +* small programs reuse large portions of structure +* runtime can reconstruct and execute correctly +* wire bundles can reconstruct store elsewhere + +--- + +## Final Mental Model + +```text +Authoring: tricu source +Lowering: Tree Calculus (T) + +Identity: Merkle hash(root) + +Storage: Merkle DAG (node store) + +Wire: compressed node bundles + +Execution: reconstructed graph → reduce +``` + +--- + +If anything is unclear during implementation, prioritize: + +```text +determinism > simplicity > performance +``` + +In order to run tests, simply `nix build .#`. All tests must pass without modification. diff --git a/src/ContentStore.hs b/src/ContentStore.hs index 7b15626..24cfb73 100644 --- a/src/ContentStore.hs +++ b/src/ContentStore.hs @@ -20,6 +20,11 @@ import qualified Data.Map as Map import qualified Data.Serialize as Cereal import qualified Data.Text as T +data StoredNode = StoredNode ByteString deriving (Show) + +instance FromRow StoredNode where + fromRow = StoredNode <$> field + data StoredTerm = StoredTerm { termHash :: Text , termNames :: Text @@ -60,6 +65,9 @@ initContentStore = do \tags TEXT DEFAULT '')" execute_ conn "CREATE INDEX IF NOT EXISTS terms_names_idx ON terms(names)" execute_ conn "CREATE INDEX IF NOT EXISTS terms_tags_idx ON terms(tags)" + execute_ conn "CREATE TABLE IF NOT EXISTS merkle_nodes (\ + \hash TEXT PRIMARY KEY, \ + \node_data BLOB NOT NULL)" return conn getContentStorePath :: IO FilePath @@ -92,35 +100,83 @@ deserializeTerm :: ByteString -> Either String T deserializeTerm = Cereal.decodeLazy . LBS.fromStrict hashTerm :: T -> Text -hashTerm = T.pack . show . (hash :: ByteString -> Digest SHA256) . serializeTerm +hashTerm = nodeHash . buildMerkle storeTerm :: Connection -> [String] -> T -> IO Text storeTerm conn newNamesStrList term = do + putStrLn $ "DEBUG: storeTerm called for names " ++ show newNamesStrList let termBS = serializeTerm term termHashText = hashTerm term newNamesTextList = map T.pack newNamesStrList metadataText = T.pack "{}" - + putStrLn $ "DEBUG: termHash = " ++ T.unpack termHashText + -- Store all Merkle nodes for this term + putStrLn "DEBUG: storing merkle nodes" + _ <- storeMerkleNodes conn term + putStrLn "DEBUG: merkle nodes stored, querying existing" existingNamesQuery <- query conn "SELECT names FROM terms WHERE hash = ?" (Only termHashText) :: IO [Only Text] + putStrLn $ "DEBUG: query result = " ++ show (length existingNamesQuery) ++ " results" case existingNamesQuery of [] -> do + putStrLn "DEBUG: inserting new term" let allNamesToStore = serializeNameList newNamesTextList execute conn "INSERT INTO terms (hash, names, term_data, metadata, tags) VALUES (?, ?, ?, ?, ?)" (termHashText, allNamesToStore, termBS, metadataText, T.pack "") + putStrLn "DEBUG: insert complete" [(Only currentNamesText)] -> do + putStrLn $ "DEBUG: updating existing term, current names = " ++ T.unpack currentNamesText let currentNamesList = parseNameList currentNamesText let combinedNamesList = currentNamesList ++ newNamesTextList let allNamesToStore = serializeNameList combinedNamesList execute conn "UPDATE terms SET names = ?, metadata = ? WHERE hash = ?" (allNamesToStore, metadataText, termHashText) + putStrLn "DEBUG: update complete" + _ -> error $ "Multiple terms with same hash? " ++ show (length existingNamesQuery) + putStrLn "DEBUG: storeTerm returning" return termHashText +-- | Store all nodes of a Merkle DAG by traversing the Term and building/storing nodes. +-- Returns the hash of the root node. +-- This function builds the Merkle DAG while storing nodes, ensuring children are +-- stored before their parents. +storeMerkleNodes :: Connection -> T -> IO MerkleHash +storeMerkleNodes _ Leaf = return $ nodeHash NLeaf -- NLeaf hash is constant +storeMerkleNodes conn (Stem t) = do + -- Recursively process child first and get its hash + childHash <- storeMerkleNodes conn t + -- Store this node + let thisNode = NStem childHash + putMerkleNode conn thisNode + return $ nodeHash thisNode +storeMerkleNodes conn (Fork l r) = do + -- Recursively process both children first and get their hashes + leftHash <- storeMerkleNodes conn l + rightHash <- storeMerkleNodes conn r + -- Store this node + let thisNode = NFork leftHash rightHash + putMerkleNode conn thisNode + return $ nodeHash thisNode + + +-- | Insert a Merkle node into the store (idempotent). +putMerkleNode :: Connection -> Node -> IO () +putMerkleNode conn node = + execute conn "INSERT OR IGNORE INTO merkle_nodes (hash, node_data) VALUES (?, ?)" + (nodeHash node, serializeNode node) + +-- | Retrieve a Merkle node by its hash. +getNodeMerkle :: Connection -> MerkleHash -> IO (Maybe Node) +getNodeMerkle conn h = + queryMaybeOne conn "SELECT node_data FROM merkle_nodes WHERE hash = ?" (Only h) >>= \case + Just (StoredNode bs) -> return $ Just (deserializeNode bs) + Nothing -> return Nothing + hashToTerm :: Connection -> Text -> IO (Maybe StoredTerm) hashToTerm conn hashText = queryMaybeOne conn (selectStoredTermFields <> " WHERE hash = ?") (Only hashText) diff --git a/src/Research.hs b/src/Research.hs index 781208e..43683dd 100644 --- a/src/Research.hs +++ b/src/Research.hs @@ -1,12 +1,17 @@ module Research where +import Data.ByteArray (convert) +import Data.Char (chr, ord) import Data.List (intercalate) import Data.Map (Map) -import Data.Text (Text, replace) +import Data.Text (Text, replace, unpack) +import Data.Word (Word8) import System.Console.CmdArgs (Data, Typeable) +import qualified Data.ByteString as BS import qualified Data.Map as Map -import qualified Data.Text as T +import qualified Data.Text as T +import Crypto.Hash (hash, SHA256, Digest) -- Tree Calculus Types data T = Leaf | Stem T | Fork T T @@ -54,6 +59,81 @@ data EvaluatedForm = TreeCalculus | FSL | AST | Ternary | Ascii | Decode -- Environment containing previously evaluated TC terms type Env = Map.Map String T +-- Merkle DAG Node types +-- Each Tree Calculus node becomes a content-addressed object. + +type MerkleHash = Text + +data Node + = NLeaf + | NStem MerkleHash + | NFork MerkleHash MerkleHash + deriving (Show, Eq, Ord) + +-- | Canonical serialization of a Node for hashing. +-- Leaf: 0x00 +-- Stem: 0x01 || child_hash (32 bytes) +-- Fork: 0x02 || left_hash (32 bytes) || right_hash (32 bytes) +serializeNode :: Node -> BS.ByteString +serializeNode NLeaf = BS.pack [0x00] +serializeNode (NStem h) = BS.pack [0x01] <> hexToBytes h +serializeNode (NFork l r) = BS.pack [0x02] <> hexToBytes l <> hexToBytes r + +-- | Hash a node per the Merkle content-addressing spec. +-- hash = SHA256( "tricu.merkle.node.v1" <> 0x00 <> node_payload ) +nodeHash :: Node -> MerkleHash +nodeHash node = bytesToHex (sha256WithPrefix (serializeNode node)) + where sha256WithPrefix payload = + convert . (hash :: BS.ByteString -> Digest SHA256) $ utf8Tag <> BS.pack [0x00] <> payload + utf8Tag = BS.pack $ map fromIntegral $ BS.unpack "tricu.merkle.node.v1" + +-- | Convert a Hex Text hash into raw 32-byte ByteString +hexToBytes :: Text -> BS.ByteString +hexToBytes h = BS.pack $ map (uncurry combinePair) pairs + where + pairs :: [(Char, Char)] + pairs = zip (unpack h) (drop 1 $ unpack h) + combinePair :: Char -> Char -> Word8 + combinePair c1 c2 = fromIntegral (nib1 * 16 + nib2) + where nib1 = hexDigitToInt c1 + nib2 = hexDigitToInt c2 + hexDigitToInt :: Char -> Int + hexDigitToInt c + | '0' <= c && c <= '9' = ord c - ord '0' + | 'a' <= c && c <= 'f' = ord c - ord 'a' + 10 + | 'A' <= c && c <= 'F' = ord c - ord 'A' + 10 + | otherwise = error $ "Invalid hex digit: " ++ show c + +-- | Deserialize a Node from canonical bytes. +deserializeNode :: BS.ByteString -> Node +deserializeNode bs + | BS.null bs || BS.head bs == 0x00 = NLeaf + | BS.head bs == 0x01 = NStem $ bytesToHex (BS.take 32 (BS.drop 1 bs)) + | BS.head bs == 0x02 = NFork (bytesToHex (BS.take 32 (BS.drop 1 bs))) + (bytesToHex (BS.take 32 (BS.drop 33 (BS.drop 1 bs)))) + | otherwise = error $ "Unknown node type tag: " ++ show (BS.head bs) + +-- | Convert 32-byte ByteString back to hex Text +bytesToHex :: BS.ByteString -> Text +bytesToHex bs = T.pack $ concatMap byteToHexChars $ BS.unpack bs + where + byteToHexChars :: Word8 -> String + byteToHexChars w = [hexDigit (fromIntegral w `div` 16), hexDigit (fromIntegral w `mod` 16)] + hexDigit :: Int -> Char + hexDigit n + | n < 10 = chr (ord '0' + n) + | otherwise = chr (ord 'a' + n - 10) + +-- | Build a Merkle DAG from a Tree Calculus term. +buildMerkle :: T -> Node +buildMerkle Leaf = NLeaf +buildMerkle (Stem t) = NStem (nodeHash child) + where child = buildMerkle t +buildMerkle (Fork l r) = NFork (nodeHash left) (nodeHash right) + where + left = buildMerkle l + right = buildMerkle r + -- Tree Calculus Reduction Rules {- The t operator is left associative. diff --git a/tricu.cabal b/tricu.cabal index 992e9f8..20fc8a7 100644 --- a/tricu.cabal +++ b/tricu.cabal @@ -39,6 +39,7 @@ executable tricu , fsnotify , haskeline , megaparsec + , memory , mtl , sqlite-simple , tasty @@ -82,6 +83,7 @@ test-suite tricu-tests , fsnotify , haskeline , megaparsec + , memory , mtl , sqlite-simple , tasty