Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pathmap as a Collection

A simplistic way to begin describing [PathMap] is as a collection type, with an interface and behaviors similar to the std collections like [HashMap], etc. PathMap keys can be any type that is interpretable as a string of bytes, and its value type is a generic type parameter implementing Clone + Send + Sync. A value can be set at any unique path and then retrieved.

Therefore PathMap<V> has a superset of the functionality offered by HashMap<Vec<u8>, V>.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::PathMap;
let mut map = PathMap::<usize>::new();
map.insert(b"arrow", 0);
map.insert(b"bow", 1);
map.insert(b"cannon", 2);
}

The Power of Paths

Namespaces are a honking good idea - we should do more of them! - Guido Van Rossum

Unlike hashable keys, paths have the property that they are fractal. That is, one path may be a prefix to another path, and therefore any path may prefix the contents of an entire tree (or trie) of paths beneath it.

This property allows for composition, where for example, one map can be embedded, aka grafted within another, or a subtree from a map can be removed or copied to a stand-alone map with a single operation.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};
let mut sub_map = PathMap::new();
sub_map.insert(b"arrow", ());
sub_map.insert(b"bow", ());
sub_map.insert(b"cannon", ());

let mut map = PathMap::new();
map.insert(b"armor:shield", ());
map.insert(b"armor:helmet", ());
let mut wz = map.write_zipper();
wz.descend_to(b"weapons:");
wz.graft_map(sub_map);
}

Structural Sharing

Storage inside PathMap makes use of references that can point to data within the same map or across multiple maps. This means that it is possible to construct a map with far more values than could fit in memory if they were all stored individually.

For example, below is a diagram representating a map that contains every possible 4-byte-long path composed of the characters ['a', 'b', 'c', 'd']. As you can see in the graph, 16 path bytes are stored within the structure. Without structural sharing, it would be necessary to store 256 bytes (4^4) to represent the same tree.

For a real-world example, see the OEIS section in the [Examples section (GOAT, fix this link)].

NOTE: The viz feature of the crate provides tools to vizualize and inspect tries with their structural sharing.

PathMap Structure

The intro chapter described [PathMap] as a collection of values, but crucially it is actually a tree of paths. In other words, paths can exist by themselves in the absence of values.

A structure of paths is required to host values, but values are entirely optional. A path that terminates without any value or downstream children is still a valid path, and is known as a "dangling" path.

Creating and Removing Paths

PathMap provides several methods to manage path structure independently of values.

Creating Paths

[create_path] creates a dangling path (without a value) in the trie. This is useful when you need to establish path structure before adding values, or when you want to mark certain paths as existing without storing data at those locations.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::PathMap;

let mut map = PathMap::<i32>::new();
assert!(!map.path_exists_at(b"path/to/data"));

// Create the path structure without a value
map.create_path(b"path/to/data");
assert!(map.path_exists_at(b"path/to/data"));
assert_eq!(map.get_val_at(b"path/to/data"), None); // No value stored
}

Checking Path Existence

[path_exists_at] checks whether a specific path exists in the trie, regardless of whether it has an associated value.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::PathMap;

let mut map = PathMap::new();
map.insert(b"existing/path", 42);

assert!(map.path_exists_at(b"existing/path"));
assert!(map.path_exists_at(b"existing"));      // Parent path also exists
assert!(!map.path_exists_at(b"nonexistent"));
}

Removing Path Structure

[remove_branches_at] removes all child paths below the specified path. The prune parameter controls whether the path itself should be removed if it becomes dangling.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::PathMap;

let mut map = PathMap::new();
map.insert(b"base/branch1/leaf", 1);
map.insert(b"base/branch2/leaf", 2);

// Remove all branches under "base", keeping "base" as dangling path
map.remove_branches_at(b"base", false);
assert!(map.path_exists_at(b"base"));
assert!(!map.path_exists_at(b"base/branch1"));
}

[prune_path] removes dangling path segments, working upward from the specified path until it reaches a path with a value or multiple children. This is useful for cleaning up empty path structure.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::PathMap;

let mut map = PathMap::<()>::new();
map.create_path(b"long/dangling/path/chain");

// Prune the dangling path - removes the entire chain since no values exist
let removed_bytes = map.prune_path(b"long/dangling/path/chain");
assert_eq!(removed_bytes, 24); // Length of the entire path
assert!(!map.path_exists_at(b"long"));
}

Algebraic Operations on Whole Maps

[PathMap] supports efficient operations to modify maps or construct new maps, using existing maps as operands. From these primitive operations, it is possible to construct and evaluate complex queries over data represented in a PathMap. For a real-world example, see the [Aunt Knowledge Graph]((GOAT, fix this link)) example section.

These operations, along with the rules and theorems that apply to them, are referred to as the "path algebra".

Join (Union)

join creates the union of multiple tries, so a path present in any operand will be present in the result.

Example

operand_0:

books:don_quixote
books:great_gatsby,the
movies:casablanca

operand_1:

books:moby_dick
movies:star_wars
music:take_the_a_train

result:

books:don_quixote
books:great_gatsby,the
books:moby_dick
movies:casablanca
movies:star_wars
music:take_the_a_train

Meet (Intersection)

meet intersects multiple tries, so a path present in all operands will be present in the result

Example

operand_0:

books:great_gatsby,the
books:moby_dick
movies:casablanca
music:take_the_a_train

operand_1:

books:don_quixote
books:great_gatsby,the
movies:casablanca
movies:star_wars

result:

books:great_gatsby,the
movies:casablanca

Subtract

subtract removes all paths in one trie from another, so a path present in the lvalue trie that is not present in the rvalue trie will be present in the result. Conceptually, subtract acts like a meet between the lvalue and the inverse of the rvalue - although currently there is no inverse operation for PathMap.

Example

lvalue:

books:don_quixote
books:great_gatsby,the
books:moby_dick
movies:casablanca
movies:star_wars
music:take_the_a_train

rvalue:

books:don_quixote
books:moby_dick
movies:star_wars

result:

books:great_gatsby,the
movies:casablanca
music:take_the_a_train

Restrict

restrict removes paths from one trie that do not have a corresponding prefix in another trie. You can conceptualize restrict as a generalization of meet, where every path in the rvalue ends in a "wildcard".

Example

lvalue:

books:fiction:don_quixote
books:fiction:great_gatsby,the
books:fiction:moby_dick
books:non-fiction:brief_history_of_time
movies:classic:casablanca
movies:sci-fi:star_wars
music:take_the_a_train

rvalue:

books:fiction:
movies:sci-fi:

result:

books:fiction:don_quixote
books:fiction:great_gatsby,the
books:fiction:moby_dick
movies:sci-fi:star_wars

Join K Path, aka Drop Head

join_k_path collapses n bytes from all paths, joining together the sub-tries as it proceeds.

Example

lvalue:

books:don_quixote
books:great_gatsby,the
books:moby_dick

result of join_k_path(6) (dropping the first 6 chars: books:):

don_quixote
great_gatsby,the
moby_dick

Lattice and DistributiveLattice trait

The Lattice trait and related traits in the ring module are used to define join and meet behaviors on values within a PathMap. So two values at the same path may interact with each other as part of these operations.

WARNING: This mechanism is planned for rework, to allow a "policy" to be provided instead of forcing a Lattice implementation to be associated to the value type. Therefore the current behaviors will not be explained further.

A Zipper is a Cursor in a Trie

Zippers are persistent pointers that refer to a location within a trie, called the zipper's focus. When an operation is applied through a zipper, it accesses and /or acts on the focus. The advantage of zippers is that they don't require re-traversal of the in-memory data structure for each operation.

A zipper is most often used to point to a location within a [PathMap], however a zipper may also operate within other kinds of tries. One example is the [ACT] on-disk file format, which allows a trie to be explored without requiring the entire data structure to be loaded into memory. Another example is a virtualized trie, such as an infinite trie that is defined parametrically.

Zipper Traits

Zipper capabilities are defined across a number of traits. There are many different zipper objects depending on the function of the zipper and the underlying data structure it accesses. Therefore it makes sense to treat zippers as generic types with the feature-specific trait bounds you require.

TraitPurpose / Description
ZipperInspect the trie structure
ZipperConcreteInspect structural sharing
ZipperSubtriesMake a map from a subtrie, or provide an operand to algebraic ops
ZipperReadOnlySubtriesCreate [TrieRef] objects
ZipperForkingCreate temporary zippers
ZipperValuesAccess values
ZipperReadOnlyValuesAccess values with extended lifetime
ZipperReadOnlyConditionalValuesAccess values with witness pattern
ZipperAbsolutePathGet more complete path information
ZipperPathBufferControl zipper's internal buffer allocation
ZipperMovingMoves the zipper's focus within the trie
ZipperIterationTraverse / exploration the trie
ZipperReadOnlyIterationIterate and borrow values in the trie
ZipperReadOnlyConditionalIterationIterate trie values with witness pattern
ZipperWritingModify the trie structure and values, and perform algebraic ops

Zipper and Zipper-like Objects

The pathmap crate provides a number of zipper and zipper-like objects with different capabilities and performance characteristics. It is also permissible to implement most (but not all) of the zipper traits on outside objects. Therefore this is not a complete set of zippers, but includes the core types that work directly on [PathMap] tries.

ObjectLifetimeWritingMovingCost to CreatePrimary Use Case
[PathMap]'staticMedRoot data structure, stand-alone trie
[TrieRefBorrowed]BorrowedUltra LowQuick focus access, no navigation
[TrieRefOwned]'staticLowFocus access without lifetime, no navigation
[ReadZipper]BorrowedMedReading, navigation and iteration
[ReadZipperOwned]'staticHighReading, navigation and iteration, send across threads
[WriteZipper]BorrowedMedTrie modifications
[WriteZipperOwned]'staticHighTrie modifications, send across threads

The Base Zipper Trait

The [Zipper] trait forms the foundation of all zipper types in PathMap. It provides the minimal interface needed to inspect the structure of the trie at the zipper's focus. It is implemented on all zipper types.

The base trait defines three essential types of operations which are fundamental to all pathmap tries:

Path Existence

  • [path_exists] returns whether the zipper's focus is positioned on a valid path within the trie. This is fundamental since zippers can be positioned on non-existent paths - locations that could potentially hold data but currently don't.

Value Presence

  • [is_val] indicates whether there is a value stored at the current focus position. Note that a path can exist in the trie structure without necessarily having a value, and the trie may continue deeper from this point regardless of whether a value is present or not.

Child Branches

Two methods provide information about the trie structure below the current focus:

  • [child_count] returns the number of child branches extending from the current node.
  • [child_mask] returns a 256-bit mask indicating exactly which byte values have corresponding child branches.

ZipperConcrete

The [ZipperConcrete] trait is implemented on zippers that traverse in-memory trie structures, as opposed to virtual tries or abstract projections. It primarily provides the method to inspect structural sharing.

  • [is_shared] returns whether the zipper's focus is at a location that may be accessed via multiple distinct paths (including from other maps). This information is useful for optimization but should never be relied upon for correctness, as sharing behavior can change due to internal operations like thread isolation or representation changes.

ZipperSubtries

The [ZipperSubtries] trait provides methods for zippers that can access concrete subtries within [PathMap]s. This trait is required to use a zipper as a source for graft or to provide arguments to the zipper algebraic operations

  • [make_map] creates a new [PathMap] containing everything below the zipper's focus. This allows extraction of the subtrie as an independent data structures.

ZipperReadOnlySubtries

The [ZipperReadOnlySubtries] trait extends [ZipperSubtries] for read-only zippers, providing a mechanism to create [TrieRef] objects.

  • [trie_ref_at_path] returns a [TrieRef] for a specified path relative to the current focus

ZipperForking

The [ZipperForking] trait enables creating temporary zippers from existing zippers.

  • [fork_read_zipper] creates a new read-only zipper with its root at the current focus. The returned zipper implements [ZipperAbsolutePath] + [ZipperIteration] + [ZipperValues]. This is useful for constructing child zippers with different roots, creating read zippers from other zipper types, or creating zippers to pass to functions that take ownership. However, cloning the parent zipper is often preferable when you need to maintain the original zipper.

Value Access

Several traits provide access to values stored in the trie; the difference between them is the duration for which the value reference must be borrowed.

  • [ZipperValues] is avilable on most zipper types, and provides short-lived access to values.
  • [ZipperReadOnlyValues] is only available on zippers that cannot modify the trie, and privides access to the same values but with a lifetime that allows the zipper to be moved while the borrow is still active.
  • [ZipperReadOnlyConditionalValues] makes lifetime guarantees somewhere in the middle. It allows safe access and some amount of zipper movement, while ensuring it's impossible to modify or delete a value while it is actively borrowed.

ZipperValues

The [ZipperValues] trait provides most zippers with the ability to access values of type V at the focus position.

Basic Value Access

The [val] method returns an Option<&V> - Some containing a reference to the value if one exists at the focus, or None if the path exists but has no associated value.

The returned reference has the same lifetime as the method call, which means you many modify the zipper (e.g. move its focus) while the borrow is active. For longer-lived references, use [ZipperReadOnlyValues] and [ZipperReadOnlyConditionalValues].

ZipperReadOnlyValues

The [ZipperReadOnlyValues] trait provides the same value access as [ZipperValues] but with a lifetime parameter that is tied to the source of the value, irrespective of the zipper. The zipper may be modified or even dropped, and the borrow remains valid.

This trait is implemented on read-only zipper types where the underlying data structure's lifetime can be guaranteed, and will never be implemented alongside [ZipperWriting] on the same type.

ZipperReadOnlyConditionalValues

The [ZipperReadOnlyConditionalValues] trait provides an intermediate approach between [ZipperValues] and [ZipperReadOnlyValues]. It uses a witness object to ensure the underlying values remain intact, while still allowing zipper modification.

Witness-Based Access

The trait introduces a [witness] method that returns a [WitnessT] type, which acts as a proof that the underlying data remains valid. The [get_val_with_witness] method then provides a reference with the witness's lifetime rather than relying on a borrow from the zipper.

This design allows the zipper to be moved while ensuring that any borrowed values remain valid as long as the witness remains active.

Example Usage

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};

// Create a PathMap and populate it
let mut map = PathMap::new();
map.insert(b"hello", "world");
map.insert(b"hello/nested", "value");

// Convert to a read-only zipper
let mut zipper = map.into_read_zipper(b"");

// Move to a position with a value
zipper.descend_to(b"hello");
assert!(zipper.is_val());

// Create a witness to enable extended lifetime access
let witness = zipper.witness();

// Get a reference using the witness
let value_ref = zipper.get_val_with_witness(&witness).unwrap();

// The zipper can now be moved while the value reference remains valid
zipper.descend_to(b"/nested");
assert!(zipper.is_val());

// The original value reference is still valid because the witness is alive
assert_eq!(*value_ref, "world");

// Get another value using the same witness
let nested_ref = zipper.get_val_with_witness(&witness).unwrap();
assert_eq!(*nested_ref, "value");

// Both references remain valid until the witness is dropped, even after the
// zipper is dropped
drop(zipper);
assert_eq!(*value_ref, "world");
assert_eq!(*nested_ref, "value");
}

Paths and Absolute Paths

Some zippers maintain can expose their focus position within the trie as a path. The following traits expose access to the path buffer in some way.

Zipper Relative Path

The [path] method in [ZipperMoving] returns the current path from the zipper's root to its focus as a byte slice. This represents the sequence of bytes traversed to reach the current focus position.

Note that this path is relative to the zipper's root, which may not be the same as the absolute path from the original data structure's root if the zipper was created with a prefix or from a subtrie.

ZipperAbsolutePath

The [ZipperAbsolutePath] trait provides methods to access a more complete path slice, including the [origin_path], and may extend above the zipper's root.

Origin and Root Prefix Paths

The trait provides two key methods:

  • [origin_path] returns the entire path from the zipper's origin to its current focus. The origin depends on how the zipper was created - for zippers created directly from a [PathMap], this starts at the map's root.

  • [root_prefix_path] returns the path from the zipper's origin to the zipper's root. This value remains constant throughout the zipper's life and is unaffected by the focus position.

The relationship between these paths is: origin_path() == root_prefix_path() + path().

After calling [reset] to return to the zipper's root, root_prefix_path() equals origin_path().

ZipperPathBuffer

The [ZipperPathBuffer] trait provides low-level control over the zipper's internal path buffer management. This trait is primarily used for performance optimization.

Buffer Management

The trait offers methods for direct buffer manipulation:

  • [prepare_buffers] ensures the path buffer is allocated to facilitate zipper movement. Normally buffer allocation is done lazily.
  • [reserve_buffers] grows the internal buffers to accommodate larger paths and deeper node stacks, but never shrinks them.
  • [origin_path_assert_len] provides unsafe access to the path buffer beyond its current length.

The buffer management is designed to minimize allocations during zipper operations by pre-allocating the buffers for expected path lengths and traversal depths, so that the buffer can be sized once, instead of incurring multiple reallocs.

ZipperMoving

Some zippers allow the focus to be moved, which is done using the methods in the [ZipperMoving] trait. It is almost always faster to move a zipper's focus than to create a new zipper. Moving to a longer path is called descending while moving to a shorter path is called ascending.

The ZipperMoving trait provides methods to navigate through the trie. It provides both fine-grained single-step movement and higher-level jumping methods.

Descending is not limited in any way, and moving the focus to non-existent paths is one of the ways new paths are created by zippers with writing capability. When ascending, the zipper's focus cannot be moved above the zipper's root. The root is the highest possible position for the zipper's focus.

Reset and the Root Position

  • [reset] moves the focus back to the zipper's root.
  • [at_root] checks if the zipper is at its root position.

Absolute Positioning

  • [move_to_path] moves the zipper's focus directly to a specific path relative to the zipper's root.

Stepping vs. Jumping

Zippers may be descended and ascended either by stepping an absolute number of elements, or by jumping to features such as branches, values, or the end of the path. In general, moving by jumping will be faster than stepping.

Stepping Methods

These methods move the zipper by specific byte distances

  • [descend_to_byte] descends down one level to the specified child byte
  • [descend_indexed_byte] descends one byte to the nth child branch (by index in the child mask)
  • [descend_first_byte] moves down to the first child branch
  • [ascend] moves up a specified number of bytes toward the root
  • [ascend_byte] moves up exactly one byte (equivalent to ascend(1))

Jumping Methods

These methods move the zipper to positions based on trie features:

  • [descend_to] moves to a path relative to the current focus position
  • [descend_to_existing] follows a path, stopping if a non-existent path byte is encountered
  • [descend_to_val] similar to descend_to_existing but also stops if a value is encountered
  • [descend_until] descends until a position with multiple children or a value is encountered
  • [ascend_until] ascends until reaching a position with multiple children, a value, or the root
  • [ascend_until_branch] ascends until reaching a position with multiple children, skipping values

Sibling Navigation

  • [to_next_sibling_byte] moves to the next sibling at the current level
  • [to_prev_sibling_byte] moves to the previous sibling at the current level

Iteration using a Zipper

Zippers can facilitate several different iteration strategies to systematically explore a trie.

ZipperIteration

The [ZipperIteration] trait extends [ZipperMoving] to provides methods for traversal of the trie structure.

Path-Byte Based Iteration

  • [to_next_step] moves the zipper to the next existing position in the trie structure, regardless of whether it contains a value. Calling this method repeatedly will cycle through through all existing positions in the trie, visiting each position once before visiting any position twice. If the focus is not at a leaf, this method descends. If it is at a leaf, it ascends repeatedly until finding a sibling, then moves to that sibling.

Value Based Iteration

  • [to_next_val] advances the zipper to the next position containing a value, using depth-first traversal. Calling this method repeatedly will visit each position containing a value once, before visiting any position twice.

K-Path Iteration

The k_path methods allow iterating over children at depth k relative to the focus (or starting focus). This enables efficiently visiting all paths at a specific distances from a common root, which is useful for certain algorithmic patterns. For example, if your trie contains a number of 4-byte data types (e.g. record IDs, etc.), you can visit the subtrie below each one using a k_path iteration.

  • [descend_first_k_path] descends k bytes from the current focus, following the first child at each branch until reaching the target depth.
  • [to_next_k_path] moves to the next position at the same depth level, useful for systematically exploring all paths of a given length.

Example: Iterating Fixed-Length Records

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};

// Create a map with various 4-byte keys
let mut map = PathMap::new();
map.insert(b"abcd:subtrie1", ());
map.insert(b"abce:subtrie2", ());
map.insert(b"abxy:subtrie3", ());
map.insert(b"wxyz:subtrie4", ());
map.insert(b"ab:", ());  // shorter than 4 bytes

let mut zipper = map.read_zipper();

// Find and iterate all 4-byte paths
if zipper.descend_first_k_path(4) {
    println!("Visited 4-byte path: {:?}",
             std::str::from_utf8(zipper.path()).unwrap());

    // Continue to next 4-byte paths
    while zipper.to_next_k_path(4) {
        println!("Visited 4-byte path: {:?}",
                 std::str::from_utf8(zipper.path()).unwrap());
    }
}

// Output:
// Visited 4-byte path: "abcd"
// Visited 4-byte path: "abce"
// Visited 4-byte path: "abxy"
// Visited 4-byte path: "wxyz"
// Note: "ab:" is skipped (too short), longer paths also skipped
}

ZipperReadOnlyIteration

The [ZipperReadOnlyIteration] trait combines [ZipperReadOnlyValues] with [ZipperIteration] to provide iteration over values, exposing the same pattern as the std Rust Iterator trait.

  • [to_next_get_val] combines the functionality of [to_next_val] with [get_val], returning Some(&'a V) if positioned at a value, or None if iteration is complete.

ZipperReadOnlyConditionalIteration

The [ZipperReadOnlyConditionalIteration] provides the same functionality as ZipperReadOnlyIteration but for zipper types that require a witness type. See [ZipperReadOnlyConditionalValues] for more discussion on the use of witness types.

  • [to_next_get_val_with_witness] behaves like [ZipperReadOnlyIteration::to_next_get_val], but requires a witness to guarantee the lifetime of the returned references.

Modifying the Trie

The [ZipperWriting] trait provides methods to modify the trie structure and values. This trait provides comprehensive functionality for creating, updating, and removing data within the trie.

Direct Value Access

  • [get_val_mut] returns a mutable reference to the value at the focus, or None if no value exists
  • [set_val] sets the value at the focus, returning any previously existing value
  • [remove_val] removes the value at the focus, optionally pruning dangling paths

Upsert Value Access (Create-Or-Update)

  • [get_val_or_set_mut] returns a mutable reference to the value at the focus, inserting the provided default if no value exists
  • [get_val_or_set_mut_with] similar to above, but uses a closure to generate the default value when needed

Path Creation and Removal

  • [create_path] ensures a path exists to the current focus, creating intermediate nodes as needed
  • [prune_path] removes dangling path segments above the focus that have no values or branches
  • [prune_ascend] combines pruning with ascension, moving to the first non-dangling path

Subtrie Shifting

  • [insert_prefix] adds a prefix to all downstream paths from the focus
  • [remove_prefix] removes the specified number of bytes from paths above the focus

Subtrie Moving and Copying

  • [graft] replaces the subtrie below the focus with content from another zipper's focus
  • [graft_map] replaces the subtrie below the focus with the contents of a [PathMap]
  • [take_map] extracts the subtrie below the focus into a new [PathMap], removing it from the original

Branch Removal

  • [remove_branches] removes all child branches below the focus
  • [remove_unmasked_branches] selectively removes branches based on a byte mask

Pruning Behavior

Many operations accept a prune parameter that controls whether dangling paths should be automatically cleaned up. When prune is true, the operation will remove any path segments that become empty, aka "dangling" (no values, and no further downstream branches) as a result of the modification.

Automatic pruning helps maintain a compact trie structure by removing unnecessary nodes, but can be counter-productive when you plan to perform additional operations that might reuse those paths.

Algebra on Subtries

The [ZipperWriting] trait includes algebraic operations that operate on subtries below a zipper's focus.

These methods provide the same fundamental operations as the whole-map operations described in Algebraic Operations on Whole Maps, but applied to specific subtries within a larger trie structure. This enables surgical modification of specific portions of large data structures, while leaving the rest of the structure intact.

Join Operations

  • [join_into] performs a join on the subtrie at the current focus with a subtrie from another zipper
  • [join_map] joins a [PathMap] into the current subtrie, consuming the map
  • [join_into_take] joins and consumes the source subtrie from another write zipper
  • [join_k_path_into] performs join_k_path operation on the current subtrie

Meet Operations

  • [meet_into] performs a meet on the subtrie at the current focus with a subtrie from another zipper
  • [meet_2] experimental operation that meets the current subtrie with two other subtries simultaneously

Subtract Operations

  • [subtract_into] performs subtract on the subtrie at the current focus, subtracting a subtrie from another zipper

Restrict Operations

  • [restrict] performs restrict on the subtrie at the current focus using a subtrie from another zipper
  • [restricting] populates "stem" paths in the current subtrie with corresponding subtries from the source

Creating Multiple Zippers in the Same Map

Sometimes you need a write-enabled zipper along side one or more other zippers in the same map. This primarily happens if you want to:

  • Use an algebraic operation to write results into the same map that also holds the arguments
  • Use zippers to parallelize work over multiple threads

Creating multiple zippers in the same map is possible and very handy. That's what the [ZipperHead] object is designed to enable.

Making Multiple Zippers

The normal [PathMap] methods to create write-enabled zippers, ie. write_zipper and write_zipper_at_path require an &mut borrow of the map. This prevents a write-enabled zipper from being created while any other zipper simultaneously exists in that map (for reading or writing). This can be very obnoxious, especially when you are working within a single large map.

[ZipperHead] provides an alternative API to create zippers, that allows the exclusivity rules to be checked at runtime, rather than statically by the Rust borrow checker.

NOTE: Using [ZipperHead] methods comes with a high runtime cost relative to using the [PathMap] methods. So don't use a ZipperHead unless you need to.

WARNING: Creating a write zipper from a ZipperHead has the side effect of creating the path up to the zipper's root. Then dropping the write zipper will leave a dangling path if no further trie structure was added by the write zipper. If this is unacceptable, call [cleanup_write_zipper].

Zipper Exclusivity Rules

A zipper is a cursor to read and/or write into a location in a trie. By extension it is also a permission to perform that reading / writing. Analogous to the borrow checker's rules, zippers under a ZipperHead must obey the following:

For any reachable path, there can either be one write-enabled zipper, or many read-only zippers, but never both at the same time.

An example trie containing the paths: ["internal", "internet", "interval", "integer", "integral", "integration", "integrity", "intolerable", "intone"] looks like the diagram below.

Let's create a [ReadZipper] at the path: "integ" and a [WriteZipper] at the path "intern". Notice in the diagram that paths in the trie accessible to the ReadZipper are yellow. Additional ReadZippers may be created on any of these nodes.

Paths accessible to the WriteZipper are red, and they are reserved strictly for this zipper, while it exists.

The paths in blue are free for a zipper of any type to be created.

The paths in grey are not a accessible to any zippers. This is because any zipper created at those paths could descend to the exclusive red paths held by the WriteZipper.

Example: Creating Multiple Zippers in a PathMap

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};

// Create and populate a map
let mut map = PathMap::new();
map.set_val_at(b"data:0000:value", 100);
map.set_val_at(b"data:0001:value", 200);

// Create a ZipperHead at the map root
let zh = map.zipper_head();

// Now we can create multiple write zippers at different exclusive paths
let mut reader_0 = zh.read_zipper_at_path(b"data:0000:value").unwrap();
let mut reader_1 = zh.read_zipper_at_path(b"data:0001:value").unwrap();
let mut writer_0 = zh.write_zipper_at_exclusive_path(b"data:0000:result").unwrap();
let mut writer_1 = zh.write_zipper_at_exclusive_path(b"data:0001:result").unwrap();

// Each zipper can now independently modify its exclusive region
writer_0.set_val(*reader_0.val().unwrap() * 2);
writer_1.set_val(*reader_1.val().unwrap() * 2);

// Clean up - zippers must be dropped before the ZipperHead
drop(writer_0);
drop(writer_1);
drop(reader_0);
drop(reader_1);
drop(zh);

// Verify the results
assert_eq!(map.get_val_at(b"data:0000:result"), Some(&200));
assert_eq!(map.get_val_at(b"data:0001:result"), Some(&400));
}

Multi-threading Patterns

Many zipper types implement [Send] and/or [Sync] meaning it is legal to send zippers across channels to distribute work to multiple threads, working in the same underlying trie. The zipper exclusivity rules uphold the guarantees that prevent data races.

NOTE: Rust's default allocator tends to become the bottleneck when creating and editing lots of trie paths in parallel. Experimentally we have found that jemalloc alleviates this issue, and the jemalloc crate feature can be enabled to switch the default allocator for the process.

Parallel Example

The code below will spawn parallel threads, and make a deep copy of every item in the "in subtrie into the "out" subtrie. This just illustrates one possible approach to distributing a workload across multiple threads and many other patterns are possible.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};

let elements = 65535;
let thread_cnt: usize = 4;
let elements_per_thread = elements / thread_cnt;

//Pre-initialize the data in the `PathMap`, for demonstration purposes
let mut map = PathMap::<usize>::new();
let mut zipper = map.write_zipper_at_path(b"in");
for n in 0..thread_cnt {
    for i in (n * elements_per_thread)..((n+1) * elements_per_thread) {
        zipper.descend_to_byte(n as u8);
        zipper.descend_to(i.to_be_bytes());
        zipper.set_val(i);
        zipper.reset();
    }
}
drop(zipper);

let zipper_head = map.zipper_head();

std::thread::scope(|scope| {

    //Allocate channels to send the zippers
    let mut zipper_senders: Vec<std::sync::mpsc::Sender<(ReadZipperTracked<'_, '_, usize>, WriteZipperTracked<'_, '_, usize>)>> = Vec::with_capacity(thread_cnt);
    let mut signal_receivers: Vec<std::sync::mpsc::Receiver<bool>> = Vec::with_capacity(thread_cnt);

    //Spawn all the threads
    for _thread_idx in 0..thread_cnt {
        let (zipper_tx, zipper_rx) = std::sync::mpsc::channel();
        zipper_senders.push(zipper_tx);
        let (signal_tx, signal_rx) = std::sync::mpsc::channel::<bool>();
        signal_receivers.push(signal_rx);

        //=====================================================================================
        // Worker Thread Body
        //=====================================================================================
        scope.spawn(move || {
            loop {

                //The thread will block here waiting for the zippers to be sent
                match zipper_rx.recv() {
                    Ok((mut reader_z, mut writer_z)) => {

                        //We got the zippers, do the stuff
                        let witness = reader_z.witness();
                        while let Some(val) = reader_z.to_next_get_val_with_witness(&witness) {
                            writer_z.descend_to(reader_z.path());
                            writer_z.set_val(*val);
                            writer_z.reset();
                        }

                        //Tell the main thread we're done
                        signal_tx.send(true).unwrap();
                    },
                    Err(_) => {
                        //The zipper_sender channel is closed, meaning it's time to shut down
                        break;
                    }
                }
            }
        });
        //=====================================================================================
        // End of Worker Thread Body
        //=====================================================================================
    }

    let mut writer_z = zipper_head.write_zipper_at_exclusive_path(b"out").unwrap();
    writer_z.remove_branches(true);
    drop(writer_z);

    //Make a ReadZipper and a WriteZipper for each thread, and send them through the channel
    for n in 0..thread_cnt {
        let path = vec![b'o', b'u', b't', n as u8];
        let writer_z = zipper_head.write_zipper_at_exclusive_path(path).unwrap();
        let path = vec![b'i', b'n', n as u8];
        let reader_z = zipper_head.read_zipper_at_path(path).unwrap();

        zipper_senders[n].send((reader_z, writer_z)).unwrap();
    };

    //Wait for the threads to all be done
    for n in 0..thread_cnt {
        assert_eq!(signal_receivers[n].recv().unwrap(), true);
    };
});
drop(zipper_head);
}

Zipper Combinators and Abstract Tries

In the prior Zipper section, we introduced zipper types that move over a concrete trie in memory. However zippers may also reference (and often move within) tries that are defined parametrically. These are called "abstract" or "virtual" tries.

Many abstract zipper types work using other zippers as parameters. This allows zippers to be combined to create virtual tries with the desired characteristics.

NOTE: The current set of abstract zippers reflects what we needed to build MORK, as opposed to a deliberate process to create a complete API. If you have ideas or needs that aren't addressed, please consider adding other abstract zippers, or reach out and discuss your use case.

PrefixZipper

[PrefixZipper] wraps another zipper and prepends an arbitrary path prefix to the wrapped zipper's space. This allows creating a virtual trie that appears to be rooted at a different location.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};

let map: PathMap<()> = [(b"A", ()), (b"B", ())].into_iter().collect();
let mut rz = PrefixZipper::new(b"origin.prefix.", map.read_zipper());
rz.set_root_prefix_path(b"origin.").unwrap();

rz.descend_to(b"prefix.A");
assert_eq!(rz.path_exists(), true);
assert_eq!(rz.origin_path(), b"origin.prefix.A");
assert_eq!(rz.path(), b"prefix.A");
}

OverlayZipper

[OverlayZipper] traverses a virtual trie formed by fusing two other zippers. It combines the path structures of both source zippers, and has configurable value mapping to reconcile the values in the final virtual trie.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};

let map_a: PathMap<&str> = [(b"shared", "from_a"), (b"only_a", "a_value")].into_iter().collect();
let map_b: PathMap<&str> = [(b"shared", "from_b"), (b"only_b", "b_value")].into_iter().collect();

// Default mapping prefers values from the first zipper
let mut overlay = OverlayZipper::new(map_a.read_zipper(), map_b.read_zipper());

// Access value that exists in both - first zipper takes precedence
overlay.descend_to(b"shared");
assert_eq!(overlay.val(), Some(&"from_a"));

// Access value that exists only in first zipper
overlay.reset();
overlay.descend_to(b"only_a");
assert_eq!(overlay.val(), Some(&"a_value"));

// Access value that exists only in second zipper
overlay.reset();
overlay.descend_to(b"only_b");
assert_eq!(overlay.val(), Some(&"b_value"));
}

PolyZipper

[PolyZipper] is a derive macro that enables creating polymorphic zippers - enum-based zippers that can represent different underlying zipper types and dispatch to the appropriate implementation at runtime. This is conceptually similar to using &dyn trait objects, but with enum-based dispatch.

The macro automatically generates implementations for most zipper traits ([Zipper], [ZipperMoving], [ZipperIteration], etc.) by dispatching each method call to the appropriate variant. Note that [ZipperForking] is intentionally not implemented, as the mapping between child zipper types and output types is not always straightforward and should be implemented manually when needed.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};

#[derive(PolyZipper)]
enum MyPolyZipper<'trie, V: Clone + Send + Sync + Unpin = ()> {
    Tracked(ReadZipperTracked<'trie, 'trie, V>),
    Untracked(ReadZipperUntracked<'trie, 'trie, V>),
}

let map: PathMap<&str> = [(b"foo", "value1"), (b"bar", "value2")].into_iter().collect();

// Create a PolyZipper from either variant
let mut poly_z = MyPolyZipper::from(map.read_zipper());

// Use like any other zipper
poly_z.descend_to(b"foo");
assert_eq!(poly_z.val(), Some(&"value1"));
}

ProductZipper

[ProductZipper] creates a Cartesian product trie by extending each path in the primary trie with the root of the next secondary trie, recursively for all secondary zippers.

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::{PathMap, zipper::*};

let primary: PathMap<()> = [(b"base-", ())].into_iter().collect();
let secondary: PathMap<()> = [(b"nextA-", ()), (b"nextB-", ()), (b"nextC-", ())].into_iter().collect();

let mut pz = ProductZipper::new(primary.read_zipper(), [secondary.read_zipper(), secondary.read_zipper()]);

// Navigate to value in the third factor
pz.descend_to(b"base-nextB-nextA-");
assert_eq!(pz.focus_factor(), 2);
assert!(pz.is_val());
}

ProductZipper Virtual Trie from Example Above

Morphisms to Construct and Transform a Trie

GOAT

GOAT TODO

Implementing a Database

Behind every database is just somebody else's map structure - Unknown

GOAT: Trie segmentation, schemas, abstracting away the PathMap

GOAT: Managing indices

GOAT: permissions (access rights, but also concurrency)

Concurrency and Multi-Threading

GOAT: Certainly a lot to say. Try to centralize best practices in one place while not repeating too much of what's already in the ZipperHead section.

Design your Encodings Carefully

GOAT Overview on encoding,

GOAT Encoding is not just about representation but also about search. Think through the use cases

GOAT Paths can be thought of as the conceptual equivalent of an index in a database

GOAT Overall structural encoding for heterogeneous data; header data, e.g. length prefix

Canonical Representations for Primitive Types

GOAT: Maybe this doesn't need to be its own section... But it seems like we'll want to say stuff about primitive encoding that transcends a specific primitive

Encoding for Integers

GOAT, Query types (eq, gt, lt, etc.)

GOAT discuss byte count, GOAT discuss signed vs unsigned GOAT discuss endianness,

GOAT sample code to make number paths (basially just atoi)

GOAT populating subtrie with range of numbers having maximal sharing GOAT: explanation of why this is useful GOAT: sample code to generate sub-trie GOAT: sample to use a populated sub-trie to query another trie containing values

Encoding Floats

GOAT, overview of what a float is. Separate sign bit, exponent field, mantissa (significand) field

GOAT, Query types (gt, lt, etc.)

GOAT LP: I haven't given this a ton of thought. Adam seemed to think it was very involved. It didn't seem too bad to me, but he's probably right.

GOAT: Mirror the integer section's overview (what to explain), but with obviously explain the tricky stuff around floats

Encoding Point Data in Low-Dimensional Space

GOAT, Query types (KNN, (count, sum, avg) in bounds, clustering,)

GOAT, separate indices for each component

Efficient String and Symbol Representation

GOAT, Query types (exact match. What other queries to DBs offer that isn't a full regex (which requires iteration?))

GOAT Interning

GOAT, Intern within another subtrie or map? In an external structure?

GOAT, Maybe package up Remy's interning crate stand-alone

Indexing for Partial and Fuzzy String Search

GOAT, go through different query types

GOAT make paths from transformed representations

GOAT case insensitive... GOAT substring is done by truncating prefixes GOAT symspell for any fuzzy match

Discussion on Map Root Values, get_value, set_value, and Zippers

Quite a lot of complexity (and a not-insignificant cost) comes from the outgrowth of the [set_value] / [get_value] API on [PathMap] and the zipper types. This is because it is possible to call them with a zero-length path or with the zipper focus positioned at the root.

Conceptually the PathMap abstracts a [CoFree; 256] at each byte in the map. Which means there is no conceptual location for a root value. In the implementation it means each PathMap structure also needs to hold and Option<V> in addition to a root node, and each [Zipper] type needs to have a pointer to the root value in addition to the root node.

This means PathMap structures are bigger in memory and zippers are more expenseive to create, not to mention all the special case code paths dealing with root values.

Older versions of the code had the concept of rooted and un-rooted zippers, but this was a perennial foot gun and only partially addressed the problem. It became unworkable with the introduction of the ZipperHead API, and was ultimately removed.

Alternative: Redoing the public API to rework value access

As with many tradeoffs in software, making something simpler in one place often pushes the complexity elsewhere, and this situation is no exception.

So, let's explore what would be involved getting rid of the root_value (which I am taking as an assumption a priori has very little actual utility beyond making the API behave in a consistent and useable way)

Beginning with the API, firstly we would need to change the get_value / set_value zipper methods to require a path (or at least a byte), as opposed to simply operating on the focus. A zero-length path would need to be disallowed and become a usage error leading to an immediate panic.

But we still need to be able to access a value at every location in the trie (save the absolute map root). So we would do this by positioning the zipper's focus one level above the location the user needs to access.

But what how does the user access a value at the root of a zipper (not the map root, but a zipper rooted from a path within the map)? The zipper would need to be lifted up one byte, but the same problem exists at that level too.

This becomes tricky when the the [ZipperHead] assigns exclusive paths to zippers. The only way this can work is if multiple zippers are allowed to be positioned at the same focus path, but able to access exclusive (disjoint) sub-paths below that focus.

A simplification of that idea is that a given zipper is allowed to access one and only one specific byte below its root position. This solves the problem, but leads to questions around the correct behavior of the child_mask API, and what should happen if the client attemps to modify an invalid path relative to the zipper's root.

This still may be preferable to the currently-implemented design however, as it's a more faithful abstraction of the underlying structure and leads to a tighter fast-path

Alternative 2: Redo the abstract node interface to change CoFree assumption

The node contract expressed throught the [TrieNode] trait dictates that a node never holds a value at its root. Instead the current trait dictates that an associated value is owned by the parent. However this mismatch is in tension with the external API.

I think my preferred fix would be to change the trait methods to reflect the implications of values associated with node roots.

For example, a ByteNode currently is defined (conceptually) as:

    pub struct ByteNode<V> {
        mask: [u64; 4],
        values: Vec<CoFree<V>>,
    }
    pub struct CoFree<V> {
        rec: Option<TrieNodeODRc<V>>,
        value: Option<V>
    }

Under the new proposal, it'd be defined (conceptually) as:

    pub struct ByteNode<V> {
        root: Option<V>,
        mask: [u64; 4],
        values: Vec<ValOrChild<V>>,
    }
    pub enum ValOrChild<V> {
        Val(V),
        Child(TrieNodeODRc<V>)
    }

Path Forward

Either of these changes have the promise to greatly simplify the code and probably can reduce zipper movement overheads by a bit (guessing 10% or so since they don't really effect memory coherence or density and those are the big wins).

I personally believe Alternative 2 is a better option, not least because I think the external API is actually ok as it is and avoiding big changes to that limits the disruption to the internals of PathMap. However both options imply a massive amount of churn.

Therefore, I think much more complete fuzz tests are a prerequisite to embarking on either change.

Discussion on Inter-Node References & Internal Resource Management

The TrieNodeODRc type is used for all inter-node links. However it currently is just a Arc<dyn TrieNode<V> + 'static> and has a number of undesireable properties.

  • It is a 128-bit fat pointer. Links comprise a substantial amount of the in-memory footprint of the database, so shrinking this to 64 bits will be worth a lot.
  • The Arc header in the node structure eats a further 16 bytes. 8 bytes for the weak count which is totally unused, and 8 bytes for the strong count, which probably has a lot more range than we need.
  • Having each refcount separate on each node means we need to load the cache line for the node's header (which is often the whole node) into memory in order to check the refcount prior to freeing the node. Sometimes this means freeing a trie is nearly as expensive as initializing it.
  • There is no concept of availability that is separate from the refcount

Ideally we can solve all of these issues at the same time and lay the ground work for persistence (writable on-disk tries) at the same time.

Desiderata

  • Support for multiple (polymorphic) node data structures.
  • Refcounts on each (inter-node) link, and maintain the "clone the pointer to increment the refcount" paradigm.
  • Support for the make_mut copy-on-write pattern. I.e. if a referenced node's refcount is 1, then allow mutation, otherwise clone the node to a new location for mutation.
  • Support for multiple "node regions" in which nodes may be referenced and created. (More on node regions in its own section)
  • Refcounts stored centrally so reading or modifying them doesn't thrash the cache.

NOTE(igorm): Storing refcounts separately from the node itself requires storing an additional pointer (or an index at minimum) for each child. My prediction is that any savings from storing it separately will be offset by increasing node or child size by including this pointer. Additionally, it might be difficult to store a freelist of refcounts. We can experiment on this, not sure how it goes. LP: That's certainly not the design I intended to suggest. A design where a node could point at any arbitrary refcount-storage object, and thus required a pointer, would indeed be very silly. The design I had in mind would infer the location of the refcount from the node location or pointer. For example, one idea is to store all refcounts in a section at the head of each NodeBlock.

Nice to Have, but Ok if it's not possible

  • Clean separation between node format code and resource management code.

Proposal: Node Regions

Structures to own nodes and track dependencies. A node region is a logical object responsible for node life cycles, that manifests in the API.

Features Enabled by Node Regions

  • It should be possible to mmap a file containing nodes, and work with those nodes directly, and create new nodes in the same file if the file is writable.

  • It should also be possible to create new nodes in memory (not backed by a file) that reference nodes from a file. (For example a map that stores intermediate results)

  • It should be possible to copy nodes from memory to a file, and from a file to memory.

NOTE(igorm): what happens if we try to copy in-memory nodes that reference nodes on file? LP: It depends on where we are copying them to (The destination region). Before going through all the cases, I want to be clear on the distinction between copy (deep copy) and graft (create shallow reference). I also want to establish that I see node region dependencies as being transitive. i.e. if A depends on B, and B depends on C, then A also inherits a dependency on C.

The algorithm must proceed recursively from the source node. Each time a node that resides in a region that is part of the destination region's dependencies is encountered, a simple reference is enough. However if the node resiges in a foreign region (not in the dest region's dependencies) then a deep copy is needed, to copy the node to the destination region.

In the shallow graft-by-reference case, no recursive descent is needed. When a deep copy occurs, each child node is visited and, depending on where it resides, it is either referenced or copied.

  • It should be possible to work with multiple files at the same time, and potentially create in-memory structures that reference nodes in multiple files.

Design Ideas for Node Regions

  • Each node region is an object that handles its own resource management. (Tracks all refcounts, Owns memory that stores all nodes).

  • Some portion of a TrieNodeODRc smart pointer would point to a node region, and some other part of the pointer would identify the node within the region.

  • A node region can have dependencies on other node regions, which are created when a graft or similar operation makes a cross-region node reference. There is something like an Arc that keeps a region from deallocating / closing if one if its dependennt regions is still active.

  • Node regions have properties that determine the behavior of operations that create references (graft-like operations). For example, a file-backed region could not reference a node in a volatile memory region, so a graft-like operation would force a copy. On the other hand a volatile in-memory region could reference a node in a file region.

It should probably be possible to inspect any node at runtime to determine its region. To avoid bloating the node storage excessively, this would probably be accomplished using a global table of regions, and then we could either put a field on a NodeBlock or potentially use some of the space bits of the TrieNodeODRc pointer to specify which node_region a given node belongs to.

In addition to nodes, it is also necessary to track node region associations for maps and zippers for several reasons. Primarily, it may be necessary to create new nodes in different regions from the source. For example when making writable copies of nodes on a read-only medium. Also the empty node pointer is a sentinel constant, and therefore not associated with any region.

Types of Node Regions

  • A Transient Region does not track its nodes and thus does not have the ability to cleanup individual nodes. It is used for temporary storage of nodes for intermediate operations. Freeing a Transient Region can be done by dropping all of the backing memory. CAVEAT Non-Copy value types will cause a problem with Transient Regions. We either need to forbid non-Copy value types from Transient Regions, or engineer a values table that permits dropping of values.

  • A Memory Region is intended to be long-lived, but not file-backed. Individual nodes in a Memory Region can be freed, making room for new future nodes. An entire Memory Region can be freed by dropping the backing memory (if the value type is Copy), but a more common usage patterns will be to have one persistent Memory Region for the life of the program.

  • As the name suggests, a File Region is backed by a file. We may want read-write and read-only flavors of File Regions. TBD

Inter-Region Dependencies Allowed

X -> YTransientMemoryFile
Transient
Memory
File

In English: "Any region can depend on a file region (and reference its nodes). A file region can only reference nodes within itself."

Node Regions and Allocators

When the nightly feature is enabled, it is now possible to use PathMap with custom allocators. In concept, a node region is an extension of the allocator concept, so the right design is probably to forego the _in flavors of the various methods that take an allocator and pass a node region to the PathMap methods instead. Then a node region could be backed by a custom allocator.

However, the pattern that is pervasive throughout the Rust stdlib, i.e. making the Allocator an optional type parameter with a default value, may not be quite right for node regions. The Rust type system could enforce that a hybrid object isn't created; for example it could prevent the creation of a new map by joining together two maps with different allocators. However, this use case is actually desirable for node regions, and therefore the region dependency needs to be maintained and checked at runtime.

Mixing allocators within the same map, on the other hand, is likely to lead to nothing but pain & UB. So if a region implied an allocator that would be a problem. Currently the types are configures so it just isn't possible to perform operations that would cause nodes from different allocators to mingle.

So, it's an open question what the relationship between allocators and regions needs to be. The most sensible idea seems to me that allocators are checked at compile time, node regions are checked at runtime, and the type system would enforce that allocators are never mingled. The alternative implementation would involve dispatching to allocators dynamically, which sounds like a terrible idea for many reasons.

Proposal: Node Blocks

Structure that lays out multiple nodes in a contiguous block of memory. A node block is a physical object that is not visible at the API level. A node region could use a single node block, or be composed of multiple node blocks.

A node block would have:

  • Partial paths and node information for multiple nodes
  • A table of owned values
  • A table of external (links outside the node block) links

Internal links, linking nodes within the node block to other nodes within the same block, could take the form of simple indices, as there may be no need to bump the refcount, and can be heavily compressed; perhaps 8 bits would be enough to describe any value, external link, or internal node within a 4K block.

NOTE(igorm): there's an alternative that we can we pursue: use a custom allocator in each node region, which would be transparent over the API. i.e. allocate a specific node type. As for having pointers within blocks, I don't know a good strategy (and I don't know if it exists) to separate nodes into blocks, such that they're densely linked to each other. I'd have to find/make an algorithm to see if this is even possible. Intuitively, it seems the optimal distribution of nodes is NP-hard, greedy solutions are potentially many times worse, and maintaining the distribution of nodes while editing the tree/graph (specifically adding+removing nodes) is an unsolved problem.

...use a custom allocator in each node region. LP: I did several days worth of work on an ultimately abandoned a custom allocator at the beginning of this year. I had 2 reasons for leaving it. A. the rust allocator API isn't stable, and it's actually churning pretty rapidly. For example the allocator-api2 crate is kinda bit rotten already. But the bigger reason was B. I felt like the custom allocator approach only solved some of the problems / features outlined in this document, and avoiding the memory allocator paradigm altogether is probably where we want to get to to ultimately achieve all of the features.

I don't know a good strategy... LP: I'll diagram up what I have in mind and we can discuss where it might fail. We don't need optimal, just good enough, and we have an existence proof in the system allocator (or jemalloc) that something workable is possible. The other point is that usage patterns will play a big role in efficiency. It will certainly be possible to fragment the node blocks badly. Especially when we layer in the requirement for atomic updating and robustness against write failure.

maintaining the distribution of nodes while editing the tree/graph (specifically adding+removing nodes) is an unsolved problem LP: This will be partially ameliorated by in-memory regions. This lets the implementation skip the writing of a lot of the intermediate work. It is certainly still a problem that requires careful consideration. I think there are a lot of good ideas in the design of LMDB

Open Questions

  • Do refcounts live with node blocks, or somewhere else? If they live with node blocks, how do we make sure that circular references between blocks don't prevent deallocation?

NOTE(igorm): I think this is not possible to do (1) using purely reference counting (2) FAST (3) in generic case. The alternative is to have a garbage collector of some sort.

Let's look at the following tree:

a - b - c
  \ d - e

if you have references to c,d, b,e, and graft d onto c, then b onto e, this will create the following graph:

a - b - c
 \    X
  \ d - e

which has a circular path abedc(b). It seems to me that detecting this loop is not possible without visiting all subtrees of b and d when they get grafted.

LP: Circular references at the node level are illegal, so this is not a problem. Without that behavior we could get infinite paths, and break our isolation guarantees, and a bunch of other horrible things. So, if you performed those operations described above, here is what would happen...

graft d onto c This leads to:

a - b - c
 \    /
  \ d - e

then b onto e Will cause a copy-on-write of d and e, leading to"

a - b - c - d - e
 \   ^------<.
  \ d' - e' -^

My fear of circular references was referring to circularity at the node-block level. Although I think there is a credit scheme that can fix this too. Although I haven't fully thought it through yet.

LP: For resource management, we may well want something that looks more like a GC than straight reference counts. However, we will still need the refcounts on the nodes (or at least an atomic is_shared field) because that is a critical part of the copy-on-write semantic used to maintain structural sharing in the trie. Before writing off refcounts entirely,

Proposal for action

  1. We need a set of benchmarks and metrics to optimize for. Mork has a few benchmarks, and it would be nice to include some which measure deallocation.
  2. As for additional metrics, we could measure the memory per node/path.
  3. We can implement a set of cheap experiments:
  • replace global allocator with never-freeing allocator

    • What will we learn from this experiment? Are you thinking you want to separate how much of the time spent in cleanup is time spent with refcount check-and-decrement, vs. how much is simply freeing the allocations?

    Many of the more complex tests and benchmarks will forget the maps. And it does result in instant "cleanup". However if we implement a NodeBlock approach, we get both centralized refcounts and big contiguous allocations, so it will cut both costs.

  • replace Arc with Asc

    • Asc is probably worth a test. I did a prototype last year based on triomphe, and it actually got slower in spite of losing the weak counts, likely because of some of the pointer metadata not being available without the unstable APIs.

    My main hesitation here was just that I felt like addressing all the issues holistically is going to ultimately be less work and lead to a simpler and better-factored system. For example the ODRc is an &dyn ptr, but I think there are probably only 8 or 16 different node types we actually need to support, so I think we can probably find 3 or 4 bits out of the 64 to use as a tag. What we definitely don't want to do is a double-indirection (i.e. &TaggedNodeRef), so moving away from &dyn should happen at the same time as moving to a different smart-pointer type.

  • track allocated memory and created nodes, see the distribution

    • Right. I'm imagining different NodeBlock formats will exist for different purposes. We could even support native NodeBlock formats for some data file formats (not JSON because it doesn't contain lengths and offsets, but maybe beve)
  • record profiling information during benchmarks and see differences

    • 100% agreed. The counters feature provides a bit of that. It's pretty ad-hoc and gets dusted off for each experiment when we want to run it. But I'm definitely open to creating more tools and a better framework for the tools we want.
  1. instead of having refcounting, we could have a GC of some sort.
  • Yes. This is an option we shouldn't discount.

Specific Format Proposals

Pointer Structure

The new TrieNodeODRc would be a 64-bit type. In that we want to encode the following:

  • type_tag: 3 bits. Which code path to use to interpret the node.
  • block_addr: 44 bits (could be made smaller if necessary, by imposing restrictions on the computer / operating system architecture). The pointer to any valid 4K node block in the user address space can be reconstructed from this field. See https://github.com/irrustible/ointers/blob/main/src/lib.rs for details and code pertaining to this.
  • node_id: 7 bits. An id of a node, used to interpret the node's location within a block. Usually corresponds to a node block chunk index.
  • spare: 10 bits. Currently unused, by may permit tries that span multiple computers in the future.
       ┌07     ┌0F     ┌17     ┌1F     ┌27     ┌2F     ┌37     ┌3F
----------------------------------------------------------------

Precise layout TBD

NodeBlock Format

A NodeBlock would be a 4KB aligned block of memory. It would contain the following:

  • ref_counts_table: 96 x 4-Bytes = 384 Bytes. A ref-count for each node (each chunk) in the block. 4 Bytes seems like the right size for a refcount. That would provide a saturating (race-proof with relaxed access) counter up to 2 billion. An extreme minority of nodes will have >2B referencers so saturation will almost never happen. Therefore, 8 Bytes is overkill; on the other hand 2 Bytes = only 32K before saturation occurs, which could be quite common.

  • payload_meta_table: 16 Bytes. Stores 1 occupied bit per slot in the payloads_table.

    GOAT OUTDATED. IGNORE THIS PARAGRAPH. IT's BETTER IF NODES STORE THE KEY-OR-VAL info inside the individual chunk formats * payload_meta_table: 32 Bytes. Stores 2 bits per slot in the payloads_table. 1 bit for whether the slot is occupied, and 1 bit for whether it is a value or an external link. We could increase the payloads_table capacity by an additional value by skipping the "value or link" bitmask, and instead couning links from 0 up, and values from MAX down, and then only storing the (8-bit) index where values switch to links. However this increases the expsense of finding an empty slot and determining whether a given payload_ref is a value or a link.

    GOAT OUTDATED. IGNORE THIS PARAGRAPH. THE REFCOUNTS CAN DOUBLE AS OCCUPIED MARKER. * chunks_meta_table: 16 Bytes. Stores 1 occupied bit per slot in the chunks_table.

  • payloads_table: 126 * 8-Bytes = 1008 Bytes. Stores external links inter-mixed with values. External links are full 64-bit TrieNodeODRc pointers. Values are either the value type stored natively, if V is Sized < 64 bits or another TBD indirection to a key-value store or external allocation.

  • chunks_table: 84 * 32-Bytes = 2688 Bytes. Blocks of memory representing the nodes themselves. Layout depends on the respective Chunk Format.

Discussion

A NodeBlock is full when it runs out of either chunks or payloads.

The ratio of chunks-to-payloads can be changed depending on what we find when encoding the data.

We may determine there is a locality benefit to interleaving payload table entries and chunks within a block. Specifically in the case where we are traversing "through" a block, e.g. in a descend operation. We might only want to load a single chunk from a block, and if the payload(s) referenced by that chunk were in the same cache line then it would limit the number of fetches.

Deallocating node blocks may best be done with a GC algorithm, because, while we can prove nodes will not create circular references, we cannot prove the same thing about node blocks. If we do go with the GC route, we may want to move the child_or_value flags from the individual chunks to a central location in the node block, in order to know which payload entries represent links to other blocks that need to be marked for GC.

PayloadRef

A PayloadRef is an 8-bit type used within a Chunk Format, that either refers to another chunk within the same node block (local link) or a slot within the payloads_table (external link or value).

The simplest scheme would be for the high bit of a PayloadRef can indicate a local link if set, and an external link or value if not set. Then the remaining 7 bits becomes an index in either the chunks_table or the payloads_table. If we end up wanting more than 128 slots in the payloads_table, we can compare the 8-bit value against another constant, and subtract that constant to turn the PayloadRef into an index.

A PayloadRef can have several sentinel values. These specific values were chosen to avoid interference with valid mappings onto indices in the other tables. If those tables change size, we can tweak these sentinels.

  • ZERO_SIZED_VALUE = 0x7F Means "zero-size value exists here", and saves the need to actually fetch anything from the payloads_table.

  • NULL = 0xFF Means no "payload or value exists here". Can be useful to encode information within a ChunkFormat.

ByteUnsaturated Chunk Format

The ByteUnsaturated format represents a location in the trie, with up to 24 descending children, preceded by up to 4 straight prefix path bytes.

The prefix path bytes address a common pattern observed in the data where large fan-outs are often preceeded by reasonably straight paths. For example this happens because a given encoding might construct paths to represent key = value, and a common key might have many different values branching from a constant prefix formed from key. Or, for example tag data, such as arity markers, will manifest as low-variance prefixes.

The ByteUnsaturated format spans 2 32-Byte contiguous chunks, with the child_mask field occupying the entire second chunk.

Fields:

  • common_header: u8 = 1 Byte. 3-bits of tag, 3 bit integer to indicate the used length of prefix_path. has_root_val flag, and saturated_ff_val_or_child flag.

  • val_or_child_bitfield: 24-element bitfield = 3 Bytes. Indicates whether each payload in payloads is a value or a link.

  • prefix_path: [u8; 4], 4 Bytes. Provides a short prefix path between the node's root and the multi-branching location.

  • payloads: [PayloadRef; 24], 24 Bytes. References to all child links or values contained within the node. If the node contains a root value (has_root_val == 1), it will be stored in payloads_table[23]

  • child_mask: ByteMask, 32 Bytes. Contains the bitfield that will be returned by child_mask

Discussion

QUESTION: We can trade a few prefix path bytes for payload table slots or vice versa, depending on what the data tells us.

QUESTION: Do we want to stash the number of utilized slots in payloads? Allows for fast paths for a few functions that would otherwise require 4 popcnt operations (or a 256-bit compare in the case of is_empty). But we'd need to give up a child or a path prefix byte to get it.

QUESTION: We might want a "medium-saturated" format that could store up to 55ish or 88ish children before going all the way to a saturated node. Depending on the utilization we find.

ByteSaturatedHead Chunk Format

The ByteSaturatedHead format is needed to represent a fully saturated byte node. One issue with a straightforward format in the same spirit as the others, however, is that a single saturated location (i.e. a location at which all or most children are occupied) will exceed the capacity of the NodeBlock's payloads_table entries. Therefore the chunk format needs to store the full 64-bit values or TrieNodeODRc links. Also, to avoid the nasty requirement to copy and reformat large amounts of data in order to add or remove children in the middle of a heavily saturated node, a single logical node can be broken up across multiple chunk formats.

A ByteSaturatedHead chunk begins life as a ByteUnsaturated chunk, and has all the same fields. However, when it needs to be upgraded to a ByteSaturatedHead, the payloads are migrated from the NodeBlock into new ByteSaturatedBody nodes, and the payloads field in this ByteSaturatedHead node is updated to reference the new ByteSaturatedBody chunks.

Allocation can be skipped for ByteSaturatedBody chunks that have no set values. Given the fact that encoding algorithms often cluster values together (i.e. integers, printable ascii, etc.) it has been observed to be common to have large runs of unused children even for nodes that have 60+ children.

Mapping from a set bit in the child_mask to a location of a child node or value is acomplished with the following algorithm:

if (byte & 0xF) != 0xF {
  let body_node = get_node_from_payload_ref(self.payloads[byte >> 4]);
  let (body_node, idx) = (body_node, byte & 0xF);
  body_node.payloads[idx]
} else {
  let idx = self.payloads[byte >> 4];
  if idx != 0xF {
    let body_node = get_node_from_payload_ref(self.payloads[16]);
    body_node.payloads[idx]
  } else {
    let is_child = self.header.get_bit(saturated_ff_val_or_child);
    get_child_or_value_from_payload_ref(self.payloads[17], is_child);
  }
}

ByteSaturatedBody Chunk Format

A ByteSaturatedBody spans 4 continguous chunks and can represent 15 child nodes. A ByteSaturatedBody is always part of one and only one ByteSaturatedHead, and therfore its refcount must be 1. Otherwise it indicates a bug.

A ByteSaturatedBody contains the following fields:

  • payloads: [ValOrChild; 15] = 120 Bytes. Links to 15 full values or full TrieNodeODRc nodes.

  • metadata: 15-element bitfield = 2 Bytes. Stores a bit indicating whether each payload is a value or a child.

  • usused: 6 Bytes. Could contain an active bitfield, used for debug asserts. Otherwise unused, for now.

Discussion

An alternative implementation is to break a ByteSaturatedBody into 4 non-contiguous 32-Byte section chunks, and use the usused bytes in section 0 to specify the location of sections 1..3. This is equally dense, but trades the requirement to find contiguous 4-byte chunks for extra indirection.

ByteSaturatedBody chunks are the only place aside from a NodeBlock's payloads_table where references to another node can be stored. This means a GC on node blocks needs to be aware of them in order to function, which probably means we need to make a not block aware of all the chunks slots that represent ByteSaturatedBody chunks. Maybe this could be done with a sentinel value in the associated ref_counts.

StraightPath Chunk Format

The StraightPath format is the most efficient way to represent a long uninterrupted path. It packs 29 key bytes into a single 32 Byte chunk with the following fields.

  • common_header: u8 = 1 Byte. 3-bits of tag, Bit 3 == end_is_used, Bit 4 == end_val_or_child, Bit 5 == has_root_val (only needed once the changes described in [A.0001_map_root_values.md] are implemented.), Bits 6 & 7 = unused.

  • length: 1 Byte. Bits [0..5] == path_len.

  • end_payload: PayloadRef, 1 Byte. The payload at the end of the path.

  • path_bytes: [u8; 29]. The segment of the key. If the has_root_val flag is set, then path_bytes[28] is reinterpreted as the root value PayloadRef.

Discussion

QUESTION: Should we allow the end_payload space to be used as an extra path byte if V is a zero-sized type? We can use the end_is_used flag to distinguish a dangling path from a value.

LOUDS Chunk Format

The LOUDSChunkFormat format represents a subtree with up to 8 payloads, with up to 18 total path bytes, arranged throughout the tree. A crate to interpret the LOUDS bit strings can be found here: https://crates.io/crates/louds-rs. I don't know how efficient that crate is, but it's probably a good starting point.

The format consists of the following fields packed into a single 32 Byte chunk.

  • common_header: u8 = 1 Byte. 3-bits of tag, flag = has_root_val, 4-bits of unused.

  • path_bytes: [u8; 18], 18 Bytes. A 1-1 mapping with nodes encoded in the louds_bit_string

  • endpoint_metadata: An 8-element bitfield = 1 Byte. Each bit in this bitfield indicates whether the associated endpoint is a value or a link.

    GOAT OUTDATED. IGNORE THIS PARAGRAPH endpoint_metadata: 8 2-bit tags = 2 Bytes. Each tag encodes one of the following meanings: [val, link, both, root], from which a mapping between an end-point in the LOUDS structure and an index in endpoints_table can be derived. A val or link tag means 1 idx in endpoints_table is occupied; a both tag means 2 are occupied. Therefore, the mapping function needs to take the whole endpoint_metadata u16 into account, however this can be done cheaply with some bitwise arithmatic and a LUT.

  • louds_bit_string: [u8; 4], 4 Bytes. A LOUDS bit string representing the structure of the subtrie.

  • endpoints_table: [PayloadRef; 8], 8 Bytes. Stores the PayloadRefs for the values or links associated with each endpoint in the LOUDS tree. If has_root_val is set, then the PayloadRef pointing to the root value will be stored at endpoints_table[7].

    GOAT OUTDATED. IGNORE THIS PART. If endpoint_metadata[0] == root, then endpoints_table[0] will be a value associated with the root (zero-length path), once the changes described in [A.0001_map_root_values.md] are implemented.

Discussion

QUESTION: The format described above does not allow values to be expressed along the path, only at path endpoints, with the root being a special case. However, we could modify the LOUDS format to encode values as virtual children that don't get their own path bytes. The limited experimentation I have done make me thing this is more complicated, but perhaps I am wrong.

QUESTION: Do we want to sacrifice a little bit of capacity for some pre-computed information that would speed up access without parsing the whole louds_bit_string?

QUESTION: Should we encode the existance of a root value in the louds_bit_string, rather than the endpoint_metadata? It takes zero space in the endpoint_metadata.

QUESTION: Do we need the length (in bits) of the louds_bit_string? It looks like we can infer it from the last endpoint node having no children. But perhaps there is an optimization to knowing the length.

QUESTION: We could represent the bit_string in a depth-first encoding, rather than LOUDS' breadth-first. This will make traversal more efficient at the cost of making computing a child mask more expensive. My feeling is that LOUDS' breadth-first encoding probably wins. But we'd need to try some experiments.

Examples of the LOUDS Chunk Format

Example 1: Pair

(h)
 |-----+
(e)   (o)
 |     |
(l)   (w)
 |     |
(l)   (d)
 |     |
(o)   (y)

path_bytes: [h, e, o, l, w, l, d, o, y]
endpoint_metadata: [link, val, ...]
endpoints_table: [Link_o, Val_y, ...]
louds_bit_string: `1,0 | 1,1,0 | 1,0 | 1,0 | 1,0 | 1,0 | 1,0 | 1,0 | 0 | 0`

Pair Chunk Format

Probably unneeded, as it only allows for a handful more path bytes above the LOUDS format. A dedicated pair format could support a path length up to 29 bytes in total, while the LOUDS format is limited to 19 path bytes. But a pair format probably doesn't have much reason to be, between the flexibility of the LOUDS format, and the simplicity of the StraightPath format.

Star Chunk Format

A format appropriate for zero-sized V types, able to express dangling paths or paths terminating in values with only 2 bits-per-path.

GOAT TODO... We should probably measure whether a Star node would ever get used. In addition to identifying other node formats we could design to address frequently observed patterns.

Separation of algebraic policy from value type

Problems with current design.

Proposal

Values policy

Subtrie policy

octree example

Restrict & restricting as meet policies

Invalidated Assumptions

Commutativity

Meet and join are assumed to be commutative, and therefore some implementations simply flip the order of arguments for code reuse.

Strict increase / decrease

Several parts of the code currently make the assumption that join will strictly increase the number of elements / size of a trie, and likewise meet and subtract will strictly reduce it. Neither of these assumptions hold with configurable policies.

Open Questions

Is the value policy separate from the subtrie policy, or an argument to it?

GOAT

Are join and meet (and subtract) still fundamentally different?

Do we need a separate API for a join policy, meet policy, and subtract policy? Or is every operation with the same argument signature just a policy? with a single set of entry-points on the zippers (and on PathMap) to perform that operation?

Is there an ergonomics reason to keep policy types separate?

Even if we conclude that it is only the function signature that differentiates policies, it still sounds mind-meltingly-wrong to do something silly like using a join subtrie policy with a meet value policy. So should the API have type-level opinions about this?

Is "policy" the right word? Or have we come full circle back to "lattice"?

That's the question. Do we need more terminology?

Scouting write zipper

The problem

Current WriteZipper movement changes the trie more than is necessary. For example, descending a WZ past a share point will cause the in-memory nodes to become preemptively unique even before any chamges are made through the methods that would be expected to modify the trie.

The reason for this is that the trie keeps a stack of mutable node references, and obtaining one of them requires calling make_mut on the TrieNodeODRc.

Example

Adam asks: "Why is re-traversal necessary? Why can't we just coerce the read-ptr to a write ptr?"

Consider the trie produced by this code:

#![allow(unused)]
fn main() {
extern crate pathmap;
use pathmap::PathMap;
let paths = &["atropine", "botox", "colchicine", "digitalis"];
let mut shared = PathMap::<()>::new();

let mut wz = shared.write_zipper();
for path in paths {
    wz.move_to_path(b"compounds:");
    wz.descend_to(path.as_bytes());
    wz.set_val(());
}

let mut map = PathMap::<()>::new();
let mut wz = map.write_zipper();

//Most medicines need special care
wz.descend_to("keep_in_the_pharmacy:");
wz.graft_map(shared.clone());
wz.move_to_path("handle_with_care:");
wz.graft_map(shared.clone());

//Add some more poisons
wz.move_to_path("handle_with_care:compounds:endrin");
wz.set_val(());
wz.move_to_path("handle_with_care:compounds:fluorine");
wz.set_val(());
wz.move_to_path("handle_with_care:compounds:gyromitrin");
wz.set_val(());

//Watch what happens to the trie sharing before and after adding the additional poisons
let mut out_buf = Vec::new();
use pathmap::viz::{viz_maps, DrawConfig};
let cfg = DrawConfig{ ascii: true, hide_value_paths: false, minimize_values: true, logical: true };
viz_maps(&[map], &cfg, &mut out_buf).unwrap();
println!("```mermaid\n{}```", std::str::from_utf8(&out_buf).unwrap());
}

The sharing happens at the node associated with compounds:. But the write zipper would have a focus node somewhere around the split to each compound. It must go up the trie to unique the nodes. Otherwise we would have just added a bunch of poisons to our pharmacy.

But it's actually even worse. If there were another zipper on keep_in_the_pharmacy: (because we used a ZipperHead at the map root), it is possible the writing in handle_with_care: could have modified the trie in such a way that the zipper in keep_in_the_pharmacy is now holding a pointer to garbage.

Solution(s)

My current proposed solution is to essentilly postpone the make_mut operation until it's needed. This would entail traversing the zipper using ordinary as_tagged accessors, and maintaining a stack of tagged (read-only) node refs for the top (most descended) portion of the stack. Then when a trie modification was actually needed, the implementation would need to re-traverse those nodes to obtain unique mutable pointers along the path.

This sounds like twice as much traversal work - and it is, but at least the re-traversal is likely to hit cache. More importantly, if it saves even a little bit of unnecessary node duplication (and thus breaking sharing) then I feel that it will be a net win.

I'm open to alternative ideas, however.

Cached Structural Catamorphism (partial path visibility)

The problem

A full path argument is incompatible with a cached catamorphism because path information that is incorporated into the generation of the W value means that W value is no longer suitable for use in another place with a different prefix path.

However, eliminating the path argument entirely (as is currently done for the non-debug cached cata methods) might be overly strict and limit some legitimate use cases.

Discussion

Adam said:

I'm currently using the path argument in the experimental MORKL interpreter; Specifically, MORKL programs are represented variable-free as a trie where everything is inlined. This is a huge exponential blow-up of the program, but it's mitigated by the sharing in the space.

Now I re-introduced variables as absolute paths in the program because we don't have a serialization format the maintains sharing and this works because two invariants: the caching cata folding in order, and the absolute paths in the program referring to the first subtrie.

So let's find a way around this... alternatively, I can look into formats that maintain sharing again.

LP said: have 3 answers, depending on timeframe.

  1. The _debug method preserves the old behavior. And, in spite of "debug" in the name, it's still available in release builds. It's basically identical to the old behavior. But I hate that answer. Although it keeps this PR merge from creating a functionality regression in the short term.
  2. I think the ACT format absolutely should be made to support sharing. As opposed to providing yet another format. I'm not sure about the challenges associated with making that work, but from a user-facing perspective, the ACT format should fill the niche of "compact read-only format". Meaning we should engineer sharing into the format, even if it means creating a v2 of the format. Because sharing is pretty damn important for compactness. (aside: I think we should engineer in compression into ACT as well. Possibly through a 2-tier addressing scheme and compressing blocks individually. But that's not relevant to this PR.)
  3. But this use case makes me realize we want a cata API that enables both caching and paths.

So... caching and paths... How are those not fundamentally incompatible?

Consider this reframe: if instead of a trie of bytes, we have a trie of expressions, where each expression is conceptually defined as:

#![allow(unused)]
fn main() {
enum Expr {
    Sym(String),
    Var(usize),
    Expr(Vec<Expr>)
}
}

In other words, a caching cata absolutely should be able to work at the Expr level, and cache entire exprs, including all nested children. But for us the issue is that the arity (and other header) info is potentially stored above the share point, making it impossible to access. Even if it is accessible, it's just not convenient to assemble it from the child_masks and jump sub_paths provided by the current byte-wise API.

There are two directions I see here:

  1. Make a "structure-aware" catamorphism (cata at first, but this will likely set the precedent for other morphisms as well). Something that traverses the trie as if it were the above structure (or any structure, controllable via a traversal closure that can access paths)... I haven't thought through the details here, but I already see some hairy issues. Not saying it won't work, but it needs a lot more thinking.
  2. Make a "hybrid" cata. In this implementation, the closure would get a full path arg, but the closure must return (W, usize), to inform the implementation how many trailing path bytes were used to construct the W. Then, the partial path could be hashed along with the node_id to determine if the result was appropriate for sharing.

The biggest downside I see to API choice 2 is that it sounds easy to misunderstand and therefore misuse. And misuses might be subtle and hard to detect.

However, API 1 could be implemented in terms of API 2, and through that lens, API 1 would be higher runtime overhead for not much practical gain.

Another consideration is whether we want a solution that extends beyond just catamorphism to other morphisms that may take advantage of caching in a more nuanced way. E.g. a situation where paths within a subtrie is parameterized by information from earlier in the path (but not a strict prefix on the subtrie), but there are still multiple instances of the same parameter-subtrie combo. My instinct is that situation is too complicated to capture with a morphism and people should just use zipper ops at that point, but it's not a settled question.

Current Thinking

  • Eventually support a logical_cached_cata (name subject to debate) that provides a path arg to the closure, and the prefix that becomes part of the cached value's hash (option 2).

  • Update ACT format and zipper to support structural sharing.