zerotrie/builder/
mod.rs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
// This file is part of ICU4X. For terms of use, please see the file
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

//! # ZeroTrie Builder
//!
//! There are two implementations of the ZeroTrie Builder:
//!
//! - [konst::ZeroTrieBuilderConst] allows for human-readable const construction
//! - [nonconst::ZeroTrieBuilder] has the full feaure set but requires `alloc`
//!
//! The two builders follow the same algorithm but have different capabilities.
//!
//! ## Builder Algorithm Overview
//!
//! The tries are built backwards, from the last node to the first node. The key step of the
//! algorithm is **determining what is the next node to prepend.**
//!
//! In the simple case of [`ZeroTrieSimpleAscii`], all nodes are binary-search, so if the input
//! strings are provided in lexicographic order, there is a simple, deterministic method for
//! identifying the next node. This insight is what enables us to make the const builder.
//!
//! The builder works with the following intermediate state variables:
//!
//! - `prefix_len` indicates the byte index we are currently processing.
//! - `i` and `j` bracket a window of strings in the input that share the same prefix.
//! - `current_len` is the length in bytes of the current self-contained trie.
//! - `lengths_stack` contains metadata for branch nodes.
//!
//! What follows is a verbal explanation of the build steps for a trie containing:
//!
//! - "" → 11
//! - "ad" → 22
//! - "adef" → 33
//! - "adghk" → 44
//!
//! When a node is prepended, it is shown in **boldface**.
//!
//! 1. Initialize the builder by setting `i=3`, `j=4`, `prefix_len=5` (the last string),
//!    `current_len=0`, and `lengths_stack` empty. Start the main loop.
//! 2. Top of loop. The string at `i` is equal in length to `prefix_len`, so we prepend
//!    our first node: a **value node 44**, which requires a 2-byte varint. Increase
//!    `current_len` to 2.
//! 3. Reduce `prefix_len` to 4, read our `key_ascii="k"`, and recalculate `i` and `j`
//!    _(this calculation is a long chunk of code in the builder impls)_. Since there is no
//!    other string with the prefix "adgh", `i` and `j` stay the same, we prepend an
//!    **ASCII node "k"**, increase `current_len` to 3, and continue the main loop.
//! 4. Top of loop. The string at `i` is of length 5, but `prefix_len` is 4, so there is
//!    no value node to prepend.
//! 5. Reduce `prefix_len` to 3, read our `key_ascii="h"`, and recalculate `i` and `j`.
//!    There are no other strings sharing the prefix "abg", so we prepend an
//!    **ASCII node "h"**, increase `current_len` to 4, and continue the main loop.
//! 6. Top of loop. There is still no value node to prepend.
//! 7. Reduce `prefix_len` to 2, read our `key_ascii="g"`, and recalculate `i` and `j`.
//!    We find that `i=1` and `j=4`, the range of strings sharing the prefix "ad". Since
//!    `i` or `j` changed, proceed to evaluate the branch node.
//! 8. The last branch byte `ascii_j` for this prefix is "g", which is the same as `key_ascii`,
//!    so we are the _last_ target of a branch node. Push an entry onto `lengths_stack`:
//!    `BranchMeta { ascii: "g", cumulative_length: 4, local_length: 4, count: 1 }`.
//! 9. The first branch byte `ascii_i` for this prefix is "e", which is NOT equal to `key_ascii`,
//!    so we are _not the first_ target of a branch node. We therefore start evaluating the
//!    string preceding where we were at the top of the current loop. We set `i=2`, `j=3`,
//!    `prefix_len=4` (length of the string at `i`), and continue the main loop.
//! 10. Top of loop. Since the string at `i` is equal in length to `prefix_len`, we prepend a
//!     **value node 33** (which requires a 2-byte varint) and increase `current_len` to 2.
//! 11. Reduce `prefix_len` to 3, read our `key_ascii="f"`, and recalculate `i` and `j`.
//!     They stay the same, so we prepend an **ASCII node "f"**, increase `current_len` to 3,
//!     and continue the main loop.
//! 12. Top of loop. No value node this time.
//! 13. Reduce `prefix_len` to 2, read our `key_ascii="e"`, and recalculate `i` and `j`.
//!     They go back to `i=1` and `j=4`.
//! 14. The last branch byte `ascii_j` for this prefix is "g", which is NOT equal to `key_ascii`,
//!     so we are _not the last_ target of a branch node. We peek at the entry at the front of
//!     the lengths stack and use it to push another entry onto the stack:
//!     `BranchMeta { ascii: "e", cumulative_length: 7, local_length: 3, count: 2 }`
//! 15. The first branch byte `ascii_i` for this prefix is "e", which is the same as `key_ascii`,
//!     wo we are the _first_ target of a branch node. We can therefore proceed to prepend the
//!     metadata for the branch node. We peek at the top of the stack and find that there are 2
//!     tries reachable from this branch and they have a total byte length of 5. We then pull off
//!     2 entries from the stack into a local variable `branch_metas`. From here, we write out
//!     the **offset table**, **lookup table**, and **branch head node**, which are determined
//!     from the metadata entries. We set `current_len` to the length of the two tries plus the
//!     metadata, which happens to be 11. Then we return to the top of the main loop.
//! 16. Top of loop. The string at `i` is length 2, which is the same as `prefix_len`, so we
//!     prepend a **value node 22** (2-byte varint) and increase `current_len` to 13.
//! 17. Reduce `prefix_len` to 1, read our `key_ascii="d"`, and recalculate `i` and `j`.
//!     They stay the same, so we prepend an **ASCII node "d"**, increase `current_len` to 14,
//!     and continue the main loop.
//! 18. Top of loop. No value node this time.
//! 19. Reduce `prefix_len` to 0, read our `key_ascii="a"`, and recalculate `i` and `j`.
//!     They change to `i=0` and `j=4`, since all strings have the empty string as a prefix.
//!     However, `ascii_i` and `ascii_j` both equal `key_ascii`, so we prepend **ASCII node "a"**,
//!     increase `current_len` to 15, and continue the main loop.
//! 16. Top of loop. The string at `i` is length 0, which is the same as `prefix_len`, so we
//!     prepend a **value node 11** and increase `current_len` to 16.
//! 17. We can no longer reduce `prefix_len`, so our trie is complete.
//!
//! ## Perfect Hash Reordering
//!
//! When the PHF is added to the mix, the main change is that the strings are no longer in sorted
//! order when they are in the trie. To resolve this issue, when adding a branch node, the target
//! tries are rearranged in-place in the buffer to be in the correct order for the PHF.
//!
//! ## Example
//!
//! Here is the output of the trie described above.
//!
//! ```
//! use zerotrie::ZeroTrieSimpleAscii;
//!
//! const DATA: [(&str, usize); 4] =
//!     [("", 11), ("ad", 22), ("adef", 33), ("adghk", 44)];
//!
//! // As demonstrated above, the required capacity for this trie is 16 bytes
//! const TRIE: ZeroTrieSimpleAscii<[u8; 16]> =
//!     ZeroTrieSimpleAscii::from_sorted_str_tuples(&DATA);
//!
//! assert_eq!(
//!     TRIE.as_bytes(),
//!     &[
//!         0x8B, // value node 11
//!         b'a', // ASCII node 'a'
//!         b'd', // ASCII node 'd'
//!         0x90, // value node 22 lead byte
//!         0x06, // value node 22 trail byte
//!         0xC2, // branch node 2
//!         b'e', // first target of branch
//!         b'g', // second target of branch
//!         3,    // offset
//!         b'f', // ASCII node 'f'
//!         0x90, // value node 33 lead byte
//!         0x11, // value node 33 trail byte
//!         b'h', // ASCII node 'h'
//!         b'k', // ASCII node 'k'
//!         0x90, // value node 44 lead byte
//!         0x1C, // value node 44 trail byte
//!     ]
//! );
//!
//! assert_eq!(TRIE.get(b""), Some(11));
//! assert_eq!(TRIE.get(b"ad"), Some(22));
//! assert_eq!(TRIE.get(b"adef"), Some(33));
//! assert_eq!(TRIE.get(b"adghk"), Some(44));
//! assert_eq!(TRIE.get(b"unknown"), None);
//! ```

mod branch_meta;
pub(crate) mod bytestr;
pub(crate) mod konst;
#[cfg(feature = "litemap")]
mod litemap;
#[cfg(feature = "alloc")]
pub(crate) mod nonconst;

use bytestr::ByteStr;

use super::ZeroTrieSimpleAscii;

impl<const N: usize> ZeroTrieSimpleAscii<[u8; N]> {
    /// **Const Constructor:** Creates an [`ZeroTrieSimpleAscii`] from a sorted slice of keys and values.
    ///
    /// This function needs to know the exact length of the resulting trie at compile time. To
    /// figure out `N`, first set `N` to be too large (say 0xFFFF), then look at the resulting
    /// compile error which will tell you how to set `N`, like this:
    ///
    /// > the evaluated program panicked at 'Buffer too large. Size needed: 17'
    ///
    /// That error message says you need to set `N` to 17.
    ///
    /// Also see [`Self::from_sorted_str_tuples`].
    ///
    /// # Panics
    ///
    /// Panics if `items` is not sorted or if `N` is not correct.
    ///
    /// # Examples
    ///
    /// Create a `const` ZeroTrieSimpleAscii at compile time:
    ///
    /// ```
    /// use zerotrie::ZeroTrieSimpleAscii;
    ///
    /// // The required capacity for this trie happens to be 17 bytes
    /// const TRIE: ZeroTrieSimpleAscii<[u8; 17]> =
    ///     ZeroTrieSimpleAscii::from_sorted_u8_tuples(&[
    ///         (b"bar", 2),
    ///         (b"bazzoo", 3),
    ///         (b"foo", 1),
    ///     ]);
    ///
    /// assert_eq!(TRIE.get(b"foo"), Some(1));
    /// assert_eq!(TRIE.get(b"bar"), Some(2));
    /// assert_eq!(TRIE.get(b"bazzoo"), Some(3));
    /// assert_eq!(TRIE.get(b"unknown"), None);
    /// ```
    ///
    /// Panics if strings are not sorted:
    ///
    /// ```compile_fail
    /// # use zerotrie::ZeroTrieSimpleAscii;
    /// const TRIE: ZeroTrieSimpleAscii<[u8; 17]> = ZeroTrieSimpleAscii::from_sorted_u8_tuples(&[
    ///     (b"foo", 1),
    ///     (b"bar", 2),
    ///     (b"bazzoo", 3),
    /// ]);
    /// ```
    ///
    /// Panics if capacity is too small:
    ///
    /// ```compile_fail
    /// # use zerotrie::ZeroTrieSimpleAscii;
    /// const TRIE: ZeroTrieSimpleAscii<[u8; 15]> = ZeroTrieSimpleAscii::from_sorted_u8_tuples(&[
    ///     (b"bar", 2),
    ///     (b"bazzoo", 3),
    ///     (b"foo", 1),
    /// ]);
    /// ```
    ///
    /// Panics if capacity is too large:
    ///
    /// ```compile_fail
    /// # use zerotrie::ZeroTrieSimpleAscii;
    /// const TRIE: ZeroTrieSimpleAscii<[u8; 20]> = ZeroTrieSimpleAscii::from_sorted_u8_tuples(&[
    ///     (b"bar", 2),
    ///     (b"bazzoo", 3),
    ///     (b"foo", 1),
    /// ]);
    /// ```
    pub const fn from_sorted_u8_tuples(tuples: &[(&[u8], usize)]) -> Self {
        use konst::*;
        let byte_str_slice = ByteStr::from_byte_slice_with_value(tuples);
        let result = ZeroTrieBuilderConst::<N>::from_tuple_slice::<100>(byte_str_slice);
        match result {
            Ok(s) => Self::from_store(s.build_or_panic()),
            Err(_) => panic!("Failed to build ZeroTrie"),
        }
    }

    /// **Const Constructor:** Creates an [`ZeroTrieSimpleAscii`] from a sorted slice of keys and values.
    ///
    /// This function needs to know the exact length of the resulting trie at compile time. To
    /// figure out `N`, first set `N` to be too large (say 0xFFFF), then look at the resulting
    /// compile error which will tell you how to set `N`, like this:
    ///
    /// > the evaluated program panicked at 'Buffer too large. Size needed: 17'
    ///
    /// That error message says you need to set `N` to 17.
    ///
    /// Also see [`Self::from_sorted_u8_tuples`].
    ///
    /// # Panics
    ///
    /// Panics if `items` is not sorted, if `N` is not correct, or if any of the strings contain
    /// non-ASCII characters.
    ///
    /// # Examples
    ///
    /// Create a `const` ZeroTrieSimpleAscii at compile time:
    ///
    /// ```
    /// use zerotrie::ZeroTrieSimpleAscii;
    ///
    /// // The required capacity for this trie happens to be 17 bytes
    /// const TRIE: ZeroTrieSimpleAscii<[u8; 17]> =
    ///     ZeroTrieSimpleAscii::from_sorted_str_tuples(&[
    ///         ("bar", 2),
    ///         ("bazzoo", 3),
    ///         ("foo", 1),
    ///     ]);
    ///
    /// assert_eq!(TRIE.get(b"foo"), Some(1));
    /// assert_eq!(TRIE.get(b"bar"), Some(2));
    /// assert_eq!(TRIE.get(b"bazzoo"), Some(3));
    /// assert_eq!(TRIE.get(b"unknown"), None);
    /// ```
    ///
    /// Panics if the strings are not ASCII:
    ///
    /// ```compile_fail
    /// # use zerotrie::ZeroTrieSimpleAscii;
    /// const TRIE: ZeroTrieSimpleAscii<[u8; 100]> = ZeroTrieSimpleAscii::from_sorted_str_tuples(&[
    ///     ("bár", 2),
    ///     ("båzzöo", 3),
    ///     ("foo", 1),
    /// ]);
    /// ```
    pub const fn from_sorted_str_tuples(tuples: &[(&str, usize)]) -> Self {
        use konst::*;
        let byte_str_slice = ByteStr::from_str_slice_with_value(tuples);
        // 100 is the value of `K`, the size of the lengths stack. If compile errors are
        // encountered, this number may need to be increased.
        let result = ZeroTrieBuilderConst::<N>::from_tuple_slice::<100>(byte_str_slice);
        match result {
            Ok(s) => Self::from_store(s.build_or_panic()),
            Err(_) => panic!("Failed to build ZeroTrie"),
        }
    }
}