Fast histograms
In programming, you will often have to take a histogram of some data. Maybe you want to compress some data, equalize an image, crack a bad cipher, or identify a language. Unicode is the standard for text, let's count Unicode codepoints! I'll be writing all of the examples in Zig, as it is my preferred programming language, but all examples should be easily adapted to any other low-level language.
Using an array
Let's start with the naive approach of using an array with as many buckets as there are symbols to hold a count. Since there are \(2^{21}\) unicode codepoints, we'll need that many buckets. Because the array is very large, we will use dynamic allocation. In Zig, this means we pass an allocator around. With an allocator, you can ask for some memory and then give it back.
pub fn histogramNaive(allocator: std.mem.Allocator, input: []const u8) !void { // make a heap-allocated array of appropriate size var histogram = try allocator.create([1 << 21]u32); // when the function returns, give the memory used for the hash map back defer allocator.destroy(histogram); // initialize it to all zeroes @memset(histogram, 0); // get an iterator. when the utf-8 is invalid, exit var input_it = (try std.unicode.Utf8View.init(input)).iterator(); // iterate over each codepoint while (input_it.nextCodepoint()) |cp| { // increase the count. return an error on overflow histogram[cp] = try std.math.add(u32, histogram[cp], 1); } // iterate the histogram for (0.., histogram) |codepoint, count| { // check whether we had any instance of this codepoint if (count != 0) { // discard the values so that the compiler doesn't optimize them out std.mem.doNotOptimizeAway(codepoint); std.mem.doNotOptimizeAway(count); } } }
Using a hash map
When most programmers want to associate keys with values, they will immediately think of hash
maps. They work by using a hash function to transform keys into integers, then using them as an
index into an array. Zig has two hash maps in the standard library: std.HashMap
and
std.ArrayHashMap
. Let's create a function for each of them. They only differ by one line.
pub fn histogramHashMap(allocator: std.mem.Allocator, input: []const u8) !void { // create a hash map that allocates with `allocator` // the hash function will be chosen ✨ automagically ✨ var hash_map = std.AutoHashMap(u21, u32).init(allocator); // when the function returns, give the memory used for the hash map back defer hash_map.deinit(); // get an iterator. if the utf-8 is invalid, return an error var input_it = (try std.unicode.Utf8View.init(input)).iterator(); // iterate over each codepoint while (input_it.nextCodepoint()) |cp| { // insert the codepoint into the hash map or get an existing entry // return an error if there's no more memory const gop = try hash_map.getOrPut(cp); if (gop.found_existing) { // if there's already an entry for this codepoint, try to add one to it gop.value_ptr.* = try std.math.add(u32, gop.value_ptr.*, 1); } else { gop.value_ptr.* = 1; } } // get an iterator for the hash map var hash_map_it = hash_map.iterator(); // iterate the hash map while (hash_map_it.next()) |entry| { // the key is the codepoint const codepoint = entry.key_ptr.*; // the value is the count const count = entry.value_ptr.*; // discard the values so that the compiler doesn't optimize them out std.mem.doNotOptimizeAway(codepoint); std.mem.doNotOptimizeAway(count); } } pub fn histogramArrayHashMap(allocator: std.mem.Allocator, input: []const u8) !void { // ... var hash_map = std.AutoArrayHashMap(u21, u32).init(allocator); // ... }
Putting it all together
We've created implementations for two ways to create a histogram. Now is the time to benchmark
them. My preferred method of benchmarking is to write a program to perform a task, then use an
external tool like poop
or hyperfine
to time it. So, let's write one!
pub fn main() !u8 { // initialize the state for our allocator var allocator_state = std.heap.ArenaAllocator.init(std.heap.page_allocator); // get our allocator const allocator = allocator_state.allocator(); // get command-line arguments const args = try std.process.argsAlloc(allocator); defer std.process.argsFree(allocator, args); if (args.len < 3) { // user provided too few arguments usage(args) catch {}; return 1; } const input = read_input: { // open the file for reading const in_file = try std.fs.cwd().openFile(args[1], .{}); defer in_file.close(); // read as much as we can from the file, but not more than 100 megs const buffer = try in_file.readToEndAlloc(allocator, 100 << 20); break :read_input buffer; }; defer allocator.free(input); // get the mode const mode = std.meta.stringToEnum(enum { naive, hash_map, array_hash_map }, args[2]) orelse { usage(args) catch {}; return 1; }; switch (mode) { .naive => try histogramNaive(allocator, input), .hash_map => try histogramHashMap(allocator, input), .array_hash_map => try histogramArrayHashMap(allocator, input), } return 0; } fn usage(args: [][:0]const u8) !void { const stderr = std.io.getStdErr().writer(); try stderr.print( \\Usage: {s} <file> <mode> \\Mode must be one of naive, hash_map, array_hash_map. \\ , .{if (args.len == 0) "benchmark" else args[0]}, ); }
The final source file is available here.
Results
Now we can finally run our program on some test inputs, and benchmark with poop
.
Some ASCII text
Benchmark 1 (1038 runs): ./hist text_ascii.txt naive measurement mean ± σ min … max outliers delta wall_time 4.44ms ± 490us 3.45ms … 6.85ms 6 ( 1%) 0% peak_rss 9.61MB ± 594KB 8.54MB … 10.7MB 0 ( 0%) 0% cpu_cycles 5.80M ± 333K 5.49M … 9.54M 58 ( 6%) 0% instructions 13.1M ± 145 13.1M … 13.1M 0 ( 0%) 0% cache_references 508K ± 18.2K 474K … 551K 0 ( 0%) 0% cache_misses 181K ± 14.6K 149K … 231K 3 ( 0%) 0% branch_misses 154 ± 25.2 110 … 229 107 (10%) 0% Benchmark 2 (6655 runs): ./hist text_ascii.txt hash_map measurement mean ± σ min … max outliers delta wall_time 446us ± 69.4us 376us … 2.97ms 488 ( 7%) ⚡- 90.0% ± 0.3% peak_rss 1.18MB ± 1.08KB 1.11MB … 1.18MB 4 ( 0%) ⚡- 87.7% ± 0.1% cpu_cycles 112K ± 4.52K 97.3K … 152K 428 ( 6%) ⚡- 98.1% ± 0.1% instructions 139K ± 0.42 139K … 139K 1347 (20%) ⚡- 98.9% ± 0.0% cache_references 749 ± 57.0 622 … 1.34K 290 ( 4%) ⚡- 99.9% ± 0.1% cache_misses 52.2 ± 63.0 0 … 558 712 (11%) ⚡-100.0% ± 0.2% branch_misses 544 ± 85.1 398 … 809 251 ( 4%) 💩+253.9% ± 3.4% Benchmark 3 (6650 runs): ./hist text_ascii.txt array_hash_map measurement mean ± σ min … max outliers delta wall_time 457us ± 69.6us 385us … 2.66ms 434 ( 7%) ⚡- 89.7% ± 0.3% peak_rss 1.18MB ± 1.84KB 1.05MB … 1.18MB 3 ( 0%) ⚡- 87.7% ± 0.1% cpu_cycles 127K ± 5.01K 97.8K … 183K 567 ( 9%) ⚡- 97.8% ± 0.1% instructions 172K ± 0.40 172K … 172K 1120 (17%) ⚡- 98.7% ± 0.0% cache_references 825 ± 59.3 692 … 1.44K 387 ( 6%) ⚡- 99.8% ± 0.1% cache_misses 82.0 ± 103 0 … 621 832 (13%) ⚡-100.0% ± 0.2% branch_misses 460 ± 63.7 347 … 653 4 ( 0%) 💩+199.3% ± 2.5%
Some cyrillic text
Benchmark 1 (1041 runs): ./hist text_cyrillic.txt naive measurement mean ± σ min … max outliers delta wall_time 4.43ms ± 506us 3.53ms … 9.59ms 5 ( 0%) 0% peak_rss 9.63MB ± 603KB 8.56MB … 10.7MB 0 ( 0%) 0% cpu_cycles 5.79M ± 246K 5.53M … 8.78M 30 ( 3%) 0% instructions 13.1M ± 147 13.1M … 13.1M 0 ( 0%) 0% cache_references 506K ± 18.4K 473K … 549K 0 ( 0%) 0% cache_misses 179K ± 14.7K 148K … 230K 4 ( 0%) 0% branch_misses 1.52K ± 390 414 … 1.80K 212 (20%) 0% Benchmark 2 (6594 runs): ./hist text_cyrillic.txt hash_map measurement mean ± σ min … max outliers delta wall_time 454us ± 59.8us 393us … 1.88ms 486 ( 7%) ⚡- 89.7% ± 0.3% peak_rss 1.18MB ± 353 1.15MB … 1.18MB 1 ( 0%) ⚡- 87.8% ± 0.2% cpu_cycles 167K ± 6.76K 118K … 211K 403 ( 6%) ⚡- 97.1% ± 0.1% instructions 195K ± 0.44 195K … 195K 1323 (20%) ⚡- 98.5% ± 0.0% cache_references 767 ± 58.9 628 … 1.37K 305 ( 5%) ⚡- 99.8% ± 0.1% cache_misses 53.0 ± 66.9 0 … 565 735 (11%) ⚡-100.0% ± 0.2% branch_misses 2.05K ± 187 806 … 2.42K 679 (10%) 💩+ 35.1% ± 1.0% Benchmark 3 (6657 runs): ./hist text_cyrillic.txt array_hash_map measurement mean ± σ min … max outliers delta wall_time 458us ± 62.6us 396us … 2.73ms 484 ( 7%) ⚡- 89.7% ± 0.3% peak_rss 1.18MB ± 50.2 1.18MB … 1.18MB 1 ( 0%) ⚡- 87.7% ± 0.2% cpu_cycles 178K ± 8.26K 119K … 246K 569 ( 9%) ⚡- 96.9% ± 0.1% instructions 222K ± 0.41 222K … 222K 1140 (17%) ⚡- 98.3% ± 0.0% cache_references 821 ± 55.9 702 … 1.42K 370 ( 6%) ⚡- 99.8% ± 0.1% cache_misses 47.6 ± 68.6 0 … 643 625 ( 9%) ⚡-100.0% ± 0.2% branch_misses 1.92K ± 179 655 … 2.24K 652 (10%) 💩+ 27.0% ± 0.9%
All valid codepoints
Benchmark 1 (70 runs): ./hist uniform naive measurement mean ± σ min … max outliers delta wall_time 71.2ms ± 2.21ms 68.3ms … 80.6ms 1 ( 1%) 0% peak_rss 24.2MB ± 535KB 22.9MB … 24.9MB 10 (14%) 0% cpu_cycles 190M ± 6.19M 185M … 217M 2 ( 3%) 0% instructions 335M ± 160 335M … 335M 0 ( 0%) 0% cache_references 1.45M ± 24.7K 1.40M … 1.50M 1 ( 1%) 0% cache_misses 760K ± 7.61K 743K … 796K 2 ( 3%) 0% branch_misses 168K ± 220K 932 … 965K 1 ( 1%) 0% Benchmark 2 (24 runs): ./hist uniform hash_map measurement mean ± σ min … max outliers delta wall_time 211ms ± 10.8ms 200ms … 254ms 1 ( 4%) 💩+196.0% ± 3.8% peak_rss 55.2MB ± 721KB 53.6MB … 56.3MB 0 ( 0%) 💩+127.9% ± 1.1% cpu_cycles 585M ± 29.6M 561M … 709M 1 ( 4%) 💩+208.2% ± 3.9% instructions 949M ± 189 949M … 949M 0 ( 0%) 💩+183.1% ± 0.0% cache_references 17.8M ± 79.0K 17.6M … 17.9M 0 ( 0%) 💩+1124.9% ± 1.5% cache_misses 8.47M ± 226K 8.25M … 9.21M 2 ( 8%) 💩+1013.5% ± 7.0% branch_misses 3.47M ± 334K 3.09M … 4.23M 0 ( 0%) 💩+1964.7% ± 70.9% Benchmark 3 (20 runs): ./hist uniform array_hash_map measurement mean ± σ min … max outliers delta wall_time 251ms ± 5.48ms 241ms … 260ms 0 ( 0%) 💩+252.3% ± 2.3% peak_rss 79.9MB ± 1.23MB 78.6MB … 82.8MB 0 ( 0%) 💩+230.0% ± 1.5% cpu_cycles 692M ± 12.9M 672M … 718M 0 ( 0%) 💩+264.1% ± 2.2% instructions 824M ± 194 824M … 824M 0 ( 0%) 💩+145.8% ± 0.0% cache_references 16.7M ± 86.5K 16.6M … 17.0M 1 ( 5%) 💩+1050.1% ± 1.6% cache_misses 10.4M ± 220K 10.1M … 11.1M 2 (10%) 💩+1273.2% ± 6.8% branch_misses 2.09M ± 215K 1.78M … 2.63M 0 ( 0%) 💩+1144.5% ± 65.8%
Codepoints from 0 to 127 repeated until there are \(2^{21}\) of them
Benchmark 1 (95 runs): ./hist nonuniform naive measurement mean ± σ min … max outliers delta wall_time 52.3ms ± 1.39ms 49.9ms … 56.8ms 6 ( 6%) 0% peak_rss 15.0MB ± 850KB 12.0MB … 16.3MB 4 ( 4%) 0% cpu_cycles 139M ± 2.51M 136M … 147M 5 ( 5%) 0% instructions 147M ± 99.5 147M … 147M 0 ( 0%) 0% cache_references 741K ± 15.4K 717K … 784K 0 ( 0%) 0% cache_misses 329K ± 9.15K 319K … 384K 6 ( 6%) 0% branch_misses 242 ± 32.8 188 … 346 3 ( 3%) 0% Benchmark 2 (67 runs): ./hist nonuniform hash_map measurement mean ± σ min … max outliers delta wall_time 75.2ms ± 1.65ms 71.5ms … 83.0ms 2 ( 3%) 💩+ 43.8% ± 0.9% peak_rss 4.66MB ± 609KB 3.72MB … 5.82MB 0 ( 0%) ⚡- 68.9% ± 1.6% cpu_cycles 208M ± 4.38M 203M … 233M 1 ( 1%) 💩+ 50.0% ± 0.8% instructions 372M ± 98.8 372M … 372M 0 ( 0%) 💩+152.2% ± 0.0% cache_references 200K ± 13.9K 184K … 228K 0 ( 0%) ⚡- 73.0% ± 0.6% cache_misses 109K ± 15.6K 70.7K … 133K 0 ( 0%) ⚡- 66.8% ± 1.2% branch_misses 250K ± 12.7K 235K … 291K 4 ( 6%) 💩+102855.9% ± 1053.8% Benchmark 3 (56 runs): ./hist nonuniform array_hash_map measurement mean ± σ min … max outliers delta wall_time 89.6ms ± 3.45ms 85.7ms … 106ms 4 ( 7%) 💩+ 71.2% ± 1.5% peak_rss 4.58MB ± 618KB 3.65MB … 5.82MB 0 ( 0%) ⚡- 69.4% ± 1.7% cpu_cycles 249M ± 9.23M 241M … 290M 4 ( 7%) 💩+ 79.2% ± 1.4% instructions 501M ± 101 501M … 501M 0 ( 0%) 💩+239.7% ± 0.0% cache_references 201K ± 13.4K 184K … 231K 0 ( 0%) ⚡- 72.9% ± 0.7% cache_misses 111K ± 13.7K 76.4K … 132K 4 ( 7%) ⚡- 66.1% ± 1.1% branch_misses 653K ± 4.71K 634K … 661K 2 ( 4%) 💩+269198.2% ± 390.0%
So what do these values mean?
Each row corresponds to one metric measured by poop:
wall_time
is the total real time taken by the program.peak_rss
is the largest amount of memory the program at any point.cpu_cycles
is the number of CPU cycles performed.instructions
is the number of instructions executed.cache_references
is the number of times the CPU cache was hit.cache_misses
is the number of times the CPU cache was missed and the CPU had to retrieve from main memory. This is expensive.branch_misses
is the number of times the branch predictor predicted a branch incorrectly and the CPU had to roll back its state to the time the branch happened and take the other branch. This is also expensive.
Each column shows you information about the measurements taken of this metric:
mean
is the arithmetic mean of all measured values.σ
is the standard deviation of all measured values.min
is the smallest value measured.max
is the largest value measured.outliers
is the number of measurements more than 1 standard deviation away from the mean.delta
is the difference to the previous measurement.