Fast histograms

In programming, you will often have to take a histogram of some data. Maybe you want to compress some data, equalize an image, crack a bad cipher, or identify a language. Unicode is the standard for text, let's count Unicode codepoints! I'll be writing all of the examples in Zig, as it is my preferred programming language, but all examples should be easily adapted to any other low-level language.

Using an array

Let's start with the naive approach of using an array with as many buckets as there are symbols to hold a count. Since there are \(2^{21}\) unicode codepoints, we'll need that many buckets. Because the array is very large, we will use dynamic allocation. In Zig, this means we pass an allocator around. With an allocator, you can ask for some memory and then give it back.

pub fn histogramNaive(allocator: std.mem.Allocator, input: []const u8) !void {
    // make a heap-allocated array of appropriate size
    var histogram = try allocator.create([1 << 21]u32);
    // when the function returns, give the memory used for the hash map back
    defer allocator.destroy(histogram);

    // initialize it to all zeroes
    @memset(histogram, 0);

    // get an iterator. when the utf-8 is invalid, exit
    var input_it = (try std.unicode.Utf8View.init(input)).iterator();

    // iterate over each codepoint
    while (input_it.nextCodepoint()) |cp| {
        // increase the count. return an error on overflow
        histogram[cp] = try std.math.add(u32, histogram[cp], 1);
    }

    // iterate the histogram
    for (0.., histogram) |codepoint, count| {
        // check whether we had any instance of this codepoint
        if (count != 0) {
            // discard the values so that the compiler doesn't optimize them out
            std.mem.doNotOptimizeAway(codepoint);
            std.mem.doNotOptimizeAway(count);
        }
    }
}

Using a hash map

When most programmers want to associate keys with values, they will immediately think of hash maps. They work by using a hash function to transform keys into integers, then using them as an index into an array. Zig has two hash maps in the standard library: std.HashMap and std.ArrayHashMap. Let's create a function for each of them. They only differ by one line.

pub fn histogramHashMap(allocator: std.mem.Allocator, input: []const u8) !void {
    // create a hash map that allocates with `allocator`
    // the hash function will be chosen ✨ automagically ✨
    var hash_map = std.AutoHashMap(u21, u32).init(allocator);

    // when the function returns, give the memory used for the hash map back
    defer hash_map.deinit();

    // get an iterator. if the utf-8 is invalid, return an error
    var input_it = (try std.unicode.Utf8View.init(input)).iterator();

    // iterate over each codepoint
    while (input_it.nextCodepoint()) |cp| {
        // insert the codepoint into the hash map or get an existing entry
        // return an error if there's no more memory
        const gop = try hash_map.getOrPut(cp);

        if (gop.found_existing) {
            // if there's already an entry for this codepoint, try to add one to it
            gop.value_ptr.* = try std.math.add(u32, gop.value_ptr.*, 1);
        } else {
            gop.value_ptr.* = 1;
        }
    }

    // get an iterator for the hash map
    var hash_map_it = hash_map.iterator();

    // iterate the hash map
    while (hash_map_it.next()) |entry| {
        // the key is the codepoint
        const codepoint = entry.key_ptr.*;

        // the value is the count
        const count = entry.value_ptr.*;

        // discard the values so that the compiler doesn't optimize them out
        std.mem.doNotOptimizeAway(codepoint);
        std.mem.doNotOptimizeAway(count);
    }
}

pub fn histogramArrayHashMap(allocator: std.mem.Allocator, input: []const u8) !void {
    // ...
    var hash_map = std.AutoArrayHashMap(u21, u32).init(allocator);
    // ...
}

Putting it all together

We've created implementations for two ways to create a histogram. Now is the time to benchmark them. My preferred method of benchmarking is to write a program to perform a task, then use an external tool like poop or hyperfine to time it. So, let's write one!

pub fn main() !u8 {
    // initialize the state for our allocator
    var allocator_state = std.heap.ArenaAllocator.init(std.heap.page_allocator);

    // get our allocator
    const allocator = allocator_state.allocator();

    // get command-line arguments
    const args = try std.process.argsAlloc(allocator);
    defer std.process.argsFree(allocator, args);

    if (args.len < 3) {
        // user provided too few arguments
        usage(args) catch {};
        return 1;
    }

    const input = read_input: {
        // open the file for reading
        const in_file = try std.fs.cwd().openFile(args[1], .{});
        defer in_file.close();

        // read as much as we can from the file, but not more than 100 megs
        const buffer = try in_file.readToEndAlloc(allocator, 100 << 20);
        break :read_input buffer;
    };
    defer allocator.free(input);

    // get the mode
    const mode = std.meta.stringToEnum(enum { naive, hash_map, array_hash_map }, args[2]) orelse {
        usage(args) catch {};
        return 1;
    };

    switch (mode) {
        .naive => try histogramNaive(allocator, input),
        .hash_map => try histogramHashMap(allocator, input),
        .array_hash_map => try histogramArrayHashMap(allocator, input),
    }

    return 0;
}

fn usage(args: [][:0]const u8) !void {
    const stderr = std.io.getStdErr().writer();
    try stderr.print(
        \\Usage: {s} <file> <mode>
        \\Mode must be one of naive, hash_map, array_hash_map.
        \\
    ,
        .{if (args.len == 0) "benchmark" else args[0]},
    );
}

The final source file is available here.

Results

Now we can finally run our program on some test inputs, and benchmark with poop.

Some ASCII text

Benchmark 1 (1038 runs): ./hist text_ascii.txt naive
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          4.44ms ±  490us    3.45ms … 6.85ms          6 ( 1%)        0%
  peak_rss           9.61MB ±  594KB    8.54MB … 10.7MB          0 ( 0%)        0%
  cpu_cycles         5.80M  ±  333K     5.49M  … 9.54M          58 ( 6%)        0%
  instructions       13.1M  ±  145      13.1M  … 13.1M           0 ( 0%)        0%
  cache_references    508K  ± 18.2K      474K  …  551K           0 ( 0%)        0%
  cache_misses        181K  ± 14.6K      149K  …  231K           3 ( 0%)        0%
  branch_misses       154   ± 25.2       110   …  229          107 (10%)        0%
Benchmark 2 (6655 runs): ./hist text_ascii.txt hash_map
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           446us ± 69.4us     376us … 2.97ms        488 ( 7%)        ⚡- 90.0% ±  0.3%
  peak_rss           1.18MB ± 1.08KB    1.11MB … 1.18MB          4 ( 0%)        ⚡- 87.7% ±  0.1%
  cpu_cycles          112K  ± 4.52K     97.3K  …  152K         428 ( 6%)        ⚡- 98.1% ±  0.1%
  instructions        139K  ± 0.42       139K  …  139K        1347 (20%)        ⚡- 98.9% ±  0.0%
  cache_references    749   ± 57.0       622   … 1.34K         290 ( 4%)        ⚡- 99.9% ±  0.1%
  cache_misses       52.2   ± 63.0         0   …  558          712 (11%)        ⚡-100.0% ±  0.2%
  branch_misses       544   ± 85.1       398   …  809          251 ( 4%)        💩+253.9% ±  3.4%
Benchmark 3 (6650 runs): ./hist text_ascii.txt array_hash_map
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           457us ± 69.6us     385us … 2.66ms        434 ( 7%)        ⚡- 89.7% ±  0.3%
  peak_rss           1.18MB ± 1.84KB    1.05MB … 1.18MB          3 ( 0%)        ⚡- 87.7% ±  0.1%
  cpu_cycles          127K  ± 5.01K     97.8K  …  183K         567 ( 9%)        ⚡- 97.8% ±  0.1%
  instructions        172K  ± 0.40       172K  …  172K        1120 (17%)        ⚡- 98.7% ±  0.0%
  cache_references    825   ± 59.3       692   … 1.44K         387 ( 6%)        ⚡- 99.8% ±  0.1%
  cache_misses       82.0   ±  103         0   …  621          832 (13%)        ⚡-100.0% ±  0.2%
  branch_misses       460   ± 63.7       347   …  653            4 ( 0%)        💩+199.3% ±  2.5%

Some cyrillic text

Benchmark 1 (1041 runs): ./hist text_cyrillic.txt naive
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          4.43ms ±  506us    3.53ms … 9.59ms          5 ( 0%)        0%
  peak_rss           9.63MB ±  603KB    8.56MB … 10.7MB          0 ( 0%)        0%
  cpu_cycles         5.79M  ±  246K     5.53M  … 8.78M          30 ( 3%)        0%
  instructions       13.1M  ±  147      13.1M  … 13.1M           0 ( 0%)        0%
  cache_references    506K  ± 18.4K      473K  …  549K           0 ( 0%)        0%
  cache_misses        179K  ± 14.7K      148K  …  230K           4 ( 0%)        0%
  branch_misses      1.52K  ±  390       414   … 1.80K         212 (20%)        0%
Benchmark 2 (6594 runs): ./hist text_cyrillic.txt hash_map
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           454us ± 59.8us     393us … 1.88ms        486 ( 7%)        ⚡- 89.7% ±  0.3%
  peak_rss           1.18MB ±  353      1.15MB … 1.18MB          1 ( 0%)        ⚡- 87.8% ±  0.2%
  cpu_cycles          167K  ± 6.76K      118K  …  211K         403 ( 6%)        ⚡- 97.1% ±  0.1%
  instructions        195K  ± 0.44       195K  …  195K        1323 (20%)        ⚡- 98.5% ±  0.0%
  cache_references    767   ± 58.9       628   … 1.37K         305 ( 5%)        ⚡- 99.8% ±  0.1%
  cache_misses       53.0   ± 66.9         0   …  565          735 (11%)        ⚡-100.0% ±  0.2%
  branch_misses      2.05K  ±  187       806   … 2.42K         679 (10%)        💩+ 35.1% ±  1.0%
Benchmark 3 (6657 runs): ./hist text_cyrillic.txt array_hash_map
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           458us ± 62.6us     396us … 2.73ms        484 ( 7%)        ⚡- 89.7% ±  0.3%
  peak_rss           1.18MB ± 50.2      1.18MB … 1.18MB          1 ( 0%)        ⚡- 87.7% ±  0.2%
  cpu_cycles          178K  ± 8.26K      119K  …  246K         569 ( 9%)        ⚡- 96.9% ±  0.1%
  instructions        222K  ± 0.41       222K  …  222K        1140 (17%)        ⚡- 98.3% ±  0.0%
  cache_references    821   ± 55.9       702   … 1.42K         370 ( 6%)        ⚡- 99.8% ±  0.1%
  cache_misses       47.6   ± 68.6         0   …  643          625 ( 9%)        ⚡-100.0% ±  0.2%
  branch_misses      1.92K  ±  179       655   … 2.24K         652 (10%)        💩+ 27.0% ±  0.9%

All valid codepoints

Benchmark 1 (70 runs): ./hist uniform naive
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          71.2ms ± 2.21ms    68.3ms … 80.6ms          1 ( 1%)        0%
  peak_rss           24.2MB ±  535KB    22.9MB … 24.9MB         10 (14%)        0%
  cpu_cycles          190M  ± 6.19M      185M  …  217M           2 ( 3%)        0%
  instructions        335M  ±  160       335M  …  335M           0 ( 0%)        0%
  cache_references   1.45M  ± 24.7K     1.40M  … 1.50M           1 ( 1%)        0%
  cache_misses        760K  ± 7.61K      743K  …  796K           2 ( 3%)        0%
  branch_misses       168K  ±  220K      932   …  965K           1 ( 1%)        0%
Benchmark 2 (24 runs): ./hist uniform hash_map
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           211ms ± 10.8ms     200ms …  254ms          1 ( 4%)        💩+196.0% ±  3.8%
  peak_rss           55.2MB ±  721KB    53.6MB … 56.3MB          0 ( 0%)        💩+127.9% ±  1.1%
  cpu_cycles          585M  ± 29.6M      561M  …  709M           1 ( 4%)        💩+208.2% ±  3.9%
  instructions        949M  ±  189       949M  …  949M           0 ( 0%)        💩+183.1% ±  0.0%
  cache_references   17.8M  ± 79.0K     17.6M  … 17.9M           0 ( 0%)        💩+1124.9% ±  1.5%
  cache_misses       8.47M  ±  226K     8.25M  … 9.21M           2 ( 8%)        💩+1013.5% ±  7.0%
  branch_misses      3.47M  ±  334K     3.09M  … 4.23M           0 ( 0%)        💩+1964.7% ± 70.9%
Benchmark 3 (20 runs): ./hist uniform array_hash_map
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           251ms ± 5.48ms     241ms …  260ms          0 ( 0%)        💩+252.3% ±  2.3%
  peak_rss           79.9MB ± 1.23MB    78.6MB … 82.8MB          0 ( 0%)        💩+230.0% ±  1.5%
  cpu_cycles          692M  ± 12.9M      672M  …  718M           0 ( 0%)        💩+264.1% ±  2.2%
  instructions        824M  ±  194       824M  …  824M           0 ( 0%)        💩+145.8% ±  0.0%
  cache_references   16.7M  ± 86.5K     16.6M  … 17.0M           1 ( 5%)        💩+1050.1% ±  1.6%
  cache_misses       10.4M  ±  220K     10.1M  … 11.1M           2 (10%)        💩+1273.2% ±  6.8%
  branch_misses      2.09M  ±  215K     1.78M  … 2.63M           0 ( 0%)        💩+1144.5% ± 65.8%

Codepoints from 0 to 127 repeated until there are \(2^{21}\) of them

Benchmark 1 (95 runs): ./hist nonuniform naive
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          52.3ms ± 1.39ms    49.9ms … 56.8ms          6 ( 6%)        0%
  peak_rss           15.0MB ±  850KB    12.0MB … 16.3MB          4 ( 4%)        0%
  cpu_cycles          139M  ± 2.51M      136M  …  147M           5 ( 5%)        0%
  instructions        147M  ± 99.5       147M  …  147M           0 ( 0%)        0%
  cache_references    741K  ± 15.4K      717K  …  784K           0 ( 0%)        0%
  cache_misses        329K  ± 9.15K      319K  …  384K           6 ( 6%)        0%
  branch_misses       242   ± 32.8       188   …  346            3 ( 3%)        0%
Benchmark 2 (67 runs): ./hist nonuniform hash_map
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          75.2ms ± 1.65ms    71.5ms … 83.0ms          2 ( 3%)        💩+ 43.8% ±  0.9%
  peak_rss           4.66MB ±  609KB    3.72MB … 5.82MB          0 ( 0%)        ⚡- 68.9% ±  1.6%
  cpu_cycles          208M  ± 4.38M      203M  …  233M           1 ( 1%)        💩+ 50.0% ±  0.8%
  instructions        372M  ± 98.8       372M  …  372M           0 ( 0%)        💩+152.2% ±  0.0%
  cache_references    200K  ± 13.9K      184K  …  228K           0 ( 0%)        ⚡- 73.0% ±  0.6%
  cache_misses        109K  ± 15.6K     70.7K  …  133K           0 ( 0%)        ⚡- 66.8% ±  1.2%
  branch_misses       250K  ± 12.7K      235K  …  291K           4 ( 6%)        💩+102855.9% ± 1053.8%
Benchmark 3 (56 runs): ./hist nonuniform array_hash_map
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.6ms ± 3.45ms    85.7ms …  106ms          4 ( 7%)        💩+ 71.2% ±  1.5%
  peak_rss           4.58MB ±  618KB    3.65MB … 5.82MB          0 ( 0%)        ⚡- 69.4% ±  1.7%
  cpu_cycles          249M  ± 9.23M      241M  …  290M           4 ( 7%)        💩+ 79.2% ±  1.4%
  instructions        501M  ±  101       501M  …  501M           0 ( 0%)        💩+239.7% ±  0.0%
  cache_references    201K  ± 13.4K      184K  …  231K           0 ( 0%)        ⚡- 72.9% ±  0.7%
  cache_misses        111K  ± 13.7K     76.4K  …  132K           4 ( 7%)        ⚡- 66.1% ±  1.1%
  branch_misses       653K  ± 4.71K      634K  …  661K           2 ( 4%)        💩+269198.2% ± 390.0%

So what do these values mean?

Each row corresponds to one metric measured by poop:

wall_time is the total real time taken by the program.
peak_rss is the largest amount of memory the program at any point.
cpu_cycles is the number of CPU cycles performed.
instructions is the number of instructions executed.
cache_references is the number of times the CPU cache was hit.
cache_misses is the number of times the CPU cache was missed and the CPU had to retrieve from main memory. This is expensive.
branch_misses is the number of times the branch predictor predicted a branch incorrectly and the CPU had to roll back its state to the time the branch happened and take the other branch. This is also expensive.

Each column shows you information about the measurements taken of this metric:

mean is the arithmetic mean of all measured values.
σ is the standard deviation of all measured values.
min is the smallest value measured.
max is the largest value measured.
outliers is the number of measurements more than 1 standard deviation away from the mean.
delta is the difference to the previous measurement.