Programming for Maintainability 5, Reinvent That Wheel!

So several months ago, I had to write this (C++) function:

std::vector<uint8_t> extract_bitstream(const std::string& original_bitstream)
{
    std::string encoded_bitstream(original_bitstream);

    // Strip the whitespace
    encoded_bitstream.erase(std::remove_if(encoded_bitstream.begin(),
                                encoded_bitstream.end(),
                                [](char c) { return std::isspace(c); }),
        encoded_bitstream.end());

    // Decode the result
    namespace bai    = boost::archive::iterators;
    using base64_dec = bai::transform_width<bai::binary_from_base64<char*>, 8, 6>;

    std::vector<uint8_t> bitstream(base64_dec(encoded_bitstream.data()),
        base64_dec(encoded_bitstream.data() + encoded_bitstream.length()));

    // Remove null bytes that were formed from the padding
    const size_t pad_count =
        std::count(encoded_bitstream.begin(), encoded_bitstream.end(), '=');
    bitstream.erase(bitstream.end() - pad_count, bitstream.end());

    return bitstream;
}

Can you tell what it does? The comments help a bit. We get a string from the caller (which we aren’t allowed to modify, so we have to copy it). We create a std::remove_if iterator which iterates over the resulting encoded_bitstream, and pass that to erase, effectively erasing all the whitespace in the string. Then we create a base64_dec type which, for reasons which were clear at the time, has the magic numbers 8 and 6 in it. (it’s a boost::archive::iterators::transform_width - do you know offhand what transform_width does and what those arguments take?) And then we use that to create a vector, we count the number of equal signs in the encoded bitstream, and we remove that many characters from the end.

That’s a lot of words. The person writing this code needs to think about every line - in practice, maybe they copy and paste the code from Stack Overflow without fully understanding it. The code could have bugs. They have to spend time thinking about whether the 8 and the 6 are the correct numbers, and then they have to spent time thinking about where the begin() and end() calls go to strip the whitespace, and then they have to test it with various padding = to make sure it works. Then the reviewer of the code has to double check all of that. Five years from now, somebody tracking down a bug will need to reverse the same.

But it’s a conceptually simple operation: Take a string, remove the whitespace, base64-decode it. That’s all this code does. That’s all the person writing the code, or the reviewer reviewing the code, or the programmer reading the code cares to know about this code. But the code does not say that, it forces us to think at an abstraction level that is too low.

There are multiple ways to fix this. You could make some functions:

std::vector<uint8_t> base64_decode(const std::string& s) {
    // Boost-y things
}

std::string remove_whitespace(const std::string& s) {
    // std::remove etc.
}

std::vector<uint8_t> extract_bitstream(const std::string& original_bitstream) {
    const auto bitstream = remove_whitespace(original_bitstream);
    return base64_decode(bitstream);
}

Although this is an improvement, the sub-functions still contain the problematic code. Each sub-function still incurs the cost of development, review, and maintenance.

Compare this to a language like Rust, which has the language-level tooling available to re-use code from the open-source community:

fn extract_bitstream(original: &str) -> Vec<u8> {
    let bitstream: String = original.chars().filter(|c| !c.is_whitespace()).collect();
    base64::decode(&bitstream).unwrap()
}

Four lines. Here, we leverage the open-source package base64. I am trusting that base64 doesn’t have bugs, much like I was trusting that the Stack Overflow answer containing that boost code didn’t contain bugs. This code is at the abstraction level I want: I don’t have to spend company time digging into the already-solved details of base64 padding.

The astute observer will note that the Rust code iterates over the full dataset fewer times than the C++ code. In production, the strings involved are on the order of 40MB in size: Copying and even just iterating over them is slow. In the C++ example, we have to make a copy of the string before we can even iterate over it to remove the whitespace. Once we’ve done that we have to iterate over it again to decode the base64, before we then go back over it, counting the number of equal signs. (the last operation is only out of convenience - using std::count is more readable than whatever if statement that would just check the last two equal signs)

The Rust code can remove the whitespace while it’s copying the new string, and then makes one more pass over the data to decode the base64. If we perform the optimization mentioned above, C++ iterates over the dataset three times, the Rust code only twice.

The takeaway is, in short, that sometimes we spend a large fraction of time on things that aren’t “business logic.” By leveraging appropriate languages and tooling, we can increase the proportion of our time spent on business logic. And then step 3, profit.

Update: After benchmarking on a 40MB string containing 11% (2 of 18 characters) whitespace, the C++ version clocks in at 476ms and the Rust version at 140ms. Eliminating the slow final std::count on the C++ version brings it to 400ms.

This blog series updates every week at Programming for Maintainability