Programming for Maintainability 12, Error Code -5

Last time we talked about errors. And though I’m loath to think about anything for longer than a week at a time, errors are so core to our existence as developers that I feel we must talk about them.

(obviously, they’re core to your existence, my code has no bugs and therefore never errors)

In particular, I want to tell you the story about this error:

Error Code: -1074114855

For context, these days at work I’m playing a bit of customer support for a customer using a beta product in Japan. The particularly astute reader will know two things about me:

I live in America, and
I work on closed-source drivers.

The first of these means that my round-trip time to this customer is 24 hours, a full day. Longer than the round-trip time for JPL engineers working on the Mars rover.

The second of these means that I can’t simply give the customer the source code and say “hey, it looks like the error’s coming from here, hit it with a debugger.” I have to actually, you know, do my job.

So I crawl out of bed one morning and open up my workstation, and I have an email from the customer:

Hello Lane,

I hope we find you well. We have tried your product, but we receive this error:
<screenshot of dialog box saying "Error Code: -1074114855">

Sincerely,
The Customer

So I look up the error code in our error code database:

Error Code -1074114855: Internal Software Error.

Wünderbar.

I email back to the customer:

Dear Customer,

Please, right-click on that dialog box and tell me the error you receive.

Thank you,
Yours Truly

And I go to bed.

(Image stolen w/o permission from reddit.com/u/fatcatmikachu)

And then I wake up.

Hello Lane,

Enclosed please find a screenshot of the dialog box.

Sincerely,
The Customer

Error Code: -52050

Yes, the “more information” on the error code was another error code. It sounds odd (and, in practice, it’s only on accident), but remember this. We’ll come back to it.

The error database helpfully describes this error as a “file fault.”

So, 36 hours after the customer reported their blocking issue, we figured at that somewhere, deep inside our hundred-thousand line program, there was an error accessing a file. No idea which file, or what the error was.

Often, I (and a mentor of mine) describe code as having “multiple readers.” We addressed this in the very first post: Code is meant to be read by the compiler, by the reviewer, by future bug fixers…

Likewise, errors have multiple readers. They’re read by the customer, to teach them what went wrong, and they’re read by customer support, triaging what’s going on, and they’re read by developers, having to maintain code against bug reports that come in.

So we want our error codes to be able to succinctly tell developers exactly where things went wrong, and what went wrong. And, you’re probably already familiar with a very common method for doing this, tracebacks:

Traceback (most recent call last):
  File "/path/to/example.py", line 4, in <module>
    greet('Chad')
  File "/path/to/example.py", line 2, in greet
    print('Hello, ' + someon)
NameError: name 'someon' is not defined

(to steal an example from realpython.com, which is an amazing resource for learning Python)

Tracebacks are great - they tell you exactly what broke (“name ‘someon’ is not defined”), exactly where it broke (line 2 in greet), and roughly how you got there (called from line 4).

Obviously, they aren’t perfect. They reveal source code. They have some overhead, in that you need to have various debug symbols.

For local development (i.e. narrowing down an error), there are some really wacky solutions as well.

But let’s go back to that notion of “an error code, where the ‘more info’ section is just another error code.” In today’s example, that wasn’t particularly useful: The top-level error code told us only that it was a SW error (which we already knew), and the second error code told us it was a file error (which was too vague). But what if we could marry these two in a good way? For example, the inner error code tells us what failed (it was a file fault), and the outer error tells us where it failed?

In fact, we could go even further. Code is incessently nested: Functions call functions call functions, and you can easily go through 10-20 turtles all the way down to the hardware (and, yes, having worked at the firmware level, I can confirm: It is turtles all the way down). So an error can have multiple contexts as well: This was a file fault opening a file. We were trying to open an FPGA bitfile. We were opening a FPGA bitfile to see if the signature matched. We wanted to see if the signature matched because we were opening an FPGA session of device type X. Perhaps we’ll die.

This level of error granularity, and this ability to layer contexts, not only tells the user what may be useful to them: that there was an error accessing “mybitfile.fpga” to perform the signature check, but tells me what I want to know: where to direct useful questions to the customer. Were they trying to open “mybitfile.fpga”? Does that file actually exist? Is it readable? etcetera etcetera.

Of course, the takeaway from last week’s post was that you should use what your language provides. If you have exceptions with tracebacks, there’s not much you can do to inject this customer information. If you’re using C and returning integers, there’s only so much you can encode in an integer.

But, as you’re choosing your tooling and designing your error libraries, it pays off in the long run if you consider the expressiveness that your error scheme allows future maintainers of your code.

This blog series updates every week at Programming for Maintainability