Time is the Hunter: A HDL for the Future?

Time. It’s an amazing song by Hans Zimmer. It’s an inexorable passage. It’s many things to many people.

Today, we’ll talk about the second one. And how’s it’s portrayed in programming languages.

In most programming languages, time is implicit, and subsequent instructions are “later” in time. Of course, compiler optimizations and hardware reordering can change when instructions actually execute, but from the perspective of the language it proceeds linearly. For example, in this C code:

int a = 5;
int b = 6;
b = a + b;
printf("%d\n", b);

Each statement executes after the other. This program prints out 11.

This is how many languages are. But it’s not the only way to do it. Languages like Haskell and Prolog eschew the notion of time altogether, and say that there is no notion of time, not even an implicit one like in the languages we described above. This leads to confusion among many programmers when they need to make their purely functional language interact with the real world.

Because their programming language has a different representation of time than the platform (CPU) they’re targeting.

Of course, there exist targets for which neither of these representations is ideal. FPGAs are one such target. Time on an FPGA is not linearly sequential. But, also, it cannot be ignored: Much of what FPGAs do have to do with strict cycle-accurate timing. Output a pulse on this edge. Copy these bits all at the same time. Read these pins.

One of the two main languages for FPGAs is Verilog. Here’s some sample Verilog code (courtesy of Alchitry):

module servo (
    input clk,
    input rst,
    input [7:0] position,
    output servo
  );
 
  reg pwm_d, pwm_q;
  reg [19:0] ctr_d, ctr_q;
 
  assign servo = pwm_q;
 
  always @(*) begin
    ctr_d = ctr_q + 1'b1;
 
    if (position + 9'd165 > ctr_q[19:8])
      pwm_d = 1'b1;
    else
      pwm_d = 1'b0;
  end
 
  always @(posedge clk) begin
    if (rst) begin
      ctr_q <= 1'b0;
    end else begin
      ctr_q <= ctr_d;
    end
 
    pwm_q <= pwm_d;
  end
 
endmodule

This code performs a fairly simple function. It takes three input parameters, clk, rst, and position, and outputs a single bit servo which is high for anywhere between roughly 50,000 to 100,000 positive edges of clk.

That is, imagine clk is going up and down repeatedly. If position is 0x0 (i.e., eight zero bits), then servo will be high for 50,000 ticks of clk, and low for the rest of the one million tick cycle. If position is 0xff (i.e. eight one bits), then servo will be high for 100,000 ticks, and low for the rest of the cycle.

Simple enough. There are two different assignment operators: <= and =, and these represent two different concepts of time. = is what’s called a “blocking statement”: These statements execute one after another, like in C. <= is non-blocking: These execute simultaneously within a block. For example:

ctr_d = ctr_q + 1'b1;

if (position + 9'd165 > ctr_q[19:8])
  pwm_d = 1'b1;
else
  pwm_d = 1'b0;

The if statement uses the value of ctr_d that equals ctr_q + 1'b1. However, for this code:

if (rst) begin
  ctr_q <= 1'b0;
end else begin
  ctr_q <= ctr_d;
end

pwm_q <= pwm_d;

these assignments all happen simultaneously.

So in order to use Verilog effectively, you must constantly switch between two different concepts of time as you program. Additionally, you must consider the effect of the always condition: always @(posedge clk) means that the block happens only at a clock edge. always @(*) means the block is always happening.

So you must not only switch between different types of time, you must switch between different… times. Statements can be happening one after another, on clock edges. Or they can be happening simultaneously, always. Or a combination: Some statements can be happening before other statements, while some statements are happening simultaneously.

And what do we gain for this complexity? State. This construct:

always @(*) begin
  variable_d = variable_q + 1;
end

always @(posedge clk) begin
  variable_q <= variable_d;
end

Is how state is implemented. At any given point, variable_d equals variable_q + 1. Exactly on a clock edge, variable_q is set to be what variable_d is (i.e., variable_q + 1). So combined, variable_* is incrementing.

So here’s a quiz for you. If you were to add more code to this - should you use variable_d, or variable_q? What are the implications of either one?

I don’t know either. Let’s compare to how C increments a variable:

int variable = 0;
while (1) {
  variable += 1;
}

Simple! Just keep incrementing the variable forever! variable always has the current value.

Or at least, so said the industry that pushes “high-level synthesis” or HLS, which is compilers which translate C or C++ (or other “high level” languages) into HDL. They want to tap into the vast pool of C programmers to write FPGA code.

Except it isn’t that easy. I maintain that any programmer worth their salt can, in the span of months, pick up a new language as long as they understand the underlying fundamentals.

But if you don’t understand the underlying fundamentals, it doesn’t matter whether you know the syntax, you won’t be an effective programmer. All that is to say, HLS is dumb.

We need something better. And we can do that. In a world of discrete, digital logic, made of clocks and their domains, we can steal syntax from the mathematics world and express our above recurrence (that the variable keeps incrementing) like this:

variable[T] = variable[T-1] + 1;

No need to track different variable_d and variable_q variables. Additionally, with multiple variables, there’s no confusion about which order statements “run” in:

counter[T] = counter[T-1] + 1;
variable[T] = counter[T]; // Versus counter[T-1]

Since HDLs actually have no side effects, everything can be purely functional. For example, if statements could return values:

// This counter counts up to 100
counter[T] = if counter[T-1] == 100 {
    0
} else {
    counter[T-1] + 1
};

Of course, we want to retain the closeness to hardware, and the variable declarations would represent that:

// counter is a 8-bit counter
bits[7:0] counter;

counter[T] = counter[T-1] + 1;

and you would need to be able to sub-index variables:

bits[7:0] counter;
bits[15:0] variable;

counter[T] = counter[T-1] + 1;

// Duplicate counter in the high and low bytes of variable
variable[T][7:0] = counter[T];
variable[T][15:8] = {counter[T][3:0], counter[T][7:4]}; // Swap the nibbles of the high byte

So with this, how do we implement the servo code from above? Maybe something like:

module servo (
    // clk is the clock
    clock clk,
    // rst is a single bit input
    bit rst,
    bits[7:0] position,
) -> ( // Outputs are specified in a functional "return" of sorts
    bit servo,
);
  bit pwm;
  bits[19:0] ctr;

  servo[T] = pwm[T-1];

  ctr[T] = if rst {
    0
  } else {
    ctr[T-1] + 1
  };

  // Note that pwm[T] is based on ctr[T]
  pwm[T] = position[T] + 9'd165 > ctr[T][19:8];
endmodule

In terms of strict lines of code, we’re down from 31 lines to 21 lines - a 30% reduction, even with three additional comments added. We have simpler variables. All the code that affects a variable is in one place: It isn’t spread out between multiple always blocks. There’s a single system of time used throughout: All state updates every “tick” of the clock. It’s clear whether variables depend on “this” cycle’s outputs or the previous’s.

Of course, there are potential disadvantages to this syntax too. For example, you could imagine something like this:

servo[T] = pwm[T-1000];

This is a single line of code. Yet, it generates a thousand flip-flops! So the ratio of lines-of-code to synthesized area can easily be quite large. But on the flipside, this allows writing FIR filters very intuitively:

output[T] = a[T] * input[T] + b[T] * input[T-1] + c[T] * input[T-2];

Above we defined a module. But what might using that module look like? Well, everything is an expression, so it might look like:

module main(
    clock clk,
    bit rst,
) -> (
    bit servo_output,
);

  bits[27:0] position;

  position[T] = position[T-1] + 1;
  servo_output[T] = servo(clk, rst, position[T][27:20]);

endmodule

if we wanted the position to ramp up and wrap around after ~268 million cycles.

Of course, there’s a lot more to be defined to make a workable hardware description language, and a lot of trial and error along the way. For example, what’s the right way to handle state machines? How should multiple clock domains be handled? What about user-defined types - can a user define a struct like this?

struct my_struct {
  bits [7:0] counter;
  bit is_upcounting;
}

module update_mystruct(
  clock clk,
  bit rst,
  my_struct counter,
) -> (
  my_struct counter_out,
);

  counter_out[T].is_upcounter = if rst {
    0
  } else if counter[T] == 100 {
    0
  } else if counter[T].is_upcounter == 0 {
    1
  } else {
    counter[T].is_upcounter
  };

  counter_out[T].counter = if rst {
    0
  } else if counter[T].is_upcounter {
    counter[T].counter + 1
  } else {
    counter[T].counter - 1
  };

endmodule

// Usage:
bits[7:0] count;
my_struct foo;

foo[T] = update_mystruct(clk, rst, foo[T-1]);
count[T] = foo[T-1].counter;