FPGARelated.com
Blogs

PC and SP for a small CPU

Victor YurkovskyJuly 23, 2013

Ok, let's make a small stack-based CPU.

I will start where the rubber meets the road - the PC/stack subsystem that I like referring to as the 'legs'. As usual, I will present a design with a twist.

Not having a large design team, deadlines and million-dollar fab runs when designing CPUs creates a truly different environment. I can actually sit at the kitchen table and doodle around with CPU designs to my heart's content. I can try really ridiculous approaches, and work without a plan, just to see what happens. When something interesting happens, I can adjust the rest of my design to fit. I am an artist, man!

The Legs

When normal people (that is, not artists :) build CPUs, they will generally designate a register as a Program Counter (PC) and use it to address memory. The PC needs to be incremented normally; in addition it must support jumps and calls, so it is generally constructed as a loadable counter.

For calls and returns, we use the Stack Pointer (SP) that addresses the memory, either the same one as the PC or a different one. SP can be either incremented or decremented.

The stack semantics dictate that the SP must be pre-decremented on push and post incremented on pop (or the reverse). In spite of its apparent simplicity, this pre/post distinction can be tricky to implement. Some minimal implementations (J1 stack processor) give up and leave the post-increment for the next instruction (for the datastack anyway), leaving it up to the assembler to deal with the complexity.

The interaction of the PC incrementor and the return address that winds up pushed onto the stack is yet another source of complexity that is hard to describe until you try to implement it. Suffice it to say that the you have to either push an incremented address or increment the popped address to avoid running the same instruction twice. It is amazing how many real processors implement the PC/stack pointer subsystem in a clumsy way.

The traditional PC/SP implementation impacts the rest of the processor in a very significant way. Both the PC and the SP need to address memory, often simultaneously. Given that requirement, we are faced with a hard choice to make - either dual-port the RAM or require multiple cycles for instructions. Traditionally, the first choice is not an option, but with FPGAs we could do it easily (although I am loath to do so for other reasons). The second choice is not attractive either, as it incurs a significant speed penalty and increases the complexity of the design.

Decoupled Stack

Luckily, there is a third alternative: decouple the stack memory. There is little reason to keep the stack in the same memory space as the code or data, for minimal processors. Especially if you are not planning on running C on it, and I have little interest in that.

A distributed RAM can be implemented very compactly on Xilinx chips: a single slice can house two sixteen-bit RAMs. This leads to a very compact stack memory - a 16-level 16-bit stack takes up only 8 slices!

But wait, it gets better. Each half-slice also has free incrementor logic. With that, we can eliminate the PC register altogether, and use the memory addressed by SP as PC.

This arrangement makes subroutine calling really easy. We don't have to push anything - the PC is on the stack to start with!

There are consequences to this decoupled approach. Since the stack memory is outside the normal memory space, it is inaccessible to regular memory reads. For instance, you cannot take an address of data on the return stack. Running out of the stack without a separate PC also makes it entirely impossible to store data on the return stack - there simply is no pathway to move data there. This is a little traumatic, as even Forth uses the return stack sometimes to store data. However, there are workarounds.

Let's implement the legs. I will break up the functionality into small modules - the map report will show 'utilization by hierarchy' to let us identify how big each module is.


First, the stack memory:
/******************************************************************************
  A 16-bit 16-level stack memory.
  Infer a RAM16_S1.  We write it every cycle with DIN and output DOUT, which
  may be incremented.
******************************************************************************/
module STACKRAM(
  input         C,
  input   [3:0] A,
  input  [15:0] DIN,
  output [15:0] DOUT,
  input inc
);
  reg [15:0] ram[0:15];
  assign DOUT = ram[A] + inc;
  always @(posedge C)
    ram[A] <= DIN;
endmodule

The Stack Pointer:
/******************************************************************************
  A 5 bit stack pointer
  There is no penalty for using it as a 4-bit pointer
******************************************************************************/
module SP(
  input C,
  input push,
  input pop,
  output [4:0]dout
);
  reg [4:0] SP;           //Stack Pointer
  reg [4:0] newsp;
  always @(push or pop)
    case ({push,pop})
      2'b01:  newsp = SP+1;
      2'b10:  newsp = SP-1;
      default: newsp = SP;
    endcase 
  always @(posedge C)
    SP <= newsp;
  assign dout = newsp;
endmodule

And finally, the entire PC/SP module:
/******************************************************************************
  The complete PC/SP subsystem
******************************************************************************/
module PC(
  input clk,
  input [15:0] in,       //input vector data
  input inc,             //when set, increment PC
  input vec,             //when set, accept vector
  input push,            //push new value onto stack
  input pop,             //return value (increment SP for next cycle)
  output [15:0] out,
  output[3:0] addr
);
  //stack pointer - 2 slices...
  SP mysp(clk,push,pop,addr);

  wire [15:0] min;
  wire [15:0] mout;
  STACKRAM mem(clk,addr,min,mout,inc);
  //mux between direct input or old PC/inc
  assign min = vec? in : mout;
  assign out = min;   //output new address or inced old.
endmodule

Pretty simple. To test it I implemented the design on a Digilent Spartan S3 board. I connected the 4-digit display to the address bus, 8 sliding switches to the vector in register, and 3 buttons to signify jump, call and return. Running with a slow clock, I can watch my CPU incrementing the address, jumping or calling to a specified address, and returning to the original PC +1! The button instructions are decoded into control wires as follows
  ...
  reg pc_inc, pc_vec, pc_push, pc_pop;
  always @(posedge cpuclk) begin 
    case (btn[3:1])
      3'b100: begin //jump
        pc_inc=0;  pc_vec=1; pc_push=0; pc_pop=0;
        end
      3'b010: begin //call
        pc_inc=0;  pc_vec=1; pc_push=1; pc_pop=0;
        end
      3'b001: begin //return
        pc_inc=1;  pc_vec=0; pc_push=0; pc_pop=1;
        end
      default: begin //increment PC
        pc_inc=1;  pc_vec=0; pc_push=0; pc_pop=0;
        end
    endcase
  end
    
  wire [3:0] sp;
  //switches for low 8 bits of vector 
  PC mypc(cpuclk,{8'h00,sw[7:0]},pc_inc,pc_vec,pc_push,pc_pop,ab,sp);
  ...
The tools report the size as 24 slices -- pretty close to optimal. SP should really fit into 2 slices...
| +mypc              |           | 10/24         |
| ++mem              |           | 9/9           |
| ++mysp             |           | 5/5           | 
So there you have it. All that's left to do is to add the datastack, the ALUs and the instruction decoder....

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: