StrangeCPU #4. Microcode
Sliding windows containing runs of microcode.
Table of Contents:
Part 1: A new CPU - technology review, re-examination of the premises; StrangeCPU concepts; x86 notes.
Part 2: Sliding-Window Token Machines, an in-depth exploration of this curious technology; ARM notes.
Let's Examine the Premises
In the first three articles of the StrangeCPU series I've described a strange CPU indeed – a CPU with fixed size opcodes (8 or 9 bits) that map to virtually any number of internal CPU instructions or jumps of any size. Opcodes are fetched from the 'Blue RAM'; the current meaning of each opcode (yes, it changes depending on where it is invoked) is looked up inside a sliding window in the 'Red RAM'. The window position is decided deterministically by the PC; the value obtained is used to control the internal CPU units.
The processor is similar to a VLIW (Very Large Instruction Word) machine in that the Red RAM is as wide as needed to contain a bit to control every internal unit of the CPU. Needless to say, VLIW processor instructions are very wide. The sliding window allows us to drop the size of each opcode back to 8 or 9 bits.
Of course nothing is free, but the tradeoff is not so bad. Keeping the entire instruction control word (say, 32 bits for a simple Stack Machine CPU) allows us to use any instruction at any time. Using an 8-bit sliding window algorithm (base=PC/16) presents us with a peculiar limitatation - any 4K region of Blue code RAM must not use more than 256 unique instructions. We can live with that.
Every 16 Blue opcodes, the window slides up by one Red unit. We lose the lowest (oldest) instruction and gain an empty slot for a new one. This gives us a chance to vary the instruction set of the processor smoothly, adjusting to the task at hand. Of course, should we desire to use the instruction we just lost, we can simply repeat it at the top of the table.
Let's look at a random instruction description at location Red $1000 for example. Using the sliding formula we can see that at location Blue $10000 our instruction is addressable as opcode $00. Same at Blue $10001, $10002, etc. But at location Blue $10010 our instruction is out of range – the base of the sliding window is now $1001. So Blue $1000F is the upper bound of the range for instruction Red $1000.
Since the window is 256-instructions wide, the lowest window base for our Red $1000 instruction to be visible is Red $0F01. The corresponding Blue location is $0F010; our instruction here, at the lower bound of the range, is encoded as opcode $FF.
The total Blue range within which each instruction is visible is therefore is $1000F - $0F010, or 4095. That is another constant of this system – each instruction as represented in the Red instruction memory is visible to a range or 4095 Blue bytes.
Looking at an instruction set of any CPU, it is clear that in addition to the actual opcodes, the processor requires in-line data such as branch targets and immediate values to operate on. Lucky for us, sliding window technologies excel at representing wide values with narrow tokens. In fact, the first installment introduced this technology by an elegantly representing branch targets with 8 bit tokens. StrangeCPU #3. Instruction Slides - The Strangest CPU Yet! expands on this concept by representing instructions in the same sliding window system.
When we think of CPU instructions we envision some encoding scheme with instruction groups, with the available bits split up between immediate values and instruction bits. Usually we are in a bind as the instruction width is pretty small; encoding large values is difficult.
Sliding Window architecture allows us to keep the opcodes/tokens small. The instruction width is now dependent on the Red memory which can be pretty wide. A natural FPGA fit is to use 9-bit Blue opcodes to select 36-bit instructions in Red RAM. This is wide enough to represent 32-bit addresses and data while keeping the secondary decoder simple.
4 bits at the top can be used to select one of 16 instruction patterns to differentiate calls, jumps, regular instructions and immediate instructions.
So it appears that all data can be represented in the Red memory. Not having inline data keeps the code really clean (and makes deterministic decompilation possible).
So far our processors followed a simple pattern of execution:
Fetch an opcode from Blue RAM
Look up an instruction from Red RAM window and execute.
My descriptions have implied that our Blue PC is incremented once every 2-step cycle unless a control transfer instruction (CTI) has been executed. Similarly, so far we've assumed that Red RAM is addressed implicitly as PC/16.
What if we place a second 'PC' on the Red RAM, allowing us to execute short sequences of Red instructions instead of just one? We can dedicate a Red instruction bit to 'Increment Blue PC and fetch' function. Now we can execute code as we always have, one Red instruction at a time if that bit is set. But if it's clear, we increment the Red PC to execute the next Red instruction, until we hit one with that bit set.
Of course this arrangement is a weird implementation of microcode. Well, we've been microcoding all along, but now we can call microcode 'subroutines' using a single opcode/token – making our code even denser.
Since our system allows us to specialize opcodes (and adjust them over time) microcoding is just another way to get really specialized functionality. Of course, by creating runs of microcode we take away from the valuable pool of range-addressable opcodes, so microcode must be used sparingly.
But what is the hardware price for this new ability? Oddly enough, the price is extremely low. Using Xilinx FPGAs we need ½ slice per bit to implement Red microcode PC:
a mux to select between Blue PC/16 and our own count
a flip-flop to hold the result
This is pretty much a textbook description of a Xilinx Spartan3 half-slice.
Microcode executes faster than regular instructions. Regular instructions always take 2 cycles to execute; runs of microcode need one cycle per Red instructions plus one cycle overhead (n+1).
Complaints and Misconceptions
Needless to say, some people are resistant to new ideas. This one drew a fair share of criticism, some of it valid. Here are a few examples:
Why bother? Hardware is cheap.
True enough. Our inflated world has spoiled us with exponential improvements with concomitant price drops. However, waste is never good, even if it is 'free' by virtue of being subsidized. Wider buses, bigger memories and larger chips translate to higher energy consumption, heating issues requiring large heatsinks and fans, longer boot time, flakier designs, etc. Someday common sense may prevail; if not – consider this a thought-provoking exercise.
This scheme wastes a whole cycle with the extra memory.
The act of decoding instructions is hardly wasteful. Many processors use more than one internal cycle to decode instructions. While I'd love to be able to decode instructions in 0 cycles, I will accept one cycle as a viable real-world alternative.
Why not add some bits to indicate when to 'slide' the window?
The window does not really slide, it jumps around with the PC. We have to be able to instantly determine the window base during a jump, so any scheme that relies on the window sliding sequentially cannot be used.
You just re-invented the PC-relative jump.
Hack-a-day attempted to summarize my ideas in a simplistic manner that pretty much describes a PC-relative jump. If you bother reading my articles, you will see that what I propose is completely different.
This concludes the four-part series describing the theory behind StrangeCPU. I am happy to have had the opportunity to share this idea with you. I hope that one of you will get excited about it enough to give it a try – it is pretty simple to implement. If you do please contact me as I'd love to hear about it (and perhaps assist you in your implementation).
In the future I intend to make a StrangeCPU FPGA implementation myself. I'll be sure to share my findings with you when I do.
The concepts introduced here regarding the use of sliding window technologies for calls, literal loads and instruction decoding, are dedicated to the public domain and are free for your use. I am not aware of any patents or prior art regarding this subject. Please let me know if you decide to implement it and I will be happy to help.
Spartan3, PicoBlaze and Xilinx are registered trademarks of Xilinx Corporation.
I've just analysed the (ARM) machine code for GCC as an example. It has 27,729 32-bit words in executable sections of the elf (I chose arm for the uniform 32-bit long instructions).
Of these 27,729 instructions, there are 12051 unique instructions. This is many more than I expected.
Does this lack of duplication show that there is not enough redundancy (in jumps, immediates and register choices) for a sliding windows (or any other form of instruction compression) to be worthwhile?
I would love to know your thoughts about this.
I am approaching FPGA programming (ad I was programming Forth some decades ago) and I would try to run Forth on J1 or on your sliding windows CPU
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.