StrangeCPU #3. Instruction Slides - The Strangest CPU Yet!
Decoding instructions with a Sliding Window system. 0-Bit Sliding Register Windows.
Table of Contents:
- Part 1: A new CPU - technology review, re-examination of the premises; StrangeCPU concepts; x86 notes.
- Part 2: Sliding-Window Token Machines, an in-depth exploration of this curious technology; ARM notes.
- Part 3. Instruction Slides - The Strangest CPU Yet! Decoding instructions with a Sliding Window system
- Part 4. Microcode Executing multi-cycle instructions
In the past articles we've examined a technique that allows 8- or 9-bit tokens to load 32-bit literals and make 32-bit calls. Today I will expand on this tecnique to perform yet another impossible feat - the same 8/9 bit tokens can be used to expand to 32-bit instructions!
You will also see what happens when the sliding window is shrunk to a single entry. Read on.
Sliding Windows Everywhere
I am very excited about the flurry of reader responses received following Part 2. Sliding Window Token Machines. Unfortunately, hackaday credited me with the discovery of the PC-relative jump... Needless to say, a number of you took the opportunity to remind me that I am an idiot (instead of reading the article). Luckily some of you actually read my article and realized that hackaday's description is wrong.
Please try to place your comments here, below this article. Some of your comments were very useful and should be available to all readers of these articles. It also allows me to answer your questions easily.
Sliding Window algorithms are a proven concept. Your TCP/IP stack is using one right now to transmit unlimited-sized streams using a fixed-size packet numbering scheme. A quick look some other Sliding Windows you may be familiar with first.
Sliding Register Windows
The venerable 6502 processor used a 256-byte zero page. Zero page, located at $0000 in RAM, was special. The 6502 allowed indirection through pointers in zero page. Like my bathroom window, Zero Page is a perfect example of a sliding window really stuck at the bottom.
6502's descendants and relatives, the 65CE02 and the 6809 had lubricated the window slide and allowed the zero page to be placed in any 256-byte-aligned page in RAM. This is an example of a very jerky slide.
The 65816 improved on the idea by allowing the zero page go be anywhere in RAM. With a true sliding window, the Sliding Window Token Machine should be easily implementable on the 65816.
Although personally I prefer stack machines, most processors feature a fixed register file. The SPARC architecture introduced the concept of a register file that is implemented as a sliding window. The window looked into a limited set of registers and required OS intervention to fill and spill the registers into/from RAM when the limit was hit (very quickly). This concept didn't really take on, although it makes a lot of sense and eliminates the senseless shuffling of registers and stack data between calls. Decompile some C code and see for yourself why your applications are huge and slow.
A sliding window of registers mapped into system RAM makes a lot of sense. The window slides upon subroutine entry, driving caller's parameters into the bottom of the register set and freeing the rest for private use (and parameters space for further calls). Since modern DRAMs are 'bursty', it's not a bad time to cache the register set into fast internal SRAM.
0-Bit Sliding Register Window
In the last article we talked about various token sizes – 8 bits, 9 bits, 16 bits. These present us with the window widths of 256, 512 and 64K. It's clear what happens as we expand the token size – the window is two to the power of token width. What happens at the other extreme? Let's reduce the token size to zero – no token at all. Our window size is now 1 (20); only one item is visible.
What use is that, you ask. Allow me to present the most commonly-used Sliding Window system – the one that uses 0-bit tokens. Since it is a degenerate form of the Sliding Window, with the tokens not even visible, it is usually not thought of as a Sliding Window at all. It is implemented in pretty much all modern CPUs. It is called 'the Stack'.
The only part of this system that is used is the slide; it is the stack pointer. The sliding action is automatic at runtime: generally, on read, we slide to the positive side after, to write, we slide negative first.
The old stack pointer is indeed a form of the sliding register window. By adding a few control bits we can access items deeper on the stack – widening the window so to speak.
Sliding Windows for Instructions
To generalize, Sliding Window algorithms can sometimes be used to map a small fixed-size selector onto a much larger data space by constraining access to a localized window. We've seen it used with the re-assembly of out-of-order TCP/IP packets using a fixed-size packet ID; accessing a large number of registers using a 5-bit index in the SPARC architecture, and selecting a subroutine address in any-sized memory using an 8-bit token in my previous articles.
Today I would like to present yet another use for a Sliding Window algorithm. This one will select CPU instructions from a much larger set of all possible instructions.
As soon as the CPU fetches an instruction, it must figure out how to execute it. This process is known as 'decoding'. The CPU has a number of units; each one is controlled by a set of control signals. Very simple CPUs have instructions that directly control these signals. Very complicated VLIW CPUs often do the same. Most CPUs you are likely to encounter today have complicated decoding circuits that eventually trigger the control signals.
Let's take a look at the instruction format of a simple stack processor, the J1 Forth CPU. This simple zero-operand stack machine processor is small and simple; it runs at 50MIPs in a Spartan3 FPGA. The Verilog source is 200 lines – I suggest you take a look at this amazing design by James Bowen.
The instruction is 16 bits:
The first 4 types of instructions deal with literals and control transfers. The fifth one is of interest here as it wires the instruction bits directly to control signals of different units. The bitfields are described here:
The ALU uses 4 bits to encode the following operations:
Finally, here are some examples of some operations that map directly onto Forth instructions:
Statistically speaking, not all instructions are created equal. Philip Koopman, in his book “Stack Computers, the New Wave” (a must-read for anyone interested in CPU design), quotes the following statistics of static instruction use:
Interesting, isn't it? A full three-quarters of the sample code uses only a dozen or so instructions.
Instruction usage exhibits a strong locality. A math package will make heavy use of ALU instructions; a group of subroutines that work on strings will require memory-access instructions with pointer incrementing and decrementing.
Finally, some combinations of the control fields are just not useful at all, and are not likely to be encountered in real code.
Since any CPU's instruction size is limited, and the number of control signals is large, most CPUs opt to restrict the number of combinations of control signals that can be used onto a much smaller subset of useful instructions. Other combinations which occasionally may be useful (such as incrementing a register while returning) are simply not accessible.
Very Large Instruction Word processors are an exception – they often present many more control signals right in the bitfields of their very wide instructions. These processors are not common – probably because they are hard to code for and debug, and gobble up enormous amounts of memory with these giant instructions.
The instruction decoder of a normal CPU is fixed. It simply maps the instruction token onto the control bits corresponding to various units that are engaged (or left to sit this instruction out). This can happen over a number of internal cycles. The mapping system can be as simple as a wide ROM (see the 6502 design). Instructions can be microcoded – a multi-cycle state machine (or another CPU) inside the CPU may have access to the control signals. But the meaning of tokens (instructions you see) is generally fixed.
Sliding Instruction Windows
Remember our magical implementation of Part 1? It new all about our call instructions simply by their locations (since the targets never change once compiled) – making the call instructions unnecessary altogether. In the real implementation we put the call tokens back in – after all something has to take up the byte of memory anyway.
The magical implementation is not limited to calls. It can know about instructions other than calls, since they don't change. Once we compile an add instruction at address $100 and the CPU knows about it, there is no reason to even keep it there... But enough foolishness – back to reality.
The same Sliding Window Token machinery can be applied to decode instructions. Since our machine is already marking Blue table entries with a bit to identify literals, we can add another bit to identify instructions. Actually, since FPGA memories have free parity bits, 32-bits is really 36 bits, four bits are free for such tagging, leaving room to spare. 3 out of 16 possible permutations are used so far:
Reference (call target)
A single 9-bit token can now index as a wide literal, a subroutine call, or a decoded instruction, depending on what it indexes.
Note that the CPU above needs only 16 control lines to function, although our table has 32 bits. The extra 16 bits are unused - but we can add other units and provide custom hardwired instructions at no additional cost to the decoder.
Will this really work?
In previous articles, instructions were handled using an escape token to switch into 'linear execution' mode. Alternatively, I suggested dedicating 32 or 64 of the possible 256/512 tokens to instructions for the stack machine.
The technique presented here is infinitely more flexible. Instead of taking a fixed number of tokens away from calls, it treats calls, literal loads and all other instructions in the same statistical maner, and is certainly no worse than the descriptions in parts 1 and 2.
Instructions are no longer fixed values – they index into the table. As the window slides, so do the instruction enumerations. Debugging is a little bit more interesting (but not much).
Instructions are subject to window underflow and overflow. The compiler should be smart enough to know that there is no more room in the table for new instructions and relocate the code accordingly. Similarly, if the instruction we need is not present in the window, the compiler should synthesize it and place it into the table.
What we gain is flexibility. Instructions that are rarely used are created only for these occasions, instead of occupying valuable instruction space. They remain accessible, for that time that you need to do a strange combination of things, such as increment the stack while performing a logical AND. Your code can be very efficient.
All instructions, not just calls, are now handled the same way, without escape tokens.
The Blue memory read takes time and introduces a latency. In the simple analysis, all instructions of our processor now require 2 cycles. That is not so bad - many processors require more than one internal cycle to complete an instruction. In future articles we will come back with ways to accelerate the execution of this processor.
What about the registers?
Instructions that include register bitfields seem to confound this architecture. However, because of ABI register conventions, register use is constrained to a subset of all possible register operations. Looking at ARM or MIPs architectures, it would be possible to use a 16-bit window to encode 32-bit instructions. This encoding would be superior to the Thumb architecture (which is a very limited subset of the ARM instruction set). I don't currently have the statistics to work with, but it's an interesting research direction.
It is also possible to exclude the registers from the sliding window system. An 8-bit bitfield can be reserved for the Sliding Window Token, while other bits are dedicated to addressing registers (hopefully using a Sliding Window system as well). If you are familiar with the operation of PicoBlaze from Xilinx, you will remember that every instruction includes a generous 'constant' field. Why not use this field for the sliding window index?
In previous articles, I described a fixed-size token processor that maps an 8 bit token to a 32-bit call target or a 32-bit literal using a Sliding Window algorithm. The processor is implementable as a VM in software and is quite useful as a Domain-Specific language.
Part 3 introduces the idea of instruction decoding using the same Sliding Window method, and the same shared Blue memory. Now we are in the probably in the FPGA domain, as complex instruction decoding is probably not worth doing in software using an off-the-shelf CPU. We will continue our exploration in the future articles.
Interesting, isn't it?
The concepts introduced here regarding the use of sliding window technologies for calls, literal loads and instruction decoding, are dedicated to the public domain and are free for your use. I am not aware of any patents or prior art regarding this subject. Please let me know if you decide to implement it and I will be happy to help.
Spartan3, PicoBlaze and Xilinx are registered trademarks of Xilinx Corporation.
What LZ has that your technique lacks is the ability to have back references (tokens in your terminology) that refer not just to individual entities but to contiguous ranges. I think it would be possible to fit this into your system if you used one of your four tag bits as a stop bit indicating that the red PC should advance to the next value.
Keep in mind that this is not a compression scheme, but a statically compiled table of pointers with a deterministic window placement algorithm. LZ 'runtime' copies runs of data from the table it builds on the fly. StrangeCPU works with a pre-built table of pointers to code and a pre-built code space. At runtime it substitutes 32-bit data for 8-bit tokens using this static information.
Thanks for your comment.
In an LZ compressed format, the byte stream consists of uncompressed entities and sliding-window'd back-references to multi-symbol sequences.
In StrangeCPU, the blue memory consists is a pool of individual decoded instructions. The red memory is a stream of uncompressed instructions and sliding window'd references to to instructions in blue memory.
The major differences are:
1) You have split streams for the memory containing the references and the memory being referenced. This seems like a good idea since it saves having to maintain a window-sized temporary buffer, which LZ needs.
2) When you encounter a reference (in red memory), that tells StrangeCPU to insert a single instruction pulled from the sliding window in blue memory. In LZ, by contrast, a reference can pull in an arbitrary sequence. I think you could enhance StrangeCPU to handle this. Consider this proposal:
* When StrangeCPU advances the red PC and encounters a blue-memory reference, it resolves the sliding window reference, places the address in a blue PC, and begins executing at that blue PC.
* When that instruction completes, the blue PC is incremented and execution from blue memory continues.
* This repeats until the blue PC points to a memory location with the STOP bit set.
* At that point, the red PC is incremented, and execution resumes from that location in red memory.
Keeping sequences of code in Blue memory and invoking them with a single token completely changes the semantics of processing... Do we now have two different kinds of subroutines - in different memory spaces? Just the thought of writing a compiler or even an assembler for this architecture makes me cringe (and a little bit excited, I must add:).
However, doing this defeats the purpose. StrangeCPU uses 8-bit tokens to refer to individual instructions. Only one of each instruction is required to be present in the window at execution. Because the window slides, an instruction placed at the top of a window for a given PC will still be available 16*511 or 8K later. Subroutines or sequences are better coded with the 9-bit RED memory, not the wide Blue memory. Once we go that way, we will have to duplicate lots of instructions to keep them in sequences.
There's definitely a lot of complications in my proposal. For example, you'd probably have to mandate that any branch instruction in blue memory must have the end bit set. I have no idea how you'd deal with the PC changing while in a blue-memory sequence.
I guess you can think of this as a special form of subroutine, but at a very fine granularity. You essentially have a single level call stack of free call stack with the call address being bounded by your sliding window.
If I had to write software for it, I'd tackle it in the assembler. I'd have the compiler emit a traditional stream of instructions, and then have the assembler do a linear pass over them. At each instruction, it looks for sequences within the currently visible window of which the following instructions are a tail. If it finds such, it replaces the sequence with the appropriate reference token instead. I think this is conceptually similar to how an LZ-based compressor (DEFLATE, gzip, etc) works.
A *real* LZ-based architecture seems impractical to me, because you'd need to keep around a temporary buffer containing the window a previously decompressed bits. Plus, to make it effective, you'd probably need to split your instruction stream into several, each composed of one of opcodes, register numbers, PC-relative offsets, and immediates. You'll get much better compression that way.
Technically, this requires a 2-1 mux, flop, and an increment logic on the Blue address input. That's 1/2 slice per bit with Spartan3, and all this happens in the first cycle. The ALU and other units operate during the second cycle. Eventually a Blue bit results in a NEXT operation getting another token.
A microcode routine can have more than one label inside, so a token can jump into the middle of it if a label is defined and if it makes sense.
I like this. There is plenty of time in the first cycle. And thinking of this as microcode makes it all work.
I do think it'd be neat if the assembler could discover repeated instruction sequences automatically, but I guess that is kind of heroic.
BTW , the J1 processor has a quirk - to post-increment the stack the assembler inserts an stack increment bits into the _next_ instruction...
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Please login (on the right) if you already have an account on this platform.
Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: