Would you like to be notified by email when Victor Yurkovsky publishes a new blog?
Decoding instructions with a Sliding Window system. 0-Bit Sliding Register Windows.
In the past articles we've examined a technique that allows 8- or 9-bit tokens to load 32-bit literals and make 32-bit calls. Today I will expand on this tecnique to perform yet another impossible feat - the same 8/9 bit tokens can be used to expand to 32-bit instructions!
You will also see what happens when the sliding window is shrunk to a single entry. Read on.
I am very excited about the flurry of reader responses received following Part 2. Sliding Window Token Machines. Unfortunately, hackaday credited me with the discovery of the PC-relative jump... Needless to say, a number of you took the opportunity to remind me that I am an idiot (instead of reading the article). Luckily some of you actually read my article and realized that hackaday's description is wrong.
Please try to place your comments here, below this article. Some of your comments were very useful and should be available to all readers of these articles. It also allows me to answer your questions easily.
Sliding Window algorithms are a proven concept. Your TCP/IP stack is using one right now to transmit unlimited-sized streams using a fixed-size packet numbering scheme. A quick look some other Sliding Windows you may be familiar with first.
The venerable 6502 processor used a 256-byte zero page. Zero page, located at $0000 in RAM, was special. The 6502 allowed indirection through pointers in zero page. Like my bathroom window, Zero Page is a perfect example of a sliding window really stuck at the bottom.
6502's descendants and relatives, the 65CE02 and the 6809 had lubricated the window slide and allowed the zero page to be placed in any 256-byte-aligned page in RAM. This is an example of a very jerky slide.
The 65816 improved on the idea by allowing the zero page go be anywhere in RAM. With a true sliding window, the Sliding Window Token Machine should be easily implementable on the 65816.
Although personally I prefer stack machines, most processors feature a fixed register file. The SPARC architecture introduced the concept of a register file that is implemented as a sliding window. The window looked into a limited set of registers and required OS intervention to fill and spill the registers into/from RAM when the limit was hit (very quickly). This concept didn't really take on, although it makes a lot of sense and eliminates the senseless shuffling of registers and stack data between calls. Decompile some C code and see for yourself why your applications are huge and slow.
A sliding window of registers mapped into system RAM makes a lot of sense. The window slides upon subroutine entry, driving caller's parameters into the bottom of the register set and freeing the rest for private use (and parameters space for further calls). Since modern DRAMs are 'bursty', it's not a bad time to cache the register set into fast internal SRAM.
In the last article we talked about various token sizes – 8 bits, 9 bits, 16 bits. These present us with the window widths of 256, 512 and 64K. It's clear what happens as we expand the token size – the window is two to the power of token width. What happens at the other extreme? Let's reduce the token size to zero – no token at all. Our window size is now 1 (20); only one item is visible.
What use is that, you ask. Allow me to present the most commonly-used Sliding Window system – the one that uses 0-bit tokens. Since it is a degenerate form of the Sliding Window, with the tokens not even visible, it is usually not thought of as a Sliding Window at all. It is implemented in pretty much all modern CPUs. It is called 'the Stack'.
The only part of this system that is used is the slide; it is the stack pointer. The sliding action is automatic at runtime: generally, on read, we slide to the positive side after, to write, we slide negative first.
The old stack pointer is indeed a form of the sliding register window. By adding a few control bits we can access items deeper on the stack – widening the window so to speak.
To generalize, Sliding Window algorithms can sometimes be used to map a small fixed-size selector onto a much larger data space by constraining access to a localized window. We've seen it used with the re-assembly of out-of-order TCP/IP packets using a fixed-size packet ID; accessing a large number of registers using a 5-bit index in the SPARC architecture, and selecting a subroutine address in any-sized memory using an 8-bit token in my previous articles.
Today I would like to present yet another use for a Sliding Window algorithm. This one will select CPU instructions from a much larger set of all possible instructions.
As soon as the CPU fetches an instruction, it must figure out how to execute it. This process is known as 'decoding'. The CPU has a number of units; each one is controlled by a set of control signals. Very simple CPUs have instructions that directly control these signals. Very complicated VLIW CPUs often do the same. Most CPUs you are likely to encounter today have complicated decoding circuits that eventually trigger the control signals.
Let's take a look at the instruction format of a simple stack processor, the J1 Forth CPU. This simple zero-operand stack machine processor is small and simple; it runs at 50MIPs in a Spartan3 FPGA. The Verilog source is 200 lines – I suggest you take a look at this amazing design by James Bowen.
The instruction is 16 bits:
The first 4 types of instructions deal with literals and control transfers. The fifth one is of interest here as it wires the instruction bits directly to control signals of different units. The bitfields are described here:
The ALU uses 4 bits to encode the following operations:
Finally, here are some examples of some operations that map directly onto Forth instructions:
Statistically speaking, not all instructions are created equal. Philip Koopman, in his book “Stack Computers, the New Wave” (a must-read for anyone interested in CPU design), quotes the following statistics of static instruction use:
Interesting, isn't it? A full three-quarters of the sample code uses only a dozen or so instructions.
Instruction usage exhibits a strong locality. A math package will make heavy use of ALU instructions; a group of subroutines that work on strings will require memory-access instructions with pointer incrementing and decrementing.
Finally, some combinations of the control fields are just not useful at all, and are not likely to be encountered in real code.
Since any CPU's instruction size is limited, and the number of control signals is large, most CPUs opt to restrict the number of combinations of control signals that can be used onto a much smaller subset of useful instructions. Other combinations which occasionally may be useful (such as incrementing a register while returning) are simply not accessible.
Very Large Instruction Word processors are an exception – they often present many more control signals right in the bitfields of their very wide instructions. These processors are not common – probably because they are hard to code for and debug, and gobble up enormous amounts of memory with these giant instructions.
The instruction decoder of a normal CPU is fixed. It simply maps the instruction token onto the control bits corresponding to various units that are engaged (or left to sit this instruction out). This can happen over a number of internal cycles. The mapping system can be as simple as a wide ROM (see the 6502 design). Instructions can be microcoded – a multi-cycle state machine (or another CPU) inside the CPU may have access to the control signals. But the meaning of tokens (instructions you see) is generally fixed.
Remember our magical implementation of Part 1? It new all about our call instructions simply by their locations (since the targets never change once compiled) – making the call instructions unnecessary altogether. In the real implementation we put the call tokens back in – after all something has to take up the byte of memory anyway.
The magical implementation is not limited to calls. It can know about instructions other than calls, since they don't change. Once we compile an add instruction at address $100 and the CPU knows about it, there is no reason to even keep it there... But enough foolishness – back to reality.
The same Sliding Window Token machinery can be applied to decode instructions. Since our machine is already marking Blue table entries with a bit to identify literals, we can add another bit to identify instructions. Actually, since FPGA memories have free parity bits, 32-bits is really 36 bits, four bits are free for such tagging, leaving room to spare. 3 out of 16 possible permutations are used so far:
Reference (call target)
A single 9-bit token can now index as a wide literal, a subroutine call, or a decoded instruction, depending on what it indexes.
Note that the CPU above needs only 16 control lines to function, although our table has 32 bits. The extra 16 bits are unused - but we can add other units and provide custom hardwired instructions at no additional cost to the decoder.
In previous articles, instructions were handled using an escape token to switch into 'linear execution' mode. Alternatively, I suggested dedicating 32 or 64 of the possible 256/512 tokens to instructions for the stack machine.
The technique presented here is infinitely more flexible. Instead of taking a fixed number of tokens away from calls, it treats calls, literal loads and all other instructions in the same statistical maner, and is certainly no worse than the descriptions in parts 1 and 2.
Instructions are no longer fixed values – they index into the table. As the window slides, so do the instruction enumerations. Debugging is a little bit more interesting (but not much).
Instructions are subject to window underflow and overflow. The compiler should be smart enough to know that there is no more room in the table for new instructions and relocate the code accordingly. Similarly, if the instruction we need is not present in the window, the compiler should synthesize it and place it into the table.
What we gain is flexibility. Instructions that are rarely used are created only for these occasions, instead of occupying valuable instruction space. They remain accessible, for that time that you need to do a strange combination of things, such as increment the stack while performing a logical AND. Your code can be very efficient.
All instructions, not just calls, are now handled the same way, without escape tokens.
The Blue memory read takes time and introduces a latency. In the simple analysis, all instructions of our processor now require 2 cycles. That is not so bad - many processors require more than one internal cycle to complete an instruction. In future articles we will come back with ways to accelerate the execution of this processor.
Instructions that include register bitfields seem to confound this architecture. However, because of ABI register conventions, register use is constrained to a subset of all possible register operations. Looking at ARM or MIPs architectures, it would be possible to use a 16-bit window to encode 32-bit instructions. This encoding would be superior to the Thumb architecture (which is a very limited subset of the ARM instruction set). I don't currently have the statistics to work with, but it's an interesting research direction.
It is also possible to exclude the registers from the sliding window system. An 8-bit bitfield can be reserved for the Sliding Window Token, while other bits are dedicated to addressing registers (hopefully using a Sliding Window system as well). If you are familiar with the operation of PicoBlaze from Xilinx, you will remember that every instruction includes a generous 'constant' field. Why not use this field for the sliding window index?
In previous articles, I described a fixed-size token processor that maps an 8 bit token to a 32-bit call target or a 32-bit literal using a Sliding Window algorithm. The processor is implementable as a VM in software and is quite useful as a Domain-Specific language.
Part 3 introduces the idea of instruction decoding using the same Sliding Window method, and the same shared Blue memory. Now we are in the probably in the FPGA domain, as complex instruction decoding is probably not worth doing in software using an off-the-shelf CPU. We will continue our exploration in the future articles.
Interesting, isn't it?
Add a Comment