Would you like to be notified by email when Victor Yurkovsky publishes a new blog?
An in-depth exploration of Sliding Window Token Machines; ARM notes.
In the last article '1. A New CPU' I introduced a CPU that uses 8-bit tokens to indirect through a table of targets. By using a 256-entry-wide sliding window into a large table of pointers, the processor can address gigabytes of RAM in spite of its minuscule 8-bit fixed size instructions. The price we pay is a memory read and an adder.
Is the offer too good to be true? Some readers suggested that the description smells of a perpetual engine. Relax, and consider that in addition to the 8 bits of the instruction itself, we are using the address of the instruction to gain extra information (in fact turning 32+8 bits of address/instruction into a 32-bit target address). The laws of thermodynamics and information theory are not violated.
A quick review of the processor. The CPU employs two separate memories. An 8-bit-wide memory – let's call it Red memory - contains tokens, opcodes, and occasional in-line data. A separate, 32-bit memory (let's refer to it as Blue memory) contains addresses of call targets. Note that the term 'tokens' here means 8-bit 'calls' to subroutines in the code memory.
The PC register addresses Red memory and identifies the currently executing token/opcode. The Blue equivalent of the PC, register B simultaneously points to the base of the sliding window in Blue memory, and determines the reachable set of targets for the token at PC. B is calculated as a function of the PC; the function can be as simple as PC>>4 (this results in the window sliding one Blue entry for every 16 tokens in Red memory).
For clarity, let's put the currently executing token into a register called Q.
The execution mechanism for tokens is simple. For normal mode, executing lists of tokens (subroutine calls):
Fetch token into Q; increment PC; calculate B.
Since it's always call, preserve PC on the stack;
Load PC with the value from Blue memory [B + Q] ;
In order to return from a subroutine, token 0 is reserved. When encountered, PC is popped from the stack and the process continues.
Some mechanism is needed to execute other instructions. For hardware implementations I recommend dedicating a portion of the 256 (or for 9-bit tokens, 512) tokens to linear instructions. For software implementations it works out better to have an 'escape' token that forces a 'native' mode where assembly language code is executed.
In software, the sliding window interpreter provides a viable basis for a Domain-Specific Language. In hardware, it is a decent scalable processor that provides quite a bit of flexibility to FPGA designs with minimal resource use and extremely high instruction density.
The arrangement described above imposes certain requirements and limitations.
Assuming a 32-bit address space with 4GB of Red memory, the Blue memory provides one new 32-bit pointer for every 16 tokens. This requires 1GB of Blue memory to completely fill this 32-bit system. In practice, we will need a lot less Red memory for any imaginable task. Bytecoded systems are incredibly compact (an entire forth-like environment with interactive debugging fits in less than 8K). 256K is unthinkably large (as far as code size is concerned), and that requires 64K of Blue memory. For emulated systems none of this is an issue with today's technology.
For FPGA implementations, the system can be scaled down to as little as 1 dual-ported BRAM containing both Red and Blue memories. This works out to about 1820 9-bit tokens in Red, leaving room for 114 16-bit pointers in Blue. This is enough to do some real processing for control applications. A larger system using nine 1-bit Red BRAMs provides 16K instructions and requires 1K of 16-bit Blue ram, or 1 BRAM (for a total of 10 BRAMS). Scaling up, to support an external 256K SRAM, we can dedicate 16 internal 1K x 18bit BRAMS, still in the realm of a small Xilinx FPGA. The design is infinitely scalable with no penalty. All that while remaining code-compatible with even the smallest sibling.
As for the instruction width – obviously, in emulated systems 8 bits is a natural choice. This provides the window width of 256 entries and plenty of slack should our code density vary. For FPGA implementations we have a zero-cost opportunity to go to 9 bits. This allows us to easily 'sacrifice' up to 256 tokens to the remaining non-call instruction set without losing a beat.
Finally the simple Q=PC>>4 function for calculating the Q register can be adjusted to the expected code density. As mentioned above, with 8-bit tokens and 32-bit pointers, >>4 dictates that the Blue memory is ¼ the size of Red memory. This implies highly factored code with the average subroutine size of 16 instructions, catering to Forth code. Should the subroutine length be radically different, the function can be adjusted. For instance, >>5 will work better if subroutines are averaging 32 instructions, resulting in Blue memory being half the size – 1/8 of the Red memory in the above examples.
More interesting mapping functions exist, although the simplicity of >>x is hard to beat. I haven't been able to improve on it enough to bother; generally the cost of implementation of more complex functions simply does not justify any improvement in the mapping quality. If you come up with a better one, please let us know.
The word 'threaded' is an old term implying code containing exclusively calls to subroutines (see http://en.wikipedia.org/wiki/Threaded_code for the definition and explanation of different threaded models). It has nothing to do with OS threads or multithreading.
Threaded code execution is interesting because it's highly non-linear. Even though subroutines are lists of calls to other subroutines, the lists are not executed sequentially. The CPU traverses the tree in a depth-first path, going all the way down to native code routines and winding its way back up before initiating another dive. This is similar to a needle piercing layers of fabric to come out at the bottom only to wind its way back up - the origin of the word is pretty obvious.
Following the execution chain starting at TEST, we call FOO which then calls A, a leaf node. Upon return we call B, which calls D and upon return, E, thus completing B. Returning to FOO, we are finished and return back to TEST, only to call BAR, A again, then C, return back to BAR and subsequently to TEST. Since we are done, we return from TEST.
Once again, observe that even though as we write routine FOO we see it as a sequential list of instructions - A B RET, at execution time never do we go from A to B directly - we thread all the way down in between. This is a crucial point to remember.
Normal threaded code as described in Wikipedia can take different forms, including subroutine threading. Subroutine threaded code looks like this:
$40000000: E8 0000000B CALL FOO
$40000005: E8 0000001C CALL BAR
$4000000A: C3 RET
$4000000B: ..... FOO: ....
It can be represented as a list of addresses like this:
$40000000: 4000000B FOO
$40000004: 4000001C BAR
Note that the now each call is only 4 bytes long and requires an interpreter to run it. The interpreter is very simple - it fetches a 4-byte quantity and calls through it.
Using our new scheme, the code will look like this:
$00001000: 01 FOO
$00001001: 0D BAR
In order to execute token $0C, the interpreter finds the base of the table, and indexes the second pointer and proceeds to execute there. It's not the first pointer - that would be token 0.
While this looks much more complicated than a simple call instruction, the reality is that the 4-byte address needs to be read from memory no matter what. In our new scheme, it is read from the table; x86 reads it from the instruction stream. x86 has the 5-byte overhead for every call - ours is 5 bytes for the first call and 1 byte for every additional call if it's already in the table, which is a good deal of the time. If it's not in the table, we have to duplicate the target and the cost is back to 5 bytes, so we are generally well ahead of the game.
When looking at token machines and bytecode interpreters we expect a fixed meaning attached to each bytecode. Take the Java VM, P-code or the Smalltalk VM. These interpreters emulate a virtual processor with a fixed instruction set. Our beast is quite different - the bytecodes do not have a fixed meaning - they stand in for subroutine calls to various addresses. Even more disturbingly, the enumerations keep changing as we cross the 16-byte boundaries. For instance, a sequence of calls to a routine 'FOO' located near the boundary will disassemble to:
$0000100D: 13 FOO
$0000100E: 13 FOO
$0000100F: 13 FOO
$00001010: 12 FOO ;a new enumeration kicks in
$00001011: 12 FOO
This seems insane at first, but consider the x86 call instruction: $E8 followed by relative offset from PC+5. X86 code looks like this:
$10001000: E8 10001009 CALL FOO
$10001005: E8 10001004 CALL FOO ;a new offset for every call - can't id address visually
$10001000: E8 10000FFF CALL FOO
Looking at it rationally, I can't say that one is weirder than the other. You will quickly get used to it, and deriving the target of the token is only a little harder than in the x86 world. We'll have a decompiler doing it for us anyway.
Assume that we are compiling (or assembling) code to some area in Red RAM. For simplicity's sake, let's say the sliding window in Blue RAM is half-full, and offset 128 is free:
The process is simple: we look up the name of the target in our symbol table. Now we search the window for the address of the target, starting at +0 and going up to the insertion point. If the target is present, we simply compile the index as a bytecode. If it's not, we add it to the table at the insertion point and compile its index (128 in this case).
Should a long run of targets that are not presently in the table be encountered, our table will start filling up. In an extreme situation, the table will fill up and leave us unable to compile code. In this case, the only option is to compile a bunch of NOP instructions or a jump to the next 16-byte boundary to free up a slot (by causing the window to slide). This 'table overrun' and happens rarely.
In the opposite extreme case, the table can become nearly empty. For instance, a long list of calls to the same target will cause the window to slide, 1 entry for each 16 calls. Eventually, only that target will remain as the first and only element in the table, resulting in a 'table under-run'. Subsequent calls will force the entry to be duplicated upon the next slide of the window. No harm done; the table will be repopulated as calls to other targets resume.
Please remember that the talk of 'filling the table' or the table being 'half-empty' and any notion of 'insertion points' are strictly compile-time phenomena. Nothing is written into the table during execution (except during compiling). At runtime, the table is always full (except for the very top of memory). This is not a runtime cache - it is a pre-compiled table of targets!
A careful examination of the table usage shows that the ideal situation is for the insertion point to be high up, close to the 'full window' condition. This way we make the most of the window and have access to many previously-used pointers. Getting too close to the top makes the overrun condition more likely, however.
Inserting a pointer close to the top of the window has another beneficial effect: as the window slides, the pointer will continue to be visible to the code without requiring duplication. A quick calculation shows that a pointer at index 255 will remain visible for 256 slides, or 256*16=4096 bytes of code that follows. A commonly-used routine will benefit from being placed high up.
On other side of the spectrum, there are subroutines that are used only once or twice; often a programmer will split up a large routine into several smaller ones. This is not factoring, so it's pointless to carry a pointer like that in the window for too long. Pointers like that benefit from being placed as low as possible; a routine used only once is best located at index 0, where it will be retired within 16 bytes.
The compiler will definitely work better if there is a way to sort pointers and place them high or low. Instead of an insertion point for the Blue table memory, we can instead fill the memory with 0 pointers and scan for a free slot from the top or the bottom of the window to insert pointers low or high, as needed.
We've seen that call instructions are prime candidates for factoring due to the high probability of targets being used repeatedly, and the fact that the target space is a small subset of all possible addresses. There is another likely suspect we should examine: literals.
Since our instructions are fixed size, the decoding is simplified greatly - we do not have to deal with 4-byte-long interruptions in the instruction stream to accommodate call targets. But what about immediate values? How do we load registers with literals?
Many processors differentiate the immediate instructions into 'small' and 'large' categories. This is sensible - why waste 4 bytes to load 0 or 1 into a register? Byte immediates are very valuable when parsing ASCII text, processing pixel values, doing simple comparisons. Long immediates are useful for loading pointers into registers and processing complicated data.
It seems to make sense to keep byte immediate operations - it's simple enough to fetch another byte. It does complicate instruction timing - without pre-fetching some instructions will take 2 cycles. It's also possible to fetch 2 bytes at once and discard the second byte unless the first one is an immediate instruction. It's a slight annoyance, but is doable.
It is possible to use the same table technology that is driving the call operations of our processor. We do have to differentiate pointers from literals. This can be accomplished by using an extra bit in the table. Since Xilinx FPGAs have parity bits in the BRAMs, 32-bit memories are 36 bits wide anyway, so we can use a bit to indicate that a register load is to be performed instead of a call. Our bytecoded processor will continue with its 8 (or 9) bit tokens; some of these will now be used for loading literals.
A side benefit of this is that all the factoring that previously applied to targets now works for immediate values. So common values such as 0, 1, powers of two, and immediates representing pointers used often (such as addresses of hardware registers) will sit in the window and be shared. The price - space is taken out of the table and the metrics of matching code density to our sliding functions have changed.
So far I've described a distinctly 'threaded' CPU, highly suitable for running Forth. The main reason for that was to focus on the call system and address resolution (and because my personal interests lie in that direction). But it is pretty obvious that the mechanism for resolving an 8 or 9 (or whatever size) -bit token to a full width of the address bus is pretty generic.
Modern processors use between 8 and 64 bits to identify branch targets and store immediate values inline. Any architecture with any instruction set can incorporate the sliding table resolution mechanism. ARM architecture with its limited bitfield for calls would benefit with no other changes.
A 16-bit fixed-size instruction set can accommodate a multi-register architecture; the entire second byte can be used for targets in CTIs, literal index for immediate instructions, and offset/index for other instructions.
Finally, I would like to present another option that I haven't yet discussed. So far we've seen that 8 or 9 bit tokens, with the sliding window of 256 or 512 entries, provide enough slack to support general-purpose processing, occasionally running into trouble with window overruns. What if the tokens were wider?
How about 16 bits? The window is now a super-generous 64K entries. That is a truly enormous amount of slack, and the window coverage is fantastically wide. A pointer placed at the top of the table does not get retired for 64K slides. At 16-bytes per slide, that is 64K x 16 = 1M of 2-byte instructions, or 2MB of code! Two-byte tokens are still a lot more compact than native calls in any 32-bit architecture, and we gain a virtually unlimited memory space, amazing scalability, and all the good factoring things mentioned above.
Wider tokens are a great way to create a medium-to-large-sized processor. Software versions for Domain Specific Languages work extremely well too - modern processors handle 2-byte words with ease.
In the last article I promised to continuously share implementation notes. Keeping with that promise, here is an implementation of the inner interpreter for the ARM processors. This particular code was pulled out of my FemtoForth implementation of the sliding window interpreter I built a couple of years ago for NXP's ARM Cortex M3 processors (tested on LPC1343 and LPC1768). The assembler is FASMARM, and the following conventions are used:
Token size is 8 bits
Table entry size is 32 bits
r6 is IP
r3 is current token
r2 is temporary IP
r12 is $1000FFFE, a constant used to mask IP to clean it up and align it
TSHIFT is the constant used to shift the address to create window baseTWDIV is used to shift it back
to align it (size of the pointer)
The ARM code is a little more complicated than its x86 cousin by the alignment requirements of ARM instructions and targets. A trick is also used to maintain separate tables and code areas in RAM and FLASH areas of the LPC chips (see if you can figure it out. Hint: flash is at $00000000 and RAM is at $10000000)
ldrb.w r3,[r6],1 ;2; load token
cbz r3, inner3 ;zero? return
; this also works but needs an extra register
; and r3,r12,r0,LSR 3 ;1; is 0x1000FFFE 1/8 of address, even
; bfi r0,r3,0,28 ;and keep the RAM bits
;Thread in. First, make the table base: ((a+1)>>tshift) AND tround
;To preserve the RAM bit, use the signed extract from bit TSHIFT up to RAM bit.
sbfx r2,r6,TSHIFT,29-TSHIFT ;1; shift, create an S29
ands r2,r12,r2 ;1; Mask S29 to reinstate RAM bit
ldr.w r2,[r2,r3,LSL TWDIV] ;2; r2=I. Keep caller's R6 for the CODE case...
ldrb.w r3,[r2],2 ;2; r3=#. normal +1, plus +1 in case it's code..
; cbz is longer than a compare here.
bxeq r2 ;2; 0=CODE! r6 is caller's, r2 will be abandoned.
RPUSH r6 ;2; not code, save old r6
subs r6,r2,1 ;1; not code, adjust the pointer back to +1
b inner_loop ;1;
I hope this article explained some of the many subtleties of the Sliding Window token machines. In future articles I will try to put together a simple emulation of the new CPU (one of the versions discussed above). I will present more implementation notes for the FPGA version.
I will be posting emulation software as well as Verilog over the next few months on Github. Check (and watch) https://github.com/stacksmith/sliding-cpu (don't expect much until April 2013).
I want to remind you that the technology presented here has been placed into the Public Domain and is free from any restrictions whatsoever. Please feel free to experiment and let me know how it goes.
Add a Comment