Would you like to be notified by email when Victor Yurkovsky publishes a new blog?
Summary: In this multi-part series I will share with you a design, implementation notes and code for a slightly different kind of a CPU featuring a novel token machine that resolves an 8-bit token to pretty much any address in a 32-bit or even 64-bit address space, using not much more than an adder.
I would like to be clear that the original research and implementations I am about to present are being placed into the Public Domain. I am not aware of any patents on the concepts presented (unless noted), and am not privy to any proprietary information or trade secrets of others.
--Feel free to skip my ranting - proceed to the 'The Anatomy of a Call' section --
When a new technology is introduced, invariably claims are made that it will change the world. Not yet, but soon, when the price-point comes down low enough that everyone can afford it. Years go by, and indeed the price drops and everyone has 'it'. But the changes are rarely monumental.
VCRs allowed us to watch movies at home, and yet movie theaters still thrived. Everyone can afford a high-def movie camera; the world is indeed different because of it, and yet we still watch Hollywood movies (OK, and some 'funny' youtube videos, but time-wasting technology is a whole other topic).
FPGA technologies had 'leveled the playing field'. A kid can save up $50 to buy a devboard, load up some free software and create a never-before-seen brilliant new CPU design. And implement it, without a million-dollar fab cycle. And it will run at clock speeds of at least tens of MHz. So where are all the new brilliant CPU designs, from kids or grownups? Where are the original ideas and breakthroughs? Sure there are some improvements such as clever cache designs, branch prediction, out-of-order code execution. But most processors we see today are pretty much ripped from the pages of dusty old computer architecture books.
I've been trying to find something new and exciting for a while. When I feel particularly clever I sit down with a clean FPGA and try to write 'The Great American Processor' (TGAP). It is a humbling experience, and I strongly suggest it to all of you, for those times you feel especially clever.
Occasionally I do stumble upon something interesting, worth sharing. There I was, hyperfocusing on the call instructions, addressing issues and how different CPUs handle them. Looking at access patterns in execution threads and dozens of instruction sets, combined with ABIs of different operating system and calling conventions of different languages, some patterns appeared. After a couple of decades I realized that there is a way to create a processor that factors out some of the redundancies in these processes. The resulting architecture hits the 'sweet spot' of having a fixed instruction size (8 or 9 bits), incredible jump reach and is pretty efficient in a simple implementation while allowing for later optimisations.
Let's review the technologies and agree on some naming conventions first.
A big portion of the commercial success or failure of a CPU has to do with its instruction set. An Instruction Set Architecture (ISA) of a processor is closely related to its internal register structure, data bus width and address bus width. It is also shaped by religious affiliations of the designer - RISC or CICS (or my personal favorite, MISC), Von Neuman or Harvard, etc.
ISAs are the product of hard tradeoffs - each instruction has only so many bits to spare. Are all instructions the same size or variable size? Do immediate values and addresses get squeezed into the instruction or float in-between instructions? How do we represent registers? Addressing modes?
Sacrifices are made with each decision, and the resulting instruction set is a unique artifact that instantaneously identifies the architectural details of the processor and the preferences of the designer.
Looking at modern processors, we find surprisingly little originality where it comes to Control Transfer Instructions (CTIs) - the calls, the jumps, and the conditional branches. The hard work of the processor - arithmetic and logic - is done in long sequences of linear code, but he magic happens in CTIs. The control flow is altered, sometimes unconditinally (as in jump instructions), sometimes temporarily (subroutine calls), and on occasions, in ways even harder to describe (conditional expressions, counted loops and multi-way transfers). A simple list of instructions becomes a complicated Turing machine.
Briefly looking at the history of CPUs we can see that some early processors embedded the location of the next instruction into every instruction. This arrangement allows (and forces) each instruction to be a branch to a new location in addition to its normal function. It is obviously wasteful - we do not need to branch at every instruction - there are long runs of sequential logic in every reasonable program.
An obvious optimisation is the Program Counter (PC). The PC contains the address of the next instruction and is normally incremented using a dedicated circuit. Thus instructions are executed sequentially until a CTI is encountered. At that point, the PC is loaded with the new address. Where does it come from? The instruction stream, a register, stack, or memory (as in indirect jumps).
All modern processors do this, and very little difference exists between them when CTIs are concerned. As expected, CISC processors will do more complicated things (like jump indirectly through a memory location) in one instruction, while RISC processors will jump to an address stored in a register (memory indirection will require a separate instruction).
The differences come up in the optimisations. Fast processors can run hundreds of times faster then memories, and pipelines and caches are used extensively. A CTI changes the flow of execution, making simple pipelines and caches less useful. Complex branch prediction mechanisms are employed in an attempt to keep pipelines from stalling and avoid losing hundreds of cycles per branch. Speculative execution mechanisms are often employed to improve performance. The complexity of these mechanisms is generally outside of the realm of the 'homemade processor'.
The workhorse of CTIs is the Direct Call instruction. Programmers quickly recognized that 'spaghetti code', an arrangement of code where the processor simply jumps from one place to another (my mathematically-minded friends would call it a 'directed graph'), is practically impossible to debug. A neater arrangement of code is a 'tree'. Computer languages quickly honed in on the tree arrangement. The PC register combined with the stack provides support for a clean implementation of subroutines and is a staple of all modern ISAs.
A code tree is a very particular arrangement of code. The appearance of a simple linear sequence of bytes is just that - an appearance. Some bytes are more important than others by virtue of pointing or being pointed to. While rarely addressed directly, the consequences of this arrangement reverberate through the architectures of all modern processors. The complexity associated with the instruction handling of modern processors is largely a consequence of attempting to bolt on 'aftermarket' add-ons to accelerate processing instead of starting over with a design that matches the codestream topology.
Let's agree on some terminology first:
CALL INSTRUCTION - A call is a control transfer with the option of returning to the instruction following. This of course is facilitated by the use of the stack.
CALL SITE - The memory area containing the call instruction.
CALL TARGET - The memory address which is the destination of the call.
REACH - the range of addresses reachable with the call instruction.
REFERENCE - a pointer or an offset identifying an address.
The call instruction is present in all modern architectures. Generally, the new address is placed inline as either a pointer or an offset (x86). It can also be encoded into the instruction itself (ARM). The number of bits dedicated to the encoded target address decides 'the reach' of the call (and all CTIs). Often a processor will have some far-reaching jump and call instructions, and some conditional shorter-reaching CTIs to facilitate conditional expressions.
Since the goal of the call is to direct processing to another area in memory, the reach of the call instruction is a crucial characteristic. It is preferable that a call can reach anywhere in the memory space of the processor. All attempts to limit the reach by claiming that 'xxx is enough for any program' fall on their face in more complex environments where libraries or kernels are used - parts of the memory space are not easily reachable.
Classic ARM processors fail here miserably. In order to encode a call into a fixed 32-bit instruction, only 24 bits are available to encode the target, resulting in the reach of only 16MB in either direction from the call site. Many would say that 16MB is plenty - how many programs are that big?
However, an ugly problem crops up with System-On-Chip implementations that so often use the ARM architecture. These SOCs have on-board flash memory and on-board RAM. Both are capable of executing code, and naturally, the designer will place a bunch of code into flash. For general-purpose computing (and some specialized computing) it is nice to run some programs from the RAM. And of course, we would love to keep some 'kernel' or 'libraries' in flash and call them from RAM...
Well, sorry, folks, here is a typical memory layout of an ARM7 SOC:
$00000000 128K Flash
$40000000 16K SRAM
To call a subroutine stored in flash memory from RAM requires a reach of more than negative $40000000, but the feeble reach of the call is only -$01000000, 1/64 of what we need. To make matters worse, similar limitations exist in addressing hardware registers, tucked all the way up the in the E's and F's. As well as encoding 32-bit literal values.
X86, a CISC processor, with its variable-sized instructions, can encode 32-bit pointers and offset really well (at the expense of more complicated instruction decoding). The 64-bit extension expose odd (in my humble opinion, poor) design choices:
-Pointers for indirect jumps must be 64 bits;
-Addresses for jumping through registers must be 64 bits;
-offsets for direct calls must be 32 bits;
-all immediate values are 32 bits except when loading register RAX in which case they may be 64 bits.
In 64-bit mode the reach problem once again rears its ugly head. 4GB should be enough for any one piece of code, but how is a loader supposed to use 8 or 12 GB when a simple jump or a call cannot cross 4GB? With thunks, tables or other such contrivances. Of course x86 processors are severely constrained by the need to maintain backward compatibility. And the ISA has to somehow share the core instruction set with the now-required FPU, MMX, SSE, SSE2, AMD 3DNow, x86-64, SSE4, AVX, AVX2, and... what am I missing? Oh yes, VMX, SMX, XSAVE, RDRAND, FSGSBASE, INVPCID, HLE, RTM and SVM extensions. OK.
Now let's get to the interesting part.
For the purposes of this discussion, lets use a hypothetical 32-bit system with 32-bit address bus. We will assume statically compiled code without patching or self-modifying code, for now. Let's say the call instruction is a byte-long opcode followed by a 4-byte reference.
Our hypothetical processor will run a hypothetical piece of code. Looking at the code, we see that there are hundreds of subroutines (call targets), and thousands of calls to them (call sites). There are also linear runs of code, which are neither call sites nor targets.
Common sense and prior art dictate that we must have enough bits in our call instruction to reach any possible location of the subroutine - thus defining our reach. Most designers simply take it for granted. But is it true?
Actually, there is an extra term in the call instruction that is uniquely different from reach. Let's fine-tune Reach and define a new quality called Choice:
REACH - the range of addresses reachable with call instructions in general.
CHOICE - the ability of a particular call to transfer control to a number of different locations.
There is a subtle difference betweeen the two. Reach has to do with how far the call instructions can go from 0 (in the case of absolute references) or from the current PC (in the case of relative references). Choice has to do with how many different possible locations can become a target of a particular call instruction at runtime (related to the number of bits dedicated at the callsite). Normally we do not differentiate between the two - the normal call mechanism is a degenerate case of an equation where reach and choice are one and the same. But it does not have to be the case.
Let's look the problem from another angle. When we compile a call to a subroutine, the target is static. Technically it is usually decided by the linker or the loader. In a running system, the target will never move; the callsite, encoding the target, will therefore never change. And yet, it is capable (by virtue of having all of the 32 bits present) of encoding any address at runtime! To paraphrase, every callsite is capable of reaching any taget location within its reach, but only one target will ever be used at that callsite. There is only 1 possible target out of 1 for each call, which should requires 0 bits to encode.
Since the target of a call at a given site is unique and will never change, another mechanism could be used for mapping callsite address to the call target, making it a 1-byte instruction. This would actually work implemented as something similar to an associative memory (please read on to see why this is not a good solution).
Each target is likely to be used more than once; eliminating the actual reference from the call saves 32 bit per call. For a commonly used subroutine there may be thousands of wasted callsite bytes.
Finally, representing targets with 32 bits is wasteful. Only a handful of valid targets exists; the rest of the possible call destinations are not just useless, but actually harmful! Jumping into the middle of code or data is how programs die.
This is known as adding injury to insult. A call site contains 32 bits, enough information to target 4 billion locations, only hundreds of them useful and the rest harmful. And 0 bits are in fact required at the call site as we know exactly where the call will end up just by looking at its address. The actual target address can be factored out and shared by all the calls to the same target!
In fact we could eliminate the call instruction altogether. Since the location and targets of all calls are known, the processor doesn't even need the call instruction, and can simply transfer execution to the target address (kept elsewhere) as soon as the 'callsite' (now containing nothing at all) is approached. Similarly it can 'return' at the end of the subroutine, simply by knowing how many bytes the subroutine is, also with no opcodes involved. The linear code continuum is bent into a fractal.
So let's replace our PC with this magical black box that generates addresses sequentially most of the time, but sometimes, at the former locations of our call and return instructions, simply redirects the processor to the new locations, for just the right amount of cycles! How would we implement it?
An associative memory would work. An associative memory is like a reverse phone directory - given the current PC it can look up the branch target that would ordinarily be encoded in the call instruction there. Those addresses that actually jump, call or return, provide the proper targets (with the help of the stack pointer and memory, as usual). Otherwise, we increment the previous address.
Of course, we need to keep a list of targets somewhere (now in the associative memory), and that takes up space. Are we just moving change from one pocket to another? No, there are definite benefits. Each possible targets is now uniquely represented. Furthermore, all the targets are now in one place. Duplication is eliminated through factoring: multiple references to the same location now take no space at all in our code memory. All kinds of new optimisations are possible: we can prefetch code more easily, cache the first few instructions of each subroutine, etc.
Unfortunately, associative memories are expensive when fast and wide, and boy, this one needs to be fast and wide. And deep, too. I will let the reader do the math to figure out why this approach is probably only magical. Oddly enough, a much simpler and better solution exists (and requires no magic)...
In our magical optimal design, our CTIs take up no space, each target takes up 32 bits, and a very large associative memory resolves each instruction address to the 'next' address. It's time to start compromising. Let's look at some approaches that can be implemented, and see how much we are willing to give up for the ability to implement.
We cannot negotiate away REACH (we do need to be able to place subroutines anywhere in RAM). Therefore, somewhere, we have to keep a table of targets, in a memory as wide as necessary for full reach. Same as before, we will keep 32 bits per target.
We can negotiate with CHOICE (which is currently 0 bits). If we sacrifice, say 8 bits per call, we can place the index of the target in the table into the call. Our calls will now take up one byte, have full reach. But, since we have only 8 bits of CHOICE, our system supports only 256 possible targets. This is hardly acceptable, even for a small microcontroller.
We could give up more CHOICE bits. With 16 bits our target table supports 64K targets, a formidable amount. This works very well, but... If you are old and paying attention, you will instantly recognize what is known as 'Token-Threaded Forth'. These indirect approaches have been very well explored by writers of Forth-like systems. But there are serious limitations to these systems. And I promised you something new anyway.
Let's get back to our 8-bit-per call system. We wasted the CHOICE bits to simply encode an index into the table, just to keep the system simple. We are not taking the callsite address into account at all. Is there a better way? Of course there is.
Our 8 bits of CHOICE can select one of 256 possible call targets. The simple table implementation above implies that every call has to select from the same 256 possible targets. But that is an unnecessary limitation - we can vary the tables for different callsites, based on the callsite address.
Those familiar with Java will recognize that 'literal pools' perform a similar task. Every function has a table of literals used inside the function. Since there is usually a handful of these literals per function, they can be address by a byte index into the literal pool. This technique can be used for calls, although it's wasteful and patented. There is still a better way.
Let's add another element to the callsite/target/choice/reach paradigm. Time. Actually there are two of these - run-time (the order in which subroutines are invoked) and compile-time (the order in which subroutines and calls are compiled). We are interested in compile-time for now.
We shall start with our 8-bit tabled system. At the start of each subroutine as we compile them, the address of the entry point is added to the table as a target. As we compile calls, we search the table and compile the index of the target - an 8 bit quantity - into the code.
As we proceed doing this, it becomes obvious that there is a directionality in our endeavor. Subroutines are defined, and targets are sequentially added. Calls are made to subroutines previously defined. Some subroutines are used often and as our table grows, are located farther and farther back from the insertion point. Many times, primitives are defined and used once or a few times, also sliding back from the 'NOW' point. The table is in fact a cache of recently-used targets.
Eventually we reach the 256th entry in the table. If we write a 257th entry, it will no longer be reachable. What if we retire the first table element to make room? After all, it was defined so long ago that there is a pretty good chance we don't need it anymore. Now we have a sliding window of 256 elements inside an infinitely large table.
The 'retiring' sounds complicated. How do we implement this? Surprisingly easy.
The base of the table can be a simple function of the call address. For instance, ADDRESS/16. This means that every time our PC advances 16 bytes, our table slides over by one element, freeing up more space for targets reachable by our 8-bit CHOICE. It also means that old subroutines will be phased out - and hopefully never used again. (If they are needed again, we have more room for targets - we can duplicate the target in the table. This happens very rarely.)
Since the table is static at the time of execution, and refers back to existing entries, I will refer to this arrangement as a Static Sliding-Window Lookback Cache. Because it is simply a static list of targets, no 'cache pollution' occurs during subroutine calls - a different part of the table is addressed. Upon return, we are back in the original part of the table.
The base of the sliding window is a function of the address. This function is tunable to match the rate of retiring of old targets with the rate of introduction of new targets during compilation. The width of the CHOICE field (8 bits in our example) provides us with the slack the system desperately needs. It controls the width of the lookback window and provides us with the 'availability' of previously defined targets at any given point of execution.
I will provide more information about the compilation process and the tuning of the parameters in the next installment.
Let's examine this new bytecoded processor.
The processor has two separate memories: a byte-wide code memory and 32-bit target memory. Instead of a PC, we have a slightly more complicated table-lookup blackbox. Each cycle, the system does the following:
NEXT = TABLE[ CURRENT/16 + BYTECODE ]
Also, save return address on the stack.
As we know, dividing by 16 is trivial (>>4 in sofware, no logic required in hardware) and addition and an indirection are a snap.
So far all the processor can do is call. To return, let's designate opcode 0 as a return instruction, reducing our table range to 1-255.
To add more 'regular' instructions we have a couple of options. If we choose a stack-machine processor, we can simply claim 31 more slots in the instruction space for pretty much all operations we need, and leave instructions 32-255 as table indices for jumping. That works well. Otherwise we can designate an 'escape' token to switch to linear mode instruction set (the example below uses the same opcode 0 to indicate such an escape, provided that 0 is the first opcode after jumping to a subroutine).
Modern FPGAs have 9-bit memories, so switching to instruction width of 9 bits gives us even more latitude without any wasted resources:
65-511 CALL operations.
This processor is totally implementable, and in later articles I will examine some implementation notes and code.
A quick proof-of-concept can be coded in our x86 world using an 8-bit tokens:
xor eax,eax ;clear upper 3 bytes for lodsb
lodsb ;al=tok, inc esi
shl eax,2 ;token*4 (pointers are 4-bytes), set flags...
jz return ;A 0 token that is not first means return
push rsi ;thread in...
shr esi,4 ;Tricky: esi shr 4 then shl 2 for alignment
mov esi,[esi*4+eax] ;and index it with token*4 resulting in CALL
shl eax,2 ;first byte of subroutine 0? Machine language code follows
jnz inner_loop ;continue threading
call rsi ;call assembly subroutine that follows
jmp return ;thanks for catching a bug, KSM
This is a partially-unrolled implementation of the new processor. Code is allocated at $40000000 and the table at $10000000. The interpreter is less than 32 bytes long and fits into a cache line. In preliminary tests it ran about 1/10 the speed of native code, not too bad for a bytecode interpreter. Of course in an FPGA implementation there is no need for such inefficiencies.
This interpreter will thread (Forth parlance for executing lists of subroutine calls) normally. If the first token of a subroutine is 0, the interpreter will issue an x86 call to the code that follows; return will get back to the interpreter. For threaded lists of subroutines, 0 means return to the caller (this 0 has to be in a non-first position in a subroutine).
In the next article I will look at the pros and cons of this methodology. We will examine some unique characteristics of the Static Lookback Sliding Window Caches. We will learn more about the inner mechanisms and fine-tuning of the interpreter and start mapping out the FPGA version of the design.
Add a Comment