I am resurrecting a CPU design I created some years back for use in an FPGA. It was highly optimized for minimal resource usage, both logic and BRAMs. One of the things I didn't optimize fully is the usage of the full 9 bits of BRAMs in most FPGAs. When used in byte width or larger, they use multiples of 9 bits rather than 8 to provide a parity bit. It seems a waste to ignore these bits. I have already optimized the instruction set fairly well for a zero- operand architecture (dual stack based) using 8 bit instructions. In general, 8 bits is overkill for this sort of machine. For example, various MISC CPUs use 5 bit instructions packed into larger words. There is no need for 256 different instructions, so I used an immediate field in the jump and call instructions and still have a total of 36 instructions. Given this as a starting point, an easy way to use this extra bit is as a return flag. There are any number of instructions that if followed by a return, could now be combined saving both the time of execution and the byte of storage for the return instruction. Referring to Koopman's data on word frequency this will reduce the number of bytes in the code by up to 7.5% and speed by up to nearly 12%. I say "up to" because only 19 of the current 36 instructions (or about half) can be optimized this way. Instructions that use the return stack can't be combined with the RET opcode because it also uses the return stack. The result is likely to be that half of the time this RET bit can be used giving a 3+ % savings in code size and 6% savings in clock cycles. I would never think of enlarging the program memory by 12.5% to get a 3% savings, but since this bit is actually free, it makes some sense to use it this way. On instructions where it has a better use, such as the literal, it is not used for the return. The literal instruction goes from a 7 bit literal to an 8 bit literal shifted onto the return stack. The jump and call instructions go from having a 4 bit immediate field to a 5 bit field. This may seem like not much, but in the code I had written for this CPU it was a significant percentage of the time that the length of the jump was far enough that it would not fit in 4 bits and required an extra byte of storage. Going to a 5 bit field would have accommodated most of these cases. The question is whether there is a better way to use this extra bit in an instruction. I am sure it would not be productive to just add more opcodes. I haven't seen any other ideas in any designs I have read about, so the RET function seems like the best. Rick
Bit width in CPU cores
Started by ●December 21, 2008
Reply by ●December 21, 20082008-12-21
On Dec 21, 4:08 am, rickman <gnu...@gmail.com> wrote:> I am resurrecting a CPU design I created some years back for use in an > FPGA. It was highly optimized for minimal resource usage, both logic > and BRAMs. One of the things I didn't optimize fully is the usage of > the full 9 bits of BRAMs in most FPGAs. When used in byte width or > larger, they use multiples of 9 bits rather than 8 to provide a parity > bit. It seems a waste to ignore these bits. > > I have already optimized the instruction set fairly well for a zero- > operand architecture (dual stack based) using 8 bit instructions. In > general, 8 bits is overkill for this sort of machine. For example, > various MISC CPUs use 5 bit instructions packed into larger words. > There is no need for 256 different instructions, so I used an > immediate field in the jump and call instructions and still have a > total of 36 instructions. > > Given this as a starting point, an easy way to use this extra bit is > as a return flag. There are any number of instructions that if > followed by a return, could now be combined saving both the time of > execution and the byte of storage for the return instruction. > Referring to Koopman's data on word frequency this will reduce the > number of bytes in the code by up to 7.5% and speed by up to nearly > 12%. I say "up to" because only 19 of the current 36 instructions (or > about half) can be optimized this way. Instructions that use the > return stack can't be combined with the RET opcode because it also > uses the return stack. > > The result is likely to be that half of the time this RET bit can be > used giving a 3+ % savings in code size and 6% savings in clock > cycles. I would never think of enlarging the program memory by 12.5% > to get a 3% savings, but since this bit is actually free, it makes > some sense to use it this way. > > On instructions where it has a better use, such as the literal, it is > not used for the return. The literal instruction goes from a 7 bit > literal to an 8 bit literal shifted onto the return stack. The jump > and call instructions go from having a 4 bit immediate field to a 5 > bit field. This may seem like not much, but in the code I had written > for this CPU it was a significant percentage of the time that the > length of the jump was far enough that it would not fit in 4 bits and > required an extra byte of storage. Going to a 5 bit field would have > accommodated most of these cases. > > The question is whether there is a better way to use this extra bit in > an instruction. I am sure it would not be productive to just add more > opcodes. I haven't seen any other ideas in any designs I have read > about, so the RET function seems like the best.I meant to crosspost to CLF and I also wanted to ask if anyone has looked at similarly making the word size a multiple of 9 rather than 8 bits. That not only matches the BRAMs, but also the multipliers. Is there any real advantage to this? I guess this could make it a bit hard to address 8 bit bytes in an 18 bit word. Or maybe not. Opinions? Rick
Reply by ●December 21, 20082008-12-21
In article <46017e00-ae8e-45d7-945f-020dd34b6483@20g2000yqt.googlegroups.com>, rickman <gnuarm@gmail.com> writes:>I am resurrecting a CPU design I created some years back for use in an >FPGA. It was highly optimized for minimal resource usage, both logic >and BRAMs. One of the things I didn't optimize fully is the usage of >the full 9 bits of BRAMs in most FPGAs. When used in byte width or >larger, they use multiples of 9 bits rather than 8 to provide a parity >bit. It seems a waste to ignore these bits. > >I have already optimized the instruction set fairly well for a zero- >operand architecture (dual stack based) using 8 bit instructions. In >general, 8 bits is overkill for this sort of machine. For example, >various MISC CPUs use 5 bit instructions packed into larger words. >There is no need for 256 different instructions, so I used an >immediate field in the jump and call instructions and still have a >total of 36 instructions. > >Given this as a starting point, an easy way to use this extra bit is >as a return flag. There are any number of instructions that if >followed by a return, could now be combined saving both the time of >execution and the byte of storage for the return instruction.... Dick (Richard E) Sweet did his thesis on this area at Stanford in 1977. Empirical Estimates of Program Entropy That approach was used on the Mesa instruction set at Xeorx. The ACM will sell you a copy of his thesis (or a paper derived from it) 168 pages. There is also a PARC CSL report on the Mesa work. The Computer History Museum has a copy. The library at PARC may have copies but they have probably pitched them by now. The goal of that work was to reduce the size of the code. (It seems a bit silly with all the bloat out there now, but we thought it was important back then. An Alto had 1/2 megabyte of RAM.) My memory is that most of the instructions were loads and stores. I expect that will depend upon your sample set. How much code do you have? Do you have a compiler or are you coding in assembly? The Mesa world had a stack and a PC and two pointer registers. We called them Local and Global. Local was a pointer to the stack frame for the current subroutine. Global was the pointer to the module's storage. (static in c) The Local stack frames were not actually a stack. They came from a heap so you could do co-routines and fancy stuff like that. Anyway, if you read Dick's thesis or the Mesa report, I'll bet it will give you some ideas. I'm pretty sure he had a version of the compiler that parsed the code and spit out normalized (un-optomized) intermediate data that he could analyze with another set of tools. (aka compress) -- These are my opinions, not necessarily my employer's. I hate spam.
Reply by ●December 21, 20082008-12-21
On Dec 21, 7:13 am, rickman wrote:> I meant to crosspost to CLF and I also wanted to ask if anyone has > looked at similarly making the word size a multiple of 9 rather than 8 > bits. That not only matches the BRAMs, but also the multipliers. Is > there any real advantage to this? I guess this could make it a bit > hard to address 8 bit bytes in an 18 bit word. Or maybe not. > Opinions?The Intellasys chips use 18 bit words. In several places, 9 bit addresses are used. Only whole words are addressed, not bytes. Instructions are 5 bits wide, but of the four that fit into an 18 bit words the last only has the top 3 bits (the lower bits are assumed to always be 0). -- Jecel
Reply by ●December 21, 20082008-12-21
On Dec 21, 12:13 pm, rickman <gnu...@gmail.com> wrote:> I meant to crosspost to CLF and I also wanted to ask if anyone has > looked at similarly making the word size a multiple of 9 rather than 8 > bits. That not only matches the BRAMs, but also the multipliers. Is > there any real advantage to this? I guess this could make it a bit > hard to address 8 bit bytes in an 18 bit word. Or maybe not. > Opinions? >ANS Forth says that address units need not be bytes. You may have 18-bit AUs, 9-bit AUs, or 6-bit AUs. (In the latter case you will need a fast division by 3 :). On a processor that used 16-bit words and ignored the least significant bit, it was easy to make a software emulation of byte access (by the way, both big- and little-endian could be emulated). As to instructions, do you want to pack commands of the same size into 18-bit words, or it is ok to have commands of different sizes? For example, you can make a frequency study of instruction usage, and encode more frequently used instructions with a smaller amount of bits. (Yes, I am a mathematician.) You can pack up to 4 instructions into a 18-bit word, provided that only 2 of them are 5-bit. You will not need to shift more than by one bit: if the instruction is 4-bit (bits 0-3), the next instruction starts at bit 4; if the instruction is 5-bit (bits 0-4), you will need to shift the word by 1 bit to have the next instruction at the same bit 4. Alternatively, you may have three 5-bit instructions and one 3-bit instruction (having some defaults about the two bits that are not there). Only 1 memory access per 3 or 4 instructions ;)
Reply by ●December 21, 20082008-12-21
On Dec 21, 11:26 am, Jecel <je...@merlintec.com> wrote:> On Dec 21, 7:13 am, rickman wrote: > > > I meant to crosspost to CLF and I also wanted to ask if anyone has > > looked at similarly making the word size a multiple of 9 rather than 8 > > bits. That not only matches the BRAMs, but also the multipliers. Is > > there any real advantage to this? I guess this could make it a bit > > hard to address 8 bit bytes in an 18 bit word. Or maybe not. > > Opinions? > > The Intellasys chips use 18 bit words. In several places, 9 bit > addresses are used. Only whole words are addressed, not bytes. > Instructions are 5 bits wide, but of the four that fit into an 18 bit > words the last only has the top 3 bits (the lower bits are assumed to > always be 0). > > -- JecelThe main thing I am worried about with 18 bit words is byte addressing. Bytes could be addressed as 8 bit quantities packed into an 18 bit word with two bits unused (or used as parity perhaps), or they could be 9 bits each. 9 bit bytes is unusual, but a "byte" is not a fixed number of bits. I have also see machines (old ones admittedly) with a 36 bit word and 6 bit bytes. I think the bytes used for character data was called "field code". This was a way of optimizing character data for limited storage. But I think that is way too complex for a tiny core CPU. I like the idea of 9 bit bytes since that matches the instruction width. I'm just wondering if it will make trouble when character data is used in arithmetic. To use bytes in arithmetic requires the sign bit to be extended. With 9 bit bytes, the question is whether to extend the ninth bit or the eighth. Doing one or the other in hardware could make the other very awkward. I guess I could just wait until I have something to write code for and see how it works. It's not like the instruction set is cast in stone at any point. Its also possible that it's not need to be done in hardware at all. How does the Intellasys chips use 9 bit addresses? Are these just immediate fields within instructions? I have trouble remembering the details of the various Forth CPUs, but I want to say the Intellasys chip is one of Chuck Moore's designs and uses a full word for an instruction that contains immediate data/address. But that would be 5 bit instruction and 13 bit immediate. That doesn't seem to match. Besides, I want to say that was in the context of a 20 bit instruction word. I guess I am mixing different incarnations. Rick
Reply by ●December 21, 20082008-12-21
rickman wrote: ...> How does the Intellasys chips use 9 bit addresses? Are these just > immediate fields within instructions? I have trouble remembering the > details of the various Forth CPUs, but I want to say the Intellasys > chip is one of Chuck Moore's designs and uses a full word for an > instruction that contains immediate data/address. But that would be 5 > bit instruction and 13 bit immediate. That doesn't seem to match. > Besides, I want to say that was in the context of a 20 bit instruction > word. I guess I am mixing different incarnations.Remember, each core has only 64 words of RAM and 64 words of ROM. From the Data Sheet concerning branches: The size and interpretation of the branch address field is affected by opcode packing. The field is always right justified within the same instruction word as the opcode that controls it. The slot number of the branch opcode will determine the number of high order bits available to the address field. Each of the branch cases, and its affect upon address size and effective address calculation, are illustrated in Table 3.4. Example 1: If the opcode is in Slot 0, there is room for a full nine bits of address and a tenth bit to set extended mode. Example 2: In Slot 1, there is room for eight address bits. Note that eight bits will reach all of RAM and ROM, but cannot reach the I/O ports. In other words, P8 is always set to 0 when branching from slot 1. P9 is not affected by Slot 2 banches. Example 3: In Slot 2, 3 address bits are available. These three bits are combined with five bits of the P register to form an 8-bit address. P8 is set to 0 when branching from Slot 2. Note that the address is in the same 8- word page as the value of the P register when the opcode executes. P9 is not affected by Slot 2 banches. -------- Direct addressing uses the 9-bit B register. Cheers, Elizabeth -- ================================================== Elizabeth D. Rather (US & Canada) 800-55-FORTH FORTH Inc. +1 310.999.6784 5959 West Century Blvd. Suite 700 Los Angeles, CA 90045 http://www.forth.com "Forth-based products and Services for real-time applications since 1973." ==================================================
Reply by ●December 21, 20082008-12-21
On Dec 21, 9:33=A0am, rickman <gnu...@gmail.com> wrote:> Opinions?I see two issues; FPGA and Forth. There is the issue that many FPGA offer 18-bit wide logic that can easily be used for 18b-bit wide memory and math. There is the issue that Forth comes from a tradition of being word oriented both at the semantic level and evaluation level. Chuck has said that Forth is word oriented and that we should recognize that and take advantage of it. All his hardware and software has been word addressing for decades and early Forth implementations were often on machines which did not have eight bit bytes. Chuck has contrasted this tradition to C which he points out comes from an 8-bit byte addressing tradition and has spawned generations of designs to copy 8-bit bytes from one location to another. The 21 series and the c18 are both examples of word oriented designs. In Chuck's software characters went from six bits to a variable number of bits using Shannon encoding. There is very little dealing with 8-bit bytes. The standard recognizes words and characters and that chars and words can be the same thing in an ANS compliant system. By default most Forth run on systems designed for C which have byte addressing and 8-bit characters or extended encoding for unicode. The standard seems to recognize the word oriented tradition while recognizing the standard practice of dealing with 8-bit bytes and byte oriented file and math operations. It is one of the most obvious differences between the kind of Forth you see Chuck Moore do and common standard Forth code. Chuck deals with bytes and files very infrequently, he uses words and blocks.> The main thing I am worried about with 18 bit words is byte > addressing. =A0Having word width be a multiple of 8-bits is important when bytes are what is important. On the FPGA you can use the 18-bit hardware as 16-bit words with a little more for parity or other hardware use. On F21 and P21 I used 20 bit characters most of the time but sometimes used two 8-bit characters packed into a 20-bit word. But without a hardware opcode to pack or unpack bytes eight shifts and masking in software make the use of bytes not as speed efficient even if it is more efficient use of memory. Sometimes I pack two and a half bytes into word on a 20-bit machine or on an 18-bit machine I pack two and a quarter bytes into a word. But packing tightly isn't a good format for manipulating the data.> Bytes could be addressed as 8 bit quantities packed into > an 18 bit word with two bits unused (or used as parity perhaps), or > they could be 9 bits each. =A09 bit bytes is unusual, but a "byte" is > not a fixed number of bits. =A0I have also see machines (old ones > admittedly) with a 36 bit word and 6 bit bytes. =A0I think the bytes > used for character data was called "field code". =A0This was a way of > optimizing character data for limited storage. =A0But I think that is > way too complex for a tiny core CPU.C comes from a tradition of 8-bit bytes in files. GCC says right up front that you want byte addressing. Forth comes from a word oriented tradition. There is some overlap in hosted systems and standards and probably more in the future as unicode is more widespread.> I like the idea of 9 bit bytes since that matches the instruction > width. =A0I'm just wondering if it will make trouble when character data > is used in arithmetic. =A0To use bytes in arithmetic requires the sign > bit to be extended. With 9 bit bytes, the question is whether to > extend the ninth bit or the eighth. Doing one or the other in hardware > could make the other very awkward. =A0The big issue I think is using code that was designed around byte addressing. Originally there was some code ported to i21 from the FSL that had been claimed to be portable code but which was clearly designed around byte oriented concepts. It assumed that 8-bits worth carries into the next 8-bit thing. It was called portable standard code and was if words were 16 or 32 bits but the Forth standard says that non-byte addressing can be standard. As is the case with a lot of portable code it wasn't. Though standard Forth the non-byte addressing broke a lot of 'portable' Forth code that had been designed after C code that had made byte addressing assumptions. The assumptions are often so ingrained that people look at the code and don't see the assumptions. At a minimum the code needed a declaration of an environmental dependency on byte addressing.> I guess I could just wait until I > have something to write code for and see how it works. =A0It's not like > the instruction set is cast in stone at any point. =A0Its also possible > that it's not need to be done in hardware at all.You want the hardware to do what is done frequently. If byte manipulation is the target you probably want byte addressing in hardware and logic.> How does the Intellasys chips use 9 bit addresses? =A0The address space including all RAM, ROM, and register and multiport addressing space is 9-bits wide. When bit-8 is false bit-7 selects RAM or ROM and bits 0-5 select one of 64 cells. When bit-8 is true the register/port address decoding is selected.> immediate fields within instructions? =A0No, only 9-bits of address decoding is used to select RAM/ROM/REG. Only the bottom 9 bits of R are set on a call or used on a return. A branch opcode in slot 0 is allocated a 13-bit field. This covers all of RAM/ROM/REG and uses a bit to expand "+" into plus with or without carry. 5+13=3D18 A branch opcode in slot 1 is allocated an 8-bit field. This is RAM/ROM but not REG. 5+5+8=3D18 A branch opcode in slot 2 is allocated a 3-bit field which limits the branch range to the current 8 address page. 5+5+5+3=3D18 On the c18 the three slots offer 13, 8, and 3-bit addressing fields respectively as 5, 10, and 15 bits are already used for five bit opcodes. Things were more complicated on the 21 series because they had both byte and word addressing. They booted in byte addressing mode so one could use an 8-bit ROM or FLASH. They mapped the byte loaded into the lowest 8 bits and executed the instruction in the last slot, the lowest five bits. And the branch opcodes in slot 3 occupied the lowest five bits and used all eight bits as the address field. So each branch opcode was limited to eight addresses on the current 8-bit memory page because the branch opcode was part of the address field. The lower five bits of the addresses used by the branch opcodes were opcode bits so you could only set the three bits above those for addressing in this 8-bit addressing mode. It made for ugly boot code in this 8-bit mode so we tended to just use canned boot routines to load compiled programs. I wrote a few applications that ran entirely from ROM with no RAM but they were even uglier due to the branch constraints in the byte addressing mode. P21 had only 10-bit address fields. i21/f21 had more branch options with 10-bit and 14-bit page fields and home page selection and home page/current 14-bit page branch mode. All 21 series used macros for branching beyond the page address field ranges, # push ; =A0> I guess I am mixing different incarnations.I hope the branch opcode review helped. Best Wishes
Reply by ●December 21, 20082008-12-21
Op Sun, 21 Dec 2008 09:32:17 -0800 (PST) schreef m_l_g3:> On Dec 21, 12:13 pm, rickman <gnu...@gmail.com> wrote: >> I meant to crosspost to CLF and I also wanted to ask if anyone has >> looked at similarly making the word size a multiple of 9 rather than 8 >> bits. That not only matches the BRAMs, but also the multipliers. Is >> there any real advantage to this? I guess this could make it a bit >> hard to address 8 bit bytes in an 18 bit word. Or maybe not. >> Opinions? >> > > ANS Forth says that address units need not be bytes. > You may have 18-bit AUs, 9-bit AUs, or 6-bit AUs. > (In the latter case you will need a fast division by 3 :).An address unit or character according to the Forth standard is at least 8 bits. So a CDC 6000 would have an environmental dependency here ;-) -- Coos CHForth, 16 bit DOS applications http://home.hccnet.nl/j.j.haak/forth.html
Reply by ●December 21, 20082008-12-21





