Reply by Tim February 25, 20042004-02-25
Ray Andraka wrote:
> In my experience, the stumbling block for custom CPUs is not so much > the hardware as it is the compiler for it.
Jan Gray did an interesting article on this for Circuit Cellar a few years back, targeting the lcc compiler. The article will still be on www.fpgacpu.org
Reply by john jakson February 24, 20042004-02-24
jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402240132.7a92aa17@posting.google.com>...
> > Cache architecture is > > currently 1 way set associative, but more Blockrams would allow more > > ways. > > Do you not think that the number of ways has to be as least as great > as the number of threads? I would expect a significant amount of > conflict misses (particularly in the I-Cache) if this is not the case. > Hit-under-miss is a must. > Otherwise all those impressive Mega-Hurtz will just be thown away > stalling for cache refills. > > Cheers, > Jo
Hi Jon Not neccesarily. On a conventional HT cpu, the threads would all be independant, and likely fight over the cache set size and 2 way would probably be a min. Since these threads are supposed to be cooperating as Occam proceses would, then their opcodes would be local but that assumes sibling processes run close to each other in time space. No guarantee of that. In the HW event driven case, its much easier to speculate about what will likely happen as the scheduling model is so much simpler. Even if there are lots of conflicts what will happen is the threads will just keep delaying. In the HW time wheel, there are actually 16 threads waiting to go (or null Ps if less available). These 16 represent the front of the proper P queue stored in linklist out in memory space (only some of which might be in cache at any time). The HW only allows the front 4 of those to queue up in the Iop queue. The fetcher steals or forces available cache reads slots to keep this full rotating between the 4 queues which live inside distributed 16b DP rams by 64 wide. Hence each running thread can buffer up to 16 small ops or 4 extended 64b ops or some mix. On a side note, if the cpu were 64b wide, the HT would have to be 8way, but then the Iqueue HW would be twice as wide too so that still allows each P to buffer up same no of ops. I would have to tweak the HDL code to group the rams for hight v width keeping it 64b wide output always. Wider data ops doesn't really change the opcode fetch rate since that now looks half as much as before. The fractional costs of executing ops now changes from 9/8 to 17/16 cycles for ALU a<-b OP c; so a slight speed up. Putting large literals or actual addresses in code space would wipe that out. The fetcher also write the Pid with the opcodes and it does a superficial check to see if any 16b ops are bra codes or not. If it pushes a bra, it will then keep pushing just a few more words until its past and then rotate that Pid out and take the next one from the other 12 waiting. The other side of the Iop queue just reads the 1-4 wide opcodes with the Pid and decode/execute the ops. It tracks the opcode size and uses any bra codes as just another contol field. By the time the bra decision is available, the Iop box will have executed
>>4 ops, but then it will have been P switched already. The bra
decision when it does arrive will post back the modified ip into the Pid selected ip field. Pid rides along the datapath pipeline too. Bra pts may be used to do the outer timesharing but I may leave that to a SW kernal. Cache misses will probably be treated same way, if the miss is going to be long, switch to the next P in the side queue. You can imagine a little railway track figure of 8 made up of selective pipelines & muxes holding minimal P state. Something like Johnson logic or hot coded state engine in charge. One huge difference between this HT processor and the ones you hear about x86, Alpha etc, I expect to use RLDRAM as 2nd level cache which RAS cycles in 20ns which is about 1.5 effective cpu cycles 13.3ns. It is 8way banked and can support 2.5ns datarates and control. I will probably be limited to the 311MHz rate and DDR is limited to 622MHz in the specs (conveniant 2x), this is right on the edge of what FPGAs can do and below the RLDRAM2 800MHz std. Remember x86 in particular has to be designed to work with very slow RAS elcheapo DDR Rams which can be several 100x slower than cpu cycles. Intel can't do a special tweak for RLDRAM since the difference is still very large, maybe 50 or more. In this cpu I could almost throw cache out and go direct to RLDRAM as main memory which is why I am not too concerned about tiny cache. I will be building an RLDRAM model soon by faking a bunch of 8 Blockrams together with delays and muxes demuxes. This will let me test out 1-8 cpu models running with faked RLDRAM all inside a sp-400 part. Further a 64b 8way HT cpu would actually cycle slower than RLDRAM ie 26ns. The real purpose of the cache which is a unified data-instruction-workspace is to satisfy the enormous bandwith req of the workspace operations. Reg cpus have 1 or more reg files separate from d/i cache but they have the burden of very high swap contexts. R3 keeps many workspaces in uni cache and provide 3ports to datapath 2 reads and 2joined writes using a pair of DP rams. The instr and data fetch requirements could be met by fast RLDRAM without cache, some buffering would still be needed. The T9000 style workspace caching is what makes this all work and that the cpus run close to RLDRAM speed. If R3 ever went ASIC and n x faster, ofcourse the cache would go full cuctom and bigger by far. Hope that helps johnjakson_usa_com
Reply by john jakson February 24, 20042004-02-24
Jim Granville <no.spam@designtools.co.nz> wrote in message news:<RDx_b.28109$ws.3170985@news02.tsnz.net>...
> "Hal Murray wrote: > >>I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the > >>code for it, although folks who are more saavy than I on the software side might argue that the high speed > >>hardware design is the hard part. > > > > > > How much code are you writing? Would you be willing/happy to do it in asembler? > > > > Assemblers can be pretty simple, especially if the target is raw binary running > > at loaded at 0 rather than something needing linkers and libraries. Also helps > > if the target is RISC and doesn't have messy addressing modes. > > > > How much would a reasonably clean sample assembler help? There should be > > a good example from the academic world. Just type in the new opcode table. > > > "AS" from Alfred Arnold is a good wide-cores assembler, with a choice of > Pascal or C sources : > > http://john.ccac.rwth-aachen.de:8000/as/download.html > > And HLA (High Level Assembler) is currently x86 only, but the front > end, and approach is much closer to higher level languages (but minus > the bloat). V2 will allow different back ends, for opcode outputs. > Worth watching. > > http://webster.cs.ucr.edu/AsmTools/HLA/index.html > > This is able to support quite large code efforts, and remain > close to the iron.. > > A benefit of working from the 'best assembler' end, is the ease of > support multiple/tiny core instances - which is one of the > advantages of such soft cores. > > -jg
Although an assembler is only a tiny fraction of the effort of a C compiler, once done it only opens the door just enough to bootstrap up slowly. For a processor to have much wider appeal needs the full effort either to port or write from scratch. I will probably set the hard type semantics of C aside for awhile and just add a very quick dirty codegen that handles C style assembler and simple 1 size expressions with none of the usual optimizations and just play dumb. Then baseline C/Verilog/Occam/inline asm can be written that might violate some proper rules. The compiler wouldn't be able to compile itself but I could get on with testbench and verification. Right now it can analyize itself but doesn't emit anything. It does have a nice #preprocessor built into the lexer that allows C++ like use of definitions with same name but varying no of params that is not described in lcc book. The usual way in the past was to define subsets of the target language and compile for that with the compiler also being restricted to that level. The 1st pass might be an assembler. The compiler could then operate at some level on the target and as the language subset is raised, the compiler gets to use the new features and tests them on the next round. I don't think people do that anymore unless the language is brand new and no compiler exists yet. Once it does exist, it's usually easier to cross port. This brings up a point, can a new compiler be $ distributed if the design is largely based off of previous open code. I will have to go check the license on lcc. johnjakson_usa_com
Reply by Jon Beniston February 24, 20042004-02-24
> Cache architecture is > currently 1 way set associative, but more Blockrams would allow more > ways.
Do you not think that the number of ways has to be as least as great as the number of threads? I would expect a significant amount of conflict misses (particularly in the I-Cache) if this is not the case. Hit-under-miss is a must. Otherwise all those impressive Mega-Hurtz will just be thown away stalling for cache refills. Cheers, JonB
Reply by Jim Granville February 23, 20042004-02-23
"Hal Murray wrote:
>>I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the >>code for it, although folks who are more saavy than I on the software side might argue that the high speed >>hardware design is the hard part. > > > How much code are you writing? Would you be willing/happy to do it in asembler? > > Assemblers can be pretty simple, especially if the target is raw binary running > at loaded at 0 rather than something needing linkers and libraries. Also helps > if the target is RISC and doesn't have messy addressing modes. > > How much would a reasonably clean sample assembler help? There should be > a good example from the academic world. Just type in the new opcode table.
"AS" from Alfred Arnold is a good wide-cores assembler, with a choice of Pascal or C sources : http://john.ccac.rwth-aachen.de:8000/as/download.html And HLA (High Level Assembler) is currently x86 only, but the front end, and approach is much closer to higher level languages (but minus the bloat). V2 will allow different back ends, for opcode outputs. Worth watching. http://webster.cs.ucr.edu/AsmTools/HLA/index.html This is able to support quite large code efforts, and remain close to the iron.. A benefit of working from the 'best assembler' end, is the ease of support multiple/tiny core instances - which is one of the advantages of such soft cores. -jg
Reply by Ray Andraka February 23, 20042004-02-23
The low complexity is why I chose the architecture I did.  Unfortunately,  I did that design in schematics, before
I started using VHDL, so resurrecting it at this point involves more time than can devote to it.

john jakson wrote:

> The HW part is more fun though. The 1802 takes me back, not bad in a > twisted sort of way, it certainly used very little logic, I had it > under a scope at Inmos. > > Regards > > johnjakson_usa_com
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
Reply by Hal Murray February 23, 20042004-02-23
>I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the >code for it, although folks who are more saavy than I on the software side might argue that the high speed >hardware design is the hard part.
How much code are you writing? Would you be willing/happy to do it in asembler? Assemblers can be pretty simple, especially if the target is raw binary running at loaded at 0 rather than something needing linkers and libraries. Also helps if the target is RISC and doesn't have messy addressing modes. How much would a reasonably clean sample assembler help? There should be a good example from the academic world. Just type in the new opcode table. -- The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.
Reply by john jakson February 23, 20042004-02-23
Ray Andraka <ray@andraka.com> wrote in message news:<403A43E8.6338D0C1@andraka.com>...
> John, > > In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for > it. I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz. It was a > pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep > register file like the 1802, and it was a harvard architecture like the PIC. Like the 1802, the operands > for the ALU were fetched from the register file and results returned to the register file. The beauty of it > was that for control applications, you often did not even need any memory beyond the register file. The > processor size was about 80 CLBs (translates to 80 slices in current architectures). I'm not a compiler > person, so the big difficulty I had with it was the compiler. > > I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the > code for it, although folks who are more saavy than I on the software side might argue that the high speed > hardware design is the hard part. > >
Hi Ray Half agreed, as Jan has shown any std risc cpu project can grab lcc to do the task quite quickly by messing with the emit tables. If this were just another std risc project I'd probably do same, but then it wouldn't be anywhere near 300MHz either, more like MicroBlaze. Only hyperthreading allows max speed, but if the processes don't communicate with each other then lcc could still be used as is and ignore the HT stuff. Some of my background is in compilers and other tools but I never worked for anybody doing that. The lcc compiler (Hanson & Fraser) is possibly the best documented C compiler writing text book around and highly recomended as it explains thoroughly just how horrible C really is where most C books gloss over it's complexity. The complexity for me comes because I am combining essentially 3 langs together and putting in a mini OS runtime. The Transputer did it before but chose an unfriendly syntax and supported C only as an afterthought. I will probably get through it ok but I would love to pass that part on but then that person would be knee deep in it instead. The HW part is more fun though. The 1802 takes me back, not bad in a twisted sort of way, it certainly used very little logic, I had it under a scope at Inmos. Regards johnjakson_usa_com
Reply by Jim Granville February 23, 20042004-02-23
Ray Andraka wrote:
> John, > > In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for > it. I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz. It was a > pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep > register file like the 1802, and it was a harvard architecture like the PIC. Like the 1802, the operands > for the ALU were fetched from the register file and results returned to the register file. The beauty of it > was that for control applications, you often did not even need any memory beyond the register file. The > processor size was about 80 CLBs (translates to 80 slices in current architectures). I'm not a compiler > person, so the big difficulty I had with it was the compiler. > > I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the > code for it, although folks who are more saavy than I on the software side might argue that the high speed > hardware design is the hard part.
This is right, and John admits this in another reply. You should also add DEBUG support, as that's more important as the CPU targets bigger applications. Once you have a compiler, users will want to do more and more, and then debug becomes very important. It depends a lot on the target use. Something that runs from a Block RAM inside the FPGA, can be very small/very fast, but is probably best coded in some form of Assembler. Best example of 'Advanced Assembler Art' is Randy Hyde's HLA (High level Assembler) but that currently targets only x86 - tho I'm sure that's not hard to fix :) This HLA allows IF..THEN..ELSIF etc, and handles the labels needed, as well as giving local scope (so is a big step-up from vanilla ASM). -jg
Reply by Ray Andraka February 23, 20042004-02-23
John,

In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
it.  I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz.  It was a
pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
register file like the 1802, and it was a harvard architecture like the PIC.   Like the 1802, the operands
for the ALU were fetched from the register file and results returned to the register file.  The beauty of it
was that for control applications, you often did not even need any memory beyond the register file.  The
processor size was about 80 CLBs (translates to 80 slices in current architectures).   I'm not a compiler
person, so the big difficulty I had with it was the compiler.

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.


john jakson wrote:

> jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402230332.3e1160e@posting.google.com>... > > > I am even tempted to max the datapath to 64b as it only > > > adds 3-4 pipestages and not much to the control. > > > > Sure, but the more pipeline stages you add, the longer the latency is > > for each instruction. How many cycles latency will there be for a > > single add instruction? Do you intend to make sure that the number of > > threads is equal to this latency, so that the latency as perceived the > > thread executing the instruction is 0? > > > > What's your cache / memory architecture? Handling lots of threads > > could be tricky. > > > > Cheers, > > JonB > > I just posted a very long reply but the server just xxxxed it so I > will write it again later offline. > > Quick answer yes, HT must match 4 or 8 etc. Cache architecture is > currently 1 way set associative, but more Blockrams would allow more > ways. Question of whether the FPGA should hold lots of lite cpus or 1 > monster cpu or maybe combinations of both! > > Regards > > johnjakson_usa_com > > 508 4800777 EST after 8pm
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759