jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402240132.7a92aa17@posting.google.com>...
> > Cache architecture is
> > currently 1 way set associative, but more Blockrams would allow more
> > ways.
> 
> Do you not think that the number of ways has to be as least as great
> as the number of threads? I would expect a significant amount of
> conflict misses (particularly in the I-Cache) if this is not the case.
> Hit-under-miss is a must.
> Otherwise all those impressive Mega-Hurtz will just be thown away
> stalling for cache refills.
> 
> Cheers,
> Jo

Hi Jon

Not neccesarily. On a conventional HT cpu, the threads would all be
independant, and likely fight over the cache set size and 2 way would
probably be a min. Since these threads are supposed to be cooperating
as Occam proceses would, then their opcodes would be local but that
assumes sibling processes run close to each other in time space. No
guarantee of that. In the HW event driven case, its much easier to
speculate about what will likely happen as the scheduling model is so
much simpler. Even if there are lots of conflicts what will happen is
the threads will just keep delaying.

In the HW time wheel, there are actually 16 threads waiting to go (or
null Ps if less available). These 16 represent the front of the proper
P queue stored in linklist out in memory space (only some of which
might be in cache at any time). The HW only allows the front 4 of
those to queue up in the Iop queue. The fetcher steals or forces
available cache reads slots to keep this full rotating between the 4
queues which live inside distributed 16b DP rams by 64 wide. Hence
each running thread can buffer up to 16 small ops or 4 extended 64b
ops or some mix.

On a side note, if the cpu were 64b wide, the HT would have to be
8way, but then the Iqueue HW would be twice as wide too so that still
allows each P to buffer up same no of ops. I would have to tweak the
HDL code to group the rams for hight v width keeping it 64b wide
output always. Wider data ops doesn't really change the opcode fetch
rate since that now looks half as much as before. The fractional costs
of executing ops now changes from 9/8 to 17/16 cycles for ALU a<-b OP
c; so a slight speed up. Putting large literals or actual addresses in
code space would wipe that out.


The fetcher also write the Pid with the opcodes and it does a
superficial check to see if any 16b ops are bra codes or not. If it
pushes a bra, it will then keep pushing just a few more words until
its past and then rotate that Pid out and take the next one from the
other 12 waiting.  The other side of the Iop queue just reads the 1-4
wide opcodes with the Pid and decode/execute the ops. It tracks the
opcode size and uses any bra codes as just another contol field. By
the time the bra decision is available, the Iop box will have executed
>>4 ops, but then it will have been P switched already. The bra
decision when it does arrive will post back the modified ip into the
Pid selected ip field. Pid rides along the datapath pipeline too. Bra
pts may be used to do the outer timesharing but I may leave that to a
SW kernal.

Cache misses will probably be treated same way, if the miss is going
to be long, switch to the next P in the side queue. You can imagine a
little railway track figure of 8 made up of selective pipelines &
muxes holding minimal P state. Something like Johnson logic or hot
coded state engine in charge.

One huge difference between this HT processor and the ones you hear
about x86, Alpha etc, I expect to use RLDRAM as 2nd level cache which
RAS cycles in 20ns  which is about 1.5 effective cpu cycles 13.3ns. It
is 8way banked and can support 2.5ns datarates and control. I will
probably be limited to the 311MHz rate and DDR is limited to 622MHz in
the specs (conveniant 2x), this is right on the edge of what FPGAs can
do and below the RLDRAM2 800MHz std.

Remember x86 in particular has to be designed to work with very slow
RAS elcheapo DDR Rams which can be several 100x slower than cpu
cycles. Intel can't do a special tweak for RLDRAM since the difference
is still very large, maybe 50 or more.

In this cpu I could almost throw cache out and go direct to RLDRAM as
main memory which is why I am not too concerned about tiny cache. I
will be building an RLDRAM model soon by faking a bunch of 8 Blockrams
together with delays and muxes demuxes. This will let me test out 1-8
cpu models running with faked RLDRAM all inside a sp-400 part. Further
a 64b 8way HT cpu would actually cycle slower than RLDRAM ie 26ns.

The real purpose of the cache which is a unified
data-instruction-workspace is to satisfy the enormous bandwith req of
the workspace operations. Reg cpus have 1 or more reg files separate
from d/i cache but they have the burden of very high swap contexts. R3
keeps many workspaces in uni cache and provide 3ports to datapath 2
reads and 2joined writes using a pair of DP rams. The instr and data
fetch requirements could be met by fast RLDRAM without cache, some
buffering would still be needed. The T9000 style workspace caching is
what makes this all work and that the cpus run close to RLDRAM speed.
If R3 ever went ASIC and n x faster, ofcourse the cache would go full
cuctom and bigger by far.

Hope that helps

johnjakson_usa_com

Jim Granville <no.spam@designtools.co.nz> wrote in message news:<RDx_b.28109$ws.3170985@news02.tsnz.net>...
> "Hal Murray wrote:
> >>I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
> >>code for it, although folks who are more saavy than I on the software side might argue that the high speed
> >>hardware design is the hard part.
> > 
> > 
> > How much code are you writing?  Would you be willing/happy to do it in asembler?
> > 
> > Assemblers can be pretty simple, especially if the target is raw binary running
> > at loaded at 0 rather than something needing linkers and libraries.  Also helps
> > if the target is RISC and doesn't have messy addressing modes.
> > 
> > How much would a reasonably clean sample assembler help?  There should be
> > a good example from the academic world.  Just type in the new opcode table.
> 
> 
> "AS" from Alfred Arnold is a good wide-cores assembler, with a choice of 
> Pascal or C sources :
> 
> http://john.ccac.rwth-aachen.de:8000/as/download.html
> 
>   And HLA (High Level Assembler) is currently x86 only, but the front
> end, and approach is much closer to higher level languages (but minus 
> the bloat). V2 will allow different back ends, for opcode outputs.
>   Worth watching.
> 
> http://webster.cs.ucr.edu/AsmTools/HLA/index.html
> 
>   This is able to support quite large code efforts, and remain
> close to the iron..
> 
>   A benefit of working from the 'best assembler' end, is the ease of 
> support multiple/tiny core instances - which is one of the
> advantages of such soft cores.
> 
> -jg

Although an assembler is only a tiny fraction of the effort of a C
compiler, once done it only opens the door just enough to bootstrap up
slowly. For a processor to have much wider appeal needs the full
effort either to port or write from scratch.

I will probably set the hard type semantics of C aside for awhile and
just add a very quick dirty codegen that handles C style assembler and
simple 1 size expressions with none of the usual optimizations and
just play dumb. Then baseline C/Verilog/Occam/inline asm can be
written that might violate some proper rules. The compiler wouldn't be
able to compile itself but I could get on with testbench and
verification. Right now it can analyize itself but doesn't emit
anything. It does have a nice #preprocessor built into the lexer that
allows C++ like use of definitions with same name but varying no of
params that is not described in lcc book.

The usual way in the past was to define subsets of the target language
and compile for that with the compiler also being restricted to that
level. The 1st pass might be an assembler. The compiler could then
operate at some level on the target and as the language subset is
raised, the compiler gets to use the new features and tests them on
the next round. I don't think people do that anymore unless the
language is brand new and no compiler exists yet. Once it does exist,
it's usually easier to cross port.

This brings up a point, can a new compiler be $ distributed if the
design is largely based off of previous open code. I will have to go
check the license on lcc.

johnjakson_usa_com

> Cache architecture is
> currently 1 way set associative, but more Blockrams would allow more
> ways.

Do you not think that the number of ways has to be as least as great
as the number of threads? I would expect a significant amount of
conflict misses (particularly in the I-Cache) if this is not the case.
Hit-under-miss is a must.
Otherwise all those impressive Mega-Hurtz will just be thown away
stalling for cache refills.

Cheers,
JonB

"Hal Murray wrote:
>>I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
>>code for it, although folks who are more saavy than I on the software side might argue that the high speed
>>hardware design is the hard part.
> 
> 
> How much code are you writing?  Would you be willing/happy to do it in asembler?
> 
> Assemblers can be pretty simple, especially if the target is raw binary running
> at loaded at 0 rather than something needing linkers and libraries.  Also helps
> if the target is RISC and doesn't have messy addressing modes.
> 
> How much would a reasonably clean sample assembler help?  There should be
> a good example from the academic world.  Just type in the new opcode table.

"AS" from Alfred Arnold is a good wide-cores assembler, with a choice of 
Pascal or C sources :

http://john.ccac.rwth-aachen.de:8000/as/download.html

  And HLA (High Level Assembler) is currently x86 only, but the front
end, and approach is much closer to higher level languages (but minus 
the bloat). V2 will allow different back ends, for opcode outputs.
  Worth watching.

http://webster.cs.ucr.edu/AsmTools/HLA/index.html

  This is able to support quite large code efforts, and remain
close to the iron..

  A benefit of working from the 'best assembler' end, is the ease of 
support multiple/tiny core instances - which is one of the
advantages of such soft cores.

-jg

The low complexity is why I chose the architecture I did.  Unfortunately,  I did that design in schematics, before
I started using VHDL, so resurrecting it at this point involves more time than can devote to it.

john jakson wrote:

> The HW part is more fun though. The 1802 takes me back, not bad in a
> twisted sort of way, it certainly used very little logic, I had it
> under a scope at Inmos.
>
> Regards
>
> johnjakson_usa_com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

>I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
>code for it, although folks who are more saavy than I on the software side might argue that the high speed
>hardware design is the hard part.

How much code are you writing?  Would you be willing/happy to do it in asembler?

Assemblers can be pretty simple, especially if the target is raw binary running
at loaded at 0 rather than something needing linkers and libraries.  Also helps
if the target is RISC and doesn't have messy addressing modes.

How much would a reasonably clean sample assembler help?  There should be
a good example from the academic world.  Just type in the new opcode table.

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

Ray Andraka <ray@andraka.com> wrote in message news:<403A43E8.6338D0C1@andraka.com>...
> John,
> 
> In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
> it.  I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz.  It was a
> pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
> register file like the 1802, and it was a harvard architecture like the PIC.   Like the 1802, the operands
> for the ALU were fetched from the register file and results returned to the register file.  The beauty of it
> was that for control applications, you often did not even need any memory beyond the register file.  The
> processor size was about 80 CLBs (translates to 80 slices in current architectures).   I'm not a compiler
> person, so the big difficulty I had with it was the compiler.
> 
> I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
> code for it, although folks who are more saavy than I on the software side might argue that the high speed
> hardware design is the hard part.
> 
> 

Hi Ray

Half agreed, as Jan has shown any std risc cpu project can grab lcc to
do the task quite quickly by messing with the emit tables. If this
were just another std risc project I'd probably do same, but then it
wouldn't be anywhere near 300MHz either, more like MicroBlaze.

Only hyperthreading allows max speed, but if the processes don't
communicate with each other then lcc could still be used as is and
ignore the HT stuff.

Some of my background is in compilers and other tools but I never
worked for anybody doing that. The lcc compiler (Hanson & Fraser) is
possibly the best documented C compiler writing text book around and
highly recomended as it explains thoroughly just how horrible C really
is where most C books gloss over it's complexity. The complexity for
me comes because I am combining essentially 3 langs together and
putting in a mini OS runtime. The Transputer did it before but chose
an unfriendly syntax and supported C only as an afterthought.

I will probably get through it ok but I would love to pass that part
on but then that person would be knee deep in it instead.

The HW part is more fun though. The 1802 takes me back, not bad in a
twisted sort of way, it certainly used very little logic, I had it
under a scope at Inmos.

Regards

johnjakson_usa_com

Ray Andraka wrote:
> John,
> 
> In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
> it.  I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz.  It was a
> pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
> register file like the 1802, and it was a harvard architecture like the PIC.   Like the 1802, the operands
> for the ALU were fetched from the register file and results returned to the register file.  The beauty of it
> was that for control applications, you often did not even need any memory beyond the register file.  The
> processor size was about 80 CLBs (translates to 80 slices in current architectures).   I'm not a compiler
> person, so the big difficulty I had with it was the compiler.
> 
> I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
> code for it, although folks who are more saavy than I on the software side might argue that the high speed
> hardware design is the hard part.

  This is right, and John admits this in another reply.
You should also add DEBUG support, as that's more important as the CPU 
targets bigger applications.
  Once you have a compiler, users will want to do more and more, and 
then debug becomes very important.

  It depends a lot on the target use.
Something that runs from a Block RAM inside the FPGA, can be very 
small/very fast, but is probably best coded in some form of Assembler.
  Best example of 'Advanced Assembler Art' is Randy Hyde's HLA (High 
level Assembler) but that currently targets only x86
- tho I'm sure that's not hard to fix :)
  This HLA allows IF..THEN..ELSIF etc, and handles the labels needed,
as well as giving local scope (so is a big step-up from vanilla ASM).

-jg

John,

In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
it.  I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz.  It was a
pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
register file like the 1802, and it was a harvard architecture like the PIC.   Like the 1802, the operands
for the ALU were fetched from the register file and results returned to the register file.  The beauty of it
was that for control applications, you often did not even need any memory beyond the register file.  The
processor size was about 80 CLBs (translates to 80 slices in current architectures).   I'm not a compiler
person, so the big difficulty I had with it was the compiler.

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.

john jakson wrote:

> jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402230332.3e1160e@posting.google.com>...
> > > I am even tempted to max the datapath to 64b as it only
> > > adds 3-4 pipestages and not much to the control.
> >
> > Sure, but the more pipeline stages you add, the longer the latency is
> > for each instruction. How many cycles latency will there be for a
> > single add instruction? Do you intend to make sure that the number of
> > threads is equal to this latency, so that the latency as perceived the
> > thread executing the instruction is 0?
> >
> > What's your cache / memory architecture? Handling lots of threads
> > could be tricky.
> >
> > Cheers,
> > JonB
>
> I just posted a very long reply but the server just xxxxed it so I
> will write it again later offline.
>
> Quick answer yes, HT must match 4 or 8 etc. Cache architecture is
> currently 1 way set associative, but more Blockrams would allow more
> ways. Question of whether the FPGA should hold lots of lite cpus or 1
> monster cpu or maybe combinations of both!
>
> Regards
>
> johnjakson_usa_com
>
> 508 4800777 EST after 8pm

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759