comp.arch.fpga | Altera Cyclone replacement| page 4

Reply by ●February 14, 20192019-02-14

On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
> 
> Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on.  My projects use a CPU as a controller and often have very critical real time requirements.  While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions.  That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language. 
> 

Can you quantify criticality of your real-time requirements?

Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

> 
> > >  Many stack based CPUs can be implemented in 1k LUT4s or less.  They can run fast, >100 MHz and typically are not pipelined.  

1 cycle per instruction not pipelined means that stack can not be implemented
in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those. 
The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

Reply by ●February 14, 20192019-02-14

On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.com wrote:
> On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
> > 
> > Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on.  My projects use a CPU as a controller and often have very critical real time requirements.  While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions.  That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language. 
> > 
> 
> Can you quantify criticality of your real-time requirements?

Eh?  You are asking my requirement or asking how important it is?  Not sure how to answer that question.  I can only say that my CPU designs give single cycle execution, so I can design with them the same way I design the hardware in VHDL. 

> Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?

It increases interrupt latency which is not a problem if you aren't using interrupts, a common technique for such embedded processors.  Otherwise multi-cycle instructions complicate the CPU instruction decoder.  Using a short instruction format allows minimal decode logic.  Adding a cycle counter increases the number of inputs to the instruction decode block and so complicates the logic significantly. 

> Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

Cache, branch predictors???  You have that with 1 kLUT CPUs???  I think we design in very different worlds.  My program storage is inside the FPGA and runs at the full speed of the CPU.  The CPU is not pipelined (according to me, someone insisted that it was a 2 level pipeline, but with no pipeline delay, oh well) so no branch prediction needed. 

> > > >  Many stack based CPUs can be implemented in 1k LUT4s or less.  They can run fast, >100 MHz and typically are not pipelined.  
> 
> 1 cycle per instruction not pipelined means that stack can not be implemented
> in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

Huh?  So my block RAM stack is pipelined or are you saying I'm only imagining it runs in one clock cycle?  Instructions are things like 

ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (return from call), RETI (return from interrupt).  The interrupt pushes return address to return stack and PSW to data stack in one cycle with no latency so, like the other instructions is single cycle, again making using it like designing with registers in the HDL code. 

> Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

Or both.  To get the block RAMs single cycle the read and write happen on different phases of the main clock.  I think read is on falling edge while write is on rising edge like the rest of the logic.  Instructions and data are in physically separate memory within the same address map, but no way to use either one as the other mechanically.  Why would Harvard ever be a problem for an embedded CPU? 

> And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. 

Not hard when the CPU is simple and designed to be easy to implement rather than designing it to be like all the other CPUs with complicated functionality.  

> Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those. 

I only use the low grade parts.  I haven't used NIOS and this processor won't get to 380 MHz I'm pretty sure.  Pipelining it would be counter it's design goals but might be practical, never thought about it. 

> The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Or the instruction is simple and runs fast. 

> Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

That's where my CPU lies, I think it was 600 LUT4s last time I checked.  

Rick C.

Reply by ●February 14, 20192019-02-14

On Thursday, February 14, 2019 at 1:24:40 PM UTC+2, gnuarm.del...@gmail.com wrote:
> On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.com wrote:
> > On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
> > > 
> > > Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on.  My projects use a CPU as a controller and often have very critical real time requirements.  While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions.  That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language. 
> > > 
> > 
> > Can you quantify criticality of your real-time requirements?
> 
> Eh?  You are asking my requirement or asking how important it is?  

How important they are. What happens if particular instruction most of the time 
takes n clocks, but sometimes, rarely, could take n+2 clocks? Are system-level requirements impacted?

> Not sure how to answer that question.  I can only say that my CPU designs give single cycle execution, so I can design with them the same way I design the hardware in VHDL. 
> 
> 
> > Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
> 
> It increases interrupt latency which is not a problem if you aren't using interrupts, a common technique for such embedded processors.

I don't like interrupts in small systems. Neither in MCUs nor in FPGAs.
In MCUs nowadays we have bad ass DMAs. In FPGAs we can build bad ass DMA ourselves. Or throw multiple soft cores on multiple tasks. That why I am interested in *small* soft cores in the first place.

>  Otherwise multi-cycle instructions complicate the CPU instruction decoder.  

I see no connection to decoder. May be, you mean microsequencer?
Generally, I disagree. At least for very fast clock rates it is easier to design non-pipelined or partially pipelined core where every instruction flows through several phases.

Or, may be, you think about variable-length instructions? That's again, orthogonal to number of clocks per instruction. Anyway, I think that variable-length instructions are very cool, but not for 500-700 LUT4s budget. I would start to consider VLI for something like 1200 LUT4s.

> Using a short instruction format allows minimal decode logic.  Adding a cycle counter increases the number of inputs to the instruction decode block and so complicates the logic significantly. 
> 
> 
> > Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.
> 
> Cache, branch predictors???  You have that with 1 kLUT CPUs???  I think we design in very different worlds. 

I don't *want* data caches in sort of tasks that I do with this small cores. Instruction cache is something else. I am not against them in "hard" MCUs.
In small soft cores that we are discussing right now they are impractical rather than evil.
But static branch prediction is something else. I can see how static branch prediction is practical in 700-800 LUT4s. I didn't have it implemented in my half-dozen (in the mean time the # is growing). But it is practical, esp. for applications that spend most of the time in very short loops.

> My program storage is inside the FPGA and runs at the full speed of the CPU.  The CPU is not pipelined (according to me, someone insisted that it was a 2 level pipeline, but with no pipeline delay, 

I am starting to suspect that you have very special definition of "not pipelined" that differs from definition used in literature.

> oh well) so no branch prediction needed. 
> 
> 
> > > > >  Many stack based CPUs can be implemented in 1k LUT4s or less.  They can run fast, >100 MHz and typically are not pipelined.  
> > 
> > 1 cycle per instruction not pipelined means that stack can not be implemented
> > in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.
> 
> Huh?  So my block RAM stack is pipelined or are you saying I'm only imagining it runs in one clock cycle?  Instructions are things like 
> 
> ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (return from call), RETI (return from interrupt).  The interrupt pushes return address to return stack and PSW to data stack in one cycle with no latency so, like the other instructions is single cycle, again making using it like designing with registers in the HDL code. 
> 
> 
> > Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.
> 
> Or both.  To get the block RAMs single cycle the read and write happen on different phases of the main clock.  I think read is on falling edge while write is on rising edge like the rest of the logic.  Instructions and data are in physically separate memory within the same address map, but no way to use either one as the other mechanically.  Why would Harvard ever be a problem for an embedded CPU? 
> 

Less of the problem when you are in full control of software stack.
When you are not in full control, sometimes compilers like to place data, esp. jump tables for implementing HLL switch/case construct, in program memory.
Still, even with full control of the code generation tools, sometimes you want
architecture consisting of tiny startup code that loads the bulk of the code from external memory, most commonly from SPI flash.
Another, less common possible reason is saving space by placing code and data in the same memory block. Esp. when blocks are relatively big and there are few of them.

> 
> > And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. 
> 
> Not hard when the CPU is simple and designed to be easy to implement rather than designing it to be like all the other CPUs with complicated functionality.  
> 

It is certainly easier when branching is based on arithmetic flags rather than
on the content of register, like a case in MIPS derivatives, including Nios2 and RISC-V. But still hard. You have to wait for instruction to arrive from memory, decode an instruction, do logical operations on flags and select between two alternatives based on result of logical operation, all in one cycle.
If branch is PC-relative, which is the case in nearly all popular 32-bit architectures, you also have to do an address addition, all in the same cycle.

But even if it's somehow doable for PC-relative branches, I don't see how, assuming that stack is stored in block memory, it is doable for *indirect* jumps. I'd guess, you are somehow cutting corners here, most probably by requiring the address of indirect jump to be in the top-of-stack register that is not in block memory.

> 
> > Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those. 
> 
> I only use the low grade parts.  I haven't used NIOS 

Nios, not NIOS. The proper name and spelling is Nios2, because for a brief period in early 00s Altera had completely different architecture that was called Nios.

> and this processor won't get to 380 MHz I'm pretty sure.  Pipelining it would be counter it's design goals but might be practical, never thought about it. 
> 
> 
> > The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.
> 
> Or the instruction is simple and runs fast. 
> 

I don't doubt that you did it, but answers like that smell hand-waving.

> 
> > Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.
> 
> That's where my CPU lies, I think it was 600 LUT4s last time I checked.  
> 

Does it include single-cycle 32-bit shift/rotate by arbitrary 5-bit count (5 variations, logical and arithmetic right shift, logical left shift, rotate right, rotate left)?
Does it include zero-extended and sign-extended byte and half-word loads (fetches, in you language) ?
In my cores these two functions combined are the biggest block, bigger than 32-bit ALU, and comparable in size with result writeback mux.
Also, I assume that you cores have no multiplier, right?

> Rick C.

Reply by ●February 14, 20192019-02-14

On Thursday, February 14, 2019 at 8:38:47 AM UTC-5, already...@yahoo.com wr=
ote:
> On Thursday, February 14, 2019 at 1:24:40 PM UTC+2, gnuarm.del...@gmail.c=
om wrote:
> > On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.co=
m wrote:
> > > On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gma=
il.com wrote:
> > > >=20
> > > > Ok, if you are doing C in FPGA CPUs then you are in a different wor=
ld than the stuff I've worked on.  My projects use a CPU as a controller an=
d often have very critical real time requirements.  While C doesn't prevent=
 that, I prefer to just code in assembly language and more importantly, use=
 a CPU design that provides single cycle execution of all instructions.  Th=
at's why I like stack processors, they are easy to design, use a very simpl=
e instruction set and the assembly language can be very close to the Forth =
high level language.=20
> > > >=20
> > >=20
> > > Can you quantify criticality of your real-time requirements?
> >=20
> > Eh?  You are asking my requirement or asking how important it is? =20
>=20
> How important they are. What happens if particular instruction most of th=
e time=20
> takes n clocks, but sometimes, rarely, could take n+2 clocks? Are system-=
level requirements impacted?

Of course, that depends on the application.  In some cases it would simply =
not work correctly because it was designed into the rest of the logic not e=
ntirely unlike a FSM.  In other cases it would make the timing indeterminat=
e which means it would make it harder to design the logic surrounding this =
piece. =20

> > Not sure how to answer that question.  I can only say that my CPU desig=
ns give single cycle execution, so I can design with them the same way I de=
sign the hardware in VHDL.=20
> >=20
> >=20
> > > Also, even for most critical requirements, what's wrong with multiple=
 cycles per instructions as long as # of cycles is known up front?
> >=20
> > It increases interrupt latency which is not a problem if you aren't usi=
ng interrupts, a common technique for such embedded processors.
>=20
> I don't like interrupts in small systems. Neither in MCUs nor in FPGAs.
> In MCUs nowadays we have bad ass DMAs. In FPGAs we can build bad ass DMA =
ourselves. Or throw multiple soft cores on multiple tasks. That why I am in=
terested in *small* soft cores in the first place.

Yup, interrupts can be very bad.  But if you requirements are to do one thi=
ng in software that has real time requirements (such as service an ADC/DAC =
or fast UART) while the rest of the code is managing functions with much mo=
re relaxed real time requirements, using an interrupt can eliminate a CPU c=
ore or the design of a custom DMA with particular features that are easy in=
 software.=20

There are things that are easy to do in hardware and things that are easy t=
o do in software with some overlap.  Using a single CPU and many interrupts=
 fits into the domain of not so easy to do.  That doesn't make simple use o=
f interrupts a bad thing. =20

> >  Otherwise multi-cycle instructions complicate the CPU instruction deco=
der. =20
>=20
> I see no connection to decoder. May be, you mean microsequencer?

Decoder has outputs y(i) =3D f(x(j)) where x(j) is all the inputs and y(i) =
is all the outputs and f() is the function mapping inputs to outputs.  If y=
ou have multiple states for instructions the decoding function has more inp=
uts than if you only decode instructions and whatever state flags might be =
used such as carry or zero or interrupt input.=20

In general this will result in more complex instruction decoding.=20

> Generally, I disagree. At least for very fast clock rates it is easier to=
 design non-pipelined or partially pipelined core where every instruction f=
lows through several phases.

If by "easier" you mean possible, then yes.  That's why they use pipelining=
, to achieve clock speeds that otherwise can't be met.  But it is seldom si=
mple since pipelining is more than just adding registers.  Instructions int=
eract and on branches the pipeline has to be flushed, etc.=20

> Or, may be, you think about variable-length instructions? That's again, o=
rthogonal to number of clocks per instruction. Anyway, I think that variabl=
e-length instructions are very cool, but not for 500-700 LUT4s budget. I wo=
uld start to consider VLI for something like 1200 LUT4s.

Nope, just talking about using multiple clock cycles for instructions.  Usi=
ng variable number of clock cycles would be more complex in general and mul=
tiple length instructions even worse... in general.  There are always possi=
bilities to simplify some aspect of this by complicating some aspect of tha=
t.=20

> > Using a short instruction format allows minimal decode logic.  Adding a=
 cycle counter increases the number of inputs to the instruction decode blo=
ck and so complicates the logic significantly.=20
> >=20
> >=20
> > > Things like caches and branch predictors indeed cause variability (wi=
tch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cyc=
les per instruction.
> >=20
> > Cache, branch predictors???  You have that with 1 kLUT CPUs???  I think=
 we design in very different worlds.=20
>=20
> I don't *want* data caches in sort of tasks that I do with this small cor=
es. Instruction cache is something else. I am not against them in "hard" MC=
Us.
> In small soft cores that we are discussing right now they are impractical=
 rather than evil.

Or unneeded.  If the programs fits in the on chip memory, no cache is neede=
d.  What sort of programming are you doing in <1kLUT CPUs that would requir=
e slow off-chip program storage?=20

> But static branch prediction is something else. I can see how static bran=
ch prediction is practical in 700-800 LUT4s. I didn't have it implemented i=
n my half-dozen (in the mean time the # is growing). But it is practical, e=
sp. for applications that spend most of the time in very short loops.

If the jump instruction is one clock cycle and no pipeline, jump prediction=
 is not possible I think. =20

> > My program storage is inside the FPGA and runs at the full speed of the=
 CPU.  The CPU is not pipelined (according to me, someone insisted that it =
was a 2 level pipeline, but with no pipeline delay,=20
>=20
> I am starting to suspect that you have very special definition of "not pi=
pelined" that differs from definition used in literature.

Ok, not sure what that means.  Every instruction takes one clock cycle.  Wh=
ile a given instruction is being executed the next instruction is being fet=
ched, but the *actual* next instruction, not the "possible" next instructio=
n.  All branches happen during the branch instruction execution which fetch=
es the correct next instruction.=20

This guy said I was pipelining the fetch and execute...  I see no purpose i=
n calling that pipelining since it carries no baggage of any sort.=20

> > oh well) so no branch prediction needed.=20
> >=20
> >=20
> > > > > >  Many stack based CPUs can be implemented in 1k LUT4s or less. =
 They can run fast, >100 MHz and typically are not pipelined. =20
> > >=20
> > > 1 cycle per instruction not pipelined means that stack can not be imp=
lemented
> > > in memory block(s). Which, in combination with 1K LUT4s means that ei=
ther stack is very shallow or it is not wide (i.e. 16 bits rather than 32 b=
its). Either of it means that you need many more instructions (relatively t=
o 32-bit RISC with 32 or 16 registers) to complete the job.
> >=20
> > Huh?  So my block RAM stack is pipelined or are you saying I'm only ima=
gining it runs in one clock cycle?  Instructions are things like=20
> >=20
> > ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (ret=
urn from call), RETI (return from interrupt).  The interrupt pushes return =
address to return stack and PSW to data stack in one cycle with no latency =
so, like the other instructions is single cycle, again making using it like=
 designing with registers in the HDL code.=20
> >=20
> >=20
> > > Also 1 cycle per instruction necessitates either strict Harvard memor=
ies or true dual-ported memories.
> >=20
> > Or both.  To get the block RAMs single cycle the read and write happen =
on different phases of the main clock.  I think read is on falling edge whi=
le write is on rising edge like the rest of the logic.  Instructions and da=
ta are in physically separate memory within the same address map, but no wa=
y to use either one as the other mechanically.  Why would Harvard ever be a=
 problem for an embedded CPU?=20
> >=20
>=20
> Less of the problem when you are in full control of software stack.
> When you are not in full control, sometimes compilers like to place data,=
 esp. jump tables for implementing HLL switch/case construct, in program me=
mory.
> Still, even with full control of the code generation tools, sometimes you=
 want
> architecture consisting of tiny startup code that loads the bulk of the c=
ode from external memory, most commonly from SPI flash.
> Another, less common possible reason is saving space by placing code and =
data in the same memory block. Esp. when blocks are relatively big and ther=
e are few of them.

There is nothing to prevent loading code into program memory.  It's all one=
 address space and can be written to by machine code.  So I guess it's not =
really Harvard, it's just physically separate memory. Since instructions ar=
e not a word wide, I think the program memory does not implement a full wor=
d width.. to be honest, I don't recall.  I haven't used this CPU in years. =
 I've been programming in Forth on PCs more recently. =20

Another stack processor is the J1 which is used in a number of applications=
 and even had a TCP/IP stack implemented in about 8 kW (kB?) (kinstructions=
?).  You can find info on it with a google search.  It is every bit as smal=
l as mine and a lot better documented and programmed in Forth while mine is=
 programmed in assembly which is similar to Forth. =20

> > > And even with all that conditions in place, non-pipelined conditional=
 branches at 100 MHz sound hard.=20
> >=20
> > Not hard when the CPU is simple and designed to be easy to implement ra=
ther than designing it to be like all the other CPUs with complicated funct=
ionality. =20
> >=20
>=20
> It is certainly easier when branching is based on arithmetic flags rather=
 than
> on the content of register, like a case in MIPS derivatives, including Ni=
os2 and RISC-V. But still hard. You have to wait for instruction to arrive =
from memory, decode an instruction, do logical operations on flags and sele=
ct between two alternatives based on result of logical operation, all in on=
e cycle.
> If branch is PC-relative, which is the case in nearly all popular 32-bit =
architectures, you also have to do an address addition, all in the same cyc=
le.

I guess this is where I disagree on the pipelining aspect of my design.  I =
register the current instruction so the memory fetch is in the previous cyc=
le based on that instruction.  So my delay path starts with the instruction=
, not the instruction pointer.  The instruction decode for each section of =
the CPU is in parallel of course.  The three sections of the CPU are the in=
struction fetch, the data path and the address path.  The data path and add=
ress path roughly correspond to the data and return stacks in Forth.  In my=
 CPU they can operate separately and the return stack can perform simple ma=
th like increment/decrement/test since it handles addressing memory.  In Fo=
rth everything is done on the data stack other than holding the return addr=
esses, managing DO loop counts and user specific operations. =20

My CPU has both PC relative addressing and absolute addressing.  One way I =
optimize for speed is by careful management of the low level implementation=
.  For example I use an adder as a multiplexor when it's not adding.  A+0 i=
s A, 0+B is B, A+B is well, A+B. =20

> But even if it's somehow doable for PC-relative branches, I don't see how=
, assuming that stack is stored in block memory, it is doable for *indirect=
* jumps. I'd guess, you are somehow cutting corners here, most probably by =
requiring the address of indirect jump to be in the top-of-stack register t=
hat is not in block memory.

Indirect addressing???  Indirect addressing requires multiple instructions,=
 yes.  The return stack is used for address calculations typically and that=
 stack is fed directly into the instruction fetch logic... it is the "retur=
n" stack (or address unit, your choice) after all.=20

> > > Not impossible if your FPGA is very fast, like top-speed Arria-10, wh=
ere you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 M=
Hz+. But it does look impossible in low speed grades budget parts, like slo=
west speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that=
 Lattice Mach series is somewhat slower than even those.=20
> >=20
> > I only use the low grade parts.  I haven't used NIOS=20
>=20
> Nios, not NIOS. The proper name and spelling is Nios2, because for a brie=
f period in early 00s Altera had completely different architecture that was=
 called Nios.

I haven't used those processors either.=20

> > and this processor won't get to 380 MHz I'm pretty sure.  Pipelining it=
 would be counter it's design goals but might be practical, never thought a=
bout it.=20
> >=20
> >=20
> > > The only way that I can see non-pipelined conditional branches work a=
t 100 MHz in low end devices is if your architecture has branch delay slot.=
 But that by itself is sort of pipelining, just instead of being done in HW=
, it is pipelining exposed to SW.
> >=20
> > Or the instruction is simple and runs fast.=20
> >=20
>=20
> I don't doubt that you did it, but answers like that smell hand-waving.

Ok, whatever that means.=20

> > > Besides, my current hobby interest is in 500-700 LUT4s rather than in=
 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available =
too, so one can as well use OTS Nios2f which is pretty fast and validated t=
o the level that hobbyist's cores can't even dream about.
> >=20
> > That's where my CPU lies, I think it was 600 LUT4s last time I checked.=
 =20
> >=20
>=20
> Does it include single-cycle 32-bit shift/rotate by arbitrary 5-bit count=
 (5 variations, logical and arithmetic right shift, logical left shift, rot=
ate right, rotate left)?

There are shift instructions.  It does not have a barrel shifter if that is=
 what you are asking.  A barrel shifter is not really a CPU.  It is a CPU f=
eature and is large and slow.  Why slow down the rest of the CPU with a ver=
y slow feature?  That is the sort of thing that should be external hardware=
. =20

When they design CPU chips, they have already made compromises that require=
 larger, slower logic which require pipelining.  The barrel shifter is perf=
ect for pipelining, so it fits right in.=20

> Does it include zero-extended and sign-extended byte and half-word loads =
(fetches, in you language) ?

I don't recall, but I'll say no.  I do recall some form of sign extension, =
but I may be thinking of setting the top of stack by the flags.  Forth has =
words that treat the word on the top of stack as a word, so the mapping is =
better if this is implemented.  I'm not convinced this is really better tha=
n using the flags directly in the asm, but for now I'm compromising.  I'm n=
ot really a compiler writer, so...=20

> In my cores these two functions combined are the biggest block, bigger th=
an 32-bit ALU, and comparable in size with result writeback mux.

Sure, the barrel shifter is O(n^^2) like a multiplier.  That's why in small=
 CPUs it is often done in loops.  Since loops can be made efficient with th=
e right instructions that's a good way to go.  If you really need the optim=
um speed for barrel shifting, then I guess a large block of logic and pipel=
ining is the way to go.=20

I needed to implement multiplications, but they are on 24 bit words that ar=
e being shifted into and out of a CODEC bit serial.  I found a software shi=
ft and add to work perfectly well, no need for special hardware. =20

Boman was using his J1 for video work (don't recall the details) but the Mi=
croblaze was too slow and used too much memory.  The J1 did the same functi=
ons faster and in less code with generic instructions, nothing unique to th=
e application if I remember correctly... not that the Microblaze is the gol=
d standard.=20

> Also, I assume that you cores have no multiplier, right?

By "cores" you mean CPUs?  Core actually, remember the interrupt, one CPU, =
one interrupt.   Yes, no hard multiplier as yet.  The pure hardware impleme=
ntation of the CODEC app used shift and add in hardware as well but new fea=
tures were needed and space was running out in the small FPGA, 3 kLUTs.  Th=
e slower, simpler stuff could be ported to software easily for an overall r=
eduction in LUT4 usage along with the new features.=20

I don't typically try to compete with the functionality of ARMs with my CPU=
 designs.  To me they are FPGA logic adjuncts.  So I try to make them as si=
mple as the other logic.=20

I wrote some code for a DDS in software once as a benchmark for CPU instruc=
tion set designs.  The shortest and fastest I came up with was a hybrid bet=
ween a stack CPU and a register CPU where objects near the top of stack cou=
ld be addressed rather than having to always move things around to put the =
nouns where the verbs could reach them.  I have no idea how to program that=
 in anything other than assembly which would be ok with me.  I used an exce=
l spread sheet to analyze the 50 to 90 instructions in this routine.  It wo=
uld be interesting to write an assembler that would produce the same output=
s.=20

Rick C.

Reply by Hul Tytus ●February 14, 20192019-02-14

32 bit RISC mcus with 32 registers... do you have any actual devices in 
mind?

Hul

already5chosen@yahoo.com wrote:
> On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
> > 
> > Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on.  My projects use a CPU as a controller and often have very critical real time requirements.  While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions.  That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language. 
> > 

> Can you quantify criticality of your real-time requirements?

> Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
> Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

> > 
> > > >  Many stack based CPUs can be implemented in 1k LUT4s or less.  They can run fast, >100 MHz and typically are not pipelined.  

> 1 cycle per instruction not pipelined means that stack can not be implemented
> in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

> Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

> And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those. 
> The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

> Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

Reply by ●February 14, 20192019-02-14

On Thursday, February 14, 2019 at 11:15:58 PM UTC+2, Hul Tytus wrote:
> 32 bit RISC mcus with 32 registers... do you have any actual devices in 
> mind?
> 
> Hul
> 

First, I don't like to answer to top-poster.
Next time I wouldn't answer.

The discussion was primarily about soft cores.
Two most popular soft cores Nios2 and Microblaze are 32-bit RISCs with 32 registers.

In "hard" MCUs there are MIPS-based products from Microchip.

More recently there appeared few RISC-V MCUs. Probbaly more is going to follow.

In the past there were popular PPC-based MCU devices from various vendors. They are less popular today, but still exist. Freescale (now NXP) e200 core variants are designed specifically for MCU applications.
https://www.nxp.com/products/product-selector:PRODUCT-SELECTOR#/category/c731_c381_c248

So, not the whole 32-bit MCU world is ARM Cortex-M. Just most of it ;-)

Reply by A.P.Richelieu ●February 15, 20192019-02-15

Den 2019-02-14 kl. 11:07, skrev already5chosen@yahoo.com:
> On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
>>
>> Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on.  My projects use a CPU as a controller and often have very critical real time requirements.  While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions.  That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.
>>
> 
> Can you quantify criticality of your real-time requirements?
> 
> Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
> Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.
> 
>>
>>>>   Many stack based CPUs can be implemented in 1k LUT4s or less.  They can run fast, >100 MHz and typically are not pipelined.
> 
> 1 cycle per instruction not pipelined means that stack can not be implemented
> in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.
> 
> Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.
> 
> And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.
> The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.
> 
> Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.
> 

I think the best way to get exact performance is to implement a 
multithreaded architecture.
This is not the smallest CPU architecture, but the pipeline will run at
very high frequency.

The Multithreaded architecture I have used has
a classic three stage pipeline, fetch, decode, execute,
so there are three instructions active all the time.

The architecture implements ONLY 1 clock cycle in each stage.

Many CPUs implement multicycle functionality, by having statemachines
inside the decode stage.
Thge decode stage can either control the execute stage (the datapath)
directly by decoding the instruction in the fetch stage output,
or it can control the execute stage from one of several statemachines
implementing things like interrupt entry, interrupt exit etc.

The datapath can easily require 80-120 control signals,
so each statemachine needs to have the same number of state registers.
On top of that you need to multiplex all the statemachines together.
This is a considerable amount of logic.

I do it a little bit differently. The CPU has an instruction set
which is basically 16 bit + immediates. This gives room for 16 registers
if you want to have a decent instruction set. 8 bit instruction
and 2 x 4 bit register addresses.

The instruction decoder support an extended 22 bit instruction set.
This gives room for a 10 bit extended instruction set, and 2 x 6 bit
register addresses.
The extended register address space is used for two purposes.
1. To address special registers like the PSR
2. To address a constant ROM, for a few useful constants.

The fetch stage can fetch instructions from two places.
1. The instruction queue(2). The instruction queue only supports 16 bit 
instructions with 16/32 bit immediates.
2. A small ROM which provides 22 bit instructions (with 22 bit immediates)

Whenever something happens which normally would require a multicycle
instruction, the thread makes a subroutine jump (0 clock cycle jump)
into the ROM, and executes 22 bit instructions.

A typical use would be an interrupt.
To clear the interrupt flag, you want to clear one bit in the PSR.

The instruction ROM contains
     ANDC PSR, const22   ; AND constantROM[22] with PSR.
                         ; ConstantROM[22] == 0xFFFFFEFF
                         ; Clear bit 9 (I) of PSR

To implement multithreading, I need a single decoder,
but multiple register banks, one per thread.
Several special purpose registers per thread (like PSR)
is also needed.

I also need multiple instruction queues (one per thread)

To speed up the pipeline, it is important to follow a simple rule.
A thread cannot ever execute in a cycle, if the instruction
depends in anyway on the result of the previous instruction.
If that rule is followed, you do not need to feedback
the result of an ALU operation to the ALU.

The simplest way to follow the rule is to never
let a thread execute during two adjacent clock cycles.
This limits the performance of a thread to max 1/2 that of
what the CPU is capable of but at the same time,
there is less logic in the critical path, so you
can increase the clock frequency.

Now you suddenly can run code with exact properties.
You can say that I want to execute 524 instructions
per millisecond, and that is what the CPU will do.

You can let all the interrupts be executed in one thread,
so you do not disturb the time critical threads.

The architecture is well suited for FPGA work since
you can use standard dual port RAMs for registers.

I use two dual port RAMs to implement the register banks (each has one 
read port and one write port)
The writes are connected together, so you have in effect a register
memory with 1 write port and 2 read ports.

If the CPU architectural model has, lets say, 16 registers x 32,
and you use 2 x (256 x 32) dual port RAMs, you have storage for
16 threads. 2 x (16 CPUs x 16 registers x 32 bits)
If you use 512 x 32 bit DPRAMs you have room for 32 threads.

If you want to study a real example look at the MIPS multithreaded cores
https://www.mips.com/products/architectures/ase/multi-threading/

They decided to build that after I presented my research to their CTO.
They had more focus on the performance than the real time control
which is a pity.
FPGA designers do not have that limitation.

AP

Previous 2 34Next

Altera Cyclone replacement

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group