On Thursday, February 14, 2019 at 8:38:47 AM UTC-5, already...@yahoo.com wr=
ote:
> On Thursday, February 14, 2019 at 1:24:40 PM UTC+2, gnuarm.del...@gmail.c=
om wrote:
> > On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.co=
m wrote:
> > > On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gma=
il.com wrote:
> > > >=20
> > > > Ok, if you are doing C in FPGA CPUs then you are in a different wor=
ld than the stuff I've worked on. My projects use a CPU as a controller an=
d often have very critical real time requirements. While C doesn't prevent=
that, I prefer to just code in assembly language and more importantly, use=
a CPU design that provides single cycle execution of all instructions. Th=
at's why I like stack processors, they are easy to design, use a very simpl=
e instruction set and the assembly language can be very close to the Forth =
high level language.=20
> > > >=20
> > >=20
> > > Can you quantify criticality of your real-time requirements?
> >=20
> > Eh? You are asking my requirement or asking how important it is? =20
>=20
> How important they are. What happens if particular instruction most of th=
e time=20
> takes n clocks, but sometimes, rarely, could take n+2 clocks? Are system-=
level requirements impacted?
Of course, that depends on the application. In some cases it would simply =
not work correctly because it was designed into the rest of the logic not e=
ntirely unlike a FSM. In other cases it would make the timing indeterminat=
e which means it would make it harder to design the logic surrounding this =
piece. =20
> > Not sure how to answer that question. I can only say that my CPU desig=
ns give single cycle execution, so I can design with them the same way I de=
sign the hardware in VHDL.=20
> >=20
> >=20
> > > Also, even for most critical requirements, what's wrong with multiple=
cycles per instructions as long as # of cycles is known up front?
> >=20
> > It increases interrupt latency which is not a problem if you aren't usi=
ng interrupts, a common technique for such embedded processors.
>=20
> I don't like interrupts in small systems. Neither in MCUs nor in FPGAs.
> In MCUs nowadays we have bad ass DMAs. In FPGAs we can build bad ass DMA =
ourselves. Or throw multiple soft cores on multiple tasks. That why I am in=
terested in *small* soft cores in the first place.
Yup, interrupts can be very bad. But if you requirements are to do one thi=
ng in software that has real time requirements (such as service an ADC/DAC =
or fast UART) while the rest of the code is managing functions with much mo=
re relaxed real time requirements, using an interrupt can eliminate a CPU c=
ore or the design of a custom DMA with particular features that are easy in=
software.=20
There are things that are easy to do in hardware and things that are easy t=
o do in software with some overlap. Using a single CPU and many interrupts=
fits into the domain of not so easy to do. That doesn't make simple use o=
f interrupts a bad thing. =20
> > Otherwise multi-cycle instructions complicate the CPU instruction deco=
der. =20
>=20
> I see no connection to decoder. May be, you mean microsequencer?
Decoder has outputs y(i) =3D f(x(j)) where x(j) is all the inputs and y(i) =
is all the outputs and f() is the function mapping inputs to outputs. If y=
ou have multiple states for instructions the decoding function has more inp=
uts than if you only decode instructions and whatever state flags might be =
used such as carry or zero or interrupt input.=20
In general this will result in more complex instruction decoding.=20
> Generally, I disagree. At least for very fast clock rates it is easier to=
design non-pipelined or partially pipelined core where every instruction f=
lows through several phases.
If by "easier" you mean possible, then yes. That's why they use pipelining=
, to achieve clock speeds that otherwise can't be met. But it is seldom si=
mple since pipelining is more than just adding registers. Instructions int=
eract and on branches the pipeline has to be flushed, etc.=20
> Or, may be, you think about variable-length instructions? That's again, o=
rthogonal to number of clocks per instruction. Anyway, I think that variabl=
e-length instructions are very cool, but not for 500-700 LUT4s budget. I wo=
uld start to consider VLI for something like 1200 LUT4s.
Nope, just talking about using multiple clock cycles for instructions. Usi=
ng variable number of clock cycles would be more complex in general and mul=
tiple length instructions even worse... in general. There are always possi=
bilities to simplify some aspect of this by complicating some aspect of tha=
t.=20
> > Using a short instruction format allows minimal decode logic. Adding a=
cycle counter increases the number of inputs to the instruction decode blo=
ck and so complicates the logic significantly.=20
> >=20
> >=20
> > > Things like caches and branch predictors indeed cause variability (wi=
tch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cyc=
les per instruction.
> >=20
> > Cache, branch predictors??? You have that with 1 kLUT CPUs??? I think=
we design in very different worlds.=20
>=20
> I don't *want* data caches in sort of tasks that I do with this small cor=
es. Instruction cache is something else. I am not against them in "hard" MC=
Us.
> In small soft cores that we are discussing right now they are impractical=
rather than evil.
Or unneeded. If the programs fits in the on chip memory, no cache is neede=
d. What sort of programming are you doing in <1kLUT CPUs that would requir=
e slow off-chip program storage?=20
> But static branch prediction is something else. I can see how static bran=
ch prediction is practical in 700-800 LUT4s. I didn't have it implemented i=
n my half-dozen (in the mean time the # is growing). But it is practical, e=
sp. for applications that spend most of the time in very short loops.
If the jump instruction is one clock cycle and no pipeline, jump prediction=
is not possible I think. =20
> > My program storage is inside the FPGA and runs at the full speed of the=
CPU. The CPU is not pipelined (according to me, someone insisted that it =
was a 2 level pipeline, but with no pipeline delay,=20
>=20
> I am starting to suspect that you have very special definition of "not pi=
pelined" that differs from definition used in literature.
Ok, not sure what that means. Every instruction takes one clock cycle. Wh=
ile a given instruction is being executed the next instruction is being fet=
ched, but the *actual* next instruction, not the "possible" next instructio=
n. All branches happen during the branch instruction execution which fetch=
es the correct next instruction.=20
This guy said I was pipelining the fetch and execute... I see no purpose i=
n calling that pipelining since it carries no baggage of any sort.=20
> > oh well) so no branch prediction needed.=20
> >=20
> >=20
> > > > > > Many stack based CPUs can be implemented in 1k LUT4s or less. =
They can run fast, >100 MHz and typically are not pipelined. =20
> > >=20
> > > 1 cycle per instruction not pipelined means that stack can not be imp=
lemented
> > > in memory block(s). Which, in combination with 1K LUT4s means that ei=
ther stack is very shallow or it is not wide (i.e. 16 bits rather than 32 b=
its). Either of it means that you need many more instructions (relatively t=
o 32-bit RISC with 32 or 16 registers) to complete the job.
> >=20
> > Huh? So my block RAM stack is pipelined or are you saying I'm only ima=
gining it runs in one clock cycle? Instructions are things like=20
> >=20
> > ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (ret=
urn from call), RETI (return from interrupt). The interrupt pushes return =
address to return stack and PSW to data stack in one cycle with no latency =
so, like the other instructions is single cycle, again making using it like=
designing with registers in the HDL code.=20
> >=20
> >=20
> > > Also 1 cycle per instruction necessitates either strict Harvard memor=
ies or true dual-ported memories.
> >=20
> > Or both. To get the block RAMs single cycle the read and write happen =
on different phases of the main clock. I think read is on falling edge whi=
le write is on rising edge like the rest of the logic. Instructions and da=
ta are in physically separate memory within the same address map, but no wa=
y to use either one as the other mechanically. Why would Harvard ever be a=
problem for an embedded CPU?=20
> >=20
>=20
> Less of the problem when you are in full control of software stack.
> When you are not in full control, sometimes compilers like to place data,=
esp. jump tables for implementing HLL switch/case construct, in program me=
mory.
> Still, even with full control of the code generation tools, sometimes you=
want
> architecture consisting of tiny startup code that loads the bulk of the c=
ode from external memory, most commonly from SPI flash.
> Another, less common possible reason is saving space by placing code and =
data in the same memory block. Esp. when blocks are relatively big and ther=
e are few of them.
There is nothing to prevent loading code into program memory. It's all one=
address space and can be written to by machine code. So I guess it's not =
really Harvard, it's just physically separate memory. Since instructions ar=
e not a word wide, I think the program memory does not implement a full wor=
d width.. to be honest, I don't recall. I haven't used this CPU in years. =
I've been programming in Forth on PCs more recently. =20
Another stack processor is the J1 which is used in a number of applications=
and even had a TCP/IP stack implemented in about 8 kW (kB?) (kinstructions=
?). You can find info on it with a google search. It is every bit as smal=
l as mine and a lot better documented and programmed in Forth while mine is=
programmed in assembly which is similar to Forth. =20
> > > And even with all that conditions in place, non-pipelined conditional=
branches at 100 MHz sound hard.=20
> >=20
> > Not hard when the CPU is simple and designed to be easy to implement ra=
ther than designing it to be like all the other CPUs with complicated funct=
ionality. =20
> >=20
>=20
> It is certainly easier when branching is based on arithmetic flags rather=
than
> on the content of register, like a case in MIPS derivatives, including Ni=
os2 and RISC-V. But still hard. You have to wait for instruction to arrive =
from memory, decode an instruction, do logical operations on flags and sele=
ct between two alternatives based on result of logical operation, all in on=
e cycle.
> If branch is PC-relative, which is the case in nearly all popular 32-bit =
architectures, you also have to do an address addition, all in the same cyc=
le.
I guess this is where I disagree on the pipelining aspect of my design. I =
register the current instruction so the memory fetch is in the previous cyc=
le based on that instruction. So my delay path starts with the instruction=
, not the instruction pointer. The instruction decode for each section of =
the CPU is in parallel of course. The three sections of the CPU are the in=
struction fetch, the data path and the address path. The data path and add=
ress path roughly correspond to the data and return stacks in Forth. In my=
CPU they can operate separately and the return stack can perform simple ma=
th like increment/decrement/test since it handles addressing memory. In Fo=
rth everything is done on the data stack other than holding the return addr=
esses, managing DO loop counts and user specific operations. =20
My CPU has both PC relative addressing and absolute addressing. One way I =
optimize for speed is by careful management of the low level implementation=
. For example I use an adder as a multiplexor when it's not adding. A+0 i=
s A, 0+B is B, A+B is well, A+B. =20
> But even if it's somehow doable for PC-relative branches, I don't see how=
, assuming that stack is stored in block memory, it is doable for *indirect=
* jumps. I'd guess, you are somehow cutting corners here, most probably by =
requiring the address of indirect jump to be in the top-of-stack register t=
hat is not in block memory.
Indirect addressing??? Indirect addressing requires multiple instructions,=
yes. The return stack is used for address calculations typically and that=
stack is fed directly into the instruction fetch logic... it is the "retur=
n" stack (or address unit, your choice) after all.=20
> > > Not impossible if your FPGA is very fast, like top-speed Arria-10, wh=
ere you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 M=
Hz+. But it does look impossible in low speed grades budget parts, like slo=
west speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that=
Lattice Mach series is somewhat slower than even those.=20
> >=20
> > I only use the low grade parts. I haven't used NIOS=20
>=20
> Nios, not NIOS. The proper name and spelling is Nios2, because for a brie=
f period in early 00s Altera had completely different architecture that was=
called Nios.
I haven't used those processors either.=20
> > and this processor won't get to 380 MHz I'm pretty sure. Pipelining it=
would be counter it's design goals but might be practical, never thought a=
bout it.=20
> >=20
> >=20
> > > The only way that I can see non-pipelined conditional branches work a=
t 100 MHz in low end devices is if your architecture has branch delay slot.=
But that by itself is sort of pipelining, just instead of being done in HW=
, it is pipelining exposed to SW.
> >=20
> > Or the instruction is simple and runs fast.=20
> >=20
>=20
> I don't doubt that you did it, but answers like that smell hand-waving.
Ok, whatever that means.=20
> > > Besides, my current hobby interest is in 500-700 LUT4s rather than in=
1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available =
too, so one can as well use OTS Nios2f which is pretty fast and validated t=
o the level that hobbyist's cores can't even dream about.
> >=20
> > That's where my CPU lies, I think it was 600 LUT4s last time I checked.=
=20
> >=20
>=20
> Does it include single-cycle 32-bit shift/rotate by arbitrary 5-bit count=
(5 variations, logical and arithmetic right shift, logical left shift, rot=
ate right, rotate left)?
There are shift instructions. It does not have a barrel shifter if that is=
what you are asking. A barrel shifter is not really a CPU. It is a CPU f=
eature and is large and slow. Why slow down the rest of the CPU with a ver=
y slow feature? That is the sort of thing that should be external hardware=
. =20
When they design CPU chips, they have already made compromises that require=
larger, slower logic which require pipelining. The barrel shifter is perf=
ect for pipelining, so it fits right in.=20
> Does it include zero-extended and sign-extended byte and half-word loads =
(fetches, in you language) ?
I don't recall, but I'll say no. I do recall some form of sign extension, =
but I may be thinking of setting the top of stack by the flags. Forth has =
words that treat the word on the top of stack as a word, so the mapping is =
better if this is implemented. I'm not convinced this is really better tha=
n using the flags directly in the asm, but for now I'm compromising. I'm n=
ot really a compiler writer, so...=20
> In my cores these two functions combined are the biggest block, bigger th=
an 32-bit ALU, and comparable in size with result writeback mux.
Sure, the barrel shifter is O(n^^2) like a multiplier. That's why in small=
CPUs it is often done in loops. Since loops can be made efficient with th=
e right instructions that's a good way to go. If you really need the optim=
um speed for barrel shifting, then I guess a large block of logic and pipel=
ining is the way to go.=20
I needed to implement multiplications, but they are on 24 bit words that ar=
e being shifted into and out of a CODEC bit serial. I found a software shi=
ft and add to work perfectly well, no need for special hardware. =20
Boman was using his J1 for video work (don't recall the details) but the Mi=
croblaze was too slow and used too much memory. The J1 did the same functi=
ons faster and in less code with generic instructions, nothing unique to th=
e application if I remember correctly... not that the Microblaze is the gol=
d standard.=20
> Also, I assume that you cores have no multiplier, right?
By "cores" you mean CPUs? Core actually, remember the interrupt, one CPU, =
one interrupt. Yes, no hard multiplier as yet. The pure hardware impleme=
ntation of the CODEC app used shift and add in hardware as well but new fea=
tures were needed and space was running out in the small FPGA, 3 kLUTs. Th=
e slower, simpler stuff could be ported to software easily for an overall r=
eduction in LUT4 usage along with the new features.=20
I don't typically try to compete with the functionality of ARMs with my CPU=
designs. To me they are FPGA logic adjuncts. So I try to make them as si=
mple as the other logic.=20
I wrote some code for a DDS in software once as a benchmark for CPU instruc=
tion set designs. The shortest and fastest I came up with was a hybrid bet=
ween a stack CPU and a register CPU where objects near the top of stack cou=
ld be addressed rather than having to always move things around to put the =
nouns where the verbs could reach them. I have no idea how to program that=
in anything other than assembly which would be ok with me. I used an exce=
l spread sheet to analyze the 50 to 90 instructions in this routine. It wo=
uld be interesting to write an assembler that would produce the same output=
s.=20
Rick C.