comp.arch.fpga | Tiny CPUs for Slow Logic| page 4

Reply by Theo ●March 20, 20192019-03-20

gnuarm.deletethisbit@gmail.com wrote:
> On Tuesday, March 19, 2019 at 10:29:07 AM UTC-4, Theo Markettos wrote:
> >
> When people talk about things like "software running on such heterogeneous
> cores" it makes me think they don't really understand how this could be
> used.  If you treat these small cores like logic elements, you don't have
> such lofty descriptions of "system software" since the software isn't
> created out of some global software package.  Each core is designed to do
> a specific job just like any other piece of hardware and it has discrete
> inputs and outputs just like any other piece of hardware.  If the hardware
> clock is not too fast, the software can synchronize with and literally
> function like hardware, but implementing more complex logic than the same
> area of FPGA fabric might.

The point is that we need to understand what the whole system is doing.  In
the XMOS case, we can look at a piece of software with N threads, running
across the cores provided on the chip.  One piece of software, distributed
over the hardware resource available - the system is doing one thing.

Your bottom-up approach means it's difficult to see the big picture of
what's going on.  That means it's hard to understand the whole system, and
to program from a whole-system perspective.

> Not sure what is hard to think about.  It's a CPU, a small CPU with
> limited memory to implement small tasks that can do rather complex
> operations compared to a state machine really and includes memory,
> arithmetic and logic as well as I/O without having to write a single line
> of HDL.  Only the actual app needs to be written.

Here are the sematic descriptions of basic logic elements:

LUT:  q = f(x,y,z)
FF:   q <= d_in  (delay of one cycle)
BRAM: q = array[addr]
DSP:  q = a*b + c

A P&R tool can build a system out of these building blocks.  It's notable
that the state-holding elements in this schema do nothing else except
holding state.  That makes writing the tools easier (and we all know how
difficult the tools already are).  In general, we don't tend to instantiate
these primitives manually but describe the higher level functions (eg a 64
bit add) in HDL and allow the tools to select appropriate primitives for us
(eg a number of fast-adder blocks chained together).

What's the logic equation of a processor?  It has state, but vastly more
state than the simplicity of a flipflop.  What pattern does the P&R tool
need to match to infer a processor?  How is any verification tool going
to understand whether the processor with software is doing the right thing?

If your answer is 'we don't need verification tools, we program by hand'
then a) software has bugs, and automated verification is a handy way to
catch them, and b) you're never going to be writing hundreds of different
mini-programs to run on each core, let alone make them correct.

If we scale the processors up a bit, I could see the merits in say a bank
of, say, 32 Cortex M0s that could be interconnected as part of the FPGA
fabric and programmed in software for dedicated tasks (for instance, read
the I2C EEPROM on the DRAM DIMM and configure the DRAM controller at boot). 
But this is an SoC construct (built using SoC builder tools, and over which
the programmer has some purview although, as it turns out, sketchier than
you might think[1]).  Such CPUs would likely be running bigger corpora of
software (for instance, the DRAM controller vendor's provided initialisation
code) which would likely be in C.  But in this case we could just use a
soft-core today (the CPU ISA is most irrelevant for this application, so a
RISC-V/Microblaze/NIOS would be fine).

[1] https://inf.ethz.ch/personal/troscoe/pubs/hotos15-gerber.pdf

I can also see another niche, at the extreme bottom end, where a CPLD might
have one of your processors plus a few hundred logic cells.  That's
essentially a microcontroller with FPGA, or an FPGA with microcontroller -
which some of the vendors already produce (although possibly not
small/cheap/low power enough).  Here I can't see the advantages of using a
stack-based CPU versus paying a bit more to program in C.  Although I don't
have experience in markets where the retail price of the product is $1, and so
every $0.001 matters.

> > I would be interested to know what applications might use heterogenous
> > many-cores and what performance is achievable.
> 
> Yes, clearly not getting the concept.  Asking about heterogeneous
> performance is totally antithetical to this idea.

You keep mentioning 700 MIPS, which suggests performance is important.  If
these are simple state machine replacements, why do we care about
performance?

In essence, your proposal has a disconnect between the situations existing
FPGA blocks are used (implemented automatically by P&R tools) and the
situations software is currently used (human-driven software and
architectural design).  It's unclear how you claim to bridge this gap.

Theo

Reply by ●March 20, 20192019-03-20

On Tuesday, March 19, 2019 at 10:07:38 PM UTC+2, Tom Gardner wrote:
> On 19/03/19 17:35, already5chosen@yahoo.com wrote:
> > On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:
> >> The "granularity" of the computation and communication will be a key to
> >> understanding what the OP is thinking.
> > 
> > I don't know what Rick had in mind. I personally would go for one "hard-CPU"
> > block per 4000-5000 6-input logic elements (i.e. Altera ALMs or Xilinx CLBs).
> > Each block could be configured either as one 64-bit core or pair of 32-bit
> > cores. The bock would contains hard instruction decoders/ALUs/shifters and
> > hard register files. It can optionally borrow adjacent DSP blocks for
> > multipliers. Adjacent embedded memory blocks can be used for data memory.
> > Code memory should be a bit more flexible giving to designer a choice between
> > embedded memory blocks or distributed memory (X)/MLABs(A).
> 
> It would be interesting to find an application level
> description (i.e. language constructs) that
>   - could be automatically mapped onto those primitives
>     by a toolset
>   - was useful for more than a niche subset of applications
>   - was significantly better than existing tools
> 
> I wouldn't hold my breath :)

I think, you are looking at it from wrong angle.
One doesn't really need new tools to design and simulate such things. What's needed is a combinations of existing tools - compilers, assemblers, probably software simulator plug-ins into existing HDL simulators, but the later is just luxury for speeding up simulations, in principle, feeding HDL simulator with RTL model of the CPU core will work too.

As to niches, all "hard" blocks that we currently have in FPGAs are about niches. It's extremely rare that user's design uses all or majority of the features of given FPGA device and need LUTs, embedded memories, PLLs, multiplies, SERDESs, DDR DRAM I/O blocks etc in exact amounts appearing in the device.
It still makes sense, economically, to have them all built in, because masks and other NREs are mighty expensive while silicon itself is relatively cheap. Multiple small hard CPU cores are really not very different from features, mentioned above.

Reply by Tom Gardner ●March 20, 20192019-03-20

On 20/03/19 10:41, already5chosen@yahoo.com wrote:
> On Tuesday, March 19, 2019 at 10:07:38 PM UTC+2, Tom Gardner wrote:
>> On 19/03/19 17:35, already5chosen@yahoo.com wrote:
>>> On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:
>>>>
>>>> The UK Parliament is an unmitigated dysfunctional mess.
>>>>
>>>
>>> Do you prefer dysfunctional mesh ;)
>>
>> :) I'll settle for anything that /works/ predictably :(
>>
> 
> UK political system is completely off-topic in comp.arch.fpga. However I'd say that IMHO right now your parliament is facing unusually difficult problem on one hand, but at the same time it's not really "life or death" sort of the problem. Having troubles and appearing non-decisive in such situation is normal. It does not mean that the system is broken.
> 

Firstly, you chose to snip the analogy, thus removing the context.

Secondly, actually currently there are /very/ plausible reasons
to believe it might be life or death for my 98yo mother, and
may hasten my death. No, I'm not going to elaborate on a public
forum.

I will note that Operation Yellowhammer will, barring miracles,
be started on Monday, and that a prominent *brexiteer* (Michael Gove)
is shit scared of a no-deal exit because all the chemicals required
to purify our drinking water come from Europe.

Reply by Theo Markettos ●March 20, 20192019-03-20

already5chosen@yahoo.com wrote:
> As to niches, all "hard" blocks that we currently have in FPGAs are about
> niches.  It's extremely rare that user's design uses all or majority of
> the features of given FPGA device and need LUTs, embedded memories, PLLs,
> multiplies, SERDESs, DDR DRAM I/O blocks etc in exact amounts appearing in
> the device.  It still makes sense, economically, to have them all built
> in, because masks and other NREs are mighty expensive while silicon itself
> is relatively cheap.  Multiple small hard CPU cores are really not very
> different from features, mentioned above.

A lot of these 'niches' have been proven in soft-logic.

Implement your system in soft-logic, discover that there's lots of
multiply-adds and they're slow and take up area.  A DSP block is thus an
'accelerator' (or 'most compact representation') of the same concept in
soft-logic.

The same goes for BRAMs (can be implemented via registers but too much
area), adders (slow when implemented with generic LUTs), etc.

Other features (SERDES, PLLs, DDR, etc) can't be done at all without
hard-logic support.  If you want those features, you need the hard logic,
simple as that.

Through analysis of existing designs we can have a provable win of the hard
over soft logic, to make it worthwhile putting it on the silicon and
integrating into the tools.  In some of these cases, I'd guess the win over
the soft-logic is 10x or more saving in area.

Rick's idea can be done today in soft-logic.  So someone could build a proof
of concept and measure the cases where it improves things over the baseline. 
If that case is compelling, let's put it in the hard logic.

But thus far we haven't seen a clear case for why someone should build a
proof of concept.  I'm not saying it doesn't exist, but we need a clear
elucidation of the problem that it might solve.

Theo

Reply by Tom Gardner ●March 20, 20192019-03-20

On 20/03/19 10:56, already5chosen@yahoo.com wrote:
> On Tuesday, March 19, 2019 at 10:07:38 PM UTC+2, Tom Gardner wrote:
>> On 19/03/19 17:35, already5chosen@yahoo.com wrote:
>>> On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:
>>>> The "granularity" of the computation and communication will be a key
>>>> to understanding what the OP is thinking.
>>> 
>>> I don't know what Rick had in mind. I personally would go for one
>>> "hard-CPU" block per 4000-5000 6-input logic elements (i.e. Altera ALMs
>>> or Xilinx CLBs). Each block could be configured either as one 64-bit core
>>> or pair of 32-bit cores. The bock would contains hard instruction
>>> decoders/ALUs/shifters and hard register files. It can optionally borrow
>>> adjacent DSP blocks for multipliers. Adjacent embedded memory blocks can
>>> be used for data memory. Code memory should be a bit more flexible giving
>>> to designer a choice between embedded memory blocks or distributed memory
>>> (X)/MLABs(A).
>> 
>> It would be interesting to find an application level description (i.e.
>> language constructs) that - could be automatically mapped onto those
>> primitives by a toolset - was useful for more than a niche subset of
>> applications - was significantly better than existing tools
>> 
>> I wouldn't hold my breath :)
> 
> 
> I think, you are looking at it from wrong angle. One doesn't really need new
> tools to design and simulate such things. What's needed is a combinations of
> existing tools - compilers, assemblers, probably software simulator plug-ins
> into existing HDL simulators, but the later is just luxury for speeding up
> simulations, in principle, feeding HDL simulator with RTL model of the CPU
> core will work too.

That would be one perfectly acceptable embodiment of a toolset
that I mentioned.

But more difficult that creating such a toolset is defining
an application level description that a toolset can munge.

So, define (initially by example, later more formally) inputs
to the toolset and outputs from it. Then we can judge whether
the concepts are more than handwaving wishes.



> As to niches, all "hard" blocks that we currently have in FPGAs are about
> niches. It's extremely rare that user's design uses all or majority of the
> features of given FPGA device and need LUTs, embedded memories, PLLs,
> multiplies, SERDESs, DDR DRAM I/O blocks etc in exact amounts appearing in
> the device. It still makes sense, economically, to have them all built in,
> because masks and other NREs are mighty expensive while silicon itself is
> relatively cheap. Multiple small hard CPU cores are really not very different
> from features, mentioned above.

All the blocks you mention have a simple API and easily
enumerated set of behaviour.

The whole point of processors is that they enable much more
complex behaviour that is practically impossible to enumerate.

Alternatively, if it is possible to enumerate the behaviour
of a processor, then it would be easy and more efficient to
implement the behaviour in conventional logic blocks.

Reply by ●March 20, 20192019-03-20

On Wednesday, March 20, 2019 at 3:37:17 PM UTC+2, Tom Gardner wrote:
> 
> But more difficult that creating such a toolset is defining
> an application level description that a toolset can munge.
> 
> So, define (initially by example, later more formally) inputs
> to the toolset and outputs from it. Then we can judge whether
> the concepts are more than handwaving wishes.
> 

I don't understand what you are asking for.

If I had such thing, I'd use it in exactly the same way that I use soft cores (Nios2) today. I will just use them more frequently, because today it costs me logic resources (often acceptable, but not always) and synthesis and fitter time (and that what I really hate). On the other hand, "hard" core would be almost free in both aspects. 
It would be as expensive as "soft" or even costlier, in HDL simulations, but until now I managed to avoid "full system" simulations that cover everything including CPU core and the program that runs on it. Or may be, I did it once or twice years ago and already don't remember. Anyway, for me it's not an important concern and I consider myself rather heavy user of soft cores.

Also, theoretically, if performance of the hard core is non-trivially higher than that of soft cores, either due to higher IPC (I didn't measure, but would guess that for majority of tasks Nios2-f IPC is 20-30% lower than ARM Cortex-M4) or due to higher clock rate, then it will open up even more niches. However I'd expect that performance factor would be less important for me, personally, than other factors mentioned above.

Reply by Tom Gardner ●March 20, 20192019-03-20

On 20/03/19 14:11, already5chosen@yahoo.com wrote:
> On Wednesday, March 20, 2019 at 3:37:17 PM UTC+2, Tom Gardner wrote:
>> 
>> But more difficult that creating such a toolset is defining an application
>> level description that a toolset can munge.
>> 
>> So, define (initially by example, later more formally) inputs to the
>> toolset and outputs from it. Then we can judge whether the concepts are
>> more than handwaving wishes.
>> 
> 
> I don't understand what you are asking for.

Go back and read the parts of my post that you chose to snip.

Give a handwaving indication of the concepts that avoid the
conceptual problems that I mentioned.

Or better still, get the OP to do it.



> If I had such thing, I'd use it in exactly the same way that I use soft cores
> (Nios2) today. I will just use them more frequently, because today it costs
> me logic resources (often acceptable, but not always) and synthesis and
> fitter time (and that what I really hate). On the other hand, "hard" core
> would be almost free in both aspects. It would be as expensive as "soft" or
> even costlier, in HDL simulations, but until now I managed to avoid "full
> system" simulations that cover everything including CPU core and the program
> that runs on it. Or may be, I did it once or twice years ago and already
> don't remember. Anyway, for me it's not an important concern and I consider
> myself rather heavy user of soft cores.
> 
> Also, theoretically, if performance of the hard core is non-trivially higher
> than that of soft cores, either due to higher IPC (I didn't measure, but
> would guess that for majority of tasks Nios2-f IPC is 20-30% lower than ARM
> Cortex-M4) or due to higher clock rate, then it will open up even more
> niches. However I'd expect that performance factor would be less important
> for me, personally, than other factors mentioned above.

Reply by ●March 20, 20192019-03-20

On Wednesday, March 20, 2019 at 6:14:21 AM UTC-4, David Brown wrote:
> On 20/03/2019 03:30, gnuarm.deletethisbit@gmail.com wrote:
> > On Tuesday, March 19, 2019 at 10:29:07 AM UTC-4, Theo Markettos
> > wrote:
> >> Tom Gardner <spamjunk@blueyonder.co.uk> wrote:
> >>> Understand XMOS's xCORE processors and xC language, see how they
> >>> complement and support each other. I found the net result 
> >>> stunningly easy to get working first time, without having to 
> >>> continually read obscure errata!
> >> 
> >> I can see the merits of the XMOS approach.  But I'm unclear how
> >> this relates to the OP's proposal, which (I think) is having tiny
> >> CPUs as hard logic blocks on an FPGA, like DSP blocks.
> >> 
> >> I completely understand the problem of running out of hardware
> >> threads, so a means of 'just add another one' is handy.  But the
> >> issue is how to combine such things with other synthesised logic.
> >> 
> >> The XMOS approach is fine when the hardware is uniform and the
> >> software sits on top, but when the hardware is synthesised and the
> >> 'CPUs' sit as pieces in a fabric containing random logic (as I
> >> think the OP is suggesting) it becomes a lot harder to reason about
> >> what the system is doing and what the software running on such
> >> heterogeneous cores should look like.  Only the FPGA tools have a
> >> full view of what the system looks like, and it seems stretching
> >> them to have them also generate software to run on these cores.
> > 
> > When people talk about things like "software running on such
> > heterogeneous cores" it makes me think they don't really understand
> > how this could be used.  If you treat these small cores like logic
> > elements, you don't have such lofty descriptions of "system software"
> > since the software isn't created out of some global software package.
> > Each core is designed to do a specific job just like any other piece
> > of hardware and it has discrete inputs and outputs just like any
> > other piece of hardware.  If the hardware clock is not too fast, the
> > software can synchronize with and literally function like hardware,
> > but implementing more complex logic than the same area of FPGA fabric
> > might.
> > 
> 
> That is software.
> 
> If you want to try to get cycle-precise control of the software and use
> that precision for direct hardware interfacing, you are almost certainly
> going to have a poor, inefficient and difficult design.  It doesn't
> matter if you say "think of it like logic" - it is /not/ logic, it is
> software, and you don't use that for cycle-precise control.  You use
> when you need flexibility, calculations, and decisions.

I suppose you can make anything difficult if you try hard enough.  

The point is you don't have to make it difficult by talking about "software running on such heterogeneous cores".  Just talk about it being a small hunk of software that is doing a specific job.  Then the mystery is gone and the task can be made as easy as the task is. 

In VHDL this would be a process().  VHDL programs are typically chock full of processes and no one wrings their hands worrying about how they will design the "software running on such heterogeneous cores". 

BTW, VHDL is software too. 


> > There is no need to think about how the CPUs would communicate unless
> > there is a specific need for them to do so.  The F18A uses a
> > handshaked parallel port in their design.  They seem to have done a
> > pretty slick job of it and can actually hang the processor waiting
> > for the acknowledgement saving power and getting an instantaneous
> > wake up following the handshake.  This can be used with other CPUs or
> > 
> 
> Fair enough.

Ok, that's a start. 

Rick C.

Reply by ●March 20, 20192019-03-20

On Wednesday, March 20, 2019 at 4:31:27 PM UTC+2, Tom Gardner wrote:
> On 20/03/19 14:11, already5chosen@yahoo.com wrote:
> > On Wednesday, March 20, 2019 at 3:37:17 PM UTC+2, Tom Gardner wrote:
> >> 
> >> But more difficult that creating such a toolset is defining an application
> >> level description that a toolset can munge.
> >> 
> >> So, define (initially by example, later more formally) inputs to the
> >> toolset and outputs from it. Then we can judge whether the concepts are
> >> more than handwaving wishes.
> >> 
> > 
> > I don't understand what you are asking for.
> 
> Go back and read the parts of my post that you chose to snip.
> 
> Give a handwaving indication of the concepts that avoid the
> conceptual problems that I mentioned.

Frankly, it starts to sound like you never used soft CPU cores in your designs.
So, for somebody like myself, who uses them routinely for different tasks since 2006, you are really not easy to understand.
Concept? Concepts are good for new things, not for something that is a variation of something old and routine and obviously working.

> 
> Or better still, get the OP to do it.
> 

With that part I agree.

Reply by ●March 20, 20192019-03-20

On Wednesday, March 20, 2019 at 6:29:50 AM UTC-4, already...@yahoo.com wrote:
> On Wednesday, March 20, 2019 at 4:32:07 AM UTC+2, gnuarm.del...@gmail.com wrote:
> > On Tuesday, March 19, 2019 at 11:24:33 AM UTC-4, Svenn Are Bjerkem wrote:
> > > On Tuesday, March 19, 2019 at 1:13:38 AM UTC+1, gnuarm.del...@gmail.com wrote:
> > > > Most of us have implemented small processors for logic operations that don't need to happen at high speed.  Simple CPUs can be built into an FPGA using a very small footprint much like the ALU blocks.  There are stack based processors that are very small, smaller than even a few kB of memory.  
> > > > 
> > > > If they were easily programmable in something other than C would anyone be interested?  Or is a C compiler mandatory even for processors running very small programs?  
> > > > 
> > > > I am picturing this not terribly unlike the sequencer I used many years ago on an I/O board for an array processor which had it's own assembler.  It was very simple and easy to use, but very much not a high level language.  This would have a language that was high level, just not C rather something extensible and simple to use and potentially interactive. 
> > > > 
> > > > Rick C.
> > > 
> > > picoblaze is such a small cpu and I would like to program it in something else but its assembler language. 
> > 
> > Yes, it is small.  How large is the program you are interested in? 
> > 
> > Rick C.
> 
> I don't know about Svenn Are Bjerkem, but can tell you about myself.
> Last time when I considered something like that and wrote enough of the program to make measurements the program contained ~250 Nios2 instructions. I'd guess, on minimalistic stack machine it would take 350-400 instructions.
> At the end, I didn't do it in software. Coding the same functionality in HDL turned out to be not hard, which probably suggests that my case was smaller than average.
> 
> Another extreme, where I did end up using "small" soft core, it was much more like "real" software: 2300 Nios2 instructions.

What sorts of applications where these? 

Rick C.

Previous 2 345 6 7 Next

Tiny CPUs for Slow Logic

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group