Reply by glen herrmannsfeldt January 28, 20092009-01-28
Martin Thompson <martin.j.thompson@trw.com> wrote:
(snip)
 
> Which is exactly the point - if you do know things about parallelism, > the tools need the let you express that to them in an easy and > intuitive fashion. I also agree with Jonathan that CSP feels a > good way to do it (but maybe we're both weird :).
I have worked with systolic array implementations of dynamic programming algorithms, and they look completely different from software implementations. If you want an example, look at the software and hardware implementations of CRC32. In software it can be very easily done a byte at a time with a 256 word lookup table. The hardware (high speed) implementations are completely different because what is available and fast is completely different.
> Sure - we all *want* to the the tools rip and have it do a good job, > but I think that many algorithms will make better use of parallelism > with some hints given to the tools about how to do it.
-- glen
Reply by Martin Thompson January 28, 20092009-01-28
"HT-Lab" <hans64@ht-lab.com> writes:

> "Jonathan Bromley" <jonathan.bromley@MYCOMPANY.com> wrote in message >> What could be more >> natural than to say (or think) "Do XYZ; but while >> you're doing it, do as much of ABC as you can do >> without knowing the results of XYZ"? > > OK, if you have this information than fine, pass it on to the tool. However, > I believe that in most cases you just want to give the tool some > performance/area constraints and let it rip on your code. >
Which is exactly the point - if you do know things about parallelism, the tools need the let you express that to them in an easy and intuitive fashion. I also agree with Jonathan that CSP feels a good way to do it (but maybe we're both weird :). Sure - we all *want* to the the tools rip and have it do a good job, but I think that many algorithms will make better use of parallelism with some hints given to the tools about how to do it. Cheers, Martin -- martin.j.thompson@trw.com TRW Conekt - Consultancy in Engineering, Knowledge and Technology http://www.conekt.net/electronics.html
Reply by Brian Drummond January 28, 20092009-01-28
On Wed, 28 Jan 2009 11:36:45 +0000, Jonathan Bromley
<jonathan.bromley@MYCOMPANY.com> wrote:

>On Wed, 28 Jan 2009 00:35:26 +0000, Brian Drummond wrote: > >>In one sense, signals are already very similar to occam's >>channels, in that their events communicate synchronisation. > >For sure, but there's a big difference: signals are >broadcast and non-negotiated. occam channels handshake >(rendezvous) between a single source and a single sink.
Thanks for an excellent summary of stuff I was hazy on.
>To do that in HDL requires at least two signals, one in >each direction, with all the fuss and poor encapsulation >that entails.
Signals in both directions again...
>Of course, the Ada task entry rendezvous does all that >occam channels do, and more; it's something I sorely >miss in HDLs, especially when writing testbenches.
Heh, maybe we need Ada2Hardware rather than C2Hardware. VHDL might give us something of a head start there...
> I'm just saying that >we could move on a little further, but there doesn't >seem to be any collective appetite for doing so.
I'm not so sure there's no appetite, but the path isn't exactly clear. We can identify a few shortcomings in the language, but then what? I'm sure bidirectional elements in record ports (and the reason it can't be done) has been discussed at length while I was taking a nap... Another missing feature is "out" generics; I would like an "out" mode generic on my divider to say its latency is 8 clock cycles (versus 12 for another architecture) and let instantiating blocks adjust their pipelines automatically. There are other approaches but I still find myself adjusting pipelines by hand. But these two won't get us very far...
>Handshake Solutions offer a CSP-like language "Haste" >that can be synthesised to asynchronous hardware (using >Muller C-elements and various other tricks, I believe) >but it seems far-fetched to imagine FPGAs being a viable >target any time soon.
Thanks for the pointer in any case. - Brian
Reply by HT-Lab January 28, 20092009-01-28
"Jonathan Bromley" <jonathan.bromley@MYCOMPANY.com> wrote in message 
news:0tg0o4tt9prmne7ggbudf427o8cqhc1rld@4ax.com...
> On Wed, 28 Jan 2009 11:29:23 -0000, "HT-Lab" wrote: > >>The human brain is not that >>well suited to think concurrently > > I absolutely, fundamentally disagree. > > If you're stuck in a purely-sequential straitjacket, > you are forced to jump through absurd hoops to > express concurrent activity.
That is not the point, all I am saying is that writing sequential code is easier than concurrent code. If you were asked to develop say an IP stack and the choice of language would be yours (ignore end application performance etc), would you go for VHDL/Verilog or for C/C++? (fill in any sequential language you prefer).
> What could be more > natural than to say (or think) "Do XYZ; but while > you're doing it, do as much of ABC as you can do > without knowing the results of XYZ"?
OK, if you have this information than fine, pass it on to the tool. However, I believe that in most cases you just want to give the tool some performance/area constraints and let it rip on your code. Hans www.ht-lab.com
> The widespread > public distaste for concurrent descriptions simply > reflects the fact that our tools for writing those > descriptions are poorly matched to people's > expectations. Above all else, what's needed is > flexible composition of parallel and sequential > descriptions, together with clear and intuitive > ways to express synchronisation. CSP works for > me, but not (it seems) for everyone. Typical > real-time operating systems seem to me to make > a truly lousy job of it, being designed for > convenience of implementation rather than for > expressive power. > > Just my $0.02. > -- > Jonathan Bromley, Consultant > > DOULOS - Developing Design Know-how > VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services > > Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK > jonathan.bromley@MYCOMPANY.com > http://www.MYCOMPANY.com > > The contents of this message may contain personal views which > are not the views of Doulos Ltd., unless specifically stated.
Reply by Jonathan Bromley January 28, 20092009-01-28
On Wed, 28 Jan 2009 11:29:23 -0000, "HT-Lab" wrote:

>The human brain is not that >well suited to think concurrently
I absolutely, fundamentally disagree. If you're stuck in a purely-sequential straitjacket, you are forced to jump through absurd hoops to express concurrent activity. What could be more natural than to say (or think) "Do XYZ; but while you're doing it, do as much of ABC as you can do without knowing the results of XYZ"? The widespread public distaste for concurrent descriptions simply reflects the fact that our tools for writing those descriptions are poorly matched to people's expectations. Above all else, what's needed is flexible composition of parallel and sequential descriptions, together with clear and intuitive ways to express synchronisation. CSP works for me, but not (it seems) for everyone. Typical real-time operating systems seem to me to make a truly lousy job of it, being designed for convenience of implementation rather than for expressive power. Just my $0.02. -- Jonathan Bromley, Consultant DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK jonathan.bromley@MYCOMPANY.com http://www.MYCOMPANY.com The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.
Reply by Jonathan Bromley January 28, 20092009-01-28
On Wed, 28 Jan 2009 00:35:26 +0000, Brian Drummond wrote:

>In one sense, signals are already very similar to occam's >channels, in that their events communicate synchronisation.
For sure, but there's a big difference: signals are broadcast and non-negotiated. occam channels handshake (rendezvous) between a single source and a single sink. To do that in HDL requires at least two signals, one in each direction, with all the fuss and poor encapsulation that entails. SystemVerilog's interface construct might offer a way out, but [self publicity alert] as I'll discuss in a paper at DVCon next month, interfaces have many shortcomings that need sorting out before they are useful for reasonably high-level design. Of course, the Ada task entry rendezvous does all that occam channels do, and more; it's something I sorely miss in HDLs, especially when writing testbenches.
> And that [HDL signals] works well in simulation.
Yes, the discrete-event simulation model is powerful and highly appropriate for digital simulation at a fairly low level. No-one in their right mind would want to jettison that, with the enormous benefits it has brought to digital design. I'm just saying that we could move on a little further, but there doesn't seem to be any collective appetite for doing so.
>The trouble of course is synthesis; almost all of that gets lost, >because those damn flip-flops only understand one clock; and Xilinx >STILL won't put Reed-Muller gates (choose another self-timed primitive >if you prefer) on their chips! (Achronix, anyone? Though it could take >synth tools a while to catch up...)
Achronix looks really interesting, but note that they are currently marketing it as a way to get high performance ** while preserving a synchronous design style **. They appear to have found a way to get traditional synthesis tools to work with their technology, but unfortunately their dev kit is way too expensive for idle speculative experimentation so I know no more than what it says on their web site. Handshake Solutions offer a CSP-like language "Haste" that can be synthesised to asynchronous hardware (using Muller C-elements and various other tricks, I believe) but it seems far-fetched to imagine FPGAs being a viable target any time soon.
>So we have to re-invent our composition mechanisms again in >excruciatingly low level detail.
Yes. Don't hold your breath waiting for that to change :-( -- Jonathan Bromley, Consultant DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK jonathan.bromley@MYCOMPANY.com http://www.MYCOMPANY.com The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.
Reply by jleslie48 January 27, 20092009-01-27
On Jan 27, 9:18 am, Andreas Ehliar <ehliar-nos...@isy.liu.se> wrote:
> On 2009-01-27, jleslie48 <j...@jonathanleslie.com> wrote: > > > I've got no choice. I have a deliverable to a customer, and > > failure is not an option. My project worked perfectly fine on a > > DSP running C code, but the customer has dictated that It > > must run in a pure FPGA environment. So now I'm on the hook > > to get it to run on a FPGA. > > Hi, > > have you considered simply putting a small processor in an FPGA? > If you have a large amount of C code you could use a 32-bit FPGA > optimized processor like MicroBlaze if you are using Xilinx or > Nios II if you are using Altera. (You will need to buy the EDK > if you are using Xilinx, but it sounds like this will save you > a lot of time right now so it is probably worth it.) > > If space is at a premium in the FPGA you could use an 8-bit > processor optimized for FPGAs such as the PicoBlaze. This will > allow you to create most of your project in an environment in > which you are comfortable. (And as you have probably already > noticed, low speed string processing is substantially easier > to do in software.) There is no licensing fee for PicoBlaze > as far as I know. > > You may still have to create some hardware to interface your > processor with the outside world, but if your core algorithms > are written in C I suspect that a processor is the most efficient > way to get this done. > > If you have some performance problems you can move parts > of your algorithm into hardware, but hopefully you won't > need that. > > /Andreas
considered and dismissed. My project is to piggy back on an existing platform and no processor is available. Plus my real processing will be on some A/D boards data, I'm not even going to talk to those until I get the basics down. It suffices to say that my C code was already too slow and my new constraints dictate a 100x improvement in speed. C is out.
Reply by jleslie48 January 27, 20092009-01-27
"...but it takes some courage (or
confidence) to do that in public. "

I've got no choice. I have a deliverable to a customer, and
failure is not an option. My project worked perfectly fine on a
DSP running C code, but the customer has dictated that It
must run in a pure FPGA environment.  So now I'm on the hook
to get it to run on a FPGA.


Anyway, some quickies,

I'm looking at the source code for the bucket brigade fifo, and
I can't even determine where the 16 characters are stored.

here's the source to the fifo:
http://grace.evergreen.edu/dtoi/arch06w/asm/KCPSM3/VHDL/bbfifo_16x8.vhd

and I "think" this is the storage:

 -- SRL16E data storage

  data_width_loop: for i in 0 to 7 generate
  --
  attribute INIT : string;
  attribute INIT of data_srl : label is "0000";
  --
  begin

     data_srl: SRL16E
     --synthesis translate_off
     generic map
     --synthesis translate_on
     port map(   D => data_in(i),
                CE => valid_write,
               CLK => clk,
                A0 => pointer(0),
                A1 => pointer(1),
                A2 => pointer(2),
                A3 => pointer(3),
                 Q => data_out(i) );

  end generate data_width_loop;


so here are the quickie questions:

1) what's a LUT?
2) what do you use the reserved words ATTRIBUTE and LABEL ?
3) what is that GENERIC MAP thing and what does(INIT => X"0000")
mean ?
4) INIT is the variable name right not a reserved/library word?
5) what about STRING?
6) half_full looks at pointer(3) (the highest order bit of pointer)
which is my
   first clue of a 16byte buffer for all values of pointer (signal
pointer             : std_logic_vector(3 downto 0);)
   where the high order bit is set we know that we have 8 or more
  characters in the buffer.  but where does data_out take on the
  value of the character to be sent?  I don't see any mechanism for
  the bucket brigade to pass along the buckets (and to continue with
  the analogy, throw "data_out" onto the fire )
7) I image 1-6 will answer this, but should I want this buffer to be
larger,
   I'm guessing  its design is adaptable to 32,64,128, ... aka only
power of
   2 sizes, so how would I go about expanding that buffer?





Reply by Jonathan Bromley January 27, 20092009-01-27
On Tue, 27 Jan 2009 06:11:12 -0800 (PST), jleslie48 wrote:

>I see EE guys tremble in fear at a double nested bubble sort
I'm an EE guy (more or less) who knows enough about software to know that a bubble sort is to be mocked rather than feared, but I think I see what you're getting at. Interestingly, the better class of EE is increasingly obliged to get up to speed with quite sophisticated software these days - but primarily in the context of functional verification, i.e. testbenches. Oh, and in the strange twilight world of hardware-dependent software - device drivers and similar arcana - where HW and SW folk must of necessity talk to one another, but generally do so through gritted teeth. As a by-the-by, sorting is one of the places where hardware and software folk can have a genuinely interesting dialogue. Sorting techniques that work really well in software (heapsort, Shell, radix sorting) simply don't map well to hardware implementation, because of the extremely non-uniform data flow and complicated address calculations that they involve. On the other hand, there are some fascinating parallel sort algorithms (Batcher aka bitonic sort, for example), which map extremely well to hardware provided the scale of the problem isn't too big, and whose parallelism can be exploited to get blindingly fast performance. I've noted here in the past that, for small data sets, simple insertion sort can easily be implemented in hardware to give O(N) sorting time at the cost of O(N) hardware size; that's a tradeoff that you probably don't wish to make in software, ever, because the O(N) size is sure to pop up as another O(N) time instead, giving the insertion sort O(N^2) performance - which is just as cruddy as bubble sort :-)
> here I am with the rolls reversed on something as > silly as putting out the "hello > world" message on a screen...
Nothing to be ashamed of. Memory draws a merciful veil over the embarrassments of my early attempts to learn about OOP. It's good to stretch the boundaries of your comfort zone, but it takes some courage (or confidence) to do that in public. -- Jonathan Bromley, Consultant DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK jonathan.bromley@MYCOMPANY.com http://www.MYCOMPANY.com The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.
Reply by Andreas Ehliar January 27, 20092009-01-27
On 2009-01-27, jleslie48 <jon@jonathanleslie.com> wrote:
> I've got no choice. I have a deliverable to a customer, and > failure is not an option. My project worked perfectly fine on a > DSP running C code, but the customer has dictated that It > must run in a pure FPGA environment. So now I'm on the hook > to get it to run on a FPGA.
Hi, have you considered simply putting a small processor in an FPGA? If you have a large amount of C code you could use a 32-bit FPGA optimized processor like MicroBlaze if you are using Xilinx or Nios II if you are using Altera. (You will need to buy the EDK if you are using Xilinx, but it sounds like this will save you a lot of time right now so it is probably worth it.) If space is at a premium in the FPGA you could use an 8-bit processor optimized for FPGAs such as the PicoBlaze. This will allow you to create most of your project in an environment in which you are comfortable. (And as you have probably already noticed, low speed string processing is substantially easier to do in software.) There is no licensing fee for PicoBlaze as far as I know. You may still have to create some hardware to interface your processor with the outside world, but if your core algorithms are written in C I suspect that a processor is the most efficient way to get this done. If you have some performance problems you can move parts of your algorithm into hardware, but hopefully you won't need that. /Andreas