Reply by glen herrmannsfeldt●January 28, 20092009-01-28
Martin Thompson <martin.j.thompson@trw.com> wrote:
(snip)
> Which is exactly the point - if you do know things about parallelism,
> the tools need the let you express that to them in an easy and
> intuitive fashion. I also agree with Jonathan that CSP feels a
> good way to do it (but maybe we're both weird :).
I have worked with systolic array implementations of
dynamic programming algorithms, and they look completely
different from software implementations.
If you want an example, look at the software and hardware
implementations of CRC32. In software it can be very
easily done a byte at a time with a 256 word lookup table.
The hardware (high speed) implementations are completely
different because what is available and fast is completely
different.
> Sure - we all *want* to the the tools rip and have it do a good job,
> but I think that many algorithms will make better use of parallelism
> with some hints given to the tools about how to do it.
-- glen
Reply by Martin Thompson●January 28, 20092009-01-28
"HT-Lab" <hans64@ht-lab.com> writes:
> "Jonathan Bromley" <jonathan.bromley@MYCOMPANY.com> wrote in message
>> What could be more
>> natural than to say (or think) "Do XYZ; but while
>> you're doing it, do as much of ABC as you can do
>> without knowing the results of XYZ"?
>
> OK, if you have this information than fine, pass it on to the tool. However,
> I believe that in most cases you just want to give the tool some
> performance/area constraints and let it rip on your code.
>
Which is exactly the point - if you do know things about parallelism,
the tools need the let you express that to them in an easy and
intuitive fashion. I also agree with Jonathan that CSP feels a
good way to do it (but maybe we're both weird :).
Sure - we all *want* to the the tools rip and have it do a good job,
but I think that many algorithms will make better use of parallelism
with some hints given to the tools about how to do it.
Cheers,
Martin
--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html
Reply by Brian Drummond●January 28, 20092009-01-28
On Wed, 28 Jan 2009 11:36:45 +0000, Jonathan Bromley
<jonathan.bromley@MYCOMPANY.com> wrote:
>On Wed, 28 Jan 2009 00:35:26 +0000, Brian Drummond wrote:
>
>>In one sense, signals are already very similar to occam's
>>channels, in that their events communicate synchronisation.
>
>For sure, but there's a big difference: signals are
>broadcast and non-negotiated. occam channels handshake
>(rendezvous) between a single source and a single sink.
Thanks for an excellent summary of stuff I was hazy on.
>To do that in HDL requires at least two signals, one in
>each direction, with all the fuss and poor encapsulation
>that entails.
Signals in both directions again...
>Of course, the Ada task entry rendezvous does all that
>occam channels do, and more; it's something I sorely
>miss in HDLs, especially when writing testbenches.
Heh, maybe we need Ada2Hardware rather than C2Hardware.
VHDL might give us something of a head start there...
> I'm just saying that
>we could move on a little further, but there doesn't
>seem to be any collective appetite for doing so.
I'm not so sure there's no appetite, but the path isn't exactly clear.
We can identify a few shortcomings in the language, but then what?
I'm sure bidirectional elements in record ports (and the reason it can't
be done) has been discussed at length while I was taking a nap...
Another missing feature is "out" generics; I would like an "out" mode
generic on my divider to say its latency is 8 clock cycles (versus 12
for another architecture) and let instantiating blocks adjust their
pipelines automatically.
There are other approaches but I still find myself adjusting pipelines
by hand.
But these two won't get us very far...
>Handshake Solutions offer a CSP-like language "Haste"
>that can be synthesised to asynchronous hardware (using
>Muller C-elements and various other tricks, I believe)
>but it seems far-fetched to imagine FPGAs being a viable
>target any time soon.
Thanks for the pointer in any case.
- Brian
Reply by HT-Lab●January 28, 20092009-01-28
"Jonathan Bromley" <jonathan.bromley@MYCOMPANY.com> wrote in message
news:0tg0o4tt9prmne7ggbudf427o8cqhc1rld@4ax.com...
> On Wed, 28 Jan 2009 11:29:23 -0000, "HT-Lab" wrote:
>
>>The human brain is not that
>>well suited to think concurrently
>
> I absolutely, fundamentally disagree.
>
> If you're stuck in a purely-sequential straitjacket,
> you are forced to jump through absurd hoops to
> express concurrent activity.
That is not the point, all I am saying is that writing sequential code is
easier than concurrent code. If you were asked to develop say an IP stack
and the choice of language would be yours (ignore end application
performance etc), would you go for VHDL/Verilog or for C/C++? (fill in any
sequential language you prefer).
> What could be more
> natural than to say (or think) "Do XYZ; but while
> you're doing it, do as much of ABC as you can do
> without knowing the results of XYZ"?
OK, if you have this information than fine, pass it on to the tool. However,
I believe that in most cases you just want to give the tool some
performance/area constraints and let it rip on your code.
Hans
www.ht-lab.com
> The widespread
> public distaste for concurrent descriptions simply
> reflects the fact that our tools for writing those
> descriptions are poorly matched to people's
> expectations. Above all else, what's needed is
> flexible composition of parallel and sequential
> descriptions, together with clear and intuitive
> ways to express synchronisation. CSP works for
> me, but not (it seems) for everyone. Typical
> real-time operating systems seem to me to make
> a truly lousy job of it, being designed for
> convenience of implementation rather than for
> expressive power.
>
> Just my $0.02.
> --
> Jonathan Bromley, Consultant
>
> DOULOS - Developing Design Know-how
> VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services
>
> Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
> jonathan.bromley@MYCOMPANY.com
> http://www.MYCOMPANY.com
>
> The contents of this message may contain personal views which
> are not the views of Doulos Ltd., unless specifically stated.
Reply by Jonathan Bromley●January 28, 20092009-01-28
On Wed, 28 Jan 2009 11:29:23 -0000, "HT-Lab" wrote:
>The human brain is not that
>well suited to think concurrently
I absolutely, fundamentally disagree.
If you're stuck in a purely-sequential straitjacket,
you are forced to jump through absurd hoops to
express concurrent activity. What could be more
natural than to say (or think) "Do XYZ; but while
you're doing it, do as much of ABC as you can do
without knowing the results of XYZ"? The widespread
public distaste for concurrent descriptions simply
reflects the fact that our tools for writing those
descriptions are poorly matched to people's
expectations. Above all else, what's needed is
flexible composition of parallel and sequential
descriptions, together with clear and intuitive
ways to express synchronisation. CSP works for
me, but not (it seems) for everyone. Typical
real-time operating systems seem to me to make
a truly lousy job of it, being designed for
convenience of implementation rather than for
expressive power.
Just my $0.02.
--
Jonathan Bromley, Consultant
DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services
Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com
The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
Reply by Jonathan Bromley●January 28, 20092009-01-28
On Wed, 28 Jan 2009 00:35:26 +0000, Brian Drummond wrote:
>In one sense, signals are already very similar to occam's
>channels, in that their events communicate synchronisation.
For sure, but there's a big difference: signals are
broadcast and non-negotiated. occam channels handshake
(rendezvous) between a single source and a single sink.
To do that in HDL requires at least two signals, one in
each direction, with all the fuss and poor encapsulation
that entails. SystemVerilog's interface construct might
offer a way out, but [self publicity alert] as I'll
discuss in a paper at DVCon next month, interfaces have
many shortcomings that need sorting out before they
are useful for reasonably high-level design.
Of course, the Ada task entry rendezvous does all that
occam channels do, and more; it's something I sorely
miss in HDLs, especially when writing testbenches.
> And that [HDL signals] works well in simulation.
Yes, the discrete-event simulation model is powerful
and highly appropriate for digital simulation at a
fairly low level. No-one in their right mind would
want to jettison that, with the enormous benefits it
has brought to digital design. I'm just saying that
we could move on a little further, but there doesn't
seem to be any collective appetite for doing so.
>The trouble of course is synthesis; almost all of that gets lost,
>because those damn flip-flops only understand one clock; and Xilinx
>STILL won't put Reed-Muller gates (choose another self-timed primitive
>if you prefer) on their chips! (Achronix, anyone? Though it could take
>synth tools a while to catch up...)
Achronix looks really interesting, but note that they
are currently marketing it as a way to get high performance
** while preserving a synchronous design style **. They
appear to have found a way to get traditional synthesis
tools to work with their technology, but unfortunately
their dev kit is way too expensive for idle speculative
experimentation so I know no more than what it says on
their web site.
Handshake Solutions offer a CSP-like language "Haste"
that can be synthesised to asynchronous hardware (using
Muller C-elements and various other tricks, I believe)
but it seems far-fetched to imagine FPGAs being a viable
target any time soon.
>So we have to re-invent our composition mechanisms again in
>excruciatingly low level detail.
Yes. Don't hold your breath waiting for that to change :-(
--
Jonathan Bromley, Consultant
DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services
Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com
The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
Reply by jleslie48●January 27, 20092009-01-27
On Jan 27, 9:18 am, Andreas Ehliar <ehliar-nos...@isy.liu.se> wrote:
> On 2009-01-27, jleslie48 <j...@jonathanleslie.com> wrote:
>
> > I've got no choice. I have a deliverable to a customer, and
> > failure is not an option. My project worked perfectly fine on a
> > DSP running C code, but the customer has dictated that It
> > must run in a pure FPGA environment. So now I'm on the hook
> > to get it to run on a FPGA.
>
> Hi,
>
> have you considered simply putting a small processor in an FPGA?
> If you have a large amount of C code you could use a 32-bit FPGA
> optimized processor like MicroBlaze if you are using Xilinx or
> Nios II if you are using Altera. (You will need to buy the EDK
> if you are using Xilinx, but it sounds like this will save you
> a lot of time right now so it is probably worth it.)
>
> If space is at a premium in the FPGA you could use an 8-bit
> processor optimized for FPGAs such as the PicoBlaze. This will
> allow you to create most of your project in an environment in
> which you are comfortable. (And as you have probably already
> noticed, low speed string processing is substantially easier
> to do in software.) There is no licensing fee for PicoBlaze
> as far as I know.
>
> You may still have to create some hardware to interface your
> processor with the outside world, but if your core algorithms
> are written in C I suspect that a processor is the most efficient
> way to get this done.
>
> If you have some performance problems you can move parts
> of your algorithm into hardware, but hopefully you won't
> need that.
>
> /Andreas
considered and dismissed. My project is to piggy back on an
existing platform and no processor is available. Plus my real
processing
will be on some A/D boards data, I'm not even going to talk to those
until I get the basics down. It suffices to say that my C code was
already
too slow and my new constraints dictate a 100x improvement in speed.
C is out.
Reply by jleslie48●January 27, 20092009-01-27
"...but it takes some courage (or
confidence) to do that in public. "
I've got no choice. I have a deliverable to a customer, and
failure is not an option. My project worked perfectly fine on a
DSP running C code, but the customer has dictated that It
must run in a pure FPGA environment. So now I'm on the hook
to get it to run on a FPGA.
Anyway, some quickies,
I'm looking at the source code for the bucket brigade fifo, and
I can't even determine where the 16 characters are stored.
here's the source to the fifo:
http://grace.evergreen.edu/dtoi/arch06w/asm/KCPSM3/VHDL/bbfifo_16x8.vhd
and I "think" this is the storage:
-- SRL16E data storage
data_width_loop: for i in 0 to 7 generate
--
attribute INIT : string;
attribute INIT of data_srl : label is "0000";
--
begin
data_srl: SRL16E
--synthesis translate_off
generic map
--synthesis translate_on
port map( D => data_in(i),
CE => valid_write,
CLK => clk,
A0 => pointer(0),
A1 => pointer(1),
A2 => pointer(2),
A3 => pointer(3),
Q => data_out(i) );
end generate data_width_loop;
so here are the quickie questions:
1) what's a LUT?
2) what do you use the reserved words ATTRIBUTE and LABEL ?
3) what is that GENERIC MAP thing and what does(INIT => X"0000")
mean ?
4) INIT is the variable name right not a reserved/library word?
5) what about STRING?
6) half_full looks at pointer(3) (the highest order bit of pointer)
which is my
first clue of a 16byte buffer for all values of pointer (signal
pointer : std_logic_vector(3 downto 0);)
where the high order bit is set we know that we have 8 or more
characters in the buffer. but where does data_out take on the
value of the character to be sent? I don't see any mechanism for
the bucket brigade to pass along the buckets (and to continue with
the analogy, throw "data_out" onto the fire )
7) I image 1-6 will answer this, but should I want this buffer to be
larger,
I'm guessing its design is adaptable to 32,64,128, ... aka only
power of
2 sizes, so how would I go about expanding that buffer?
Reply by Jonathan Bromley●January 27, 20092009-01-27
On Tue, 27 Jan 2009 06:11:12 -0800 (PST), jleslie48 wrote:
>I see EE guys tremble in fear at a double nested bubble sort
I'm an EE guy (more or less) who knows enough about software
to know that a bubble sort is to be mocked rather than feared,
but I think I see what you're getting at. Interestingly,
the better class of EE is increasingly obliged to get up to
speed with quite sophisticated software these days - but
primarily in the context of functional verification, i.e.
testbenches. Oh, and in the strange twilight world of
hardware-dependent software - device drivers and similar
arcana - where HW and SW folk must of necessity talk to
one another, but generally do so through gritted teeth.
As a by-the-by, sorting is one of the places where hardware
and software folk can have a genuinely interesting dialogue.
Sorting techniques that work really well in software (heapsort,
Shell, radix sorting) simply don't map well to hardware
implementation, because of the extremely non-uniform data
flow and complicated address calculations that they involve.
On the other hand, there are some fascinating parallel
sort algorithms (Batcher aka bitonic sort, for example),
which map extremely well to hardware provided the scale
of the problem isn't too big, and whose parallelism can
be exploited to get blindingly fast performance. I've
noted here in the past that, for small data sets, simple
insertion sort can easily be implemented in hardware
to give O(N) sorting time at the cost of O(N) hardware
size; that's a tradeoff that you probably don't wish
to make in software, ever, because the O(N) size is
sure to pop up as another O(N) time instead, giving
the insertion sort O(N^2) performance - which is
just as cruddy as bubble sort :-)
> here I am with the rolls reversed on something as
> silly as putting out the "hello
> world" message on a screen...
Nothing to be ashamed of. Memory draws a merciful
veil over the embarrassments of my early attempts to
learn about OOP. It's good to stretch the boundaries
of your comfort zone, but it takes some courage (or
confidence) to do that in public.
--
Jonathan Bromley, Consultant
DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services
Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com
The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
Reply by Andreas Ehliar●January 27, 20092009-01-27
On 2009-01-27, jleslie48 <jon@jonathanleslie.com> wrote:
> I've got no choice. I have a deliverable to a customer, and
> failure is not an option. My project worked perfectly fine on a
> DSP running C code, but the customer has dictated that It
> must run in a pure FPGA environment. So now I'm on the hook
> to get it to run on a FPGA.
Hi,
have you considered simply putting a small processor in an FPGA?
If you have a large amount of C code you could use a 32-bit FPGA
optimized processor like MicroBlaze if you are using Xilinx or
Nios II if you are using Altera. (You will need to buy the EDK
if you are using Xilinx, but it sounds like this will save you
a lot of time right now so it is probably worth it.)
If space is at a premium in the FPGA you could use an 8-bit
processor optimized for FPGAs such as the PicoBlaze. This will
allow you to create most of your project in an environment in
which you are comfortable. (And as you have probably already
noticed, low speed string processing is substantially easier
to do in software.) There is no licensing fee for PicoBlaze
as far as I know.
You may still have to create some hardware to interface your
processor with the outside world, but if your core algorithms
are written in C I suspect that a processor is the most efficient
way to get this done.
If you have some performance problems you can move parts
of your algorithm into hardware, but hopefully you won't
need that.
/Andreas