FPGARelated.com
Forums

Fast 28x28 multiplier + adder in Virtex4

Started by gretzteam February 24, 2005
Hi,
We are using a virtex4 FPGA to prototype a DSP processor to be
implemented in an ASIC. We are using the ISE flow and everything works
fine except that we can't prototype at full speed. We are only able to
run at about 65MHz, which is far from the 150MHz target. The longest
combinationnal path is in the MAC, which contains a 28x28 multiplier
followed by a 56x56 adder.  I created the multiplier and the adder
using Core Generator.

Is there a way to speed this up? The virtex4 have those Xtreame DSP
slices, but I can't find a way to to make good use of them, since our
datapath is so large. 

Thank you,
David

gretzteam wrote:

> We are using a virtex4 FPGA to prototype a DSP processor to be > implemented in an ASIC. We are using the ISE flow and everything works > fine except that we can't prototype at full speed. We are only able to > run at about 65MHz, which is far from the 150MHz target. The longest > combinationnal path is in the MAC, which contains a 28x28 multiplier > followed by a 56x56 adder. I created the multiplier and the adder > using Core Generator.
> Is there a way to speed this up? The virtex4 have those Xtreame DSP > slices, but I can't find a way to to make good use of them, since our > datapath is so large.
Virtex4 has 18x18 multiplier hardware. Your 28x28 may be made from them, but you need to pipeline it, and also a pipeline stage before the adder. I will guess that gets to 150MHz, but you will have to try it to find out. -- glen
gretzteam wrote:
> Hi, > We are using a virtex4 FPGA to prototype a DSP processor to be > implemented in an ASIC. We are using the ISE flow and everything works > fine except that we can't prototype at full speed. We are only able to > run at about 65MHz, which is far from the 150MHz target. The longest > combinationnal path is in the MAC, which contains a 28x28 multiplier > followed by a 56x56 adder. I created the multiplier and the adder > using Core Generator. > > Is there a way to speed this up? The virtex4 have those Xtreame DSP > slices, but I can't find a way to to make good use of them, since our > datapath is so large. > > Thank you, > David >
If you use the Xtreme DSP slices properly, with all of their dedicated interconnects, you should be able to do a 34x34 multiply using 4 pipelined slices at full rate (450-500MHz, depending upon part speed). You might need an extra two slices to do the 56-bit accumulate. Look for the "XtremeDSP Design Consdierations" guide on the Xilinx site and it describes how to do this. I'm not sure exactly what CoreGen is producing but it might not be completely optimized. It might be using CLB fabric for some of the operations. -Kevin
Right now I'm not using anything fancy. I created a 28x28 multiplier
and a 56x56 adder with coregen and wired them together. I used the
multiplier component and it is supposed to use the XtremeDSP slices.
Maybe it is not wise enough to make use of other dedicated
interconnects. I will look at this "XtremeDSP Design Consdierations".
Thank you,
David

> Right now I'm not using anything fancy. I created a 28x28 multiplier
Pipelining is the magic word (Coregen calls it registered inputs and outputs) Regards Falk
I can't really use pipelining here. The MAC is all combinationnal; i
receive inputs at time 0, and I need an answer by time x. I don't see
how pipelining would help.
Thanks,
Dave

gretzteam wrote:
> I can't really use pipelining here. The MAC is all combinationnal; i > receive inputs at time 0, and I need an answer by time x. I don't see > how pipelining would help.
What is x? If x is one clock cycle then you need either faster logic or a lot more of it. I believe this can be done easily with a three cycle pipeline, so that you get an answer out every cycle, which each one taking three cycles. -- glen
Hi,
I guess I don't understand something about pipeling. In my case, the
whole system runs at master clock, which I would like to be 100MHz or
more. Right now, the whole MAC unit is combinational logic and needs
to produce an answer for each clock cycle (time x=1/100MHz). Are you
guys saying that if I would run the mac at 3 times the master clock
(300MHz) with a three stage pipeline, I could compute the answer fast
enough?

Thanks,
David

glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote in message news:<cvo67a$828$1@gnus01.u.washington.edu>...
> gretzteam wrote: > > I can't really use pipelining here. The MAC is all combinationnal; i > > receive inputs at time 0, and I need an answer by time x. I don't see > > how pipelining would help. > > What is x? > > If x is one clock cycle then you need either faster logic or > a lot more of it. I believe this can be done easily with a > three cycle pipeline, so that you get an answer out every cycle, > which each one taking three cycles. > > -- glen
David wrote:
> Hi, > I guess I don't understand something about pipeling. In my case, the > whole system runs at master clock, which I would like to be 100MHz or > more. Right now, the whole MAC unit is combinational logic and needs > to produce an answer for each clock cycle (time x=1/100MHz). Are you > guys saying that if I would run the mac at 3 times the master clock > (300MHz) with a three stage pipeline, I could compute the answer fast > enough?
Howdy David, Using different terms, let's try another analogy on this Saturday: imagine an automobile assembly line. It puts out a certain number of cars per hour. If you add another step in the assembly process, you can still get the same number of cars per hour out - it just takes a little longer for it to roll off the assembly line. Circuits work the same way. If your main requirement is to be able to handle a certain number of calculations per second, you can possibly break the calculations up into smaller parts which are easier to do in series: rather than doing a multiply and an accumulate in the same cycle, do the multiply in one cycle, and the addition in the next cycle. While the accumulation is occuring during this 2nd cycle, the 2nd piece of data is being multiplied. On the 3rd cycle, the 2nd piece of data is now in the accumulator and a 3rd piece of data enters the multiplier. You get the same number of calculations per second out of the circuit (or perhaps even more, since you can meet timing now!), but it takes 20 ns rather than 10 ns. If you can't stand the extra delay, then you may need to up the clock rate (and then you will sure enough have to pipeline!). Hope that helps, Marc
Hi,
I understand what you mean. However, I don't think it works in my case
because I have a loop (it is a MAC). In order to start the next
calculation, I need an answer to the previous one. I guess the only
solution is faster logic. I thought that a virtex4 would be able to
give us those kind of calculation speed...

Dave