FPGARelated.com
Forums

Inferring multiple-DSP48 pipelined multiplier in VHDL

Started by Robin Bruce July 3, 2006
Robin Bruce wrote:

> Martin, > > >>Have you had a look in FPGA editor to see what's going on? > > > This is where I myself look dim: I did open up the NCD file in the FPGA > Editor. I didn't really know what to do to tell if the right > registering was occurring. All I could see was that all 4 DSP48s were > instantiated together in a little row. I've never used FPGA editor > before. I'm more familiar with PlanAhead for looking at that sort of > thing, but I don't have that on my laptop, my current working platform. > > >>Is it actually this bit of code that limits the timing? > > > Well, all I can say is that I don't think so. It could very well be > though, but I've tried writing the VHDL in very different ways, guided > by things I've found in one or two guides to instantiating the DSP48s > in VHDL. Every way I write the VHDL, the same performance is obtained. > The thing is that I can see that the synthesis tool is making some kind > of effort to pipeline the thing. > > This is the critical path that comes out of the synthesis report if > this means anything to anyone: > > Data Path: mult_inst/Mmult__n00001 to mult_inst/Mmult__n0000_35 > Gate Net > Cell:in->out fanout Delay Delay Logical Name (Net Name) > ---------------------------------------- ------------ > DSP48:CLK->PCOUT47 1 4.399 0.000 mult_inst/Mmult__n00001 > (mult_inst/Mmult__n00002_PCIN_to_mult_inst/Mmult__n00001_PCOUT_47) > DSP48:PCIN47->PCOUT47 1 2.363 0.000 > mult_inst/Mmult__n00002 > (mult_inst/Mmult__n00003_PCIN_to_mult_inst/Mmult__n00002_PCOUT_47) > DSP48:PCIN47->PCOUT47 1 2.363 0.000 > mult_inst/Mmult__n00003 > (mult_inst/Mmult__n00004_PCIN_to_mult_inst/Mmult__n00003_PCOUT_47) > DSP48:PCIN47->P35 1 2.270 0.534 mult_inst/Mmult__n00004 > (mult_inst/Mmult__n0000_s_69) > FD:D 0.391 mult_inst/Mmult__n0000_0 > ---------------------------------------- > Total 12.320ns (11.786ns logic, 0.534ns route) > (95.7% logic, 4.3% route) > > Cheers, > > Robin >
Your design is not inferring the P register, so the adders are combinatorial. The adders get connected in a daisy chain. You may have to recode your RTL to reflect that, as the synthesizer is not really smart enough to push around the registers to the degree necessary to deal with differing latencies among the adder inputs.
OK,

so if I've understood this properly, instead of registering the inputs
by differing amounts in order to account for the PREG cascade at the
outputs, everything is getting (pointlessly) registered by the same
amount at the input, and then summed together combinatorily at the
output of DSP48s, followed by some more pointless registering occurs at
the output?

My interest in this is fairly academic really... The design from which
this question arose has a 35x35 multiplier generated using CoreGen, so
it works fine. It would be useful from a design methodology perspective
to have the ability to infer the DSP48s in such a simple manner. It
would make autogenerating VHDL for different pipelined multiplier
structures a walk in the park. I'm not necessarily of the opinion that
this should be possible today, just that it would be really nice and
when I came across sources that suggested it was possible to do this in
a high performance manner I decided to try it. (XtremeDSP for Virtex-4
FPGAsUser Guide, www.xilinx.com/bvdocs/userguides/ug073.pdf
& Philippe Garrault, Accelerate design performance with HDL coding
practices)

I've been really impressed with both the informal (via Ben Jones) and
formal (via tech support) reaction to me bringing this up with Xilinx,
so even if this has all come from me being a bit naive about the
capabilities of XST, there seem to be people in Xilinx who believe that
we should be able to be so naive and expect good performance at the
same time.

Still very much a beginner,

Robin

>Ray Andraka wrote: > Robin Bruce wrote: > > > Martin, > > > > > >>Have you had a look in FPGA editor to see what's going on? > > > > > > This is where I myself look dim: I did open up the NCD file in the FPGA > > Editor. I didn't really know what to do to tell if the right > > registering was occurring. All I could see was that all 4 DSP48s were > > instantiated together in a little row. I've never used FPGA editor > > before. I'm more familiar with PlanAhead for looking at that sort of > > thing, but I don't have that on my laptop, my current working platform. > > > > > >>Is it actually this bit of code that limits the timing? > > > > > > Well, all I can say is that I don't think so. It could very well be > > though, but I've tried writing the VHDL in very different ways, guided > > by things I've found in one or two guides to instantiating the DSP48s > > in VHDL. Every way I write the VHDL, the same performance is obtained. > > The thing is that I can see that the synthesis tool is making some kind > > of effort to pipeline the thing. > > > > This is the critical path that comes out of the synthesis report if > > this means anything to anyone: > > > > Data Path: mult_inst/Mmult__n00001 to mult_inst/Mmult__n0000_35 > > Gate Net > > Cell:in->out fanout Delay Delay Logical Name (Net Name) > > ---------------------------------------- ------------ > > DSP48:CLK->PCOUT47 1 4.399 0.000 mult_inst/Mmult__n00001 > > (mult_inst/Mmult__n00002_PCIN_to_mult_inst/Mmult__n00001_PCOUT_47) > > DSP48:PCIN47->PCOUT47 1 2.363 0.000 > > mult_inst/Mmult__n00002 > > (mult_inst/Mmult__n00003_PCIN_to_mult_inst/Mmult__n00002_PCOUT_47) > > DSP48:PCIN47->PCOUT47 1 2.363 0.000 > > mult_inst/Mmult__n00003 > > (mult_inst/Mmult__n00004_PCIN_to_mult_inst/Mmult__n00003_PCOUT_47) > > DSP48:PCIN47->P35 1 2.270 0.534 mult_inst/Mmult__n00004 > > (mult_inst/Mmult__n0000_s_69) > > FD:D 0.391 mult_inst/Mmult__n0000_0 > > ---------------------------------------- > > Total 12.320ns (11.786ns logic, 0.534ns route) > > (95.7% logic, 4.3% route) > > > > Cheers, > > > > Robin > > > > Your design is not inferring the P register, so the adders are > combinatorial. The adders get connected in a daisy chain. You may have > to recode your RTL to reflect that, as the synthesizer is not really > smart enough to push around the registers to the degree necessary to > deal with differing latencies among the adder inputs.
Guys,

Given feedback I've had from Xilinx, It seems that this is something
that should be possible today but isn't. Apparently a bugfix has been
made that should fix this and it will be released with ISE 9.1

Robin

Robin Bruce wrote:
> OK, > > so if I've understood this properly, instead of registering the inputs > by differing amounts in order to account for the PREG cascade at the > outputs, everything is getting (pointlessly) registered by the same > amount at the input, and then summed together combinatorily at the > output of DSP48s, followed by some more pointless registering occurs at > the output? > > My interest in this is fairly academic really... The design from which > this question arose has a 35x35 multiplier generated using CoreGen, so > it works fine. It would be useful from a design methodology perspective > to have the ability to infer the DSP48s in such a simple manner. It > would make autogenerating VHDL for different pipelined multiplier > structures a walk in the park. I'm not necessarily of the opinion that > this should be possible today, just that it would be really nice and > when I came across sources that suggested it was possible to do this in > a high performance manner I decided to try it. (XtremeDSP for Virtex-4 > FPGAsUser Guide, www.xilinx.com/bvdocs/userguides/ug073.pdf > & Philippe Garrault, Accelerate design performance with HDL coding > practices) > > I've been really impressed with both the informal (via Ben Jones) and > formal (via tech support) reaction to me bringing this up with Xilinx, > so even if this has all come from me being a bit naive about the > capabilities of XST, there seem to be people in Xilinx who believe that > we should be able to be so naive and expect good performance at the > same time. > > Still very much a beginner, > > Robin > > >Ray Andraka wrote: > > Robin Bruce wrote: > > > > > Martin, > > > > > > > > >>Have you had a look in FPGA editor to see what's going on? > > > > > > > > > This is where I myself look dim: I did open up the NCD file in the FPGA > > > Editor. I didn't really know what to do to tell if the right > > > registering was occurring. All I could see was that all 4 DSP48s were > > > instantiated together in a little row. I've never used FPGA editor > > > before. I'm more familiar with PlanAhead for looking at that sort of > > > thing, but I don't have that on my laptop, my current working platform. > > > > > > > > >>Is it actually this bit of code that limits the timing? > > > > > > > > > Well, all I can say is that I don't think so. It could very well be > > > though, but I've tried writing the VHDL in very different ways, guided > > > by things I've found in one or two guides to instantiating the DSP48s > > > in VHDL. Every way I write the VHDL, the same performance is obtained. > > > The thing is that I can see that the synthesis tool is making some kind > > > of effort to pipeline the thing. > > > > > > This is the critical path that comes out of the synthesis report if > > > this means anything to anyone: > > > > > > Data Path: mult_inst/Mmult__n00001 to mult_inst/Mmult__n0000_35 > > > Gate Net > > > Cell:in->out fanout Delay Delay Logical Name (Net Name) > > > ---------------------------------------- ------------ > > > DSP48:CLK->PCOUT47 1 4.399 0.000 mult_inst/Mmult__n00001 > > > (mult_inst/Mmult__n00002_PCIN_to_mult_inst/Mmult__n00001_PCOUT_47) > > > DSP48:PCIN47->PCOUT47 1 2.363 0.000 > > > mult_inst/Mmult__n00002 > > > (mult_inst/Mmult__n00003_PCIN_to_mult_inst/Mmult__n00002_PCOUT_47) > > > DSP48:PCIN47->PCOUT47 1 2.363 0.000 > > > mult_inst/Mmult__n00003 > > > (mult_inst/Mmult__n00004_PCIN_to_mult_inst/Mmult__n00003_PCOUT_47) > > > DSP48:PCIN47->P35 1 2.270 0.534 mult_inst/Mmult__n00004 > > > (mult_inst/Mmult__n0000_s_69) > > > FD:D 0.391 mult_inst/Mmult__n0000_0 > > > ---------------------------------------- > > > Total 12.320ns (11.786ns logic, 0.534ns route) > > > (95.7% logic, 4.3% route) > > > > > > Cheers, > > > > > > Robin > > > > > > > Your design is not inferring the P register, so the adders are > > combinatorial. The adders get connected in a daisy chain. You may have > > to recode your RTL to reflect that, as the synthesizer is not really > > smart enough to push around the registers to the degree necessary to > > deal with differing latencies among the adder inputs.
It may take a little effort and time to infer complex things using RTL,
but the simulation performance has always been well worth the effort
for me. Simulating the instantiated primitives is _very_ slow compared
to RTL.

Andy


MM wrote:
> Robin, > > IMHO, trying to get inferring of anything more complex than a flip-flop, or > perhaps an adder, to work is a waste of time. Just instantiate what you > need... > > /Mikhail