FPGARelated.com
Forums

faster Spartan III adder

Started by Paul Smith June 7, 2005
Hi,

I need to add a pair of 8 bit (unsigned) integers to get a 9 bit 
(unsigned) result at 250 MHz, preferably in an XC3S50-4.

Using the Coregen adder/subtractor V7 with maximum pipelining (9) and 
RPM on, the best cycle time I can get is 4.55 ns.  At each pipeline 
level the critical path is a LUT, a MUXCY, and another LUT.

Can anyone point me at some hints for a faster implementation (besides 
going to a faster part?

TIA

Paul Smith
Indiana University Physics
"Paul Smith" <ptsmith@nospam.indiana.edu> schrieb im Newsbeitrag
news:d84q4e$uga$1@rainier.uits.indiana.edu...
> Hi, > > I need to add a pair of 8 bit (unsigned) integers to get a 9 bit > (unsigned) result at 250 MHz, preferably in an XC3S50-4. > > Using the Coregen adder/subtractor V7 with maximum pipelining (9) and > RPM on, the best cycle time I can get is 4.55 ns. At each pipeline > level the critical path is a LUT, a MUXCY, and another LUT. > > Can anyone point me at some hints for a faster implementation (besides > going to a faster part? > > TIA > > Paul Smith > Indiana University Physics
solution (one possibility) is simple do it 2 times in parallel and demux the output results back into one stream then the addition works on 125MHz and the demux should be single LUT so it works on 250 without problems :) antti3cents
(1) try using timing driven packing and placement, it might help.
(2) if you can pipeline the result, i.e. just add one more sample, this 
would solve a problem.
but try using high level code, but core generator.
(3) split the addition into two parallel ones at 1/2 frequency, and 
multiplex them at full frequency

take also a look how much % of the delay is contributed by routing :o(

Hope this helps

Vladislav

"Paul Smith" <ptsmith@nospam.indiana.edu> wrote in message 
news:d84q4e$uga$1@rainier.uits.indiana.edu...
> Hi, > > I need to add a pair of 8 bit (unsigned) integers to get a 9 bit > (unsigned) result at 250 MHz, preferably in an XC3S50-4. > > Using the Coregen adder/subtractor V7 with maximum pipelining (9) and RPM > on, the best cycle time I can get is 4.55 ns. At each pipeline level the > critical path is a LUT, a MUXCY, and another LUT. > > Can anyone point me at some hints for a faster implementation (besides > going to a faster part? > > TIA > > Paul Smith > Indiana University Physics
The trouble is probably your second LUT.  The first LUT feeds the S input of
the carry chain, yes?  This would be the LUT attached directly to the carry
chain.  The second LUT means - for reasons unknown to us - the result is
going through additional logic.  It's this logic that needs to be tweaked a
little.

Since it didn't pass through an XORCY I'm guessing this is the carry-out of
the 8-bit adder?  Look at what else feeds the LUT and try to determine why
the synthesizer wants to add logic to the adder's OUTPUT rather than in the
4-input LUT.


"Paul Smith" <ptsmith@nospam.indiana.edu> wrote in message
news:d84q4e$uga$1@rainier.uits.indiana.edu...
> Hi, > > I need to add a pair of 8 bit (unsigned) integers to get a 9 bit > (unsigned) result at 250 MHz, preferably in an XC3S50-4. > > Using the Coregen adder/subtractor V7 with maximum pipelining (9) and > RPM on, the best cycle time I can get is 4.55 ns. At each pipeline > level the critical path is a LUT, a MUXCY, and another LUT. > > Can anyone point me at some hints for a faster implementation (besides > going to a faster part? > > TIA > > Paul Smith > Indiana University Physics
"Paul Smith" <ptsmith@nospam.indiana.edu> schrieb im Newsbeitrag
news:d84q4e$uga$1@rainier.uits.indiana.edu...

> I need to add a pair of 8 bit (unsigned) integers to get a 9 bit > (unsigned) result at 250 MHz, preferably in an XC3S50-4. > > Using the Coregen adder/subtractor V7 with maximum pipelining (9) and > RPM on, the best cycle time I can get is 4.55 ns. At each pipeline > level the critical path is a LUT, a MUXCY, and another LUT.
Hmm, strange. a 8 bit adder should fit into one level of logic. make sure both inputs are registered and placed correctly (close to the carry chain). The output should be registerd too, of course ;-) OK, I did a quick test using Webpack 7.1. A plain description reaches 3.995 ns, uhhh tight timing ;-) Looking at the floorplanner (after P&R) I see the mess.The registers for my inputs are placed inside the IOBs. Not bad in general, but bad here, where we need every fraction of a ns. So I disable the option for placing the registers into the IOBs and run again. BINGO! 3.5ns. But the automatic P&R tools are lazy bastards. A look at the floorplanner reveals, that the input registers are spread over the chip. OK, handmade is handmade. We add some LOCs into the UCF. New run. 3.49 ns. Hmm, not too much improvement, but since the placement is fixed this should be reliable. See the files below. Njoy. Falk -- VHDL ---------------------------------------------------------------------------- ---- -- Company: -- Engineer: -- -- Create Date: 23:08:28 06/07/05 -- Design Name: -- Module Name: top_adder - Behavioral -- Project Name: -- Target Device: -- Tool versions: -- Description: -- -- Dependencies: -- -- Revision: -- Revision 0.01 - File Created -- Additional Comments: -- ---------------------------------------------------------------------------- ---- library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; ---- Uncomment the following library declaration if instantiating ---- any Xilinx primitives in this code. --library UNISIM; --use UNISIM.VComponents.all; entity top_adder is Port ( clk : in std_logic; a : in std_logic_vector(7 downto 0); b : in std_logic_vector(7 downto 0); c : out std_logic_vector(8 downto 0)); end top_adder; architecture Behavioral of top_adder is signal a_int, b_int : std_logic_vector(7 downto 0); signal c_int : std_logic_vector(8 downto 0); begin process(clk) begin if rising_edge(clk) then a_int <= a; b_int <= b; c <= c_int; c_int <= ('0' & a_int) + ('0' & b_int); end if; end process; end Behavioral; -- UCF net clk period=3.5ns; INST "c_int_0" LOC = "SLICE_X14Y4"; INST "c_int_1" LOC = "SLICE_X14Y4"; INST "c_int_2" LOC = "SLICE_X14Y5"; INST "c_int_3" LOC = "SLICE_X14Y5"; INST "c_int_4" LOC = "SLICE_X14Y6"; INST "c_int_5" LOC = "SLICE_X14Y6"; INST "c_int_6" LOC = "SLICE_X14Y7"; INST "c_int_7" LOC = "SLICE_X14Y7"; INST "c_int_8" LOC = "SLICE_X14Y8"; INST "a_int_0" LOC = "SLICE_X13Y4"; INST "a_int_1" LOC = "SLICE_X13Y4"; INST "a_int_2" LOC = "SLICE_X13Y5"; INST "a_int_3" LOC = "SLICE_X13Y5"; INST "a_int_4" LOC = "SLICE_X13Y6"; INST "a_int_5" LOC = "SLICE_X13Y6"; INST "a_int_6" LOC = "SLICE_X13Y7"; INST "a_int_7" LOC = "SLICE_X13Y7"; INST "b_int_0" LOC = "SLICE_X15Y4"; INST "b_int_1" LOC = "SLICE_X15Y4"; INST "b_int_2" LOC = "SLICE_X15Y5"; INST "b_int_3" LOC = "SLICE_X15Y5"; INST "b_int_4" LOC = "SLICE_X15Y6"; INST "b_int_5" LOC = "SLICE_X15Y6"; INST "b_int_6" LOC = "SLICE_X15Y7"; INST "b_int_7" LOC = "SLICE_X15Y7";
Paul, if you want to be fast, run lean.
You want to add, so pick an adder, not an adder/subtractor.
This design should only take 9 or 10 LUTs, and the carry chain should
be just combinatorial. And you don't have an active carry input to the
LSB. So eliminate that path from the speed analysis.
Try to get the basic functionality (without the routing) as fast as
possible. Then apply some floorplanning.
Peter Alfke, Xilinx.

Falk Brunner wrote:
> "Paul Smith" <ptsmith@nospam.indiana.edu> schrieb im Newsbeitrag > news:d84q4e$uga$1@rainier.uits.indiana.edu... > > >>I need to add a pair of 8 bit (unsigned) integers to get a 9 bit >>(unsigned) result at 250 MHz, preferably in an XC3S50-4. >> >>Using the Coregen adder/subtractor V7 with maximum pipelining (9) and >>RPM on, the best cycle time I can get is 4.55 ns. At each pipeline >>level the critical path is a LUT, a MUXCY, and another LUT. > > > Hmm, strange. a 8 bit adder should fit into one level of logic. make sure > both inputs are registered and placed correctly (close to the carry chain). > The output should be registerd too, of course ;-) > > OK, I did a quick test using Webpack 7.1. > > A plain description reaches 3.995 ns, uhhh tight timing ;-) > Looking at the floorplanner (after P&R) I see the mess.The registers for my > inputs are placed inside the IOBs. Not bad in general, but bad here, where > we need every fraction of a ns. So I disable the option for placing the > registers into the IOBs and run again. > BINGO! 3.5ns. >
or just place post and pre-registers. or par your design as a macro. Laurent www.amontec.com
Hi Falk,

What temperature/voltage did you get these results at?  At 85 C and 1.14 
volts I'm getting just over 4 ns in an XC3S50-4

Paul


Falk Brunner wrote:
> "Paul Smith" <ptsmith@nospam.indiana.edu> schrieb im Newsbeitrag > news:d84q4e$uga$1@rainier.uits.indiana.edu... > > >>I need to add a pair of 8 bit (unsigned) integers to get a 9 bit >>(unsigned) result at 250 MHz, preferably in an XC3S50-4. >> >>Using the Coregen adder/subtractor V7 with maximum pipelining (9) and >>RPM on, the best cycle time I can get is 4.55 ns. At each pipeline >>level the critical path is a LUT, a MUXCY, and another LUT. > > > Hmm, strange. a 8 bit adder should fit into one level of logic. make sure > both inputs are registered and placed correctly (close to the carry chain). > The output should be registerd too, of course ;-) > > OK, I did a quick test using Webpack 7.1. > > A plain description reaches 3.995 ns, uhhh tight timing ;-) > Looking at the floorplanner (after P&R) I see the mess.The registers for my > inputs are placed inside the IOBs. Not bad in general, but bad here, where > we need every fraction of a ns. So I disable the option for placing the > registers into the IOBs and run again. > BINGO! 3.5ns. > > But the automatic P&R tools are lazy bastards. A look at the floorplanner > reveals, that the input registers are spread over the chip. OK, handmade is > handmade. We add some LOCs into the UCF. New run. > 3.49 ns. Hmm, not too much improvement, but since the placement is fixed > this should be reliable. > See the files below. > > Njoy. > Falk
You're getting the longer times because you're going through 2 levels of
logic.  The second LUT you mentioned will cause some grief but with proper
constraints even that might be reasonable.  Located right next to each
other, a carry chain feeding a LUT might get the timing you need but you
need to coerce the tools.

Falks's results are most definitely from a carry chain without the extra
logic level.


Or code the thing so you don't have that output level of logic.
"Paul Smith" <ptsmith@nospam.indiana.edu> wrote in message
news:d89k4c$hpl$1@rainier.uits.indiana.edu...
> Hi Falk, > > What temperature/voltage did you get these results at? At 85 C and 1.14 > volts I'm getting just over 4 ns in an XC3S50-4 > > Paul
"Paul Smith" <ptsmith@nospam.indiana.edu> schrieb im Newsbeitrag
news:d89k4c$hpl$1@rainier.uits.indiana.edu...
> Hi Falk, > > What temperature/voltage did you get these results at? At 85 C and 1.14 > volts I'm getting just over 4 ns in an XC3S50-4
OK, my 3.5ns was with a -4 device. With -5, 85C and 1.14V its 3.98 ns. Hmm. Maybe its better to have two adders runnung ant half the speed and MUXing the results. Regards Falk