I'm interested in the implementation of a fast adder for 32 bit data. The CLA is too expensive so I'm searching for something different, can you provide me some reference? I think that Ling adder can be a good choice, but I don't know.. Thanks a lot
about fast adder
Started by ●July 7, 2005
Reply by ●July 7, 20052005-07-07
Giox wrote:> I'm interested in the implementation of a fast adder for 32 bit data. > The CLA is too expensive so I'm searching for something different, can > you provide me some reference? > I think that Ling adder can be a good choice, but I don't know.. > Thanks a lot >On a FPGA going faster than the dedicated fast carry ripple chain for only 32 bits data might not be easy. What is you target speed and what is your current speed ? Sylvain
Reply by ●July 7, 20052005-07-07
Hi, I'm using a virtex 300E and after the synthesis step (not place and route), the frequency is estimated as 82.129MHz. The performances are better than whose that I need, but the occupied area is considerable. Gio
Reply by ●July 7, 20052005-07-07
Giox wrote:> Hi, I'm using a virtex 300E and after the synthesis step (not place and > route), the frequency is estimated as 82.129MHz. > The performances are better than whose that I need, but the occupied > area is considerable. > GioWhich software you use to sintesis you project? did you try to use pipeline ? can you show you code sourse ? des00
Reply by ●July 7, 20052005-07-07
I'm using the Xilinx ISE pack, with it's synthesis pack. I'm not able to show the code, but it is simply a 32 bit CLA bit from 4 different 8 bit CLA with group propagate and generate Gio
Reply by ●July 7, 20052005-07-07
Have you tried a simple adder? Verilog: module myadd ( input clk, input [31:0] a, b, output reg [31:0] y ); always @(posedge clk) y = a + b; enmodule The dedicated adder circuitry is very fast silicon. Trying to best the native performance of the adder is difficult. Most people have their performance hurt by having more than one (or two) levels of logic in the adder. If you go from registered inputs to registered outputs you should get significantly better performance than the CLA structure you're trying. Let us know how your performance changes with a simple 32-bit adder. Giox wrote:> I'm using the Xilinx ISE pack, with it's synthesis pack. > I'm not able to show the code, but it is simply a 32 bit CLA bit from 4 > different 8 bit CLA with group propagate and generate > Gio
Reply by ●July 7, 20052005-07-07
special for you, i did simple test
This code
library ieee;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned.all;
use ieee.std_logic_1164.all;
entity adder is
port
(
in_clock : in std_logic;
in_reset_b : in std_logic;
in_dataA : in std_logic_vector(31 downto 0);
in_dataB : in std_logic_vector(31 downto 0);
in_strobe : in std_logic;
out_data : out std_logic_vector(32 downto 0);
out_strobe : out std_logic
);
end entity adder;
architecture adder of adder is
begin
process (in_clock, in_reset_b) is
begin
if (in_reset_b = '0') then
out_data <= (others => '0');
elsif (rising_edge(in_clock)) then
if (in_strobe = '1') then
out_data <= ext(in_dataA, out_data'length) + ext(in_dataB,
out_data'length);
end if;
end if;
end process;
process (in_clock) is
begin
if (rising_edge(in_clock)) then
out_strobe <= in_strobe;
end if;
end process;
end architecture adder;
i syntesis by Simplify 8.1 for xcv300efg256-6
simplify report is
Worst slack in design: 3.605
...................................
Requested Estimated Requested Estimated
Clock Clock
Starting Clock Frequency Frequency Period Period
Slack Type Group
---------------------------------------------------------------------------------------------------------------------
adder|in_clock 100.0 MHz 156.4 MHz 10.000 6.395
3.605 inferred Inferred_clkgroup_0
=====================================================================================================================
...............................
then i P&R in ISE 6.3 SP2 full
and report
...............
Number of errors: 0
Number of warnings: 0
Logic Utilization:
Number of Slice Flip Flops: 26 out of 6,144 1%
Number of 4 input LUTs: 32 out of 6,144 1%
Logic Distribution:
Number of occupied Slices: 17 out of
3,072 1%
Number of Slices containing only related logic: 17 out of
17 100%
Number of Slices containing unrelated logic: 0 out of
17 0%
*See NOTES below for an explanation of the effects of unrelated
logic
Total Number of 4 input LUTs: 32 out of 6,144 1%
Number of bonded IOBs: 100 out of 176 56%
IOB Flip Flops: 8
Number of GCLKs: 1 out of 4 25%
Number of GCLKIOBs: 1 out of 4 25%
.................
where didn't was an error? constrain was only
NET "in_clock" TNM_NET = "in_clock";
TIMESPEC "TS_in_clock" = PERIOD "in_clock" 10.000 ns HIGH 50.00%;
#End clock constraints
# Output Constraints
OFFSET = OUT : 10.000 : AFTER in_clock ;
# Input Constraints
OFFSET = IN : 10.000 : BEFORE in_clock ;
That's why i recomend you to use Simplify :) Good luck!
Reply by ●July 7, 20052005-07-07
Gulp, interesting. I tested your code with my tools, it is faster with simplify than with my tools. However it seems that the biggest trouble is the use of CLA, it seems that the synthesis process allows for better results than the CLA that I implemented by hand. I'm not as experienced as you but is it possible that a standard (read from standard university book) implementation of CLA generate conflicts that disable the use of specific feature of the FPGA? It seems that yes but I would like your advice. Thanks again Giovanni
Reply by ●July 7, 20052005-07-07
Giox wrote:> Gulp, interesting. > I tested your code with my tools, it is faster with simplify than with > my tools. However it seems that the biggest trouble is the use of CLA, > it seems that the synthesis process allows for better results than the > CLA that I implemented by hand. I'm not as experienced as you but is it > possible that a standard (read from standard university book) > implementation of CLA generate conflicts that disable the use of > specific feature of the FPGA?You mean you didn't try the simple + first ? All modern FPGA have a dedicated carry ripple chain that allows a very quick propagation of the carry from a LogicCell to the adjacent one. So by using this, you only need n LogicCells for a n bits adders and the carry is handled by dedicated logic. When trying to do your CLA, you only used generic logic so you add supplementary delays. Using others architecture for addition than the simple + is only good for very big adders. Sylvain
Reply by ●July 7, 20052005-07-07






