FPGARelated.com
Forums

modulo 2**32-1 arith

Started by Ilya Kalistru December 15, 2015
Am Dienstag, 15. Dezember 2015 11:32:14 UTC+1 schrieb Ilya Kalistru:
> Hello. > I need to add two unsigned numbers modulo 2**32-1. > Now it's done in very inefficient way: at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that. > Does anybody know a better way to do that?
Which logic family? What is "better"? Less ressources? Faster? Do you need a result every cycle? If you can do with a result every other cycle, you could try following: - Make an 32b adder with carry in and carry out - First cycle: Add the two numbers with carry. - Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result. This should be below 40 LEs. But I have not tried it out... Regards, Thomas www.entner-electronics.com - Home of EEBlaster
KJ wrote:

> On Tuesday, December 15, 2015 at 4:33:51 PM UTC-5, Rob Gaddi wrote: >> I'm guessing the requirement for modulo 2**32-1 is driven by the >> algorithm, possibly some checksummy sort of thing. I know Fletcher uses >> 2**N-1 modulo pretty heavily. >> > > Could be right, at least there is some possible context now to the question. Thanks. > >> If your concern is speed rather than size, you could probably do it by >> running two parallel adders, Y0 = A+B and Y1 = A+B+1. If the Y1 >> add carries out then Y = Y1 else Y = Y0. >> > > See my earlier post today. Quartus will implement 'Y1=A+B+1' exactly the same as 'Y1=Y+1'. >
Y0=A+B is a 32-bit adder with CIN=0. Y1=A+B+1 is a second 32-bit adder with CIN=1. Choosing between them is a 32-bit 2:1 mux. So my rough math puts my answer at 96 LEs and 32 flops. But the worst case propagation path is from the LSB, through the 32-bit Y1 carry chain, to the mux select, to the output, which should be pretty screamingly fast. Whereas Ilya's (and rickman's as well, if I'm reading it right) use the carry-out of the A+B+1 add as the carry in to the "real" add. That gives you the total prop delay of a 64-bit carry chain, plus some slop. The OP, who seems to have vanished off into the ether, never specified what "efficiency" he was trying to optimize on, or whether pipeline delays were acceptable, etc.
>> Alternatively, the way you do it in an optimized Fletcher is in blocks. >> Add a whole bunch of samples together with sufficient extra bits on the >> high end to count the overflows, then periodicially stop taking in new >> data and add the overflows back into the LSBs. You could easily get >> that "periodically" down to once out of every 1024 samples. >> > Interesting idea, but I don't think it saves anything over Ilya's method. Ilya's method as originally specified (i.e. spread out over two clock cycles) takes essentially 2 logic cells per bit (but it takes ~2.4 per bit for single clock cycle). When it comes time to periodically update the overflows that you mentioned, you will end up with another adder and consume another 32 logic cells (one per bit). >
My above comment again. If I'm right about the checksum, then Ilya's method (at least on a reading) gives you an operating duty cycle of 50%, you can put new data no faster than every other clock because you're waiting for the result of that "add one more decision" to percolate back around. Leave out the pipeline registers and your fmax gets ugly. If you just carry the excess and cook it periodically, you get a "data accept" duty cycle of well past 99%. There may be some cleverer way to get to a 100% duty cycle pipeline, but with only one cup of coffee in me I don't see it.
> Kevin Jennings
-- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.
The tool was not using the carry in.  Here is a version that uses the 
carry in.  66 LEs in one clock cycle.  It can be reduced to about 40 LEs 
if done in two clock cycles.

BTW, did you test any of the designs you synthesized?  How do you know 
they actually work?

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use std.textio.all;

ENTITY FPGA IS
   GENERIC (
	  WIDTH	: POSITIVE	:= 32
	);
   port(
	  -- Board Clock
	  A, B	: in	unsigned(WIDTH - 1 downto 0);
	  Y	: out	unsigned(WIDTH - 1 downto 0)
	);
END FPGA;

ARCHITECTURE behavior OF FPGA IS

   signal Y1 : unsigned(WIDTH downto 0);
   signal temp : unsigned(WIDTH downto 0);

BEGIN

   Y1 <= ('0' & A) + ('0' & B) + 1;
   temp <= (A & '1') + (B & Y1(32));
   Y <= temp (WIDTH downto 1);

END;

-- 

Rick
On Wednesday, December 16, 2015 at 12:12:01 PM UTC-5, rickman wrote:
> The tool was not using the carry in. Here is a version that uses the > carry in. 66 LEs in one clock cycle. It can be reduced to about 40 LEs > if done in two clock cycles. >
Here is the fitter summary showing 98 logic cells for this design that you posted. Presumably you're using a different tool or different target device to come up with 66 so that number only becomes relevant if you use that same tool and target device to synthesize all of the other variants. +---------------------------------------------------------------------------------+ ; Fitter Summary ; +------------------------------------+--------------------------------------------+ ; Fitter Status ; Successful - Wed Dec 16 12:31:58 2015 ; ; Quartus II 64-Bit Version ; 13.1.0 Build 162 10/23/2013 SJ Web Edition ; ; Revision Name ; Junk ; ; Top-level Entity Name ; FPGA ; ; Family ; Cyclone IV GX ; ; Device ; EP4CGX22CF19C6 ; ; Timing Models ; Final ; ; Total logic elements ; 98 / 21,280 ( < 1 % ) ; ; Total combinational functions ; 98 / 21,280 ( < 1 % ) ; ; Dedicated logic registers ; 0 / 21,280 ( 0 % ) ; ; Total registers ; 0 ; ; Total pins ; 96 / 167 ( 57 % ) ; ; Total virtual pins ; 0 ; ; Total memory bits ; 0 / 774,144 ( 0 % ) ; ; Embedded Multiplier 9-bit elements ; 0 / 80 ( 0 % ) ; ; Total GXB Receiver Channel PCS ; 0 / 4 ( 0 % ) ; ; Total GXB Receiver Channel PMA ; 0 / 4 ( 0 % ) ; ; Total GXB Transmitter Channel PCS ; 0 / 4 ( 0 % ) ; ; Total GXB Transmitter Channel PMA ; 0 / 4 ( 0 % ) ; ; Total PLLs ; 0 / 4 ( 0 % ) ; +------------------------------------+--------------------------------------------+ Kevin Jennings
On Wednesday, December 16, 2015 at 9:23:49 AM UTC-5, thomas....@gmail.com 
> If you can do with a result every other cycle, you could try following: > - Make an 32b adder with carry in and carry out > - First cycle: Add the two numbers with carry. > - Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result. > > This should be below 40 LEs. But I have not tried it out... >
I have yours coming in at 36 if I coded up what you said correctly, not tested. You're in the lead now with Ilya in second. Kevin Jennings ==== START OF CODE ==== library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity Custom_Adder is generic( SP1_METHOD: integer range 0 to 1; -- Controls how 'Sum + 1' is calculated METHOD: integer range 0 to 7); port( Clock: in std_ulogic; A: in unsigned(31 downto 0); B: in unsigned(31 downto 0); C: out unsigned(31 downto 0); Valid: out std_ulogic ); end Custom_Adder; -- Results: -- Logic Elements -- Method# SP1=0 SP1=1 Notes -- 0 32 32 C <= A+B, no modulo '2**32-1' as a baseline -- 1 32 32 C <= A+B+1, as another reference point -- 2 76 76 Ilya's method although implemented all in one clock cycle -- 3 97 97 Rickman's method #1 -- 4 97 97 Rickman's method #2 -- 5 65 65 Ilya's method as stated, two clock cycles -- 6 76 108 Same as method 2, but when SP1_METHOD=1 it will computes Sum + 1 as A+1+B rather than A+B+1 -- 7 36 36 Thomas' method, two clock cycles architecture RTL of Custom_Adder is signal Sum: unsigned(32 downto 0); signal Sum_P1: unsigned(32 downto 0); constant MAX: unsigned(32 downto 0) := '0' & x"FFFF_FFFF"; begin Sum <= resize(A, Sum'length) + resize(B, Sum'length); Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + resize(B, Sum'length) + 1; -- KJ note: If you change Sum_P1 to compute A+1+B rather than A+B+1 then the number of logic cells -- increases by 27 (i.e. almost one for each bit of the adder -- KJ note: Performing all calculations on one clock cycle in order to determine logic cells. GEN_METHOD0: if (METHOD = 0) generate C <= Sum(C'range); end generate GEN_METHOD0; GEN_METHOD1: if (METHOD = 1) generate C <= Sum_P1(C'range); end generate GEN_METHOD1; GEN_METHOD2: if (METHOD = 2) generate -- Ilya's method, implemented all in one clock cycle: -- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result -- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range); end generate GEN_METHOD2; GEN_METHOD3: if (METHOD = 3) generate signal Mod_Res: unsigned(32 downto 0); begin -- Rickman's first method -- sum <= RESIZE(a, 33) + RESIZE(b, 33); -- temp <= sum + 1; -- mod_result <= sum + temp(32); -- answer <= mod_result(31 downto 0); Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32)); C <= Mod_Res(C'range); end generate GEN_METHOD3; GEN_METHOD4: if (METHOD = 4) generate -- Rickman's second method -- signal Y: unsigned(31 downto 0); -- signal Y, Y1, A, B: unsigned(32 downto 0); <-- KJ note: Redclares 'Y' -- Y1 <= A + B + 1; <-- KJ note: Error: must have 33 elements not 32 -- Y <= resize(A + B + resize(Y1(32 downto 32),33), 32); signal Y: unsigned(31 downto 0); signal Y1: unsigned(32 downto 0); begin Y1 <= resize(A, 33) + resize(B, 33) + 1; Y <= resize(A + B + resize(Y1(32 downto 32),33), 32); C <= Y; end generate GEN_METHOD4; GEN_METHOD5: if (METHOD = 5) generate -- Ilya's method: -- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result -- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that signal Sum_Dlyd: unsigned(Sum'range); signal Sum_Dlyd_P1: unsigned(Sum'range); begin Sum_Dlyd <= Sum when rising_edge(Clock); Sum_Dlyd_P1 <= Sum_Dlyd + 1; C <= Sum_Dlyd_P1(C'range) when (Sum_Dlyd(32) = '1') else Sum_Dlyd(C'range); end generate GEN_METHOD5; GEN_METHOD6: if (METHOD = 6) generate -- Same as method 2 (Ilya's method, implemented all in one clock cycle) except -- the ordering of operands for the computation of 'Sum + 1' is modified. signal Sum_P1: unsigned(Sum'range); begin Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + 1 + resize(B, Sum'length); C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range); end generate GEN_METHOD6; GEN_METHOD7: if (METHOD = 7) generate -- Thomas' method -- Make an 32b adder with carry in and carry out -- First cycle: Add the two numbers with carry. -- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result. -- This should be below 40 LEs. But I have not tried it out... signal Sum_With_Carry: unsigned(Sum'range); signal Carry_In: unsigned(0 downto 0); signal Carry_Out: std_ulogic; begin Carry_In <= not(Carry_In); Sum_With_Carry <= resize(A, Sum_With_Carry'length) + resize(B, Sum_With_Carry'length) + Carry_In; Carry_Out <= Sum_With_Carry(32) when rising_edge(Clock); C <= Sum_With_Carry(C'range); Valid <= (Carry_In(0) and Sum_With_Carry(32)) or (not(Carry_In(0)) and not(Carry_Out)); end generate GEN_METHOD7; end RTL; ==== END OF CODE ====
On 12/16/2015 12:36 PM, KJ wrote:
> On Wednesday, December 16, 2015 at 12:12:01 PM UTC-5, rickman wrote: >> The tool was not using the carry in. Here is a version that uses the >> carry in. 66 LEs in one clock cycle. It can be reduced to about 40 LEs >> if done in two clock cycles. >> > Here is the fitter summary showing 98 logic cells for this design that you posted. Presumably you're using a different tool or different target device to come up with 66 so that number only becomes relevant if you use that same tool and target device to synthesize all of the other variants. > +---------------------------------------------------------------------------------+ > ; Fitter Summary ; > +------------------------------------+--------------------------------------------+ > ; Fitter Status ; Successful - Wed Dec 16 12:31:58 2015 ; > ; Quartus II 64-Bit Version ; 13.1.0 Build 162 10/23/2013 SJ Web Edition ; > ; Revision Name ; Junk ; > ; Top-level Entity Name ; FPGA ; > ; Family ; Cyclone IV GX ; > ; Device ; EP4CGX22CF19C6 ; > ; Timing Models ; Final ; > ; Total logic elements ; 98 / 21,280 ( < 1 % ) ; > ; Total combinational functions ; 98 / 21,280 ( < 1 % ) ; > ; Dedicated logic registers ; 0 / 21,280 ( 0 % ) ; > ; Total registers ; 0 ; > ; Total pins ; 96 / 167 ( 57 % ) ; > ; Total virtual pins ; 0 ; > ; Total memory bits ; 0 / 774,144 ( 0 % ) ; > ; Embedded Multiplier 9-bit elements ; 0 / 80 ( 0 % ) ; > ; Total GXB Receiver Channel PCS ; 0 / 4 ( 0 % ) ; > ; Total GXB Receiver Channel PMA ; 0 / 4 ( 0 % ) ; > ; Total GXB Transmitter Channel PCS ; 0 / 4 ( 0 % ) ; > ; Total GXB Transmitter Channel PMA ; 0 / 4 ( 0 % ) ; > ; Total PLLs ; 0 / 4 ( 0 % ) ; > +------------------------------------+--------------------------------------------+
I don't know what to tell you. I used Lattice Diamond 3.3. The design is simple, two 33 bit adders, 1 LUT per bit. There is nothing special about either the tool or the device (standard 4 input LUTs with carry support for adders). I think your Quartus tool is broken. Try looking at the actual logic produced. BTW, doesn't the Cyclone IV have 6 input LUTs? What is the Altera tool doing that is so inefficient? -- Rick
On Wednesday, December 16, 2015 at 12:45:05 PM UTC-5, KJ wrote:

Correction to Method 7:
Posted:  Carry_In <= not(Carry_In);
Supposed to be:  Carry_In <= not(Carry_In) when rising_edge(Clock);

Results are not changed, still 36 Logic Elements.

Kevin
On Wednesday, December 16, 2015 at 12:12:15 PM UTC-5, Rob Gaddi wrote:
>=20 > If I'm right about the checksum, then Ilya's > method (at least on a reading) gives you an operating duty cycle of > 50%, you can put new data no faster than every other clock because > you're waiting for the result of that "add one more decision" to > percolate back around. Leave out the pipeline registers and your fmax > gets ugly.
Ilya's method (Method=3D6 in the code I posted) should be computing one out= put every clock cycle. The output will just have a one clock cycle latency= compared with the pure combinatorial logic version (Method=3D2). Thomas' = method (Method=3D7) though will only accept one input every other clock cyc= le but consumes roughly half the logic resource. Kevin Jennings
On 12/16/2015 12:45 PM, KJ wrote:
> On Wednesday, December 16, 2015 at 9:23:49 AM UTC-5, thomas....@gmail.com >> If you can do with a result every other cycle, you could try following: >> - Make an 32b adder with carry in and carry out >> - First cycle: Add the two numbers with carry. >> - Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result. >> >> This should be below 40 LEs. But I have not tried it out... >> > I have yours coming in at 36 if I coded up what you said correctly, not tested. You're in the lead now with Ilya in second. > > Kevin Jennings > > ==== START OF CODE ==== > library ieee; > use ieee.std_logic_1164.all; > use ieee.numeric_std.all; > > entity Custom_Adder is generic( > SP1_METHOD: integer range 0 to 1; -- Controls how 'Sum + 1' is calculated > METHOD: integer range 0 to 7); > port( > Clock: in std_ulogic; > A: in unsigned(31 downto 0); > B: in unsigned(31 downto 0); > C: out unsigned(31 downto 0); > Valid: out std_ulogic > ); > end Custom_Adder; > > -- Results: > -- Logic Elements > -- Method# SP1=0 SP1=1 Notes > -- 0 32 32 C <= A+B, no modulo '2**32-1' as a baseline > -- 1 32 32 C <= A+B+1, as another reference point > -- 2 76 76 Ilya's method although implemented all in one clock cycle > -- 3 97 97 Rickman's method #1 > -- 4 97 97 Rickman's method #2 > -- 5 65 65 Ilya's method as stated, two clock cycles > -- 6 76 108 Same as method 2, but when SP1_METHOD=1 it will computes Sum + 1 as A+1+B rather than A+B+1 > -- 7 36 36 Thomas' method, two clock cycles > > architecture RTL of Custom_Adder is > signal Sum: unsigned(32 downto 0); > signal Sum_P1: unsigned(32 downto 0); > constant MAX: unsigned(32 downto 0) := '0' & x"FFFF_FFFF"; > begin > Sum <= resize(A, Sum'length) + resize(B, Sum'length); > Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + resize(B, Sum'length) + 1; > -- KJ note: If you change Sum_P1 to compute A+1+B rather than A+B+1 then the number of logic cells > -- increases by 27 (i.e. almost one for each bit of the adder > > -- KJ note: Performing all calculations on one clock cycle in order to determine logic cells. > > GEN_METHOD0: if (METHOD = 0) generate > C <= Sum(C'range); > end generate GEN_METHOD0; > > GEN_METHOD1: if (METHOD = 1) generate > C <= Sum_P1(C'range); > end generate GEN_METHOD1; > > GEN_METHOD2: if (METHOD = 2) generate > -- Ilya's method, implemented all in one clock cycle: > -- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result > -- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that > C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range); > end generate GEN_METHOD2; > > GEN_METHOD3: if (METHOD = 3) generate > signal Mod_Res: unsigned(32 downto 0); > begin > -- Rickman's first method > -- sum <= RESIZE(a, 33) + RESIZE(b, 33); > -- temp <= sum + 1; > -- mod_result <= sum + temp(32); > -- answer <= mod_result(31 downto 0); > Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32)); > C <= Mod_Res(C'range); > end generate GEN_METHOD3; > GEN_METHOD4: if (METHOD = 4) generate > -- Rickman's second method > -- signal Y: unsigned(31 downto 0); > -- signal Y, Y1, A, B: unsigned(32 downto 0); <-- KJ note: Redclares 'Y' > -- Y1 <= A + B + 1; <-- KJ note: Error: must have 33 elements not 32 > -- Y <= resize(A + B + resize(Y1(32 downto 32),33), 32); > signal Y: unsigned(31 downto 0); > signal Y1: unsigned(32 downto 0); > begin > Y1 <= resize(A, 33) + resize(B, 33) + 1; > Y <= resize(A + B + resize(Y1(32 downto 32),33), 32); > C <= Y; > end generate GEN_METHOD4; > > GEN_METHOD5: if (METHOD = 5) generate > -- Ilya's method: > -- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result > -- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that > signal Sum_Dlyd: unsigned(Sum'range); > signal Sum_Dlyd_P1: unsigned(Sum'range); > begin > Sum_Dlyd <= Sum when rising_edge(Clock); > Sum_Dlyd_P1 <= Sum_Dlyd + 1; > C <= Sum_Dlyd_P1(C'range) when (Sum_Dlyd(32) = '1') else Sum_Dlyd(C'range); > end generate GEN_METHOD5; > > GEN_METHOD6: if (METHOD = 6) generate > -- Same as method 2 (Ilya's method, implemented all in one clock cycle) except > -- the ordering of operands for the computation of 'Sum + 1' is modified. > signal Sum_P1: unsigned(Sum'range); > begin > Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + 1 + resize(B, Sum'length); > C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range); > end generate GEN_METHOD6; > > GEN_METHOD7: if (METHOD = 7) generate > -- Thomas' method > -- Make an 32b adder with carry in and carry out > -- First cycle: Add the two numbers with carry. > -- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result. > -- This should be below 40 LEs. But I have not tried it out... > signal Sum_With_Carry: unsigned(Sum'range); > signal Carry_In: unsigned(0 downto 0); > signal Carry_Out: std_ulogic; > begin > Carry_In <= not(Carry_In); > Sum_With_Carry <= resize(A, Sum_With_Carry'length) + resize(B, Sum_With_Carry'length) + Carry_In; > Carry_Out <= Sum_With_Carry(32) when rising_edge(Clock); > C <= Sum_With_Carry(C'range); > Valid <= (Carry_In(0) and Sum_With_Carry(32)) or (not(Carry_In(0)) and not(Carry_Out)); > end generate GEN_METHOD7; > end RTL; > ==== END OF CODE ====
I see one issue. You are calculating Sum and Sum_P1 even for methods that don't use them. This would then depend on the tool to optimize away this logic. I see in some methods you recompute these values. Why not fix this and add it to the code for the methods that use it? -- Rick
I thank you all for this very interesting and useful discussion. There is a lot to think about.
But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
I manged to meet timings by using Sum(32) to decide what sum I should use.

    Sum <= resize(A, 33)+resize(B, 33);
    Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
    
    process(clk)
    begin
        if rising_edge(clk) then
            A <= A_in;
            B <= B_in;
            
            if Sum(32) = '1' then
                C<=resize(Sum_P1,32);
            else
                C<=resize(Sum,32);
            end if;
        end if;
    end process;

In my case (Artix-7) this code gives me a critical path of LUT2 then 8 CARRY4 then LUT3 and it's pretty fast (3.09 ns estimated before the routing step). It consumes 96 luts estimated.

If I use
Sum_P1 <= Sum+1;
istead, it gives me a critical path of LUT2 + 9 CARRY4 + LUT2 + 8 CARRY4 and it doesn't meet timings (4.232 ns estimated before the routing step). It consumes 64 luts estimated.


If we return to our original task with a proper modulo 2*32-1 addition and use Sun_P1(32) for the select input of the multiplexer
    Sum <= resize(A, 33)+resize(B, 33);
    Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
    
    process(clk)
    begin
        if rising_edge(clk) then
            A <= A_in;
            B <= B_in;
            
            if Sum_P(32) = '1' then
                C<=resize(Sum_P1,32);
            else
                C<=resize(Sum,32);
            end if;
        end if;
    end process;
we would have exactly the same results on Artix-7 as with "weird" addition. (And it makes sense)

in case of Sum_P1 <= Sum+1; we have LUT2 + 7 CARRI4 + LUT + 2 CARRY4 + LUT2 and we don't meet timings(4.444 ns estimated before the routing step). It consumes 96 luts estimated.

if we do it like that:
    Y1  <= resize(A, 33) + resize(B, 33) + 1;
    Y   <= resize(A + B + resize(Y1(32 downto 32),33), 32); 
    
    process(clk)
    begin
        if rising_edge(clk) then
            A <= A_in;
            B <= B_in;
            
            C <= Y;
        end if;
    end process;
we have in critical path LUT2 + 8*CARRY4 + LUT2 + 8*CARRY4 and we don't meet timings(4.338 ns estimated before the routing step). It consumes 64 luts estimated.