FPGARelated.com
Forums

add/sub 2:1 mux and ena in a single LE (Cyclone)

Started by Martin Schoeberl October 7, 2004
I want to realize an add/subtract function, a 2:1 mux between this adder
and a load value and an enable of the register in a single LE. As I can
see in the data sheet (Cyclone) this should be possible: There is an
extra input addnsub to decide between add and subtract. Two inputs of the
LUT are used for the add/sub, the remaining two inputs can perform the
2:1 mux. The register has an additional ena input.
However, with the following VHDL I get 2 LEs per bit instead of 1. Any
ideas?

Martin

VDHL example:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity alu is

generic (
    width        : integer := 32        -- one data word
);
port (
    clk            : in std_logic;

    lmux        : in std_logic_vector(width-1 downto 0);
    b            : in std_logic_vector(width-1 downto 0);

    sel_sub        : in std_logic;                            -- 0..add,
1..sub
    sel_amux    : in std_logic;                            -- 0..sum,
1..lmux
    ena_a        : in std_logic;                            -- 1..store
new value
    dout        : out std_logic_vector(width-1 downto 0)
);
end alu;

architecture rtl of alu is

    signal a        : std_logic_vector(width-1 downto 0);

begin

-- this add/sub, the sum/lmux mux and the enable should fit into
-- a single LE.

process(clk, ena_a) begin
    if rising_edge(clk) then
        if ena_a='1' then
            if sel_amux='0' then
                if sel_sub='0' then
                    a <= std_logic_vector(signed(b) - signed(a));
                else
                    a <= std_logic_vector(signed(b) + signed(a));
                end if;
            else
                a <= lmux;
            end if;
        end if;

    end if;
end process;

    dout <= a;

end rtl;
--
----------------------------------------------
JOP - a Java Processor core for FPGAs:
http://www.jopdesign.com/



Martin,
I'll guess. Is it because addnsub is low for an add? So try:-

                 if sel_sub='1' then
                     a <= std_logic_vector(signed(b) - signed(a));
                 else
                     a <= std_logic_vector(signed(b) + signed(a));
                 end if;

Cheers, Syms.




On Thu, 07 Oct 2004 19:31:33 GMT, "Martin Schoeberl" <martin.schoeberl@chello.at> wrote:

>I want to realize an add/subtract function, a 2:1 mux between this adder >and a load value and an enable of the register in a single LE. As I can >see in the data sheet (Cyclone) this should be possible: There is an >extra input addnsub to decide between add and subtract. Two inputs of the >LUT are used for the add/sub, the remaining two inputs can perform the >2:1 mux. The register has an additional ena input. >However, with the following VHDL I get 2 LEs per bit instead of 1. Any >ideas? > >Martin
Too many inputs to the LUT. 1 for the A operand 1 for the B operand 1 for the add/sub signal 1 for the load value 1 for the carry in from the previous bit. I faced this problem back in 1990 when designing R16 for the XC4005. My solution was to make the load value come in on the "A" operand path. In the older Xilinx architectures this is not too hard, as the carry logic is separate from the LUT (but very close by), and the CE for the FFs is a separate signal. While the topology of the CLBs has changed significantly since then, I believe this can still be done in the more recent Xilinx devices. Many of the earlier Altera architectures implied this was possible in their data sheets, but the architecture could not actually support it, because the data sheets showed mutually exclusive functions, while not explaining that they were mutually exclusive. For example, the CE signal shared an input with the LUT, and the LUT was broken into two 3 input LUTs to implement ADD (1 for the sum bit and one for the carry out), so you could not even do add/sub in one LE. I believe these deficiencies have been corrected in more recent products, but you would need to look very carefully at what a LAB can really do. Philip Philip Freidin Fliptronics
Hi Philip, Martin:

> Too many inputs to the LUT. > > 1 for the A operand > 1 for the B operand > 1 for the add/sub signal > 1 for the load value > 1 for the carry in from the previous bit.
Cyclone (and Stratix and MAX II) should be able to perform this function in 1 LE per bit. I *think* you have uncovered a bug in Quartus 4.1 synthesis. I'll confirm this with the synthesis team tomorrow. Basically, it looks like Quartus will automatically make a loadable adder flop, or a adder-subtractor flop, but not a loadable adder-substractor flop. If I make a very simple VHDL design that implements an synchronously loadable adder+flop, I get 1 LE/bit as expected. If I add a add/subtract selector, I get 2 LE/bit for no apparent reason. The LE (as shown in Figure 2-5 of the Cyclone databook http://www.altera.com/literature/hb/cyc/cyc_c51002.pdf) should be able to implement an sloadable adder/subtractor in 1 LE/bit. Explanation of why below. Cyclone has a LAB-wide addnsub signal that can be used to control whether the A operand to each LE in the LAB is inverted or not. In addition, addnsub can also enter the carry chain at bit 0 -- so you get compliment(a) + B + addnsub, or (compliment(a) + 1) + B = -A + B. If you consider the 4-LUT as two 3-LUTs followed by a 2:1 Mux, then you get the following assignments of signals (using some psuedo-VHDL; I'm rusty): -- The programmable inversion of the A input Aprime(i) <= A(i) WHEN addnsub = '0' else NOT(A(i)) WHEN OTHERS; -- Carry chain carry_in(0) <= addnsub; For i=1..n carry_in(i) <= carry_out(i-1); End for -- Top-Half 4-LUT computes the sum bit: TopHalfLUT(i) <= Aprime(i) XOR B(i) XOR carry_in(i) -- Bottom-Half of 4-LUT computes the carry-in of the next LE: BottomHalfLUT(i) <= (Aprime(i) AND B(i)) OR (Aprime(i) AND carry_in(i)) OR (B(i) AND carry_in(i)); carry_out(i) <= BottomHalfLUT(i); -- Now for the flop. Note that we're only using 2 of the 4 LE inputs for A(i), B(i). carry_in(i) uses -- the dedicated carry input, and addnsub signal uses the LAB-wide control signal. The LAB-wide -- sload control signal is used to select between TopHalfLUT(i) and the bypass from the LE's data3 -- input to the flop. when (clock) -- I'm lazy IF sload = '1 THEN flop(i) <= sloaddata(i); ELSE flop(i) <= TopHalfLUT(i); So there you go -- three inputs to the LE are used for A, B, and sload_data, and two lab-wide signals are used for addnsub and sload. Thanks for pointing this out. Paul Leventis Altera Corp.
> Cyclone has a LAB-wide addnsub signal that can be used to control
whether
> the A operand to each LE in the LAB is inverted or not. In addition, > addnsub can also enter the carry chain at bit 0 -- so you get
compliment(a)
> + B + addnsub, or (compliment(a) + 1) + B = -A + B. > > If you consider the 4-LUT as two 3-LUTs followed by a 2:1 Mux, then you
get
> the following assignments of signals (using some psuedo-VHDL; I'm
rusty):
> > -- The programmable inversion of the A input > Aprime(i) <= A(i) WHEN addnsub = '0' else NOT(A(i)) WHEN OTHERS; > > -- Carry chain > carry_in(0) <= addnsub; > For i=1..n > carry_in(i) <= carry_out(i-1); > End for > > -- Top-Half 4-LUT computes the sum bit: > TopHalfLUT(i) <= Aprime(i) XOR B(i) XOR carry_in(i) > > -- Bottom-Half of 4-LUT computes the carry-in of the next LE: > BottomHalfLUT(i) <= (Aprime(i) AND B(i)) OR (Aprime(i) AND carry_in(i))
OR
> (B(i) AND carry_in(i)); > carry_out(i) <= BottomHalfLUT(i); > > -- Now for the flop. Note that we're only using 2 of the 4 LE inputs
for
> A(i), B(i). carry_in(i) uses > -- the dedicated carry input, and addnsub signal uses the LAB-wide
control
> signal. The LAB-wide > -- sload control signal is used to select between TopHalfLUT(i) and the > bypass from the LE's data3 > -- input to the flop. > when (clock) -- I'm lazy > IF sload = '1 THEN > flop(i) <= sloaddata(i); > ELSE > flop(i) <= TopHalfLUT(i); > > So there you go -- three inputs to the LE are used for A, B, and
sload_data,
> and two lab-wide signals are used for addnsub and sload. >
and three lab-wide signals: addnsub, sload and ena of the FF. I've checked again with the data sheet. In figure 2-7 you can see the LE in 'dynamic arithmetic mode' and the resource are there for this kind of function. When we take a look in the Analysis & Synthesis Equations we get: --a[0] is a[0] --operation mode is normal a[0]_lut_out = lmux[0] & (C1_result[0] # sel_amux) # !lmux[0] & C1_result[0] & !sel_amux; a[0] = DFFEA(a[0]_lut_out, clk, VCC, , ena_a, , ); And it should be: a[0]_lut_out = C1_result[0]; a[0] = DFFEA(a[0]_lut_out, clk, VCC, , ena_a, sel_amux, lmux[0]); However, the last two parameters for this DFFEA are the asynchronous inputs to the FF and we want a synchronous load. Why is there such a thing as asynchronous inputs to a FF? Perhaps the synthsizer should generate LPM_FF, where the synchronous load is available. This function also uses 2 LCs per bit in a Spartan-3. As I'm not so used to 'read' the Xilinx diagram of the LC I don't know if the resources for one LC could implement this function.
> Thanks for pointing this out.
You're welcome As you will notice, this question is related to the JOP optimizing contest ;-) Martin ---------------------------------------------- JOP - a Java Processor core for FPGAs: http://www.jopdesign.com/
 
> This function also uses 2 LCs per bit in a Spartan-3. As I'm not so used > to 'read' the Xilinx diagram of the LC I don't know if the resources for > one LC could implement this function.
Don't think so. Not in this form at least. If I understand correctly the SLICE view of DS099-2 page 11, the muxes like CYINIT, CY0F, are configured during configuration of the FPGA. So if you enable the carry logic for a bunch of slices, it stays active all the time. Then for the load operation to work, you must ensure your b input is all '0', then you can do it in 1 LC/bit. If not, the carry will pollute the output ... But this is only a simplified slice view and I don't know where to find the complete one. With this view I don't see how it implements a addsub in 1LC/bit, but it can do it. In the view, I see nothing capable of inverting the F1 or F2 so that the carry logic knows that one operand is inverted. Sylvain
In Virtex-derived architectures, you can implement
 o = add ? (a + b) : c;
or
 o = sel ? (a + b) : (a + c);
or even
 o = addsub ? (addand ? a+b : a-b) : (addand ? a&b : a^b);
in one LUT per bit.

The trick is to use a MULT_AND to kill the carry propagation when add=0.
See http://www.fpgacpu.org/log/nov00.html#001112.

But as Philip points out, you'd need five input signals to do
  o = sel ? (add ? a + b : a - b) : c;
and I don't think that can be done in one LUT per bit.

Jan Gray


"Jan Gray" <jsgray@acm.org> schrieb im Newsbeitrag 
news:CXJ9d.11048$Vm1.2497@newsread3.news.atl.earthlink.net...
> In Virtex-derived architectures, you can implement > o = add ? (a + b) : c; > or > o = sel ? (a + b) : (a + c); > or even > o = addsub ? (addand ? a+b : a-b) : (addand ? a&b : a^b); > in one LUT per bit. > > The trick is to use a MULT_AND to kill the carry propagation when > add=0. > See http://www.fpgacpu.org/log/nov00.html#001112. > > But as Philip points out, you'd need five input signals to do > o = sel ? (add ? a + b : a - b) : c; > and I don't think that can be done in one LUT per bit. >
The original request needs even six inputs. In your notation I want to achieve following function: d = ena ? (sload ? c : (addnsub ? a+b : a-b)) : d However, the Cyclone LC has LAB wide signals for addnsub, sload and ena. You only need three of the LUT inputs for a,b and c which are available in arithmetic mode. For the Spartan LC I can see only the CE signal as additional 'global' input that can serve as ena. There are two inputs (FIXINA/B) for the register load, but it seems to me that GYMUX is statically configured. So it can't be used for the sload part. Martin ---------------------------------------------- JOP - a Java Processor core for FPGAs: http://www.jopdesign.com/
Hi All,

> I *think* you have uncovered a bug in Quartus 4.1 synthesis. I'll confirm > this with the synthesis team tomorrow.
First of all, I should point out that this is sub-optimal synthesis, NOT a "bug" -- the design will function, it just uses more logic elements than necessary. We *may* fix this in a future release of Quartus, but the solution will not be easy to implement so don't hold your breath. The value is rather limited due to the input limitations explained below, and the relative rarity of this combination of functions. In the meantime, there is a work-around. You can directly instantiate "stratix_lcells" (the WYSIWYG cell for Stratix/Cyclone LEs). Below I give the code (thanks to a helpful synthesis guy) for a registered adder/subtractor with oodles of extras. Features: - Implements A - B or A + B (depending on signal "addnsub") - Registers are synchronously loadable with "data" when synchronous load "sload" is asserted - There is shared clock "clk", clock enable "ena", synchronous clear "sclr", asynchronous clear "aclr" A couple caveats: - There are only 26 non-global inputs to each LAB in Cyclone (and 30 in Stratix). So the fitter will have to split the design over multiple labs if you use more than 7 bits in Cyclone, since you need 3 bits/bit (A, B, sload_data) plus a 4 local control signals and 2 global signals. Assuming aclr and clk are global, and the others are local, that's 4 extra signals you need. - When you stress the number of inputs on a LAB, you run the risk of having reduced routability, resulting in longer run-times, poor performance, or unroutable designs in the worst case. You should try to keep # of LAB inputs around 22-24. When Quartus splits the carry-chain, it must insert extra logic elements to end the chain and begin the next. For example, to implement a 10-bit add/sub/load/ena/aclr/sclr/sload requires 13 LEs. Still better than 20 LEs, but not 1:1. Also, the remaining unused in the lab will not be too useful, since the lab inputs are nearly saturated. If you have no sload or a constant sload, you can implement 10 bits/LAB since you only need 2n + 4 lab lines. Hope this helps! Paul Leventis Altera Corp. ************************* VERILOG CODE ****************** // Thanks to Gregg Baeckler for code! module addsub (clk,a,b,addnsub,sload,sclr,aclr,ena,data,out); parameter WIDTH = 7; input [WIDTH-1:0] a; // Operand A input [WIDTH-1:0] b; // Operand B (+B or -B based on addnsub) input [WIDTH-1:0] data; // Data to load upon sload input clk; // Clock input addnsub; // ADD=1, SUBTRACT=0 input sload; // Triggers synchronous load of register input sclr; // Synchronous clear input aclr; // Asynchronous clear input ena; // Clock enable output [WIDTH-1:0] out; wire [WIDTH-1:0] out; wire [WIDTH-1:0] cout_wires; // The first cell CIN is special since it has no carry-in. // Its carry-in will be the addnsub signal stratix_lcell first_cell ( .dataa(b[0]), .datab(a[0]), .datac(data[0]), .sload(sload), .sclr(sclr), .ena(ena), .aclr(aclr), .clk(clk), .inverta(addnsub), .regout(out[0]), .cout(cout_wires[0]) ); defparam first_cell .operation_mode = "arithmetic"; defparam first_cell .synch_mode = "on"; defparam first_cell .sum_lutc_input = "cin"; defparam first_cell .lut_mask = "96b2"; defparam first_cell .output_mode = "reg_only"; // fill in the rest of the cells in this loop genvar i; generate for (i=1; i<WIDTH; i=i+1) begin : ads stratix_lcell my_cell ( .dataa(b[i]), .datab(a[i]), .datac(data[i]), .sload(sload), .sclr(sclr), .ena(ena), .aclr(aclr), .clk(clk), .cin(cout_wires[i-1]), .inverta(addnsub), .regout(out[i]), .cout(cout_wires[i]) ); defparam my_cell .operation_mode = "arithmetic"; defparam my_cell .synch_mode = "on"; defparam my_cell .sum_lutc_input = "cin"; defparam my_cell .lut_mask = "96b2"; defparam my_cell .output_mode = "reg_only"; end endgenerate endmodule
Paul,

thanks for your suggestion. However, I will stay at plain VHDL and wait
for the synthesizer update :-)

> First of all, I should point out that this is sub-optimal synthesis, NOT a > "bug" -- the design will function, it just uses more logic elements than > necessary. We *may* fix this in a future release of Quartus, but the
I was never thinking that this is a 'bug' in the sense that it produces wrong results.
> solution will not be easy to implement so don't hold your breath. The value > is rather limited due to the input limitations explained below, and the > relative rarity of this combination of functions.
However, if the LAB global inputs such as 'sload' and 'ena' are not available for the synthesizer you're 'wasting' resources. Do you use these signals for other functions (perhaps the loadable counter)? BTW.: Do we really need asynchronous signals such as PRN/ALD, ADATA and CLRN (ok this one for the asynch. reset) in these days? Isn't that a waste of resources usfull only for a some designed who doing asynchronous design.
> In the meantime, there is a work-around. You can directly instantiate > "stratix_lcells" (the WYSIWYG cell for Stratix/Cyclone LEs). Below I give
Is there some documentation about these AYSIAYG lcells? I was looking for such an entity in the Megafunctions/LPM help of Quartus (befor you provided the solution) to implement this function. However, I did not find these basic megafunction. Martin ---------------------------------------------- JOP - a Java Processor core for FPGAs: http://www.jopdesign.com/