comp.arch.fpga | add/sub 2:1 mux and ena in a single LE (Cyclone)

I want to realize an add/subtract function, a 2:1 mux between this adder
and a load value and an enable of the register in a single LE. As I can
see in the data sheet (Cyclone) this should be possible: There is an
extra input addnsub to decide between add and subtract. Two inputs of the
LUT are used for the add/sub, the remaining two inputs can perform the
2:1 mux. The register has an additional ena input.
However, with the following VHDL I get 2 LEs per bit instead of 1. Any
ideas?

Martin

VDHL example:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity alu is

generic (
    width        : integer := 32        -- one data word
);
port (
    clk            : in std_logic;

    lmux        : in std_logic_vector(width-1 downto 0);
    b            : in std_logic_vector(width-1 downto 0);

    sel_sub        : in std_logic;                            -- 0..add,
1..sub
    sel_amux    : in std_logic;                            -- 0..sum,
1..lmux
    ena_a        : in std_logic;                            -- 1..store
new value
    dout        : out std_logic_vector(width-1 downto 0)
);
end alu;

architecture rtl of alu is

    signal a        : std_logic_vector(width-1 downto 0);

begin

-- this add/sub, the sum/lmux mux and the enable should fit into
-- a single LE.

process(clk, ena_a) begin
    if rising_edge(clk) then
        if ena_a='1' then
            if sel_amux='0' then
                if sel_sub='0' then
                    a <= std_logic_vector(signed(b) - signed(a));
                else
                    a <= std_logic_vector(signed(b) + signed(a));
                end if;
            else
                a <= lmux;
            end if;
        end if;

    end if;
end process;

    dout <= a;

end rtl;
--
----------------------------------------------
JOP - a Java Processor core for FPGAs:
http://www.jopdesign.com/

Reply by Symon ●October 7, 20042004-10-07

Martin,
I'll guess. Is it because addnsub is low for an add? So try:-

                 if sel_sub='1' then
                     a <= std_logic_vector(signed(b) - signed(a));
                 else
                     a <= std_logic_vector(signed(b) + signed(a));
                 end if;

Cheers, Syms.

Reply by Philip Freidin ●October 7, 20042004-10-07

On Thu, 07 Oct 2004 19:31:33 GMT, "Martin Schoeberl" <martin.schoeberl@chello.at> wrote:

>I want to realize an add/subtract function, a 2:1 mux between this adder
>and a load value and an enable of the register in a single LE. As I can
>see in the data sheet (Cyclone) this should be possible: There is an
>extra input addnsub to decide between add and subtract. Two inputs of the
>LUT are used for the add/sub, the remaining two inputs can perform the
>2:1 mux. The register has an additional ena input.
>However, with the following VHDL I get 2 LEs per bit instead of 1. Any
>ideas?
>
>Martin

Too many inputs to the LUT.

1 for the A operand
1 for the B operand
1 for the add/sub signal
1 for the load value
1 for the carry in from the previous bit.

I faced this problem back in 1990 when designing R16 for the
XC4005. My solution was to make the load value come in on the
"A" operand path.

In the older Xilinx architectures this is not too hard, as the
carry logic is separate from the LUT (but very close by), and
the CE for the FFs is a separate signal. While the topology of
the CLBs has changed significantly since then, I believe this
can still be done in the more recent Xilinx devices.

Many of the earlier Altera architectures implied this was
possible in their data sheets, but the architecture could not
actually support it, because the data sheets showed mutually
exclusive functions, while not explaining that they were
mutually exclusive. For example, the CE signal shared an input
with the LUT, and the LUT was broken into two 3 input LUTs to
implement ADD (1 for the sum bit and one for the carry out),
so you could not even do add/sub in one LE. I believe these
deficiencies have been corrected in more recent products, but
you would need to look very carefully at what a LAB can really
do.

Philip

Philip Freidin
Fliptronics

Reply by Paul Leventis (at home) ●October 8, 20042004-10-08

Hi Philip, Martin:

> Too many inputs to the LUT.
>
> 1 for the A operand
> 1 for the B operand
> 1 for the add/sub signal
> 1 for the load value
> 1 for the carry in from the previous bit.

Cyclone (and Stratix and MAX II) should be able to perform this function in
1 LE per bit.

I *think* you have uncovered a bug in Quartus 4.1 synthesis.  I'll confirm
this with the synthesis team tomorrow.  Basically, it looks like Quartus
will automatically make a loadable adder flop, or a adder-subtractor flop,
but not a loadable adder-substractor flop.  If I make a very simple VHDL
design that implements an synchronously loadable adder+flop, I get 1 LE/bit
as expected.  If I add a add/subtract selector, I get 2 LE/bit for no
apparent reason.  The LE (as shown in Figure 2-5 of the Cyclone databook
http://www.altera.com/literature/hb/cyc/cyc_c51002.pdf) should be able to
implement an sloadable adder/subtractor in 1 LE/bit.  Explanation of why
below.

Cyclone has a LAB-wide addnsub signal that can be used to control whether
the A operand to each LE in the LAB is inverted or not.  In addition,
addnsub can also enter the carry chain at bit 0 -- so you get compliment(a)
+ B + addnsub, or (compliment(a) + 1) + B = -A + B.

If you consider the 4-LUT as two 3-LUTs followed by a 2:1 Mux, then you get
the following assignments of signals (using some psuedo-VHDL; I'm rusty):

-- The programmable inversion of the A input
Aprime(i) <= A(i) WHEN addnsub = '0' else NOT(A(i)) WHEN OTHERS;

-- Carry chain
carry_in(0) <= addnsub;
For i=1..n
    carry_in(i) <= carry_out(i-1);
End for

-- Top-Half 4-LUT computes the sum bit:
TopHalfLUT(i) <= Aprime(i) XOR B(i) XOR carry_in(i)

-- Bottom-Half of 4-LUT computes the carry-in of the next LE:
BottomHalfLUT(i) <= (Aprime(i) AND B(i)) OR (Aprime(i) AND carry_in(i)) OR
(B(i) AND carry_in(i));
carry_out(i) <= BottomHalfLUT(i);

-- Now for the flop.  Note that we're only using 2 of the 4 LE inputs for
A(i), B(i).  carry_in(i) uses
-- the dedicated carry input, and addnsub signal uses the LAB-wide control
signal.  The LAB-wide
-- sload control signal is used to select between TopHalfLUT(i) and the
bypass from the LE's data3
-- input to the flop.
when (clock) -- I'm lazy
    IF sload = '1 THEN
        flop(i) <= sloaddata(i);
    ELSE
        flop(i) <= TopHalfLUT(i);

So there you go -- three inputs to the LE are used for A, B, and sload_data,
and two lab-wide signals are used for addnsub and sload.

Thanks for pointing this out.

Paul Leventis
Altera Corp.

Reply by Martin Schoeberl ●October 8, 20042004-10-08

> Cyclone has a LAB-wide addnsub signal that can be used to control
whether
> the A operand to each LE in the LAB is inverted or not.  In addition,
> addnsub can also enter the carry chain at bit 0 -- so you get
compliment(a)
> + B + addnsub, or (compliment(a) + 1) + B = -A + B.
>
> If you consider the 4-LUT as two 3-LUTs followed by a 2:1 Mux, then you
get
> the following assignments of signals (using some psuedo-VHDL; I'm
rusty):
>
> -- The programmable inversion of the A input
> Aprime(i) <= A(i) WHEN addnsub = '0' else NOT(A(i)) WHEN OTHERS;
>
> -- Carry chain
> carry_in(0) <= addnsub;
> For i=1..n
>     carry_in(i) <= carry_out(i-1);
> End for
>
> -- Top-Half 4-LUT computes the sum bit:
> TopHalfLUT(i) <= Aprime(i) XOR B(i) XOR carry_in(i)
>
> -- Bottom-Half of 4-LUT computes the carry-in of the next LE:
> BottomHalfLUT(i) <= (Aprime(i) AND B(i)) OR (Aprime(i) AND carry_in(i))
OR
> (B(i) AND carry_in(i));
> carry_out(i) <= BottomHalfLUT(i);
>
> -- Now for the flop.  Note that we're only using 2 of the 4 LE inputs
for
> A(i), B(i).  carry_in(i) uses
> -- the dedicated carry input, and addnsub signal uses the LAB-wide
control
> signal.  The LAB-wide
> -- sload control signal is used to select between TopHalfLUT(i) and the
> bypass from the LE's data3
> -- input to the flop.
> when (clock) -- I'm lazy
>     IF sload = '1 THEN
>         flop(i) <= sloaddata(i);
>     ELSE
>         flop(i) <= TopHalfLUT(i);
>
> So there you go -- three inputs to the LE are used for A, B, and
sload_data,
> and two lab-wide signals are used for addnsub and sload.
>

and three lab-wide signals: addnsub, sload and ena of the FF. I've
checked again with the data sheet. In figure 2-7 you can see the LE in
'dynamic arithmetic mode' and the resource are there for this kind of
function.

When we take a look in the Analysis & Synthesis Equations we get:

--a[0] is a[0]
--operation mode is normal

a[0]_lut_out = lmux[0] & (C1_result[0] # sel_amux) # !lmux[0] &
C1_result[0] & !sel_amux;
a[0] = DFFEA(a[0]_lut_out, clk, VCC, , ena_a, , );

And it should be:

a[0]_lut_out = C1_result[0];
a[0] = DFFEA(a[0]_lut_out, clk, VCC, , ena_a, sel_amux, lmux[0]);

However, the last two parameters for this DFFEA are the asynchronous
inputs to the FF and we want a synchronous load. Why is there such a
thing as asynchronous inputs to a FF?

Perhaps the synthsizer should generate LPM_FF, where the synchronous load
is available.

This function also uses 2 LCs per bit in a Spartan-3. As I'm not so used
to 'read' the Xilinx diagram of the LC I don't know if the resources for
one LC could implement this function.

> Thanks for pointing this out.

You're welcome

As you will notice, this question is related to the JOP optimizing
contest ;-)

Martin
----------------------------------------------
JOP - a Java Processor core for FPGAs:
http://www.jopdesign.com/

Reply by Sylvain Munaut ●October 8, 20042004-10-08

 
> This function also uses 2 LCs per bit in a Spartan-3. As I'm not so used
> to 'read' the Xilinx diagram of the LC I don't know if the resources for
> one LC could implement this function.

Don't think so. Not in this form at least.

If I understand correctly the SLICE view of DS099-2 page 11, the muxes
like CYINIT, CY0F, are configured during configuration of the FPGA.

So if you enable the carry logic for a bunch of slices, it stays active
all the time. Then for the load operation to work, you must ensure your
b input is all '0', then you can do it in 1 LC/bit.

If not, the carry will pollute the output ...


But this is only a simplified slice view and I don't know where to
find the complete one. With this view I don't see how it implements
a addsub in 1LC/bit, but it can do it. In the view, I see nothing capable
of inverting the F1 or F2 so that the carry logic knows that one operand
is inverted.



Sylvain

Reply by Jan Gray ●October 9, 20042004-10-09

In Virtex-derived architectures, you can implement
 o = add ? (a + b) : c;
or
 o = sel ? (a + b) : (a + c);
or even
 o = addsub ? (addand ? a+b : a-b) : (addand ? a&b : a^b);
in one LUT per bit.

The trick is to use a MULT_AND to kill the carry propagation when add=0.
See http://www.fpgacpu.org/log/nov00.html#001112.

But as Philip points out, you'd need five input signals to do
  o = sel ? (add ? a + b : a - b) : c;
and I don't think that can be done in one LUT per bit.

Jan Gray

Reply by Martin Schoeberl ●October 9, 20042004-10-09

"Jan Gray" <jsgray@acm.org> schrieb im Newsbeitrag 
news:CXJ9d.11048$Vm1.2497@newsread3.news.atl.earthlink.net...
> In Virtex-derived architectures, you can implement
> o = add ? (a + b) : c;
> or
> o = sel ? (a + b) : (a + c);
> or even
> o = addsub ? (addand ? a+b : a-b) : (addand ? a&b : a^b);
> in one LUT per bit.
>
> The trick is to use a MULT_AND to kill the carry propagation when 
> add=0.
> See http://www.fpgacpu.org/log/nov00.html#001112.
>
> But as Philip points out, you'd need five input signals to do
>  o = sel ? (add ? a + b : a - b) : c;
> and I don't think that can be done in one LUT per bit.
>

The original request needs even six inputs. In your notation I want to 
achieve following function:
d = ena ? (sload ? c : (addnsub ? a+b : a-b)) : d
However, the Cyclone LC has LAB wide signals for addnsub, sload and ena. 
You only need three of the LUT inputs for a,b and c which are available 
in arithmetic mode.
For the Spartan LC I can see only the CE signal as additional 'global' 
input that can serve as ena. There are two inputs (FIXINA/B) for the 
register load, but it seems to me that GYMUX is statically configured. So 
it can't be used for the sload part.

Martin
----------------------------------------------
JOP - a Java Processor core for FPGAs:
http://www.jopdesign.com/

Reply by Paul Leventis (at home) ●October 15, 20042004-10-15

Hi All,

> I *think* you have uncovered a bug in Quartus 4.1 synthesis.  I'll confirm
> this with the synthesis team tomorrow.

First of all, I should point out that this is sub-optimal synthesis, NOT a
"bug" -- the design will function, it just uses more logic elements than
necessary.  We *may* fix this in a future release of Quartus, but the
solution will not be easy to implement so don't hold your breath.  The value
is rather limited due to the input limitations explained below, and the
relative rarity of this combination of functions.

In the meantime, there is a work-around.  You can directly instantiate
"stratix_lcells" (the WYSIWYG cell for Stratix/Cyclone LEs).  Below I give
the code (thanks to a helpful synthesis guy) for a registered
adder/subtractor with oodles of extras.  Features:
  - Implements A - B or A + B (depending on signal "addnsub")
  - Registers are synchronously loadable with "data" when synchronous load
"sload" is asserted
  - There is shared clock "clk", clock enable "ena", synchronous clear
"sclr", asynchronous clear "aclr"

A couple caveats:
   - There are only 26 non-global inputs to each LAB in Cyclone (and 30 in
Stratix).  So the fitter will have to split the design over multiple labs if
you use more than 7 bits in Cyclone, since you need 3 bits/bit (A, B,
sload_data) plus a 4 local control signals and 2 global signals.  Assuming
aclr and clk are global, and the others are local, that's 4 extra signals
you need.
   - When you stress the number of inputs on a LAB, you run the risk of
having reduced routability, resulting in longer run-times, poor performance,
or unroutable designs in the worst case.  You should try to keep # of LAB
inputs around 22-24.

When Quartus splits the carry-chain, it must insert extra logic elements to
end the chain and begin the next.  For example, to implement a 10-bit
add/sub/load/ena/aclr/sclr/sload requires 13 LEs.  Still better than 20 LEs,
but not 1:1.  Also, the remaining unused in the lab will not be too useful,
since the lab inputs are nearly saturated.

If you have no sload or a constant sload, you can implement 10 bits/LAB
since you only need 2n + 4 lab lines.

Hope this helps!

Paul Leventis
Altera Corp.

************************* VERILOG CODE ******************

// Thanks to Gregg Baeckler for code!

module addsub (clk,a,b,addnsub,sload,sclr,aclr,ena,data,out);
parameter WIDTH = 7;

input [WIDTH-1:0] a;     // Operand A
input [WIDTH-1:0] b;     // Operand B (+B or -B based on addnsub)
input [WIDTH-1:0] data;  // Data to load upon sload
input clk;           // Clock
input addnsub;       // ADD=1, SUBTRACT=0
input sload;         // Triggers synchronous load of register
input sclr;          // Synchronous clear
input aclr;          // Asynchronous clear
input ena;           // Clock enable

output [WIDTH-1:0] out;
wire [WIDTH-1:0] out;
wire [WIDTH-1:0] cout_wires;

// The first cell CIN is special since it has no carry-in.
// Its carry-in will be the addnsub signal
stratix_lcell first_cell (
 .dataa(b[0]),
 .datab(a[0]),
 .datac(data[0]),
 .sload(sload),
 .sclr(sclr),
 .ena(ena),
 .aclr(aclr),
 .clk(clk),
 .inverta(addnsub),
 .regout(out[0]),
 .cout(cout_wires[0])
    );
    defparam first_cell .operation_mode = "arithmetic";
    defparam first_cell .synch_mode = "on";
    defparam first_cell .sum_lutc_input = "cin";
    defparam first_cell .lut_mask = "96b2";
    defparam first_cell .output_mode = "reg_only";

// fill in the rest of the cells in this loop
genvar i;
generate
  for (i=1; i<WIDTH; i=i+1)
  begin : ads
    stratix_lcell my_cell (
     .dataa(b[i]),
     .datab(a[i]),
     .datac(data[i]),
     .sload(sload),
     .sclr(sclr),
     .ena(ena),
     .aclr(aclr),
     .clk(clk),
     .cin(cout_wires[i-1]),
     .inverta(addnsub),
     .regout(out[i]),
     .cout(cout_wires[i])
    );
    defparam my_cell .operation_mode = "arithmetic";
    defparam my_cell .synch_mode = "on";
    defparam my_cell .sum_lutc_input = "cin";
    defparam my_cell .lut_mask = "96b2";
    defparam my_cell .output_mode = "reg_only";
  end
endgenerate

endmodule

Reply by Martin Schoeberl ●October 17, 20042004-10-17

Paul,

thanks for your suggestion. However, I will stay at plain VHDL and wait
for the synthesizer update :-)

> First of all, I should point out that this is sub-optimal synthesis, NOT a
> "bug" -- the design will function, it just uses more logic elements than
> necessary.  We *may* fix this in a future release of Quartus, but the

I was never thinking that this is a 'bug' in the sense that it produces wrong
results.

> solution will not be easy to implement so don't hold your breath.  The value
> is rather limited due to the input limitations explained below, and the
> relative rarity of this combination of functions.

However, if the LAB global inputs such as 'sload' and 'ena' are not available
for the synthesizer you're 'wasting' resources. Do you use these signals for other
functions (perhaps the loadable counter)?

BTW.: Do we really need asynchronous signals such as PRN/ALD, ADATA
and CLRN (ok this one for the asynch. reset) in these days? Isn't that a waste
of resources usfull only for a some designed who doing asynchronous design.

> In the meantime, there is a work-around.  You can directly instantiate
> "stratix_lcells" (the WYSIWYG cell for Stratix/Cyclone LEs).  Below I give

Is there some documentation about these AYSIAYG lcells? I was looking for such
an entity in the Megafunctions/LPM help of Quartus (befor you provided the solution)
to implement this function. However, I did not find these basic megafunction.

Martin
----------------------------------------------
JOP - a Java Processor core for FPGAs:
http://www.jopdesign.com/

Previous12 Next

add/sub 2:1 mux and ena in a single LE (Cyclone)

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group