# How many Altera LE's to Xilinx Slices????

Started by October 15, 2004
```Hello All,

I've been designing with Xilinx FPGAs for a while so I'm used to the
"Slice" concept. I'm looking at Altera's Max II as a nice possible
solution for a design.

I took my VHDL code and it synthesized to 40 Slices in a Spartan III.
Then I took the same code and sythesized it for a Max II (using
Quartus II now) and it was 71 LE's.

I realize a blanket statement 71 LE's (approx. =) 40 Slices, is totaly
dependant on how the code is sysnthesized.

But is a approximate 1 Slice = 2 LE's a pretty close all around
estimate.

Thanks
Eric
```
```Hi Eric,

> But is a approximate 1 Slice = 2 LE's a pretty close all around
> estimate.

Give or take ~10% as a design-dependant margin and you should be OK.

Best regards,

Ben

```
The problem is not a hardware issue, but a granularity issue.  Slices
are not a good measure of how much logic your design is using.  Slices
have two LUTs and two FFs.  If one FF is used, the slice is counted as
used.  You are better off determining how many LUTs and FFs are used in
each design.  They are much more comparable although there will be
family dependant differences in how well the designs can pack into the
larger granules.  Mostly the newer parts will pack logic and FFs more
densely than the older parts.

```
Well, given that 1 slice = 2 LUTs + 2 FFs + some more logic, and 1 LE
= 1 LUT + 1 FF + some more logic, it would be expected.

-hpa

```
```Hi Eric,

> I realize a blanket statement 71 LE's (approx. =) 40 Slices, is totaly
> dependant on how the code is sysnthesized.
>
> But is a approximate 1 Slice = 2 LE's a pretty close all around
> estimate.

Yes, that's a good 1st order estimate.  We believe that 1 Slice is equal to
about 1.8 LEs based on average results across a suite of designs, but
mileage will vary from design to design -- this lines up well with your
result though.

One thing you should do is ensure that the CAD tool is trying to use as few
LEs (and slices for Xilinx) as possible.  When you are not filling up the
device, Quartus will not try too hard to put LUTs and FFs into the same
LE -- if there's any chance it will hurt rather than help timing, it will
avoid it.  When you start filling the device close to capacity, Quartus will
try to pack more aggressively.  This is the default "auto" setting for
register packing.

To artificially force Quartus to pack as aggressively as possible into LEs,
go to the menu Assignments/Settings... select the Fitter Settings tab, and
click the "More Settings..." button.  There is a setting called "Auto Packed
Registers -- Max II".  Setting this to Minimize Area w/Chains will cause the
most aggressive packing.

Also, under the Analysis & Synthesis Settings tab, you can try out the
"area" optimization technique which heuristically cares more about area than
delay, though doesn't always necessarily reduce LE count.

Regards,

Paul Leventis
Altera Corp.

```
I disagree,  both architectures are different, you can't compare it in this
way
have how many slices into the following code ?
.....
DI : in std_logic;
DO : out std_logic;
CLOCK : in std_logic;
.....
.......
signal temp: std_logic_vector(15 downto 0);
......
begin

Demo : process(CLOCK)
begin
if rising_edge(CLOCK) then
temp<= temp(14 downto 0) & DI;
end if;
end process Demo;

DO <= temp(15);
....

```
```>One thing you should do is ensure that the CAD tool is trying to use as few
>LEs (and slices for Xilinx) as possible.  When you are not filling up the
>device, Quartus will not try too hard to put LUTs and FFs into the same
>LE -- if there's any chance it will hurt rather than help timing, it will
>avoid it.  When you start filling the device close to capacity, Quartus will
>try to pack more aggressively.  This is the default "auto" setting for
>register packing.

What would make the timing better if the LUT and FF are not packed
in the same LE?

I'm assuming that there is a very good path connecting the LUT/FF in
the same LE because it is such a common case.  What makes not
using that faster?

```
He is not talking about a LUT and FF that are connected, he means ones
that are separate.  Like a FF with the D input connected to the output
of another FF and a LUT that has its output going to another LUT only.
Unless there is a shortage of IO in the LAB, they can share the same
LE.  Same thing in the Xilinx slice.  Due to crowding of the routing, it
may result in a faster design to keep them separate.

```
```Hi Hal, Rick:

> > What would make the timing better if the LUT and FF are not packed
> > in the same LE?
> He is not talking about a LUT and FF that are connected, he means ones
> that are separate.  Like a FF with the D input connected to the output
> of another FF and a LUT that has its output going to another LUT only.
> Unless there is a shortage of IO in the LAB, they can share the same
> LE.

Rick's got it mostly right.  The Stratix/Cyclone/Max II LE/ALMs can have a
number of register/LUT pairings:
1. LUT feeds FF
2. FF feeds LUT
3. Unrelated FF and 3-input LUT
4. FF->FF connection from adjacent LE and a 4-input LUT (a register chain)
For example, we could pack an 8-bit shift register in with 7 4-LUTs and 1
3-LUT to form 8 LEs.

As Hal observed, it seems like doing #1 (or #2) is always a win.  If you
look at one FF, in our architecture we can choose to pack it with its fan-in
(#1) or fan-out (#2).  For example, if the critical path of the design is on
the output of the FF, through only one of its LUTs, using packing #2 is the
better choice for that flop.  So there is an interesting optimization
problem here.

Some of the LEs created by #1 or #2 will have two seperate LE outputs (the
Flop and the LUT) in the event that the FF/LUT connection is not single
fanout.  In theory, these multiple output LEs create a bit more routing
pressure and so you may hurt timing more by making one than you do by
bringing the FF and LUT together.  But our routing architecture has been
designed to tolerate aggressive packing.

One way that using packing #1 or #2 can be sub-optimal is in the event where
the flop really wants to be placed somewhere in-between all the things that
it feeds and feeds it.  Packing it with either source or destination might
help one path, but hurt others more than if you just left the FF in a
seperate LE and thus were free to move it where it wanted to be during
placement.

Now, when you look at #3, you must be intelligent in how you pack.  If you
take two unrelated functions that otherwise would want to be in opposite
corners of the chip and put them together, you can hurt timing.  Also, as
Rick points out, LEs of this type will have 4 inputs and 2 outputs; if you
make many of them you can start stressing the routing and this can lead to
lower performance.  Incidentally, this packing problem also arises on
Stratix II when it comes to packing multiple functions into an ALM -- if
they are unrelated, you must choose pairings wisely to not hurt performance.

Packing #4 is particularly nasty from a CAD perspective.  Creating these
packings implies a group of LEs that must all be placed in the same LAB
(register chain) and must move as a group.  This further restricts placement
and routing choices, and thus has the largest chance of being a net
negative.  But it can also help reduce the number of LEs in some designs.

Note: The more your pack together into LEs, the closer in general you can
place the LEs of a design, so doing these packings can also help performance
:-)

>Same thing in the Xilinx slice.  Due to crowding of the routing, it
> may result in a faster design to keep them separate.

The trade-offs are likely different here.  The VII slice has some FF packing
capabilities.  It can do #1, but #2 requires use of local routing (I think).
It's not clear to me from the slice diagram whether packing #3 can be done.
#4 is not possible.  Also, I'm not sure how well the architecture responds
to slices with multiple outputs (using the Y and Q outputs at the same
time).  If it was not architected for heavy use of both outputs, there could
be more routing/performance trade-off here.  This is all speculation.

What I do know is when we compare half-slice vs. LE counts on a suite of
designs, we find a ~9% advantage for Quartus + LEs over ISE + slices.  We
believe that the primary reason for this difference is the increased flop
packing density available in the Altera LE.

Regards,

Paul Leventis
Altera Corp.

```
```In article <4170D026.5193E181@yahoo.com>, rickman  <john@bluepal.net> wrote:
>Hal Murray wrote:
>> I'm assuming that there is a very good path connecting the LUT/FF in
>> the same LE because it is such a common case.  What makes not
>> using that faster?
>
>He is not talking about a LUT and FF that are connected, he means ones
>that are separate.  Like a FF with the D input connected to the output
>of another FF and a LUT that has its output going to another LUT only.
>Unless there is a shortage of IO in the LAB, they can share the same
>LE.  Same thing in the Xilinx slice.  Due to crowding of the routing, it
>may result in a faster design to keep them separate.

Not just routing, but also placement:  The separate pieces (FFs, LUTs
etc) are not placed independantly, but are packed together and then
placed.  Thus if unrelated logic is packed together inappropriately,
the placement for the packed component may be significantly worse than
if each component was placed separately.
```