>> Fire your synthesize tool and see how much resources you'd really need!

Yes, this is my point,
Both structures have different resources, when write your code; your code
stile make the difference.

Walter.


"Arash Salarian" <arash.salarian@epfl.ch> a &#4294967295;crit dans le message de
news:417397f6$1@epflnews.epfl.ch...
> "Walter Gallegos" <walter@chasque.apc.org> wrote in message
> news:10n2h60q87tig87@news.supernews.com...
> > The answare is
> >
> >      1 slice into a Spartan 3
> >    16 LE   into a MAX-II
> >
> > Can you compare this architectures as  1 Slice = 2 LE's  ?
> >
>
> I agree that there some areas that you can't simply compare the two
> architectures. For example, I had an old design with an Altera 10K series
> that used a fully async RAM block. Now, move it to a Spartan 3
architecture
> and you see that you should use the whole chip just to make that block of
> async RAM!
> However, it is perfectly understandable that a user might need to compare
> different available options and to do this, he/she would need to have
rough
> estimates to compare a Xilinx device to that of Altera. For example,
> recently I had this interesting offer for a an FPGA prototype  board with
> the same price of $99 for an Altern EP1C12 or a Xilinx XC3S400. I would
like
> to use a prototype board for very different designs so I had to compare
> between the two chips. As I program in VHDL and use synthesize tools, I
> don't really care for any specific architecture (unless something like
your
> example or my example above happens) and the thing that matters in cases
> like that is you only look for the BIGGER FPGA. To do it, you need to
> compare and to compare you can only use rough estimates.
> Personally, I find the simple equation of 1 Slice = 2 LE a very good rough
> estimate and for many designs it gives you a good answer. You have a very
> specific design and need a very good answer? Fire your synthesize tool and
> see how much resources you'd really need!
>
>

Yeah, me too.

glen herrmannsfeldt wrote:

> Ray Andraka wrote:
>
> > Depends heavily on the design.  Xilinx packs tighter for certain
> > arithmetic because of the structure of the LUT and carry chain: Altera's
> > carry chain through stratix breaks the 4 lut into a pair of 3 LUTs, one
> > for sum one for carry so it limits the number of inputs per bit.
>
> (snip)
>
> I still miss the XC4000 series where the carry chain was separate
> from the LUTs, for convenient implementation of saturating adders
> and MAX(a,b) functions by feeding the carry out or overflow
> back to an LUT input.
>
> -- glen

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Ray Andraka wrote:

> Depends heavily on the design.  Xilinx packs tighter for certain
> arithmetic because of the structure of the LUT and carry chain: Altera's
> carry chain through stratix breaks the 4 lut into a pair of 3 LUTs, one
> for sum one for carry so it limits the number of inputs per bit.  

(snip)

I still miss the XC4000 series where the carry chain was separate
from the LUTs, for convenient implementation of saturating adders
and MAX(a,b) functions by feeding the carry out or overflow
back to an LUT input.

-- glen

Depends heavily on the design.  Xilinx packs tighter for certain
arithmetic because of the structure of the LUT and carry chain: Altera's
carry chain through stratix breaks the 4 lut into a pair of 3 LUTs, one
for sum one for carry so it limits the number of inputs per bit.  Stratix
adds a little bit of extra logic to the LE to allow implementation of an
adder subtractor without going to two levels of logic, and there is a way
to load data bypassing the adder which provides single level solutions for
those specific (and fairly common) cases.  Xilinx will also allow you to
turn the LE into a 16 element shift register, which can be very handy not
only for shift register delays, but also for reloadable LUTs, which are
useful for things like adaptive DA filters.  Altera has more options for
the memory structure, which in many cases makes it more efficient for
certain types of designs requiring memory.  My point is both vendor's
offerings have some strong points, and which one is best depends heavily
on your application.

Guitarman wrote:

> Hello All,
>
> I've been designing with Xilinx FPGAs for a while so I'm used to the
> "Slice" concept. I'm looking at Altera's Max II as a nice possible
> solution for a design.
>
> I took my VHDL code and it synthesized to 40 Slices in a Spartan III.
> Then I took the same code and sythesized it for a Max II (using
> Quartus II now) and it was 71 LE's.
>
> I realize a blanket statement 71 LE's (approx. =) 40 Slices, is totaly
> dependant on how the code is sysnthesized.
>
> But is a approximate 1 Slice = 2 LE's a pretty close all around
> estimate.
>
> Thanks
> Eric

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

"Walter Gallegos" <walter@chasque.apc.org> wrote in message 
news:10n2h60q87tig87@news.supernews.com...
> The answare is
>
>      1 slice into a Spartan 3
>    16 LE   into a MAX-II
>
> Can you compare this architectures as  1 Slice = 2 LE's  ?
>

I agree that there some areas that you can't simply compare the two 
architectures. For example, I had an old design with an Altera 10K series 
that used a fully async RAM block. Now, move it to a Spartan 3 architecture 
and you see that you should use the whole chip just to make that block of 
async RAM!
However, it is perfectly understandable that a user might need to compare 
different available options and to do this, he/she would need to have rough 
estimates to compare a Xilinx device to that of Altera. For example, 
recently I had this interesting offer for a an FPGA prototype  board with 
the same price of $99 for an Altern EP1C12 or a Xilinx XC3S400. I would like 
to use a prototype board for very different designs so I had to compare 
between the two chips. As I program in VHDL and use synthesize tools, I 
don't really care for any specific architecture (unless something like your 
example or my example above happens) and the thing that matters in cases 
like that is you only look for the BIGGER FPGA. To do it, you need to 
compare and to compare you can only use rough estimates.
Personally, I find the simple equation of 1 Slice = 2 LE a very good rough 
estimate and for many designs it gives you a good answer. You have a very 
specific design and need a very good answer? Fire your synthesize tool and 
see how much resources you'd really need!

The answare is

      1 slice into a Spartan 3
    16 LE   into a MAX-II

Can you compare this architectures as  1 Slice = 2 LE's  ?

Walter.


"Walter Gallegos" <walter@chasque.apc.org> a &#4294967295;crit dans le message de
news:10n13v2dqalbv6a@news.supernews.com...
>
> "Guitarman" <ericjohnholland@hotmail.com> a &#4294967295;crit dans le message de
> news:90282e35.0410151112.77a87654@posting.google.com...
> > Hello All,
> >
> > I've been designing with Xilinx FPGAs for a while so I'm used to the
> > "Slice" concept. I'm looking at Altera's Max II as a nice possible
> > solution for a design.
> >
> > I took my VHDL code and it synthesized to 40 Slices in a Spartan III.
> > Then I took the same code and sythesized it for a Max II (using
> > Quartus II now) and it was 71 LE's.
> >
> > I realize a blanket statement 71 LE's (approx. =) 40 Slices, is totaly
> > dependant on how the code is sysnthesized.
> >
> > But is a approximate 1 Slice = 2 LE's a pretty close all around
> > estimate.
> >
> > Thanks
> > Eric
>
> I disagree,  both architectures are different, you can't compare it in
this
> way
> have how many slices into the following code ?
> .....
>         DI : in std_logic;
>         DO : out std_logic;
>         CLOCK : in std_logic;
> .....
> .......
>    signal temp: std_logic_vector(15 downto 0);
> ......
> begin
>
>    Demo : process(CLOCK)
>    begin
>       if rising_edge(CLOCK) then
>          temp<= temp(14 downto 0) & DI;
>       end if;
>    end process Demo;
>
>    DO <= temp(15);
> ....
>
>
>
>

In article <4170D026.5193E181@yahoo.com>, rickman  <john@bluepal.net> wrote:
>Hal Murray wrote:
>> I'm assuming that there is a very good path connecting the LUT/FF in
>> the same LE because it is such a common case.  What makes not
>> using that faster?
>
>He is not talking about a LUT and FF that are connected, he means ones
>that are separate.  Like a FF with the D input connected to the output
>of another FF and a LUT that has its output going to another LUT only. 
>Unless there is a shortage of IO in the LAB, they can share the same
>LE.  Same thing in the Xilinx slice.  Due to crowding of the routing, it
>may result in a faster design to keep them separate.  

Not just routing, but also placement:  The separate pieces (FFs, LUTs
etc) are not placed independantly, but are packed together and then
placed.  Thus if unrelated logic is packed together inappropriately,
the placement for the packed component may be significantly worse than
if each component was placed separately.
-- 
Nicholas C. Weaver.  to reply email to "nweaver" at the domain
icsi.berkeley.edu

Hi Hal, Rick:

> > What would make the timing better if the LUT and FF are not packed
> > in the same LE?
> He is not talking about a LUT and FF that are connected, he means ones
> that are separate.  Like a FF with the D input connected to the output
> of another FF and a LUT that has its output going to another LUT only.
> Unless there is a shortage of IO in the LAB, they can share the same
> LE.

Rick's got it mostly right.  The Stratix/Cyclone/Max II LE/ALMs can have a
number of register/LUT pairings:
1. LUT feeds FF
2. FF feeds LUT
3. Unrelated FF and 3-input LUT
4. FF->FF connection from adjacent LE and a 4-input LUT (a register chain)
For example, we could pack an 8-bit shift register in with 7 4-LUTs and 1
3-LUT to form 8 LEs.

As Hal observed, it seems like doing #1 (or #2) is always a win.  If you
look at one FF, in our architecture we can choose to pack it with its fan-in
(#1) or fan-out (#2).  For example, if the critical path of the design is on
the output of the FF, through only one of its LUTs, using packing #2 is the
better choice for that flop.  So there is an interesting optimization
problem here.

Some of the LEs created by #1 or #2 will have two seperate LE outputs (the
Flop and the LUT) in the event that the FF/LUT connection is not single
fanout.  In theory, these multiple output LEs create a bit more routing
pressure and so you may hurt timing more by making one than you do by
bringing the FF and LUT together.  But our routing architecture has been
designed to tolerate aggressive packing.

One way that using packing #1 or #2 can be sub-optimal is in the event where
the flop really wants to be placed somewhere in-between all the things that
it feeds and feeds it.  Packing it with either source or destination might
help one path, but hurt others more than if you just left the FF in a
seperate LE and thus were free to move it where it wanted to be during
placement.

Now, when you look at #3, you must be intelligent in how you pack.  If you
take two unrelated functions that otherwise would want to be in opposite
corners of the chip and put them together, you can hurt timing.  Also, as
Rick points out, LEs of this type will have 4 inputs and 2 outputs; if you
make many of them you can start stressing the routing and this can lead to
lower performance.  Incidentally, this packing problem also arises on
Stratix II when it comes to packing multiple functions into an ALM -- if
they are unrelated, you must choose pairings wisely to not hurt performance.

Packing #4 is particularly nasty from a CAD perspective.  Creating these
packings implies a group of LEs that must all be placed in the same LAB
(register chain) and must move as a group.  This further restricts placement
and routing choices, and thus has the largest chance of being a net
negative.  But it can also help reduce the number of LEs in some designs.

Note: The more your pack together into LEs, the closer in general you can
place the LEs of a design, so doing these packings can also help performance
:-)

>Same thing in the Xilinx slice.  Due to crowding of the routing, it
> may result in a faster design to keep them separate.

The trade-offs are likely different here.  The VII slice has some FF packing
capabilities.  It can do #1, but #2 requires use of local routing (I think).
It's not clear to me from the slice diagram whether packing #3 can be done.
#4 is not possible.  Also, I'm not sure how well the architecture responds
to slices with multiple outputs (using the Y and Q outputs at the same
time).  If it was not architected for heavy use of both outputs, there could
be more routing/performance trade-off here.  This is all speculation.

What I do know is when we compare half-slice vs. LE counts on a suite of
designs, we find a ~9% advantage for Quartus + LEs over ISE + slices.  We
believe that the primary reason for this difference is the increased flop
packing density available in the Altera LE.

Regards,

Paul Leventis
Altera Corp.

Hal Murray wrote:
> 
> >One thing you should do is ensure that the CAD tool is trying to use as few
> >LEs (and slices for Xilinx) as possible.  When you are not filling up the
> >device, Quartus will not try too hard to put LUTs and FFs into the same
> >LE -- if there's any chance it will hurt rather than help timing, it will
> >avoid it.  When you start filling the device close to capacity, Quartus will
> >try to pack more aggressively.  This is the default "auto" setting for
> >register packing.
> 
> What would make the timing better if the LUT and FF are not packed
> in the same LE?
> 
> I'm assuming that there is a very good path connecting the LUT/FF in
> the same LE because it is such a common case.  What makes not
> using that faster?

He is not talking about a LUT and FF that are connected, he means ones
that are separate.  Like a FF with the D input connected to the output
of another FF and a LUT that has its output going to another LUT only. 
Unless there is a shortage of IO in the LAB, they can share the same
LE.  Same thing in the Xilinx slice.  Due to crowding of the routing, it
may result in a faster design to keep them separate.  

-- 

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX

>One thing you should do is ensure that the CAD tool is trying to use as few
>LEs (and slices for Xilinx) as possible.  When you are not filling up the
>device, Quartus will not try too hard to put LUTs and FFs into the same
>LE -- if there's any chance it will hurt rather than help timing, it will
>avoid it.  When you start filling the device close to capacity, Quartus will
>try to pack more aggressively.  This is the default "auto" setting for
>register packing.

What would make the timing better if the LUT and FF are not packed
in the same LE?

I'm assuming that there is a very good path connecting the LUT/FF in
the same LE because it is such a common case.  What makes not
using that faster?

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.