Sign in

username:

password:



Not a member?

Search Comp.Arch.FPGA



Search tips

fpga by Keywords

Altera | ASIC | CPLD | Cyclone | DCM | DDR | DSP | Ethernet | ISE | JTAG | Linux | LVDS | Microblaze | ML310 | Modelsim | NIOS | OPB | PCI | Quartus | RocketIO | SDRAM | Spartan | Spartan3 | SRAM | Stratix | Verilog | VHDL | Virtex | Virtex-4 | Virtex-II | Xilinx | XST

Ads

See Also

DSPEmbedded SystemsElectronics

Comp.Arch.FPGA | Advice on Xilinx Spelunking

There are 17 messages in this thread.

You are currently looking at messages 0 to 10.

Advice on Xilinx Spelunking - Rob Gaddi - 2010-05-25 17:32:00

I've got a Spartan 6 design that I'm working with
under ISE 11.5.  A 
code block that I would expect to take up about 200 LUTs is taking 800 
instead.  600 LUTs wouldn't be the end of the world, except I'm planning 
to replicate this block 32 times, which puts me well over the top.

So the question becomes where are all of the LUTs going?  There's 
nothing in the XST status report for the section that would imply 
anywhere near this much utilization.  I've tried looking over the RTL 
schematic; it's difficult to read and from what I could make out, there 
still wasn't anything to explain all those LUTs.  Then I tried looking 
through the technology schematic instead.  The viewer took forever to 
open the schematic, and when I finally got it open it took better than a 
minute any time I wanted to refresh the screen.  Needless to say, this 
got me nowhere.

So, I'm out for advice.  Any suggestions on figuring out just where all 
of those LUTs are going?

Thanks,
Rob

-- 
Rob Gaddi, Highland Technology
Email address is currently out of order
______________________________
Join the blogging team on FPGARelated.com and earn rewards! Details Here.



Re: Advice on Xilinx Spelunking - glen herrmannsfeldt - 2010-05-25 17:54:00

Rob Gaddi <r...@technologyhighland.com>
wrote:
> I've got a Spartan 6 design that I'm working with under ISE 11.5.  A 
> code block that I would expect to take up about 200 LUTs is taking 800 
> instead.  600 LUTs wouldn't be the end of the world, except I'm planning 
> to replicate this block 32 times, which puts me well over the top.

How full is the FPGA that you are targeting?  If not so full, I
believe that the tools don't try so hard.  Well, actually the LUT
count shouldn't be so far off, but the CLB count can change, as
it doesn't fill each CLB.

Otherwise, without knowing about the design it is hard to say.

Can you say a little about the logic?  How many counters, adders, RAMs.

Maybe it is using CLB for RAM, instead of BRAM?  

-- glen

Re: Advice on Xilinx Spelunking - John_H - 2010-05-25 18:42:00

On May 25, 5:32=A0pm, Rob Gaddi
<rga...@technologyhighland.com> wrote:
> I've got a Spartan 6 design that I'm working with under ISE 11.5. =A0A
> code block that I would expect to take up about 200 LUTs is taking 800
> instead. =A0600 LUTs wouldn't be the end of the world, except I'm plannin=
g
> to replicate this block 32 times, which puts me well over the top.
>
> So the question becomes where are all of the LUTs going? =A0There's
> nothing in the XST status report for the section that would imply
> anywhere near this much utilization. =A0I've tried looking over the RTL
> schematic; it's difficult to read and from what I could make out, there
> still wasn't anything to explain all those LUTs. =A0Then I tried looking
> through the technology schematic instead. =A0The viewer took forever to
> open the schematic, and when I finally got it open it took better than a
> minute any time I wanted to refresh the screen. =A0Needless to say, this
> got me nowhere.
>
> So, I'm out for advice. =A0Any suggestions on figuring out just where all
> of those LUTs are going?
>
> Thanks,
> Rob
>
> --
> Rob Gaddi, Highland Technology
> Email address is currently out of order

A good technology view will make the world of difference.  But it
seems Xilinx isn't giving you that.  I used the Synplify synthesizer's
HDL Analyst to get a superb technology view that allowed me to
understand the occasional oddity the synthesizer would produce from my
code.  I found that technology viewer to be a truly top-notch product
and sincerely helpful in keeping a design on track.

I've only glanced at the Xilinx technology viewer, seeing that it
looked like a last-gen VW beetle compared to a modern day Lexus in the
HDL Analyst.  It may do the job but it won't be a comfortable job if
it gets too involved.
______________________________
Join the blogging team on FPGARelated.com and earn rewards! Details Here.

Re: Advice on Xilinx Spelunking - Symon - 2010-05-25 18:55:00

On 5/25/2010 10:32 PM, Rob Gaddi wrote:
>
> So the question becomes where are all of the LUTs going?
>
> Thanks,
> Rob
>
Does ISE11.5 have FPGA editor?

Syms.


Re: Advice on Xilinx Spelunking - Rob Gaddi - 2010-05-25 19:39:00

On 5/25/2010 2:54 PM, glen herrmannsfeldt wrote:
> Rob Gaddi<r...@technologyhighland.com>  wrote:
>> I've got a Spartan 6 design that I'm working with under ISE 11.5.  A
>> code block that I would expect to take up about 200 LUTs is taking 800
>> instead.  600 LUTs wouldn't be the end of the world, except I'm planning
>> to replicate this block 32 times, which puts me well over the top.
>
> How full is the FPGA that you are targeting?  If not so full, I
> believe that the tools don't try so hard.  Well, actually the LUT
> count shouldn't be so far off, but the CLB count can change, as
> it doesn't fill each CLB.
>
> Otherwise, without knowing about the design it is hard to say.
>
> Can you say a little about the logic?  How many counters, adders, RAMs.
>
> Maybe it is using CLB for RAM, instead of BRAM?
>
> -- glen

Sure.  The widget in question does 8 pole IIR filtering of 16 bit data 
using 48-bit internal data paths.  The actual add/multiply/add math is 
taken care of by a subblock that uses a DSP48 slice and 222 LUTs that 
I'm not counting towards the 800.

The block I'm looking at is the wrapper that sequences the math 
operations and holds the internal states.  The logic infers two 48 bit 
LUT RAMs, one dual port, and one quad port.  There's a 24-bit LUT RAM 
and a 24 bit adder that I use to implement an FIR prefilter (the 8 zeros 
at z=-1 that you get from the bilinear transform of an 8 pole filter). 
There's an FSM with four states, and a couple of 3 bit counters.  There 
are two 18 bit comparators, but most of the LSBs of them should optimize 
out.

I'll append the code here.  I'm not bothering to include pkg_bus as 
well, but it just defines a simple WISHBONE bus and a few constants.

--

library IEEE;
use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD.all;
use IEEE.STD_LOGIC_MISC.all;

use work.pkg_bus.all;

-- Xilinx specific macro library
-- library UNISIM;
-- use UNISIM.VComponents.all;

entity filter is
   port (
     -- Data path
     din   : in  signed(15 downto 0);
     nd    : in  boolean;

     dout  : out signed(15 downto 0);
     drdy  : out boolean;

     -- Coefficient path
     WB_IN : in  t_wb_mosi;
     WB_OUT  : out t_wb_miso;

     WB_SYS  : in  t_wb_sys
   );
end entity filter;

architecture Behavioral of filter is

   alias clk : std_logic is WB_SYS.CLK_I;
   alias rst : std_logic is WB_SYS.RST_I;

   -- Component declaration of the "filter_math" unit defined in
   -- file: "./src/vhdl/filter_math.vhd"
   component filter_math
   port(
     data : in SIGNED(47 downto 0);
     pre : in SIGNED(47 downto 0);
     post : in SIGNED(47 downto 0);
     k : in SIGNED(47 downto 0);
     lsd_nd : in BOOLEAN;
     ichg : out BOOLEAN;
     irdy : out BOOLEAN;
     y : out SIGNED(47 downto 0);
     lsd_rdy : out BOOLEAN;
     msd_rdy : out BOOLEAN;
     clk : in STD_LOGIC);
   end component;
   for all: filter_math use entity work.filter_math(Xilinx_DSP48A1);

   -- We're going to use a whole mess o' RAMs to store various
   -- and sundry.
   subtype t_data  is signed(47 downto 0);
   constant POLES    : integer := 8;
   constant MAX_IDX  : integer := POLES-1;

   -- Data memory is S3.45.
   subtype t_idx is integer range 0 to MAX_IDX;
   type t_ram    is array(t_idx) of t_data;
   signal ram_dat  : t_ram := (others => (others => '0'));

   subtype t_uns_idx is unsigned(2 downto 0);
   signal write_idx  : t_uns_idx;
   signal read_idx   : t_uns_idx;

   -- Coefficient memory is also S3.45, but since we're
   -- writing it from a 16 bit data bus, we need to be
   -- able to access it a word at a time.
   --
   type t_coefram  is array(t_idx) of t_wb_data;
   signal ram_k_hi : t_coefram := (others => (others => '0'));
   signal ram_k_md : t_coefram := (others => (others => '0'));
   signal ram_k_lo : t_coefram := (others => (others => '0'));

   -- As seen from the memory bus, the coefficients are
   -- 64 bits long.  The uppermost word of this is shared
   -- between all coefficients, and is the filter control
   -- word.

   signal fcw     : t_wb_data;

   -- Bits 2:0 are POLES_USED, which should be an odd number
   -- equal to the number of poles for this filter - 1.  Any
   -- even number here, including zero, will code for no filter.
   alias poles_used : std_logic_vector is fcw(2 downto 0);

   -- Hook the data up to the math core
   signal data : t_data;
   signal pre  : t_data;
   signal post : t_data;
   signal k  : t_data;
   signal y  : t_data;

   signal go   : boolean;

   signal lsd_nd : boolean;
   signal ichg   : boolean;
   signal irdy   : boolean;
   signal lsd_rdy  : boolean;
   signal msd_rdy  : boolean;

   -- Downstream of the math core we'll apply a cascade of 2 pole
   -- boxcar filters in order to put some zeros.  One bit growth per
   -- stage brings us to S1.23 when we're done.

   subtype t_fir_data is signed(din'length + POLES - 1 downto 0);
   type t_firram is array(t_idx) of t_fir_data;
   signal fir_cascade  : t_firram := (others => (others => '0'));
   signal fir_idx    : t_uns_idx;
   signal fir_din    : t_fir_data;

   -- Internal states of things
   signal fir_drdy   : boolean;
   signal use_fir_data : boolean;

   type t_state is (IDLE, FIR, IIR, RESET);
   signal state : t_state := RESET;

   -- LFSR noise generator.  When we first extend the 16 bit data to 24
   -- bits for the FIR filter, adding this noise in below the LSB helps
   -- make sure the IIR filters don't get into long, drawn out settlings.
   signal lfsr : std_logic_vector(22 downto 1) := (others => '0');

begin
   -------------------------------------------------------------------------
   --  Make sure our constants are compiled correctly.
   -------------------------------------------------------------------------

   assert (2**write_idx'length = POLES)
     report "Length of RAM index does not correspond to number of poles."
     severity failure;

   -------------------------------------------------------------------------
   --  Connect up the asynchronous data paths.
   -------------------------------------------------------------------------

   -- FIR data is in S1.23, the math core is expecting S3.45
   data  <=  SHIFT_LEFT(RESIZE(fir_din, data'length), 45-23) when 
use_fir_data
         else y;

   lsd_nd  <=  fir_drdy when use_fir_data else
         lsd_rdy;

   -- Everything else comes out of the RAMs.  ram_k has one r/w port and one
   -- read port, ram_dat has one write port and two read ports.
   --

   pre   <= ram_dat(TO_INTEGER(read_idx or "001"));
   post  <= ram_dat(TO_INTEGER(read_idx));
   k   <=  SIGNED(ram_k_hi(TO_INTEGER(read_idx))) &
         SIGNED(ram_k_md(TO_INTEGER(read_idx))) &
         SIGNED(ram_k_lo(TO_INTEGER(read_idx)));

   -- Instantiate our math core.
   MATH : filter_math
     port map(
       data => data,
       pre => pre,
       post => post,
       k => k,
       lsd_nd => lsd_nd,
       ichg => ichg,
       irdy => irdy,
       y => y,
       lsd_rdy => lsd_rdy,
       msd_rdy => msd_rdy,
       clk => clk
     );

   -------------------------------------------------------------------------
   --  WISHBONE coefficient readback.
   -------------------------------------------------------------------------

   WB_READBACK: process(WB_IN, fcw, ram_k_hi, ram_k_md, ram_k_lo)
     variable read_addr  : integer range 0 to MAX_IDX;
     variable word_addr  : integer range 0 to 3;

   begin
     read_addr := TO_INTEGER(WB_IN.ADDR(1 + read_idx'length downto 2));
     word_addr := TO_INTEGER(WB_IN.ADDR(1 downto 0));

     WB_OUT  <= WB_BADA_SLAVE;

     if read_addr <= MAX_IDX then
       case word_addr is
         when 0 => WB_OUT.DAT <= fcw;
         when 1 => WB_OUT.DAT <= ram_k_hi(read_addr);
         when 2 => WB_OUT.DAT <= ram_k_md(read_addr);
         when 3 => WB_OUT.DAT <= ram_k_lo(read_addr);
       end case;
     end if;

   end process WB_READBACK;

   -------------------------------------------------------------------------
   --  Wrangle the big state machine.
   -------------------------------------------------------------------------

   MACHINE: process
     variable write_addr : integer range 0 to 31;
     variable word_addr  : integer range 0 to 3;
     variable current  : t_data;

     variable unclamped  : signed(17 downto 0);  -- S3.15 number

   begin
     wait until rising_edge(clk);
     drdy    <= false;
     fir_drdy  <= false;

     if nd then
       assert (state = IDLE)
         report "New data request before IDLE state."
         severity error;
     end if;

     case state is
       when IDLE =>
         -- Hold things in the start state.
         use_fir_data  <= true;
         read_idx    <= (others => '0');
         write_idx   <= (others => '0');

         if nd then
           if (poles_used(0) = '0') then
             -- Allow for no filter at all
             dout  <= din;
             drdy  <= true;
             state <= IDLE;
           else
             -- Start our FIR filter with din at the MSBs.
             state   <= FIR;
             fir_idx   <= UNSIGNED(poles_used);
             fir_din   <= SHIFT_LEFT(
                     RESIZE(din & lfsr(lfsr'high), fir_din'length),
                     fir_din'length - din'length - 1
                   );
           end if;

         else
           state <= IDLE;
         end if;

       when FIR =>
         -- Store the value, push the average forward.
         fir_cascade(TO_INTEGER(fir_idx)) <= fir_din;

         fir_din <= SHIFT_RIGHT(fir_din, 1) +
               SHIFT_RIGHT(fir_cascade(TO_INTEGER(fir_idx)), 1);

         if (fir_idx = 0) then
           -- Start the IIR filter.  Repurpose the FIR index to count
           -- down the number of poles to do.
           state   <= IIR;
           fir_drdy  <= true;
           fir_idx   <= UNSIGNED(poles_used);
         else
           fir_idx   <= fir_idx - 1;
         end if;

       when IIR =>
         -- The main responsibilities are updating
         -- the pointers and updating the stored data.

         if msd_rdy then

           -- Update the stored data and advance the
           -- write pointer.  Also decrement the FIR index, which
           -- we're just using to count IIR stages at this point.

           ram_dat(TO_INTEGER(write_idx))  <= y;
           write_idx <= write_idx + 1;
           fir_idx   <= fir_idx - 1;

           if (fir_idx = 0) then
             state   <= IDLE;
             write_idx <= (others => '0');

             -- We've treated the data as S3.45 all the
             -- way through.  First, remap it to S3.15
             unclamped := RESIZE(SHIFT_RIGHT(y, 45-15), 18);

             -- Now clamp any excess.
             if TO_INTEGER(unclamped) >= 2**15 then
               dout  <= x"7FFF";

             elsif TO_INTEGER(unclamped) <= -(2**15) then
               dout  <= x"8001";

             else
               dout  <= RESIZE(unclamped, 16);

             end if;
             drdy    <= true;

           end if;

         elsif ichg and not lsd_nd then
           -- We can advance the read index ahead of
           -- time.
           read_idx <= write_idx + 1;

           if (fir_idx = 0) then
             use_fir_data  <= true;

           else
             use_fir_data  <= false;

           end if;

         end if;

       when RESET =>
         -- Initialize the states for both filters
         ram_dat(TO_INTEGER(write_idx))  <= (others => '0');
         fir_cascade(TO_INTEGER(fir_idx))<= (others => '0');

         if (fir_idx = 0) then
           write_idx <= (others => '0');
           state   <= IDLE;

         else
           write_idx <= write_idx + 1;
           fir_idx   <= fir_idx - 1;

         end if;

     end case;

     -- Allow bus writes to the coefficient RAM
     if is_write(WB_IN) then
       write_addr  := TO_INTEGER(WB_IN.ADDR(6 downto 2));
       word_addr := TO_INTEGER(WB_IN.ADDR(1 downto 0));

       if write_addr <= MAX_IDX then
         case word_addr is
           when 0 => fcw <= WB_IN.DAT;
           when 1 => ram_k_hi(write_addr) <= WB_IN.DAT;
           when 2 => ram_k_md(write_addr) <= WB_IN.DAT;
           when 3 => ram_k_lo(write_addr) <= WB_IN.DAT;
         end case;
       end if;
     end if;

     -- Advance the LFSR
     lfsr  <= lfsr(21 downto 1) & (lfsr(22) xnor lfsr(21));

     -- Handle the reset.
     if (rst = '1') then
       write_idx   <= (others => '0');
       read_idx    <= (others => '0');
       fir_idx     <= (others => '1');
       fcw       <= (others => '0');
       use_fir_data  <= true;
       state     <= RESET;
     end if;

   end process;

end architecture Behavioral;

-- 
Rob Gaddi, Highland Technology
Email address is currently out of order

Re: Advice on Xilinx Spelunking - Brian Drummond - 2010-05-25 20:33:00

On Tue, 25 May 2010 14:32:59 -0700, Rob Gaddi
<r...@technologyhighland.com> wrote:

>I've got a Spartan 6 design that I'm working with under ISE 11.5.  A 
>code block that I would expect to take up about 200 LUTs is taking 800 
>instead.  600 LUTs wouldn't be the end of the world, except I'm planning 
>to replicate this block 32 times, which puts me well over the top.
>
>So the question becomes where are all of the LUTs going? 

>  Then I tried looking 
>through the technology schematic instead.  The viewer took forever to 
>open the schematic, and when I finally got it open it took better than a 
>minute any time I wanted to refresh the screen.  Needless to say, this 
>got me nowhere.

Rather than use the technology viewer, I've had better luck reading the
post-synthesis netlist in a text editor! 

I'm not necessarily recommending that approach, but it has its uses. You
could quickly search for the first few instances of "ram_k_hi", then
every instance of "ram_k_hi<whatever>(63) to see if e.g. the LUT RAMs
have been duplicated to give you enough ports.

But my recommendation would be divide and conquer  on that block; it's
not large. For example, comment or "generate" out the coefficient
readback module and see how the size changes. Or "generate" out the
whole lot then re-introduce it a block at a time, comparing the synth
result with your expectations.

Have you allowed for the size of the coefficient rams - 3x64-bit as far
as I can tell from the posted code?  Or how are the 4 ports of the quad
port RAM organised? With more than 1 write port, that can get complex
and inefficient...

- Brian
______________________________
Join the blogging team on FPGARelated.com and earn rewards! Details Here.

Re: Advice on Xilinx Spelunking - Rob Gaddi - 2010-05-25 20:45:00

On 5/25/2010 5:36 PM, Brian Drummond wrote:
> On Tue, 25 May 2010 14:32:59 -0700, Rob Gaddi
> <r...@technologyhighland.com>  wrote:
>
>> I've got a Spartan 6 design that I'm working with under ISE 11.5.  A
>> code block that I would expect to take up about 200 LUTs is taking 800
>> instead.  600 LUTs wouldn't be the end of the world, except I'm planning
>> to replicate this block 32 times, which puts me well over the top.
>>
>> So the question becomes where are all of the LUTs going?
>
>>   Then I tried looking
>> through the technology schematic instead.  The viewer took forever to
>> open the schematic, and when I finally got it open it took better than a
>> minute any time I wanted to refresh the screen.  Needless to say, this
>> got me nowhere.
>
> Rather than use the technology viewer, I've had better luck reading the
> post-synthesis netlist in a text editor!
>
> I'm not necessarily recommending that approach, but it has its uses. You
> could quickly search for the first few instances of "ram_k_hi", then
> every instance of "ram_k_hi<whatever>(63) to see if e.g. the LUT RAMs
> have been duplicated to give you enough ports.
>
> But my recommendation would be divide and conquer  on that block; it's
> not large. For example, comment or "generate" out the coefficient
> readback module and see how the size changes. Or "generate" out the
> whole lot then re-introduce it a block at a time, comparing the synth
> result with your expectations.
>
> Have you allowed for the size of the coefficient rams - 3x64-bit as far
> as I can tell from the posted code?  Or how are the 4 ports of the quad
> port RAM organised? With more than 1 write port, that can get complex
> and inefficient...
>
> - Brian

The quad port only became a quad port because XST decided to implement 
the reset logic on it's own dedicated write port rather than just have 
one write port and feed it from an AND gate.

It turns out that, if I just comment out the reset logic, the 
utilization drops to 236 LUTs.  It must have been implementing something 
truly awful to try to get that extra write port in.  Why it thought it 
needed it in the first place I'll never know, but at least I'm back on 
track now.

-- 
Rob Gaddi, Highland Technology
Email address is currently out of order

Re: Advice on Xilinx Spelunking - Nial Stewart - 2010-05-26 04:54:00

> It turns out that, if I just comment out the
reset logic, the utilization drops to 236 LUTs.  It 
> must have been implementing something truly awful to try to get that extra write port
in.  Why it 
> thought it needed it in the first place I'll never know, but at least I'm back on
track now.


Rob, some(/most) templates for inferring RAMs don't work if you have a
reset defined.


Nial.




Re: Advice on Xilinx Spelunking - Brian Drummond - 2010-05-26 06:55:00

On Tue, 25 May 2010 17:45:33 -0700, Rob Gaddi
<r...@technologyhighland.com> wrote:

>On 5/25/2010 5:36 PM, Brian Drummond wrote:
>> On Tue, 25 May 2010 14:32:59 -0700, Rob Gaddi
>> <r...@technologyhighland.com>  wrote:
>>
>>> I've got a Spartan 6 design that I'm working with under ISE 11.5.  A
>>> code block that I would expect to take up about 200 LUTs is taking 800
>>> instead.  
>>  Or how are the 4 ports of the quad
>> port RAM organised? With more than 1 write port, that can get complex
>> and inefficient...

>The quad port only became a quad port because XST decided to implement 
>the reset logic on it's own dedicated write port rather than just have 
>one write port and feed it from an AND gate.
>
>It turns out that, if I just comment out the reset logic, the 
>utilization drops to 236 LUTs. 

Glad you found it. 
Implementing the reset externally as you described, is the sort of trick
that is occasionally necessary to get round XST limitations.

Or eliminating the reset, and writing all those zeroes across the
wishbone bus.

If you think that XST can be usefully improved in this area, submit a
testcase to Webcase.

- Brian

Re: Advice on Xilinx Spelunking - Rob Gaddi - 2010-05-26 12:05:00

On 5/26/2010 1:54 AM, Nial Stewart wrote:
>> It turns out that, if I just comment out the reset logic, the utilization drops
to 236 LUTs.  It
>> must have been implementing something truly awful to try to get that extra write
port in.  Why it
>> thought it needed it in the first place I'll never know, but at least I'm back on
track now.
>
>
> Rob, some(/most) templates for inferring RAMs don't work if you have a
> reset defined.
>
>
> Nial.
>

The reset logic was sequential, i.e. reset address 0, then reset address 
1, one per clock until the entire thing was done.  The intention being 
that the entire thing would take place on the normal write port of the 
RAM, which wasn't being used while it was in the reset state. 
Apparently it didn't work out that way.

-- 
Rob Gaddi, Highland Technology
Email address is currently out of order
______________________________
Join the blogging team on FPGARelated.com and earn rewards! Details Here.

| 1 | 2 | next