comp.arch.fpga | how to speed up my accumulator ??| page 2

Reply by Moti ●December 5, 20042004-12-05

Hi Mike,

Yes I know that, but my design inc_value'length is almost as the
accumulator'length ( maybee I will be able to decrese two bits..)
so it won't give me much more slack..

Thanks. Moti.

Reply by rickman ●December 5, 20042004-12-05

Moti wrote:
> 
> Hi Rickman,
> 
> First of all, thanks for the code example It's always nice and clearer
> to get one of this.
> there is only one thing bothering me in your code - the "accsingle"
> register is sampled on each rising edge of clock and therefore
> does not improves the setup time (and therefore the frequency & clk
> rate) i suppose that it should be sampled on every 2'nd clock. So maybe
> your code contains a typo but the idea is "almost" clear and i'ts a
> very clever one.

Yes, both accsingle and accdouble are sampled on the rising edge of the
clock, but only when phase is high and so only *every other* clock.  I
guess I figured that would be obvious.  The addfast signal captures the
output of a mux on *every* clock so that it still has to run at full
speed.  But this path has no carry, so it should be faster than your
previous result.  

In any regard, you can likely improve your results by floorplanning so
that the registers involved are in ajacent (or even the same) CLBs to
optimize routing.  I see no reason that your original design would not
run at 200 MHz.  

> I presented this subject (my problem) to our algorithm's guy and he
> figured out a very nice way of breaking the logic into to or more
> levels (4, 8..) , but he is still working on it I will write the code
> here when he will finish it..

You will find that approach reduces the length of the carry path.  But
the basic minimum path is from one register output through the LUT and
into a second register.  This will be the ultimate limit for any adder
design if you reduce the carry delay to a single LUT.  To reach the full
speed capability you likely will need to floorplan to get the optimally
fast routing which will be between registers in the same CLB.  At that
point your carry delay may not matter with your requirement of 5 nS. 
Typically the carry delay is < 0.1 ns/bit or < 3.2 ns for the 32 bit
adder.  

I guess all those words are trying to say that you can only do so much
with pipelining an adder.  Pipelining will break up the carry delay, the
finer you break it up, the closer to get to the reg -> LUT -> reg delay,
not zero delay.  My dual parallel approach gets you directly to the
minimum delay if that is what's needed.  But try floorplanning before
you do any more work with the algorithm.  That should be sufficient at
32 bits.  

Also, you did place and route it, right?  The timing results from
synthesis are not very accurate since they "estimate" routing times.  

-- 

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX

Reply by Antti Lukats ●December 5, 20042004-12-05

[lots of snipped]
> > Rick, hmmm... care to comment?
> > see synthesis and timing reports above :)
>
> This shows that my approach will run twice as fast.  It produces two
> results rather than one and so can be constrained to require two clock
> periods.  You need to set your timing constraints to reflect that.  The
> only paths that don't run at the half clock rate are the output mux
> running into accfast and the phase control signal.  Set the path delay
> on the accsingle and accdouble paths to be *two* clock periods (except
> for the enable from phase).

:) ok, well your code "AS IS" did not synthesise so I tried mind guess an
fix to get it synthesize, posible making an error in the guess work.
YES, calculating 2 bits per clock is a solution, this is also what I
suggested in one of my earlier posts

I presented the synthesis (and timing) of the code "as you wrote" it (after
fix) I dont see the output mux in your code, and I did not add it either

generically I agree similar approuch (if code is correct) runs about twice
the speed

> But your timing numbers show both designs running at over 200 MHz which
> is the OPs requirement, IIRC.  Did you have to do any floorplanning?
> Also, are these numbers post ROUTE or the output from synthesis?  Timing
> results from synthesis are worthless.  I would like to see the details
> on the critical path in each case.

I posted both synthesis estimate and post place and route timings, in any
case both approuch are 210MHz +

No floorplanning, just set clock constraint to 5ns nothing more

> The logic for my code should be a minimum of 97 LUTs.  Your result is
> only 34 slices which is a maximum of 68 LUTs.  I suspect there is some
> problem so that the code does not synthesize correctly (possibly in the
> code).

yes, possible i corrected your code incorrectly :(

> I have not looked at the CLB details of the newer Xilinx FPGAs.  An
> adder still requires 1 LUT per bit, right?  inc_value is a signal and
> not a constant, right?

I used all signal 32 bit wide, inc_value as input port

> -- 
>
> Rick "rickman" Collins
>
> rick.collins@XYarius.com
> Ignore the reply address. To email me use the above address with the XY
> removed.
>
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design      URL http://www.arius.com
> 4 King Ave                               301-682-7772 Voice
> Frederick, MD 21701-3110                 301-682-7666 FAX

Reply by Antti Lukats ●December 5, 20042004-12-05

"rickman" <spamgoeshere4@yahoo.com> wrote in message
news:41B35233.F5D96004@yahoo.com...
> Antti Lukats wrote:
> >
> > "rickman" <spamgoeshere4@yahoo.com> wrote in message
> > news:41B32744.70D3A95F@yahoo.com...
> > > Moti Cohen wrote:
> > > >
> > > > Hello all,
> > > > I've a design that contains a NCO (Numerically controlled
oscillator).
> > > > The NCO consists of a 32'bit accumulator. when i write the
accumulator
> > > > straight forward like this -
[snip]
> The logic for my code should be a minimum of 97 LUTs.  Your result is
> only 34 slices which is a maximum of 68 LUTs.  I suspect there is some
> problem so that the code does not synthesize correctly (possibly in the
> code).
>
> I have not looked at the CLB details of the newer Xilinx FPGAs.  An
> adder still requires 1 LUT per bit, right?  inc_value is a signal and
> not a constant, right?
>
> Rick "rickman" Collins

hm... out of curiosity I did check DDSX ipcore in 2X mode (that is
calculating 2 bits per clock), the following stats are for
- 32 bit wide accumulator
- 32 bit variable phase increment value

Synthesis:
Selected Device : 3s1500fg320-5
 Number of Slices:                      33  out of  13312     0%
 Minimum period: 4.577ns (Maximum Frequency: 218.508MHz)

Post P&R Timing:
Timing constraint: TS_clk = PERIOD TIMEGRP "clk"  5 nS   HIGH 50.000000 % ;
497 items analyzed, 0 timing errors detected. (0 setup errors, 0 hold
errors)
Minimum period is   4.657ns.
----------------------------------------------------------------------------
----
All constraints were met.
Design statistics:
   Minimum period:   4.657ns (Maximum frequency: 214.731MHz)


So DDSX ipcore can calculate 2 bits per clock (to be muxed or serialized) at
max frequency 214MHz using 33 Slices!
Ok, lets add one more slice for the mux or shifter that comes to 34 slices
:)

DDSX ipcore (in 2x mode) runs completly at 0.5 x DDS frequency!
So if the FPGA fabric can run a 2 bit shifter at 400MHz then the DDS would
run at virtual 400MHz
Real 400MHz is only used in one slice doing the shift or not at all when the
DDR iocell uses 2 phases of the clock.

Antti
PS just did run timing check on the 10GHz version of DDSX no problems either
:)
Sure 10GHz only with V4FX or V2ProX (using GT10 as serializer)

Reply by Moti ●December 5, 20042004-12-05

Hi Rickman,

I wrote ->
> there is only one thing bothering me in your code - the "accsingle"
> register is sampled on each rising edge of clock and therefore
> does not improves the setup time (and therefore the frequency & clk
> rate) i suppose that it should be sampled on every 2'nd clock

You wrote -> Yes, both accsingle and accdouble are sampled on the
rising edge of the
clock, but only when phase is high and so only
*every other* clock

That's what I ment :
as to my understanding accdouble is indeed being sampled every other
clock but,
accsingle is samped on every clock as follows :
when phase = '1'  accsingle is being updated   :
accsingle <= accdouble + inc_value

when phase = '0'  accsingle is getting sampled :
accfast   <= accsingle

so it seems to me that it is getting sampled one clock edge after it is
being changed (via the large logic block) , am I wrong or missing
something ??..

Reply by Moti ●December 6, 20042004-12-06

Another question regarding the NCO...

Does any of you guys knows the algorithm for calculating the jitter
frequency on the NCO output (MSbit) .
I know that the jitter magnitude is + - [reference clock period / 2]
and I know that I can see it (the frequency) also in a Spectrum
analyzer but I will be glad to have a formula for calculating it in
advance.

Thanks again, Moti.

Reply by John_H ●December 6, 20042004-12-06

"Moti" <moti@terasync.net> wrote in message
news:1102341075.941696.112870@c13g2000cwb.googlegroups.com...
> Another question regarding the NCO...
>
> Does any of you guys knows the algorithm for calculating the jitter
> frequency on the NCO output (MSbit) .
> I know that the jitter magnitude is + - [reference clock period / 2]
> and I know that I can see it (the frequency) also in a Spectrum
> analyzer but I will be glad to have a formula for calculating it in
> advance.
>
> Thanks again, Moti.

It's a bit involved to find the largest jitter components but I've worked
the problem in the past and found a direct correlation between my expected
jitter components and the FFT of the NCO output.  Effectively, your jitter
components are at the offsets between the frequency of your NCO and the best
fractions that approximate your NCO output-to-input frequency ratio (and the
harmonics thereof).  If you can figure the closest fractions in sequency,
you can get your main jitter components.  There is some mixing among these
frequencies but it tends to be significantly lower than the main peaks in
the scenarios I ran.

Reply by Moti ●December 6, 20042004-12-06

Hi Jhon,
thanks for your reply, altough I have to admit that I didnt entirely
understood how to actually caluculate the frequency.
Best regards, Moti

Reply by rickman ●December 6, 20042004-12-06

Antti Lukats wrote:
> 
> [lots of snipped]
> > > Rick, hmmm... care to comment?
> > > see synthesis and timing reports above :)
> >
> > This shows that my approach will run twice as fast.  It produces two
> > results rather than one and so can be constrained to require two clock
> > periods.  You need to set your timing constraints to reflect that.  The
> > only paths that don't run at the half clock rate are the output mux
> > running into accfast and the phase control signal.  Set the path delay
> > on the accsingle and accdouble paths to be *two* clock periods (except
> > for the enable from phase).
> 
> :) ok, well your code "AS IS" did not synthesise so I tried mind guess an
> fix to get it synthesize, posible making an error in the guess work.
> YES, calculating 2 bits per clock is a solution, this is also what I
> suggested in one of my earlier posts
> 
> I presented the synthesis (and timing) of the code "as you wrote" it (after
> fix) I dont see the output mux in your code, and I did not add it either
> 
> generically I agree similar approuch (if code is correct) runs about twice
> the speed

The output mux is the two assignments to accfast, one when phase is '0'
and the other when phase is '1'.  

I took another look at the code and I don't see anything that would not
synthesize.  What did the tool complain about?  


> > But your timing numbers show both designs running at over 200 MHz which
> > is the OPs requirement, IIRC.  Did you have to do any floorplanning?
> > Also, are these numbers post ROUTE or the output from synthesis?  Timing
> > results from synthesis are worthless.  I would like to see the details
> > on the critical path in each case.
> 
> I posted both synthesis estimate and post place and route timings, in any
> case both approuch are 210MHz +
> 
> No floorplanning, just set clock constraint to 5ns nothing more

I only see one timing value for each example.  What were the critical
paths in each design? 


> > The logic for my code should be a minimum of 97 LUTs.  Your result is
> > only 34 slices which is a maximum of 68 LUTs.  I suspect there is some
> > problem so that the code does not synthesize correctly (possibly in the
> > code).
> 
> yes, possible i corrected your code incorrectly :(
> 
> > I have not looked at the CLB details of the newer Xilinx FPGAs.  An
> > adder still requires 1 LUT per bit, right?  inc_value is a signal and
> > not a constant, right?
> 
> I used all signal 32 bit wide, inc_value as input port

That should work.  Can you post the code you worked with? 

-- 

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX

Reply by Antti Lukats ●December 6, 20042004-12-06

"rickman" <spamgoeshere4@yahoo.com> wrote in message
news:41B49989.E7D9EFBC@yahoo.com...
> Antti Lukats wrote:
> >
> > [lots of snipped]
> > > > Rick, hmmm... care to comment?
> > > > see synthesis and timing reports above :)
> > >
[snip]
> > I used all signal 32 bit wide, inc_value as input port
>
> That should work.  Can you post the code you worked with?
> -- 
> Rick "rickman" Collins

Rick you can try your own code with XST it complains about the sll at least!
Maybe there is better(read proper fix) to main

the timings I posted I always posted synthesis estimae and post P&R timings
news posting did change the text aligne so was hard to read


below is what I used (fast-do-not-think-at-all .. fixed) from your code
----------------------------------------------------------------------------
---
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity dds is
    Port ( clk       : in std_logic;
           rst       : in std_logic;
           inc_value : in std_logic_vector(31 downto 0);
           fout      : out std_logic);
end dds;

architecture Behavioral of dds is

signal accsingle : std_logic_vector(31 downto 0);
signal accdouble : std_logic_vector(31 downto 0);
signal accfast   : std_logic_vector(31 downto 0);
signal phase     : std_logic;

begin

process (clk, rst)
begin
  if rst = '1' then
    phase       <= '0';
    accsingle   <= (others =>'0');
    accdouble   <= (others =>'0');
    accfast     <= (others =>'0');
  elsif clk'event and clk ='1' then
    phase       <= not phase;
    if (phase = '0') then
      accfast   <= accsingle;
    else
      accfast   <= accdouble;
      accsingle <= accdouble + inc_value;
      accdouble <= accdouble + (inc_value(30 downto 0) & '0');
    end if;
  end if;
end process;

fout <= accfast (accfast'high);

end Behavioral;
------------------------------------------------------------------



below is DDSX in 2X mode there is no mux, 2 bits are calculated per clock
need external mux or serializer
------------------------------------------------------------------
//
// DDSX 2x Coyright 2004 OpenChip
//
// 2 DDS phase bits per single clock
// to be used with external (to this module) 2 bit
// deserializer running at 2X the clock of this module
//
`define DDSX_PHASE_WIDTH 32
`define DDSX_ACCU_WIDTH 32
`define DDSX_SERDES_WIDTH 2

module ddsx_2x(
 clk,
 rst,
 load,    // not used
 phase,   // phase increment
 txdata,  // output bits, need to be serialized at 2X clock
 debug_next_accu // for simulation
 );


input    clk;
input    rst;
input    load;

input  [`DDSX_PHASE_WIDTH-1:0]  phase;
output [`DDSX_SERDES_WIDTH-1:0] txdata;
output [`DDSX_ACCU_WIDTH-1:0]  debug_next_accu;
//
// Accumulator, on each clock advances 20 clocks !
//
reg [`DDSX_ACCU_WIDTH-1:0] accu;
//
// phase shifts for 2 steps
//
wire [`DDSX_PHASE_WIDTH  :0] phase_2;
//
// Calc phase shifts for all steps
//
assign  phase_2   = {phase[`DDSX_PHASE_WIDTH-1:0], 1'b0};
//
// Adder outputs
//
wire [`DDSX_ACCU_WIDTH-1:0] accu_p1;
wire [`DDSX_ACCU_WIDTH-1:0] accu_p2;
assign  accu_p1  = accu + phase;
assign  accu_p2  = accu + phase_2;
//
// Combine 2 next steps into one 2 bit word for high speed serdes
//
assign txdata[`DDSX_SERDES_WIDTH-1:0] = {
 accu_p2 [`DDSX_ACCU_WIDTH-1],
 accu_p1 [`DDSX_ACCU_WIDTH-1]};
//
// advance 2 clocks!
//
always @(posedge clk)
  if (rst) accu <= `DDSX_ACCU_WIDTH'b0;
  else accu <= accu_p2;

assign debug_next_accu = accu_p2;

endmodule

Previous 123 4 5 Next

how to speed up my accumulator ??

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group