FPGARelated.com
Forums

how to speed up my accumulator ??

Started by Moti Cohen December 5, 2004
Hi Mike,

Yes I know that, but my design inc_value'length is almost as the
accumulator'length ( maybee I will be able to decrese two bits..)
so it won't give me much more slack..

Thanks. Moti.

Moti wrote:
> > Hi Rickman, > > First of all, thanks for the code example It's always nice and clearer > to get one of this. > there is only one thing bothering me in your code - the "accsingle" > register is sampled on each rising edge of clock and therefore > does not improves the setup time (and therefore the frequency & clk > rate) i suppose that it should be sampled on every 2'nd clock. So maybe > your code contains a typo but the idea is "almost" clear and i'ts a > very clever one.
Yes, both accsingle and accdouble are sampled on the rising edge of the clock, but only when phase is high and so only *every other* clock. I guess I figured that would be obvious. The addfast signal captures the output of a mux on *every* clock so that it still has to run at full speed. But this path has no carry, so it should be faster than your previous result. In any regard, you can likely improve your results by floorplanning so that the registers involved are in ajacent (or even the same) CLBs to optimize routing. I see no reason that your original design would not run at 200 MHz.
> I presented this subject (my problem) to our algorithm's guy and he > figured out a very nice way of breaking the logic into to or more > levels (4, 8..) , but he is still working on it I will write the code > here when he will finish it..
You will find that approach reduces the length of the carry path. But the basic minimum path is from one register output through the LUT and into a second register. This will be the ultimate limit for any adder design if you reduce the carry delay to a single LUT. To reach the full speed capability you likely will need to floorplan to get the optimally fast routing which will be between registers in the same CLB. At that point your carry delay may not matter with your requirement of 5 nS. Typically the carry delay is < 0.1 ns/bit or < 3.2 ns for the 32 bit adder. I guess all those words are trying to say that you can only do so much with pipelining an adder. Pipelining will break up the carry delay, the finer you break it up, the closer to get to the reg -> LUT -> reg delay, not zero delay. My dual parallel approach gets you directly to the minimum delay if that is what's needed. But try floorplanning before you do any more work with the algorithm. That should be sufficient at 32 bits. Also, you did place and route it, right? The timing results from synthesis are not very accurate since they "estimate" routing times. -- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX
[lots of snipped]
> > Rick, hmmm... care to comment? > > see synthesis and timing reports above :) > > This shows that my approach will run twice as fast. It produces two > results rather than one and so can be constrained to require two clock > periods. You need to set your timing constraints to reflect that. The > only paths that don't run at the half clock rate are the output mux > running into accfast and the phase control signal. Set the path delay > on the accsingle and accdouble paths to be *two* clock periods (except > for the enable from phase).
:) ok, well your code "AS IS" did not synthesise so I tried mind guess an fix to get it synthesize, posible making an error in the guess work. YES, calculating 2 bits per clock is a solution, this is also what I suggested in one of my earlier posts I presented the synthesis (and timing) of the code "as you wrote" it (after fix) I dont see the output mux in your code, and I did not add it either generically I agree similar approuch (if code is correct) runs about twice the speed
> But your timing numbers show both designs running at over 200 MHz which > is the OPs requirement, IIRC. Did you have to do any floorplanning? > Also, are these numbers post ROUTE or the output from synthesis? Timing > results from synthesis are worthless. I would like to see the details > on the critical path in each case.
I posted both synthesis estimate and post place and route timings, in any case both approuch are 210MHz + No floorplanning, just set clock constraint to 5ns nothing more
> The logic for my code should be a minimum of 97 LUTs. Your result is > only 34 slices which is a maximum of 68 LUTs. I suspect there is some > problem so that the code does not synthesize correctly (possibly in the > code).
yes, possible i corrected your code incorrectly :(
> I have not looked at the CLB details of the newer Xilinx FPGAs. An > adder still requires 1 LUT per bit, right? inc_value is a signal and > not a constant, right?
I used all signal 32 bit wide, inc_value as input port
> -- > > Rick "rickman" Collins > > rick.collins@XYarius.com > Ignore the reply address. To email me use the above address with the XY > removed. > > Arius - A Signal Processing Solutions Company > Specializing in DSP and FPGA design URL http://www.arius.com > 4 King Ave 301-682-7772 Voice > Frederick, MD 21701-3110 301-682-7666 FAX
"rickman" <spamgoeshere4@yahoo.com> wrote in message
news:41B35233.F5D96004@yahoo.com...
> Antti Lukats wrote: > > > > "rickman" <spamgoeshere4@yahoo.com> wrote in message > > news:41B32744.70D3A95F@yahoo.com... > > > Moti Cohen wrote: > > > > > > > > Hello all, > > > > I've a design that contains a NCO (Numerically controlled
oscillator).
> > > > The NCO consists of a 32'bit accumulator. when i write the
accumulator
> > > > straight forward like this -
[snip]
> The logic for my code should be a minimum of 97 LUTs. Your result is > only 34 slices which is a maximum of 68 LUTs. I suspect there is some > problem so that the code does not synthesize correctly (possibly in the > code). > > I have not looked at the CLB details of the newer Xilinx FPGAs. An > adder still requires 1 LUT per bit, right? inc_value is a signal and > not a constant, right? > > Rick "rickman" Collins
hm... out of curiosity I did check DDSX ipcore in 2X mode (that is calculating 2 bits per clock), the following stats are for - 32 bit wide accumulator - 32 bit variable phase increment value Synthesis: Selected Device : 3s1500fg320-5 Number of Slices: 33 out of 13312 0% Minimum period: 4.577ns (Maximum Frequency: 218.508MHz) Post P&R Timing: Timing constraint: TS_clk = PERIOD TIMEGRP "clk" 5 nS HIGH 50.000000 % ; 497 items analyzed, 0 timing errors detected. (0 setup errors, 0 hold errors) Minimum period is 4.657ns. ---------------------------------------------------------------------------- ---- All constraints were met. Design statistics: Minimum period: 4.657ns (Maximum frequency: 214.731MHz) So DDSX ipcore can calculate 2 bits per clock (to be muxed or serialized) at max frequency 214MHz using 33 Slices! Ok, lets add one more slice for the mux or shifter that comes to 34 slices :) DDSX ipcore (in 2x mode) runs completly at 0.5 x DDS frequency! So if the FPGA fabric can run a 2 bit shifter at 400MHz then the DDS would run at virtual 400MHz Real 400MHz is only used in one slice doing the shift or not at all when the DDR iocell uses 2 phases of the clock. Antti PS just did run timing check on the 10GHz version of DDSX no problems either :) Sure 10GHz only with V4FX or V2ProX (using GT10 as serializer)
Hi Rickman,

I wrote ->
> there is only one thing bothering me in your code - the "accsingle" > register is sampled on each rising edge of clock and therefore > does not improves the setup time (and therefore the frequency & clk > rate) i suppose that it should be sampled on every 2'nd clock
You wrote -> Yes, both accsingle and accdouble are sampled on the rising edge of the clock, but only when phase is high and so only *every other* clock That's what I ment : as to my understanding accdouble is indeed being sampled every other clock but, accsingle is samped on every clock as follows : when phase = '1' accsingle is being updated : accsingle <= accdouble + inc_value when phase = '0' accsingle is getting sampled : accfast <= accsingle so it seems to me that it is getting sampled one clock edge after it is being changed (via the large logic block) , am I wrong or missing something ??..
Another question regarding the NCO...

Does any of you guys knows the algorithm for calculating the jitter
frequency on the NCO output (MSbit) .
I know that the jitter magnitude is + - [reference clock period / 2]
and I know that I can see it (the frequency) also in a Spectrum
analyzer but I will be glad to have a formula for calculating it in
advance.

Thanks again, Moti.

"Moti" <moti@terasync.net> wrote in message
news:1102341075.941696.112870@c13g2000cwb.googlegroups.com...
> Another question regarding the NCO... > > Does any of you guys knows the algorithm for calculating the jitter > frequency on the NCO output (MSbit) . > I know that the jitter magnitude is + - [reference clock period / 2] > and I know that I can see it (the frequency) also in a Spectrum > analyzer but I will be glad to have a formula for calculating it in > advance. > > Thanks again, Moti.
It's a bit involved to find the largest jitter components but I've worked the problem in the past and found a direct correlation between my expected jitter components and the FFT of the NCO output. Effectively, your jitter components are at the offsets between the frequency of your NCO and the best fractions that approximate your NCO output-to-input frequency ratio (and the harmonics thereof). If you can figure the closest fractions in sequency, you can get your main jitter components. There is some mixing among these frequencies but it tends to be significantly lower than the main peaks in the scenarios I ran.
Hi Jhon,
thanks for your reply, altough I have to admit that I didnt entirely
understood how to actually caluculate the frequency.
Best regards, Moti

Antti Lukats wrote:
> > [lots of snipped] > > > Rick, hmmm... care to comment? > > > see synthesis and timing reports above :) > > > > This shows that my approach will run twice as fast. It produces two > > results rather than one and so can be constrained to require two clock > > periods. You need to set your timing constraints to reflect that. The > > only paths that don't run at the half clock rate are the output mux > > running into accfast and the phase control signal. Set the path delay > > on the accsingle and accdouble paths to be *two* clock periods (except > > for the enable from phase). > > :) ok, well your code "AS IS" did not synthesise so I tried mind guess an > fix to get it synthesize, posible making an error in the guess work. > YES, calculating 2 bits per clock is a solution, this is also what I > suggested in one of my earlier posts > > I presented the synthesis (and timing) of the code "as you wrote" it (after > fix) I dont see the output mux in your code, and I did not add it either > > generically I agree similar approuch (if code is correct) runs about twice > the speed
The output mux is the two assignments to accfast, one when phase is '0' and the other when phase is '1'. I took another look at the code and I don't see anything that would not synthesize. What did the tool complain about?
> > But your timing numbers show both designs running at over 200 MHz which > > is the OPs requirement, IIRC. Did you have to do any floorplanning? > > Also, are these numbers post ROUTE or the output from synthesis? Timing > > results from synthesis are worthless. I would like to see the details > > on the critical path in each case. > > I posted both synthesis estimate and post place and route timings, in any > case both approuch are 210MHz + > > No floorplanning, just set clock constraint to 5ns nothing more
I only see one timing value for each example. What were the critical paths in each design?
> > The logic for my code should be a minimum of 97 LUTs. Your result is > > only 34 slices which is a maximum of 68 LUTs. I suspect there is some > > problem so that the code does not synthesize correctly (possibly in the > > code). > > yes, possible i corrected your code incorrectly :( > > > I have not looked at the CLB details of the newer Xilinx FPGAs. An > > adder still requires 1 LUT per bit, right? inc_value is a signal and > > not a constant, right? > > I used all signal 32 bit wide, inc_value as input port
That should work. Can you post the code you worked with? -- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX
"rickman" <spamgoeshere4@yahoo.com> wrote in message
news:41B49989.E7D9EFBC@yahoo.com...
> Antti Lukats wrote: > > > > [lots of snipped] > > > > Rick, hmmm... care to comment? > > > > see synthesis and timing reports above :) > > >
[snip]
> > I used all signal 32 bit wide, inc_value as input port > > That should work. Can you post the code you worked with? > -- > Rick "rickman" Collins
Rick you can try your own code with XST it complains about the sll at least! Maybe there is better(read proper fix) to main the timings I posted I always posted synthesis estimae and post P&R timings news posting did change the text aligne so was hard to read below is what I used (fast-do-not-think-at-all .. fixed) from your code ---------------------------------------------------------------------------- --- library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; entity dds is Port ( clk : in std_logic; rst : in std_logic; inc_value : in std_logic_vector(31 downto 0); fout : out std_logic); end dds; architecture Behavioral of dds is signal accsingle : std_logic_vector(31 downto 0); signal accdouble : std_logic_vector(31 downto 0); signal accfast : std_logic_vector(31 downto 0); signal phase : std_logic; begin process (clk, rst) begin if rst = '1' then phase <= '0'; accsingle <= (others =>'0'); accdouble <= (others =>'0'); accfast <= (others =>'0'); elsif clk'event and clk ='1' then phase <= not phase; if (phase = '0') then accfast <= accsingle; else accfast <= accdouble; accsingle <= accdouble + inc_value; accdouble <= accdouble + (inc_value(30 downto 0) & '0'); end if; end if; end process; fout <= accfast (accfast'high); end Behavioral; ------------------------------------------------------------------ below is DDSX in 2X mode there is no mux, 2 bits are calculated per clock need external mux or serializer ------------------------------------------------------------------ // // DDSX 2x Coyright 2004 OpenChip // // 2 DDS phase bits per single clock // to be used with external (to this module) 2 bit // deserializer running at 2X the clock of this module // `define DDSX_PHASE_WIDTH 32 `define DDSX_ACCU_WIDTH 32 `define DDSX_SERDES_WIDTH 2 module ddsx_2x( clk, rst, load, // not used phase, // phase increment txdata, // output bits, need to be serialized at 2X clock debug_next_accu // for simulation ); input clk; input rst; input load; input [`DDSX_PHASE_WIDTH-1:0] phase; output [`DDSX_SERDES_WIDTH-1:0] txdata; output [`DDSX_ACCU_WIDTH-1:0] debug_next_accu; // // Accumulator, on each clock advances 20 clocks ! // reg [`DDSX_ACCU_WIDTH-1:0] accu; // // phase shifts for 2 steps // wire [`DDSX_PHASE_WIDTH :0] phase_2; // // Calc phase shifts for all steps // assign phase_2 = {phase[`DDSX_PHASE_WIDTH-1:0], 1'b0}; // // Adder outputs // wire [`DDSX_ACCU_WIDTH-1:0] accu_p1; wire [`DDSX_ACCU_WIDTH-1:0] accu_p2; assign accu_p1 = accu + phase; assign accu_p2 = accu + phase_2; // // Combine 2 next steps into one 2 bit word for high speed serdes // assign txdata[`DDSX_SERDES_WIDTH-1:0] = { accu_p2 [`DDSX_ACCU_WIDTH-1], accu_p1 [`DDSX_ACCU_WIDTH-1]}; // // advance 2 clocks! // always @(posedge clk) if (rst) accu <= `DDSX_ACCU_WIDTH'b0; else accu <= accu_p2; assign debug_next_accu = accu_p2; endmodule