FPGARelated.com
Forums

how to speed up my accumulator ??

Started by Moti Cohen December 5, 2004
Hello all,
I've a design that contains a NCO (Numerically controlled oscillator).
The NCO consists of a 32'bit accumulator. when i write the accumulator
straight forward like this -

process (clk,resetn)
begin
	if resetn = '0' then
		accumulator	<= (others =>'0');
	elsif clk'event and clk ='1' then
		accumulator	<= accumulator + inc_value;
	end if;
end process;			
Fout <= accumulator (accumulator'high); 

the maximum frequency I can achive for 'clk' is ~ 150 MHz (spartan 3).
I need it to work in ~200 MHz so I figured out that some pipelining is
needed but I dont know how to do it because of the accumulator
feedback. Maybe someone here can explain it to me or even give me a
code example (which will be great).

Thanks in advance, Moti.
>the maximum frequency I can achive for 'clk' is ~ 150 MHz (spartan 3). >I need it to work in ~200 MHz so I figured out that some pipelining is >needed but I dont know how to do it because of the accumulator >feedback. Maybe someone here can explain it to me or even give me a >code example (which will be great).
google for carry-save adder. Or counter. The idea is to break the adder into chunks. The carry-out of each chunk goes into a FF and then into the carry-in of the next chunk. Chop it up into chunks that are small enough that they meet your speed requirements. With modern dedicated carry logic, this doesn't work as well as it did in the old days. -- The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.
"Moti Cohen" <moti@terasync.net> wrote in message
news:c04bfe33.0412050155.7afd29ee@posting.google.com...
> Hello all, > I've a design that contains a NCO (Numerically controlled oscillator). > The NCO consists of a 32'bit accumulator. when i write the accumulator > straight forward like this - > > process (clk,resetn) > begin > if resetn = '0' then > accumulator <= (others =>'0'); > elsif clk'event and clk ='1' then > accumulator <= accumulator + inc_value; > end if; > end process; > Fout <= accumulator (accumulator'high); > > the maximum frequency I can achive for 'clk' is ~ 150 MHz (spartan 3). > I need it to work in ~200 MHz so I figured out that some pipelining is > needed but I dont know how to do it because of the accumulator > feedback. Maybe someone here can explain it to me or even give me a > code example (which will be great). > > Thanks in advance, Moti.
http://ipcores.openchip.org/ddsx.html NCO with max (virtual) frequency of 11 (eleven)GHz! for your speed you possible can optimize the adder to get the performance. however it is also possible to have way higher clock frequences for the NCO then the FPGA fabric supports. it is resource consuming but working solution. to get 11GHz performance (using V4 rocketio) the 40 NCO words are calculated each clock cycle and then the result is serialized in with rocket SERDES similarly in FPGA's with no special serdes there would be still be some speed gain using the NCO at lower frequency and calculatig maybe 4 or 8 bits per clock and then using very fast shift register to shif the bits out. that approuch would be useable for 400M+ frequencies (within FPGA fabric) Antti
Hi Hall,

you said -> The idea is to break the adder into chunks..

I know that I need to break the logic but my problem is what to do with
the feedback path, should I break it too ?

Regards, Moti.

Hi Antti,

you worte ->   http://ipcores.openchip.org/ddsx.html
NCO with max (virtual) frequency of 11
(eleven)GHz!

I couldnt find any detailed description there (only features +
deliverables description for buying it)

you worte ->  For your speed you possible can optimize the adder to get
the performance

How would you suggest on doing this ?

you worte -> similarly in FPGA's with no special serdes there would be
still be some
speed gain using the NCO at lower frequency and calculatig maybe 4 or 8
bits
per clock and then using very fast shift register to shif the bits out.
that
approuch would be useable for 400M+ frequencies (within FPGA fabric

It seems to be very very interesting solution for me (higher frequency
= less jitter !! ) but I didnt realy understood how does it works so I
will appreciate it if you will provide me with more details or a with a
link to a detailed desciption..

Thanks, Moti.

Moti Cohen wrote:
> > Hello all, > I've a design that contains a NCO (Numerically controlled oscillator). > The NCO consists of a 32'bit accumulator. when i write the accumulator > straight forward like this - > > process (clk,resetn) > begin > if resetn = '0' then > accumulator <= (others =>'0'); > elsif clk'event and clk ='1' then > accumulator <= accumulator + inc_value; > end if; > end process; > Fout <= accumulator (accumulator'high); > > the maximum frequency I can achive for 'clk' is ~ 150 MHz (spartan 3). > I need it to work in ~200 MHz so I figured out that some pipelining is > needed but I dont know how to do it because of the accumulator > feedback. Maybe someone here can explain it to me or even give me a > code example (which will be great). > > Thanks in advance, Moti.
This is not elegant and it uses three times the resources, but it should run at twice your current speed. process (clk,resetn) begin if resetn = '0' then phase <= (others =>'0'); accsingle <= (others =>'0'); accdouble <= (others =>'0'); accfast <= (others =>'0'); elsif clk'event and clk ='1' then phase <= not phase; if (phase = '0') then accfast <= accsingle; else accfast <= accdouble; accsingle <= accdouble + inc_value; accdouble <= accdouble + inc_value sll 1; end if; end if; end process; Fout <= accfast (accfast'high); I don't have a feel for how close your speed is to the theoretical maximum, but have you tried optimizing your current design by using the floorplanner? First, find out what your critical path is. I expect it will be from "inc_value" to "accumulator". If so, you can place "inc_value" adjacent to "accumulator" to improve the routing delay. One other note, I don't know if the tools are smart enough to deal with a low true async reset. I always make mine high true and I belive that is the way it is spec'd for the startup block in Xilinx FPGAs. If a low true reset works, then nevermind... -- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX
"rickman" <spamgoeshere4@yahoo.com> wrote in message
news:41B32744.70D3A95F@yahoo.com...
> Moti Cohen wrote: > > > > Hello all, > > I've a design that contains a NCO (Numerically controlled oscillator). > > The NCO consists of a 32'bit accumulator. when i write the accumulator > > straight forward like this - > > > > process (clk,resetn) > > begin > > if resetn = '0' then > > accumulator <= (others =>'0'); > > elsif clk'event and clk ='1' then > > accumulator <= accumulator + inc_value; > > end if; > > end process; > > Fout <= accumulator (accumulator'high);
Selected Device : 3s1500fg676-5 Number of Slices: 17 out of 13312 0% Speed Grade: -5 Minimum period: 4.407ns (Maximum Frequency: 226.912MHz) ---------------------------------------------------------------------------- ---- Constraint | Requested | Actual | Logic | | | Levels ---------------------------------------------------------------------------- ---- TS_clk = PERIOD TIMEGRP "clk" 5 nS HIG | 5.000ns | 4.847ns | 2 H 50.000000 % | | | ---------------------------------------------------------------------------- ----
> > the maximum frequency I can achive for 'clk' is ~ 150 MHz (spartan 3). > > I need it to work in ~200 MHz so I figured out that some pipelining is > > needed but I dont know how to do it because of the accumulator > > feedback. Maybe someone here can explain it to me or even give me a > > code example (which will be great). > > > > Thanks in advance, Moti. > > This is not elegant and it uses three times the resources, but it should > run at twice your current speed. > > process (clk,resetn) > begin > if resetn = '0' then > phase <= (others =>'0'); > accsingle <= (others =>'0'); > accdouble <= (others =>'0'); > accfast <= (others =>'0'); > elsif clk'event and clk ='1' then > phase <= not phase; > if (phase = '0') then > accfast <= accsingle; > else > accfast <= accdouble; > accsingle <= accdouble + inc_value; > accdouble <= accdouble + inc_value sll 1; > end if; > end if; > end process; > Fout <= accfast (accfast'high);
Selected Device : 3s1500fg676-5 Number of Slices: 34 out of 13312 0% Speed Grade: -5 Minimum period: 4.632ns (Maximum Frequency: 215.889MHz) ---------------------------------------------------------------------------- ---- Constraint | Requested | Actual | Logic | | | Levels ---------------------------------------------------------------------------- ---- TS_clk = PERIOD TIMEGRP "clk" 5 nS HIG | 5.000ns | 4.886ns | 2 H 50.000000 % | | | ---------------------------------------------------------------------------- ---- Rick, hmmm... care to comment? see synthesis and timing reports above :) Antti
Moti Cohen wrote:

> elsif clk'event and clk ='1' then > accumulator <= accumulator + inc_value; > end if; > end process; > Fout <= accumulator (accumulator'high); > > the maximum frequency I can achive for 'clk' is ~ 150 MHz (spartan 3). > I need it to work in ~200 MHz so I figured out that some pipelining is > needed but I dont know how to do it because of the accumulator > feedback.
Hmmm... If inc_value'length < accumulator'length, maybe you could do a slice addition of the lower bits with the result msbit piped to enable an increment of the upper bits. -- Mike Treseler
Hi Rickman,

First of all, thanks for the code example It's always nice and clearer
to get one of this.
there is only one thing bothering me in your code - the "accsingle"
register is sampled on each rising edge of clock and therefore
does not improves the setup time (and therefore the frequency & clk
rate) i suppose that it should be sampled on every 2'nd clock. So maybe
your code contains a typo but the idea is "almost" clear and i'ts a
very clever one.

I presented this subject (my problem) to our algorithm's guy and he
figured out a very nice way of breaking the logic into to or more
levels (4, 8..) , but he is still working on it I will write the code
here when he will finish it..

Thanks Moti.

Antti Lukats wrote:
> > "rickman" <spamgoeshere4@yahoo.com> wrote in message > news:41B32744.70D3A95F@yahoo.com... > > Moti Cohen wrote: > > > > > > Hello all, > > > I've a design that contains a NCO (Numerically controlled oscillator). > > > The NCO consists of a 32'bit accumulator. when i write the accumulator > > > straight forward like this - > > > > > > process (clk,resetn) > > > begin > > > if resetn = '0' then > > > accumulator <= (others =>'0'); > > > elsif clk'event and clk ='1' then > > > accumulator <= accumulator + inc_value; > > > end if; > > > end process; > > > Fout <= accumulator (accumulator'high); > > Selected Device : 3s1500fg676-5 > Number of Slices: 17 out of 13312 0% > Speed Grade: -5 > Minimum period: 4.407ns (Maximum Frequency: 226.912MHz) > > ---------------------------------------------------------------------------- > ---- > Constraint | Requested | Actual | > Logic > | | | > Levels > ---------------------------------------------------------------------------- > ---- > TS_clk = PERIOD TIMEGRP "clk" 5 nS HIG | 5.000ns | 4.847ns | 2 > H 50.000000 % | | | > ---------------------------------------------------------------------------- > ---- > > > > the maximum frequency I can achive for 'clk' is ~ 150 MHz (spartan 3). > > > I need it to work in ~200 MHz so I figured out that some pipelining is > > > needed but I dont know how to do it because of the accumulator > > > feedback. Maybe someone here can explain it to me or even give me a > > > code example (which will be great). > > > > > > Thanks in advance, Moti. > > > > This is not elegant and it uses three times the resources, but it should > > run at twice your current speed. > > > > process (clk,resetn) > > begin > > if resetn = '0' then > > phase <= (others =>'0'); > > accsingle <= (others =>'0'); > > accdouble <= (others =>'0'); > > accfast <= (others =>'0'); > > elsif clk'event and clk ='1' then > > phase <= not phase; > > if (phase = '0') then > > accfast <= accsingle; > > else > > accfast <= accdouble; > > accsingle <= accdouble + inc_value; > > accdouble <= accdouble + inc_value sll 1; > > end if; > > end if; > > end process; > > Fout <= accfast (accfast'high); > > Selected Device : 3s1500fg676-5 > Number of Slices: 34 out of 13312 0% > Speed Grade: -5 > Minimum period: 4.632ns (Maximum Frequency: 215.889MHz) > > ---------------------------------------------------------------------------- > ---- > Constraint | Requested | Actual | > Logic > | | | > Levels > ---------------------------------------------------------------------------- > ---- > TS_clk = PERIOD TIMEGRP "clk" 5 nS HIG | 5.000ns | 4.886ns | 2 > H 50.000000 % | | | > ---------------------------------------------------------------------------- > ---- > > Rick, hmmm... care to comment? > see synthesis and timing reports above :)
This shows that my approach will run twice as fast. It produces two results rather than one and so can be constrained to require two clock periods. You need to set your timing constraints to reflect that. The only paths that don't run at the half clock rate are the output mux running into accfast and the phase control signal. Set the path delay on the accsingle and accdouble paths to be *two* clock periods (except for the enable from phase). But your timing numbers show both designs running at over 200 MHz which is the OPs requirement, IIRC. Did you have to do any floorplanning? Also, are these numbers post ROUTE or the output from synthesis? Timing results from synthesis are worthless. I would like to see the details on the critical path in each case. The logic for my code should be a minimum of 97 LUTs. Your result is only 34 slices which is a maximum of 68 LUTs. I suspect there is some problem so that the code does not synthesize correctly (possibly in the code). I have not looked at the CLB details of the newer Xilinx FPGAs. An adder still requires 1 LUT per bit, right? inc_value is a signal and not a constant, right? -- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX