FPGARelated.com
Forums

Is this phase accumulator trick well-known???

Started by Jonathan Bromley February 8, 2009
Jonathan Bromley wrote:

> hi comp.arch.fpga, > (accidentally posted to comp.lang.vhdl > a few moments ago- sorry)
> The question - repeated after the explanation - > is: here's what I think is a nifty trick; has > anyone seen it, or been aware of it, before? > I can't believe it's really new.
As far as I know, the usual system is to use a power of two modulus and supply enough bits to get the required accuracy. It is then easy to take the top bit as a (close enough to) 50% duty cycle clock.
> I have been messing around with baud rate generators > and suchlike - creating a pulse that's active for > one clock period at some required repetition rate - > and wanted to try a phase accumulator technique > instead of a simple divider. That makes it far > easier to specify the frequency - it's simply the > phase-delta input value - and easily allows for > non-integral divide ratios, at the cost of one > master clock period of jitter.
> The phase-accumulator produces pulses with a > repetition rate of > Fc * M / N > where Fc is the master clock, M is the phase delta > and N is the counter's modulus. However, to get > the huge convenience of specifying M as the required > frequency, I must make N be equal to the frequency > of Fc, and this is unlikely to be an exact power of 2. > So the phase accumulator works like this:
> on every clock pulse... > if (acc < 0) then > add := acc + N; > output_pulse <= '1'; > else > output_pulse <= '0'; > end if; > acc := acc - M; -- unconditionally
> This is fine, but it means that on the "wrap-around" > clock cycle I must add either N-M to the accumulator; > if either M or N are variable, that costs me another > adder.
Adders are pretty cheap in FPGAs. It would take some work to see if your method is a net savings.
> Today I came up with an intriguing (to me) alternative: > on the wrap-around cycle, add N to the accumulator; > on the immediately subsequent cycle, add (-2M); on > all other cycles, add (-M). This is of course rather > easy to do since 2M is just a left shift. A few > trial synthesis runs convinced me that it will give > measurably better performance than the two-adder > version. VHDL code is appended for anyone who wants > to play.
It seems to me that needs an additional MUX to save an adder. In 4LUT logic, I believe that is pretty close to the same in CLBs.
> My question is: has this trick been published anywhere? > Or is it something that "those skilled in the art" > already know about? I haven't seen it before, but that > simply means I probably haven't looked hard enough.
There are a large number of tricks that have been used over the years to make computer arithmetic faster. Many of those tricks aren't as effective in FPGA logic, so new ones will have to be developed. -- glen
"kadhiem_ayob" <kadhiem_ayob@yahoo.co.uk> wrote:

>Hi Jonathan, > >I can't quite get your idea (You may have mistyped M,N) > >This is what I do for phase accummulator(modulus not power of 2), >e.g. modulus = 1600, increment = 869 >------------------------------ >if(modulo_adder < 1600)then > modulo_adder <= modulo_adder + 869; > overflow <= '0'; >else > modulo_adder <= modulo_adder + 869 - 1600; > overflow <= '1'; >end if; >-------------------------------
Where the 869 or 1600 are variable your code requires two adders and a 2 way mux. Jonathan's code avoids adding the increment on overflow cycles and makes up by adding double the increment in the cycle following overflow. That requires 1 adder and a 3 way mux. It won't work when the increment is more than half the modulus. Well known trick or not I have no idea. --
Jonathan Bromley wrote:

> It simulates correctly too, although of course it needs to > be wrapped in something that provides a signal to connect > to its unconstrained input port.
This may be the point that Antti missed, as he mentioned "unmodified" code. Without a wrapper or constraint, I wouldn't expect elaboration to succeed: # vsim -c rate_gen # Loading /flip/usr1/modeltech/linux/../std.standard # Loading /flip/usr1/modeltech/linux/../ieee.std_logic_1164(body) # Loading /flip/usr1/modeltech/linux/../ieee.numeric_std(body) # Loading work.rate_gen(rtl)#1 # ** Fatal: (vsim-3347) Port 'rate' is not constrained. A minimal constraining modification like this: rate : in unsigned(15 downto 0) proves that the code would sim ok, with a proper instance. # vsim -c rate_gen # Loading /flip/usr1/modeltech/linux/../std.standard # Loading /flip/usr1/modeltech/linux/../ieee.std_logic_1164(body) # Loading /flip/usr1/modeltech/linux/../ieee.numeric_std(body) # Loading work.rate_gen(rtl)#1 VSIM 1> run VSIM 2> -- Mike Treseler
> if (...overflow...) then > modulo_adder <= modulo_adder + N; > else > modulo_adder <= modulo_adder + N - 1600; > >The last line requires TWO adders, in addition to the >multiplexer created by the IF. This causes a significant >performance hit. That's what I was trying to fix. I did >it by saying...
>So we modify the code like this: > > if (...overflow...) then > adder <= adder + N; > elsif (...there was an overflow on the previous clock...) > adder <= adder + 2*N; -- make up for the lost N > else > adder <= adder - 1600;
It looks like a 2 input mux with 2 adders vs a 3 input mux with 1 adder. The second adder can be pulled out and done ahead of time, aka pipelined. If speed rather than space is your goal, that might be better. -- These are my opinions, not necessarily my employer's. I hate spam.
Jonathan Bromley <jonathan.bromley@MYCOMPANY.com> wrote:

>hi comp.arch.fpga, >(accidentally posted to comp.lang.vhdl >a few moments ago- sorry) > >The question - repeated after the explanation - >is: here's what I think is a nifty trick; has >anyone seen it, or been aware of it, before? >I can't believe it's really new. > >I have been messing around with baud rate generators >and suchlike - creating a pulse that's active for >one clock period at some required repetition rate - >and wanted to try a phase accumulator technique >instead of a simple divider. That makes it far >easier to specify the frequency - it's simply the >phase-delta input value - and easily allows for >non-integral divide ratios, at the cost of one >master clock period of jitter. > >The phase-accumulator produces pulses with a >repetition rate of > Fc * M / N >where Fc is the master clock, M is the phase delta >and N is the counter's modulus. However, to get >the huge convenience of specifying M as the required >frequency, I must make N be equal to the frequency >of Fc, and this is unlikely to be an exact power of 2. >So the phase accumulator works like this: > > on every clock pulse... > if (acc < 0) then > add := acc + N; > output_pulse <= '1'; > else > output_pulse <= '0'; > end if; > acc := acc - M; -- unconditionally
What if you simply add N-M to the accumulator? on every clock pulse... if (acc < 0) then acc := acc + (N -M); output_pulse <= '1'; else output_pulse <= '0'; acc := acc - M; end if; -- Failure does not prove something is impossible, failure simply indicates you are not using the right tools... "If it doesn't fit, use a bigger hammer!" --------------------------------------------------------------
Jonathan Bromley wrote:
(snip)

> Yes, exactly. But if "869" is a variable (from an > input port) so that the rate is configurable, then > you have
> if (...overflow...) then > modulo_adder <= modulo_adder + N; > else > modulo_adder <= modulo_adder + N - 1600;
> The last line requires TWO adders, in addition to the > multiplexer created by the IF. This causes a significant > performance hit. That's what I was trying to fix.
The adders will run in parallel, so there should be no performance difference. -- glen
On Sun, 08 Feb 2009 17:37:18 -0700, Glen Herrmannsfeldt
<gah@ugcs.caltech.edu> wrote:

>Jonathan Bromley wrote: >(snip) > >> Yes, exactly. But if "869" is a variable (from an >> input port) so that the rate is configurable, then >> you have > >> if (...overflow...) then >> modulo_adder <= modulo_adder + N; >> else >> modulo_adder <= modulo_adder + N - 1600; > >> The last line requires TWO adders, in addition to the >> multiplexer created by the IF. This causes a significant >> performance hit. That's what I was trying to fix. > >The adders will run in parallel, so there should be no >performance difference.
I don't think they can. The first line needs a 2 input adder and the second line needs two 2 input adders in a tree or a 3 input adder. No matter how you look at it, that path will be slower than a 2 input adder; whether slow enough to matter is a different issue. More precisely, there are two paths 1) module_adder.Q to module_adder.D and 2) N.Q to module_adder.D The second path here will have some extra gates on it because before N.Q can be presented to the input of the second (m_a+N) adder, it has to go through one more adder to calculate N.Q -1600, which is the extra delay. -- Muzaffer Kal DSPIA INC. ASIC/FPGA Design Services http://www.dspia.com
I remember implementing Bresenham's line drawing algorithm on 8-bit CPUs
like this:

	suba	#N	* Subtract N
	bcc	no_ov	* Branch if no borrow...
	adda	#M	* Add M
	inc	cur_y	* increment y axis or whatever...
no_ov

(bonus points if you can identify the CPU and the assembler...)

In Verilog this would be something like:

	reg [16:0] accu;

	if (accu[16]) // Did previous (- N) overflow?
	  accu <= accu + M; // Yes, add M
	else
	  accu <= accu - N;


-- 
/*  jhallen@world.std.com AB1GO */                        /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
Muzaffer Kal wrote:

(after someone wrote)
>>> if (...overflow...) then >>> modulo_adder <= modulo_adder + N; >>> else >>> modulo_adder <= modulo_adder + N - 1600;
>>>The last line requires TWO adders, in addition to the >>>multiplexer created by the IF. This causes a significant >>>performance hit. That's what I was trying to fix.
(I wrote)
>>The adders will run in parallel, so there should be no >>performance difference.
> I don't think they can. The first line needs a 2 input adder and the > second line needs two 2 input adders in a tree or a 3 input adder. No > matter how you look at it, that path will be slower than a 2 input > adder; whether slow enough to matter is a different issue.
I thought N was more or less constant. It should probably be modulo_adder <= modulo_adder + (N - 1600); to make sure it comes out that way, though. In any case, it should run at least at 100MHz even with two adders on most current FPGAs. (snip) -- glen
On Feb 9, 1:12=A0am, Mike Treseler <mtrese...@gmail.com> wrote:
> Jonathan Bromley wrote: > > It simulates correctly too, although of course it needs to > > be wrapped in something that provides a signal to connect > > to its unconstrained input port. > > This may be the point that Antti missed, > as he mentioned "unmodified" code. > Without a wrapper or constraint, I wouldn't expect > elaboration to succeed: > > # vsim -c rate_gen > # Loading /flip/usr1/modeltech/linux/../std.standard > # Loading /flip/usr1/modeltech/linux/../ieee.std_logic_1164(body) > # Loading /flip/usr1/modeltech/linux/../ieee.numeric_std(body) > # Loading work.rate_gen(rtl)#1 > # ** Fatal: (vsim-3347) Port 'rate' is not constrained. > > A minimal constraining modification like this: > =A0rate =A0: in =A0unsigned(15 downto 0) > proves that the code would sim ok, with a proper instance. > > # vsim -c rate_gen > # Loading /flip/usr1/modeltech/linux/../std.standard > # Loading /flip/usr1/modeltech/linux/../ieee.std_logic_1164(body) > # Loading /flip/usr1/modeltech/linux/../ieee.numeric_std(body) > # Loading work.rate_gen(rtl)#1 > VSIM 1> run > VSIM 2> > > =A0-- Mike Treseler
Mike I am not that dumb ;) the original code does NOT pass synthesis without wrapper after passing the synthesis with wrapper it does not pass compile for Xilinx ISIM the same code may pass __all__ others simulator, i made no statement about an of them, only that code making xilinx XST happy doest not pass ISIM so those tools VHDL support level is just different. i constrained the signal connected to rate in the wrapper as: signal rate : unsigned(15 downto 0) :=3D X"2580"; this was the _needed_ trick in the wrapper. i assumed Jonathans code does not modifications. what he claims the case if the tools are fully VHDL-93 compliant Antti