FPGARelated.com
Forums

very wide counter (42-bit)

Started by kendor December 4, 2009
hello there

for a measuring utility (running @ 100MHZ) I need a counter of 42-bit width
whose value is used by several sub blocks of my design. As a first, somehow
dirty solution I have implemented this like follows. Since this approach
needs quite a huge amount of FFs and leads to long delaytimes (bit 0 to 42)
I am looking for an alternative. I was thinking about using Block RAM
(Spartan3) to reduce routing effort and delaytimes. (see also
http://courses.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf)

Has anyone ever done such a thing or do you have any suggestions on solving
my task?

current code: 
-------------------------------------
# i have to use std_logic_unsigned since numeric_std has as integer width
the normal 4 bytes width (32bit - which for 42 bits is not enough ...
overflow,..)

# ...
GENERIC (
  t : NATURAL := 42;  --! counter width
  wd: NATURAL := 5    --! divider (clk/(2*wd))
);

# ...
ARCHITECTURE rtl OF worldtimeCtr IS
	SIGNAL cnt: std_logic_vector(t-1 downto 0);
BEGIN
	PROCESS(clk,rst)
		VARIABLE temp :	NATURAL RANGE 0 to wd;
	BEGIN
		IF(rst='0')THEN
			cnt <= (others =>'0');
			temp := 0;
		ELSIF(clk'event and clk='1')THEN
			IF(en='1' and temp = wd)THEN
			   temp := 0;
			   cnt <= STD_LOGIC_VECTOR(cnt + 1);
			END IF;
			temp := temp+1;
		END if;
		
	END process;
	o_worldtime <= cnt;
END rtl;

# ...
-------------------------------------

thank you in advance

kendor


On Fri, 04 Dec 2009 12:15:24 -0600
"kendor" <jonas.reber@bfh.ch> wrote:

> hello there > > for a measuring utility (running @ 100MHZ) I need a counter of 42-bit > width whose value is used by several sub blocks of my design. As a > first, somehow dirty solution I have implemented this like follows. > Since this approach needs quite a huge amount of FFs and leads to > long delaytimes (bit 0 to 42) I am looking for an alternative. I was > thinking about using Block RAM (Spartan3) to reduce routing effort > and delaytimes. (see also > http://courses.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf) > > Has anyone ever done such a thing or do you have any suggestions on > solving my task? > > current code: > ------------------------------------- > # i have to use std_logic_unsigned since numeric_std has as integer > width the normal 4 bytes width (32bit - which for 42 bits is not > enough ... overflow,..) > > # ... > GENERIC ( > t : NATURAL := 42; --! counter width > wd: NATURAL := 5 --! divider (clk/(2*wd)) > ); > > # ... > ARCHITECTURE rtl OF worldtimeCtr IS > SIGNAL cnt: std_logic_vector(t-1 downto 0); > BEGIN > PROCESS(clk,rst) > VARIABLE temp : NATURAL RANGE 0 to wd; > BEGIN > IF(rst='0')THEN > cnt <= (others =>'0'); > temp := 0; > ELSIF(clk'event and clk='1')THEN > IF(en='1' and temp = wd)THEN > temp := 0; > cnt <= STD_LOGIC_VECTOR(cnt + 1); > END IF; > temp := temp+1; > END if; > > END process; > o_worldtime <= cnt; > END rtl; > > # ... > ------------------------------------- > > thank you in advance > > kendor > >
Another option would be to pipeline the block into, say, 3 segments of 14 bits a piece, so that you don't have that one LONG carry chain trying to propagate up the whole thing. Depending on how willing your toolchain is to rebalance registers (ISE 11 _may_ be smart enough), you might just be able to add a few stages of pipeline delay on the output of the entire 43 bits, and let it push things around across the logic. Otherwise you'd have to code it manually, which isn't the end of the world. -- Rob Gaddi, Highland Technology Email address is currently out of order
On Dec 4, 1:15=A0pm, "kendor" <jonas.re...@bfh.ch> wrote:
> hello there > > for a measuring utility (running @ 100MHZ) I need a counter of 42-bit wid=
th
> whose value is used by several sub blocks of my design. As a first, someh=
ow
> dirty solution I have implemented this like follows. Since this approach > needs quite a huge amount of FFs and leads to long delaytimes (bit 0 to 4=
2)
> I am looking for an alternative. I was thinking about using Block RAM > (Spartan3) to reduce routing effort and delaytimes. (see alsohttp://cours=
es.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf)
> > Has anyone ever done such a thing or do you have any suggestions on solvi=
ng
> my task? > > current code: > ------------------------------------- > # i have to use std_logic_unsigned since numeric_std has as integer width > the normal 4 bytes width (32bit - which for 42 bits is not enough ... > overflow,..) > > # ... > GENERIC ( > =A0 t : NATURAL :=3D 42; =A0--! counter width > =A0 wd: NATURAL :=3D 5 =A0 =A0--! divider (clk/(2*wd)) > ); > > # ... > ARCHITECTURE rtl OF worldtimeCtr IS > =A0 =A0 =A0 =A0 SIGNAL cnt: std_logic_vector(t-1 downto 0); > BEGIN > =A0 =A0 =A0 =A0 PROCESS(clk,rst) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 VARIABLE temp : NATURAL RANGE 0 to wd; > =A0 =A0 =A0 =A0 BEGIN > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 IF(rst=3D'0')THEN > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 cnt <=3D (others =3D>'0')=
;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 temp :=3D 0; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ELSIF(clk'event and clk=3D'1')THEN > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 IF(en=3D'1' and temp =3D =
wd)THEN
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0temp :=3D 0; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cnt <=3D STD_LOGIC=
_VECTOR(cnt + 1);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 END IF; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 temp :=3D temp+1; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 END if; > > =A0 =A0 =A0 =A0 END process; > =A0 =A0 =A0 =A0 o_worldtime <=3D cnt; > END rtl; > > # ... > ------------------------------------- > > thank you in advance > > kendor
If you mean the input clock is running 100 MHz, then after your prescaler (temp) your 42-bit count runs at 1/6 of 100 MHz if I read this code correctly? That means the entire counter has a multicycle propagation delay to itself of about 60 ns. Did you try adding a from : to style timing constraint to let the tools realize this? Regards, Gabor
kendor wrote:
> hello there > > for a measuring utility (running @ 100MHZ) I need a counter of 42-bit width > whose value is used by several sub blocks of my design. As a first, somehow > dirty solution I have implemented this like follows. Since this approach > needs quite a huge amount of FFs and leads to long delaytimes (bit 0 to 42) > I am looking for an alternative. I was thinking about using Block RAM > (Spartan3) to reduce routing effort and delaytimes. (see also > http://courses.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf) > > Has anyone ever done such a thing or do you have any suggestions on solving > my task? >
<snip>
> > thank you in advance > > kendor >
Do you need a binary output? Before carry chains, I used linear feedback shift registers for wide counters and converted to result to binary in software. Curt
I would use the DSP48 circuit. It easily runs at well over 100 MHz in
Spartan6 and much faster in Virtex 5 and 6.
No need for pre-scaling or fancy carry tricks. It's all done for you!
Look at the short description in the Spartan 6  User Guide Lite:

"Each DSP48A1 slice consists of a dedicated 18 - 18 bit two's
complement multiplier and a 48-bit accumulator, both capable of
operating at 250 MHz. The DSP48A1 slice provides extensive pipelining
and extension capabilities that enhance speed and efficiency of many
applications, even beyond digital signal processing, such as wide
dynamic bus shifters, memory address generators, wide bus
multiplexers, and memory-mapped I/O register files.

The accumulator can also be used as a synchronous up/down counter. "

Peter Alfke
I hope your comment on the declaration of WD is not what you really
wanted...

Also, en='0' disables the cnt increment, but not the prescaler (temp),
which will lead to problems if en is disabled at the wrong time or for
long enough.

Depending on how much latency you can tolerate (other posts regarding
register retiming/rebalancing), you may want to register the output of
the prescaler comparison, so that it's logic path does not add to the
counter path.

Andy
>I hope your comment on the declaration of WD is not what you really >wanted... > >Also, en='0' disables the cnt increment, but not the prescaler (temp), >which will lead to problems if en is disabled at the wrong time or for >long enough. > >Depending on how much latency you can tolerate (other posts regarding >register retiming/rebalancing), you may want to register the output of >the prescaler comparison, so that it's logic path does not add to the >counter path. > >Andy >
thank you all for your follow ups! In the comment I certainly mean prescaler - not divider ;) I am using timespecs for high and low time - ISE11 manages to do its job (however I have to increase its effort, which leads to quite some processing time (30'+)) I believe to add a pipeline would be a good idea. I'm processing 4*1024 multiplexed signals and for each signal I have 10 clock cycles for my algorithm to pass (I always switch between single incoming signals and then to the processing and wait again for the next time the same signal is selected... around 100us). Since I use the countervalue right from the beginning I would need to increase the countertime at the time I switch to the new signal. At the moment the data path needs 8 out of those 10 clock cycles. So there's not a lot of margin to add in another pipeline stage without having to add those in the whole algorithm (which works with feedbacks and loops of different delays) - so I'd prefer to have the easy way :) I didn't think of the "from : to style timing constraint" since I was not wanting to add 42 of those. But I'll give this a try. Registering the prescaler comparison sounds good to. Thanks! --------------------------------------- This message was sent using the comp.arch.fpga web interface on http://www.FPGARelated.com
On Dec 9, 7:40=A0am, "kendor" <jonas.re...@bfh.ch> wrote:
> >I hope your comment on the declaration of WD is not what you really > >wanted... > > >Also, en=3D'0' disables the cnt increment, but not the prescaler (temp), > >which will lead to problems if en is disabled at the wrong time or for > >long enough. > > >Depending on how much latency you can tolerate (other posts regarding > >register retiming/rebalancing), you may want to register the output of > >the prescaler comparison, so that it's logic path does not add to the > >counter path. > > >Andy > > thank you all for your follow ups! > > In the comment I certainly mean prescaler - not divider ;) > > I am using timespecs for high and low time - ISE11 manages to do its job > (however I have to increase its effort, which leads to quite some > processing time (30'+)) > I believe to add a pipeline would be a good idea. I'm processing 4*1024 > multiplexed signals and for each signal I have 10 clock cycles for my > algorithm to pass (I always switch between single incoming signals and th=
en
> to the processing and wait again for the next time the same signal is > selected... around 100us). Since I use the countervalue right from the > beginning I would need to increase the countertime at the time I switch t=
o
> the new signal. At the moment the data path needs 8 out of those 10 clock > cycles. So there's not a lot of margin to add in another pipeline stage > without having to add those in the whole algorithm (which works with > feedbacks and loops of different delays) - so I'd prefer to have the easy > way :) > > I didn't think of the "from : to style timing constraint" since I was not > wanting to add 42 of those. But I'll give this a try. > Registering the prescaler comparison sounds good to. > > Thanks! > > --------------------------------------- =A0 =A0 =A0 =A0 > This message was sent using the comp.arch.fpga web interface onhttp://www=
.FPGARelated.com No need to add 42 constraints. You make a timing group out of the counter bits. Then you have one constraint from that group to itself using the clock multiplied by the prescaler count as the delay. One good approach to this is as mentioned to register the prescaler to create a single cycle pulse at the prescale rate and write the counter logic such that it only changes when that signal is active (the "clock enable"). Then you can create the timing group based on the clock enable signal and perhaps catch some multicycle paths you didn't think of. Regards, Gabor
kendor <jonas.reber@bfh.ch> wrote:
 
> for a measuring utility (running @ 100MHZ) I need a counter of 42-bit width > whose value is used by several sub blocks of my design. As a first, somehow > dirty solution I have implemented this like follows. Since this approach > needs quite a huge amount of FFs and leads to long delaytimes (bit 0 to 42) > I am looking for an alternative. I was thinking about using Block RAM > (Spartan3) to reduce routing effort and delaytimes. (see also > http://courses.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf)
Someone else suggested a LFSR which seems like it might work. It depends somewhat on what you do with the count later. I was just thinking that you could cascade counters with a latch between the carry out of one and the carry in of the next. That causes the carry to occur one cycle late, which results in a strange count sequence, but fairly easy to correct externally. Though propagating the value to other subblocks seems likely to take about as long as getting the carry through 42 bits. That might require more pipeline registers throughout the design. Otherwise, 50MHz or 25MHz should be easy. A one or two bit counter at 100MHz with the appropriate logic to generate and latch a carry signal should also work. -- glen
On Dec 4, 10:15=A0am, "kendor" <jonas.re...@bfh.ch> wrote:
> hello there > > for a measuring utility (running @ 100MHZ) I need a counter of 42-bit wid=
th
> whose value is used by several sub blocks of my design. > kendor
The conventional design of a synchronous counter would concatenate 42 flip-flops, using the built-in dedicated carry chain. Its carry propagation delay is extremely short, but the total delay might be too long for 100 MHz operation. You can maintain the synchronous nature of the design, but decode an additional count enable from the first 2 flip-flops and route that signal to all the remaining 40 flip-flops in parallel. That gives the long carry chain not 10 ns, but 40 ns to stabilize, which is more than sufficient. And you still have a totally synchronous counter where all bits change on the same clock. If you think that 42 flip-flops are too many, you can use BlockRAMs. Each dual-ported 4K BlockRAM can implement an 8-bit counter per port, easily concatenated to 16 bits per BRAM. (The two ports have the same look-up functionality, just different addressing inputs, fed back from the own outputs) Two BlockRAMs can thus form a 32-bit fully synchronous counter, and a third BRAM can extend that to 48 bits. There is some trickery in gating the carry signals, but it never involves more than one level of combinatorial logic, no problem at 100 MHz. And you can also of course always use a pre-scaler, as described above. Now, if you use more modern FPGAs, like Spartan3DSP, or Spartan6, or Virtex4,5,or 6, then you can use the ready-made 48-bit accumulator (an accumulator that adds 1 per clock tick is a counter) without any design effort at all, and a speed of up to 500 MHz. Old FPGA families may sometimes look cheaper, but that may be deceptive. Would you today buy a car with drum brakes, no fuel injection, no CD player, no airbags and no air conditioning ? Peter Alfke