hello there for a measuring utility (running @ 100MHZ) I need a counter of 42-bit width whose value is used by several sub blocks of my design. As a first, somehow dirty solution I have implemented this like follows. Since this approach needs quite a huge amount of FFs and leads to long delaytimes (bit 0 to 42) I am looking for an alternative. I was thinking about using Block RAM (Spartan3) to reduce routing effort and delaytimes. (see also http://courses.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf) Has anyone ever done such a thing or do you have any suggestions on solving my task? current code: ------------------------------------- # i have to use std_logic_unsigned since numeric_std has as integer width the normal 4 bytes width (32bit - which for 42 bits is not enough ... overflow,..) # ... GENERIC ( t : NATURAL := 42; --! counter width wd: NATURAL := 5 --! divider (clk/(2*wd)) ); # ... ARCHITECTURE rtl OF worldtimeCtr IS SIGNAL cnt: std_logic_vector(t-1 downto 0); BEGIN PROCESS(clk,rst) VARIABLE temp : NATURAL RANGE 0 to wd; BEGIN IF(rst='0')THEN cnt <= (others =>'0'); temp := 0; ELSIF(clk'event and clk='1')THEN IF(en='1' and temp = wd)THEN temp := 0; cnt <= STD_LOGIC_VECTOR(cnt + 1); END IF; temp := temp+1; END if; END process; o_worldtime <= cnt; END rtl; # ... ------------------------------------- thank you in advance kendor
very wide counter (42-bit)
Started by ●December 4, 2009
Reply by ●December 4, 20092009-12-04
On Fri, 04 Dec 2009 12:15:24 -0600 "kendor" <jonas.reber@bfh.ch> wrote:> hello there > > for a measuring utility (running @ 100MHZ) I need a counter of 42-bit > width whose value is used by several sub blocks of my design. As a > first, somehow dirty solution I have implemented this like follows. > Since this approach needs quite a huge amount of FFs and leads to > long delaytimes (bit 0 to 42) I am looking for an alternative. I was > thinking about using Block RAM (Spartan3) to reduce routing effort > and delaytimes. (see also > http://courses.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf) > > Has anyone ever done such a thing or do you have any suggestions on > solving my task? > > current code: > ------------------------------------- > # i have to use std_logic_unsigned since numeric_std has as integer > width the normal 4 bytes width (32bit - which for 42 bits is not > enough ... overflow,..) > > # ... > GENERIC ( > t : NATURAL := 42; --! counter width > wd: NATURAL := 5 --! divider (clk/(2*wd)) > ); > > # ... > ARCHITECTURE rtl OF worldtimeCtr IS > SIGNAL cnt: std_logic_vector(t-1 downto 0); > BEGIN > PROCESS(clk,rst) > VARIABLE temp : NATURAL RANGE 0 to wd; > BEGIN > IF(rst='0')THEN > cnt <= (others =>'0'); > temp := 0; > ELSIF(clk'event and clk='1')THEN > IF(en='1' and temp = wd)THEN > temp := 0; > cnt <= STD_LOGIC_VECTOR(cnt + 1); > END IF; > temp := temp+1; > END if; > > END process; > o_worldtime <= cnt; > END rtl; > > # ... > ------------------------------------- > > thank you in advance > > kendor > >Another option would be to pipeline the block into, say, 3 segments of 14 bits a piece, so that you don't have that one LONG carry chain trying to propagate up the whole thing. Depending on how willing your toolchain is to rebalance registers (ISE 11 _may_ be smart enough), you might just be able to add a few stages of pipeline delay on the output of the entire 43 bits, and let it push things around across the logic. Otherwise you'd have to code it manually, which isn't the end of the world. -- Rob Gaddi, Highland Technology Email address is currently out of order
Reply by ●December 4, 20092009-12-04
On Dec 4, 1:15=A0pm, "kendor" <jonas.re...@bfh.ch> wrote:> hello there > > for a measuring utility (running @ 100MHZ) I need a counter of 42-bit wid=th> whose value is used by several sub blocks of my design. As a first, someh=ow> dirty solution I have implemented this like follows. Since this approach > needs quite a huge amount of FFs and leads to long delaytimes (bit 0 to 4=2)> I am looking for an alternative. I was thinking about using Block RAM > (Spartan3) to reduce routing effort and delaytimes. (see alsohttp://cours=es.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf)> > Has anyone ever done such a thing or do you have any suggestions on solvi=ng> my task? > > current code: > ------------------------------------- > # i have to use std_logic_unsigned since numeric_std has as integer width > the normal 4 bytes width (32bit - which for 42 bits is not enough ... > overflow,..) > > # ... > GENERIC ( > =A0 t : NATURAL :=3D 42; =A0--! counter width > =A0 wd: NATURAL :=3D 5 =A0 =A0--! divider (clk/(2*wd)) > ); > > # ... > ARCHITECTURE rtl OF worldtimeCtr IS > =A0 =A0 =A0 =A0 SIGNAL cnt: std_logic_vector(t-1 downto 0); > BEGIN > =A0 =A0 =A0 =A0 PROCESS(clk,rst) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 VARIABLE temp : NATURAL RANGE 0 to wd; > =A0 =A0 =A0 =A0 BEGIN > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 IF(rst=3D'0')THEN > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 cnt <=3D (others =3D>'0')=;> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 temp :=3D 0; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ELSIF(clk'event and clk=3D'1')THEN > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 IF(en=3D'1' and temp =3D =wd)THEN> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0temp :=3D 0; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cnt <=3D STD_LOGIC=_VECTOR(cnt + 1);> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 END IF; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 temp :=3D temp+1; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 END if; > > =A0 =A0 =A0 =A0 END process; > =A0 =A0 =A0 =A0 o_worldtime <=3D cnt; > END rtl; > > # ... > ------------------------------------- > > thank you in advance > > kendorIf you mean the input clock is running 100 MHz, then after your prescaler (temp) your 42-bit count runs at 1/6 of 100 MHz if I read this code correctly? That means the entire counter has a multicycle propagation delay to itself of about 60 ns. Did you try adding a from : to style timing constraint to let the tools realize this? Regards, Gabor
Reply by ●December 7, 20092009-12-07
kendor wrote:> hello there > > for a measuring utility (running @ 100MHZ) I need a counter of 42-bit width > whose value is used by several sub blocks of my design. As a first, somehow > dirty solution I have implemented this like follows. Since this approach > needs quite a huge amount of FFs and leads to long delaytimes (bit 0 to 42) > I am looking for an alternative. I was thinking about using Block RAM > (Spartan3) to reduce routing effort and delaytimes. (see also > http://courses.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf) > > Has anyone ever done such a thing or do you have any suggestions on solving > my task? ><snip>> > thank you in advance > > kendor >Do you need a binary output? Before carry chains, I used linear feedback shift registers for wide counters and converted to result to binary in software. Curt
Reply by ●December 7, 20092009-12-07
I would use the DSP48 circuit. It easily runs at well over 100 MHz in Spartan6 and much faster in Virtex 5 and 6. No need for pre-scaling or fancy carry tricks. It's all done for you! Look at the short description in the Spartan 6 User Guide Lite: "Each DSP48A1 slice consists of a dedicated 18 - 18 bit two's complement multiplier and a 48-bit accumulator, both capable of operating at 250 MHz. The DSP48A1 slice provides extensive pipelining and extension capabilities that enhance speed and efficiency of many applications, even beyond digital signal processing, such as wide dynamic bus shifters, memory address generators, wide bus multiplexers, and memory-mapped I/O register files. The accumulator can also be used as a synchronous up/down counter. " Peter Alfke
Reply by ●December 7, 20092009-12-07
I hope your comment on the declaration of WD is not what you really wanted... Also, en='0' disables the cnt increment, but not the prescaler (temp), which will lead to problems if en is disabled at the wrong time or for long enough. Depending on how much latency you can tolerate (other posts regarding register retiming/rebalancing), you may want to register the output of the prescaler comparison, so that it's logic path does not add to the counter path. Andy
Reply by ●December 9, 20092009-12-09
>I hope your comment on the declaration of WD is not what you really >wanted... > >Also, en='0' disables the cnt increment, but not the prescaler (temp), >which will lead to problems if en is disabled at the wrong time or for >long enough. > >Depending on how much latency you can tolerate (other posts regarding >register retiming/rebalancing), you may want to register the output of >the prescaler comparison, so that it's logic path does not add to the >counter path. > >Andy >thank you all for your follow ups! In the comment I certainly mean prescaler - not divider ;) I am using timespecs for high and low time - ISE11 manages to do its job (however I have to increase its effort, which leads to quite some processing time (30'+)) I believe to add a pipeline would be a good idea. I'm processing 4*1024 multiplexed signals and for each signal I have 10 clock cycles for my algorithm to pass (I always switch between single incoming signals and then to the processing and wait again for the next time the same signal is selected... around 100us). Since I use the countervalue right from the beginning I would need to increase the countertime at the time I switch to the new signal. At the moment the data path needs 8 out of those 10 clock cycles. So there's not a lot of margin to add in another pipeline stage without having to add those in the whole algorithm (which works with feedbacks and loops of different delays) - so I'd prefer to have the easy way :) I didn't think of the "from : to style timing constraint" since I was not wanting to add 42 of those. But I'll give this a try. Registering the prescaler comparison sounds good to. Thanks! --------------------------------------- This message was sent using the comp.arch.fpga web interface on http://www.FPGARelated.com
Reply by ●December 9, 20092009-12-09
On Dec 9, 7:40=A0am, "kendor" <jonas.re...@bfh.ch> wrote:> >I hope your comment on the declaration of WD is not what you really > >wanted... > > >Also, en=3D'0' disables the cnt increment, but not the prescaler (temp), > >which will lead to problems if en is disabled at the wrong time or for > >long enough. > > >Depending on how much latency you can tolerate (other posts regarding > >register retiming/rebalancing), you may want to register the output of > >the prescaler comparison, so that it's logic path does not add to the > >counter path. > > >Andy > > thank you all for your follow ups! > > In the comment I certainly mean prescaler - not divider ;) > > I am using timespecs for high and low time - ISE11 manages to do its job > (however I have to increase its effort, which leads to quite some > processing time (30'+)) > I believe to add a pipeline would be a good idea. I'm processing 4*1024 > multiplexed signals and for each signal I have 10 clock cycles for my > algorithm to pass (I always switch between single incoming signals and th=en> to the processing and wait again for the next time the same signal is > selected... around 100us). Since I use the countervalue right from the > beginning I would need to increase the countertime at the time I switch t=o> the new signal. At the moment the data path needs 8 out of those 10 clock > cycles. So there's not a lot of margin to add in another pipeline stage > without having to add those in the whole algorithm (which works with > feedbacks and loops of different delays) - so I'd prefer to have the easy > way :) > > I didn't think of the "from : to style timing constraint" since I was not > wanting to add 42 of those. But I'll give this a try. > Registering the prescaler comparison sounds good to. > > Thanks! > > --------------------------------------- =A0 =A0 =A0 =A0 > This message was sent using the comp.arch.fpga web interface onhttp://www=.FPGARelated.com No need to add 42 constraints. You make a timing group out of the counter bits. Then you have one constraint from that group to itself using the clock multiplied by the prescaler count as the delay. One good approach to this is as mentioned to register the prescaler to create a single cycle pulse at the prescale rate and write the counter logic such that it only changes when that signal is active (the "clock enable"). Then you can create the timing group based on the clock enable signal and perhaps catch some multicycle paths you didn't think of. Regards, Gabor
Reply by ●December 11, 20092009-12-11
kendor <jonas.reber@bfh.ch> wrote:> for a measuring utility (running @ 100MHZ) I need a counter of 42-bit width > whose value is used by several sub blocks of my design. As a first, somehow > dirty solution I have implemented this like follows. Since this approach > needs quite a huge amount of FFs and leads to long delaytimes (bit 0 to 42) > I am looking for an alternative. I was thinking about using Block RAM > (Spartan3) to reduce routing effort and delaytimes. (see also > http://courses.ece.illinois.edu/ece412/References/datasheets/xapp463.pdf)Someone else suggested a LFSR which seems like it might work. It depends somewhat on what you do with the count later. I was just thinking that you could cascade counters with a latch between the carry out of one and the carry in of the next. That causes the carry to occur one cycle late, which results in a strange count sequence, but fairly easy to correct externally. Though propagating the value to other subblocks seems likely to take about as long as getting the carry through 42 bits. That might require more pipeline registers throughout the design. Otherwise, 50MHz or 25MHz should be easy. A one or two bit counter at 100MHz with the appropriate logic to generate and latch a carry signal should also work. -- glen
Reply by ●December 11, 20092009-12-11
On Dec 4, 10:15=A0am, "kendor" <jonas.re...@bfh.ch> wrote:> hello there > > for a measuring utility (running @ 100MHZ) I need a counter of 42-bit wid=th> whose value is used by several sub blocks of my design. > kendorThe conventional design of a synchronous counter would concatenate 42 flip-flops, using the built-in dedicated carry chain. Its carry propagation delay is extremely short, but the total delay might be too long for 100 MHz operation. You can maintain the synchronous nature of the design, but decode an additional count enable from the first 2 flip-flops and route that signal to all the remaining 40 flip-flops in parallel. That gives the long carry chain not 10 ns, but 40 ns to stabilize, which is more than sufficient. And you still have a totally synchronous counter where all bits change on the same clock. If you think that 42 flip-flops are too many, you can use BlockRAMs. Each dual-ported 4K BlockRAM can implement an 8-bit counter per port, easily concatenated to 16 bits per BRAM. (The two ports have the same look-up functionality, just different addressing inputs, fed back from the own outputs) Two BlockRAMs can thus form a 32-bit fully synchronous counter, and a third BRAM can extend that to 48 bits. There is some trickery in gating the carry signals, but it never involves more than one level of combinatorial logic, no problem at 100 MHz. And you can also of course always use a pre-scaler, as described above. Now, if you use more modern FPGAs, like Spartan3DSP, or Spartan6, or Virtex4,5,or 6, then you can use the ready-made 48-bit accumulator (an accumulator that adds 1 per clock tick is a counter) without any design effort at all, and a speed of up to 500 MHz. Old FPGA families may sometimes look cheaper, but that may be deceptive. Would you today buy a car with drum brakes, no fuel injection, no CD player, no airbags and no air conditioning ? Peter Alfke






