FPGARelated.com
Forums

New(?) fast binary counter for FPGAs without carry logic (e.g. Actel) -- Request For Comment

Started by robotron September 11, 2012
Tim Wescott <tim@seemywebsite.com> wrote:

(after Tim wrote)

>>> Gray code counters have the same carry issues as regular binary >>> counters.
(then I wrote)
>> Are you sure? Having not designed one recently, I wouldn't be sure, but >> I thoght that they didn't. You do have to consider when each FF will >> change, though, and so synchronous logic tools might not be able to do >> the timing.
> Well, yes an no. Now that you mention it, it strikes me that the "only > one thing changes at a time" may make it _easier_ to anticipate carries > -- but _when_ any bit in a gray code counter changes is still dependent > on _all_ the other bits; if that has to propagate through a bunch of > logic, then you're back to slows-ville.
But if one input of an AND is zero, it doesn't matter if the other one changes. When that input goes from zero to one, what matters is that the other input isn't still changing. You might be able to get only one gate delay on any signal that actually changes, even though the gate chain is longer. All that said without actually looking at any circuits, so it might not apply here. -- glen
On 9/11/2012 2:47 PM, robotron wrote:
> Dear colleagues, > > I am sending you a proposal of binary counter, designed to minimize logic > path length in between flip-flops to one gate (MUX) only, at the expense of > not so straightforward binary counting. The reason for this design has > emerged while using Actel (MicroSemi) ProASIC/IGLOO architecture, lacking > any hardwired support for fast carry. > > I have placed VHDL code, schematics, testbench and sample C code to OpenCores: > > http://opencores.org/project,pcounter > > for further review. If you have GHDL, you can run the test easily by issuing > "make testrun" or "make testvcd" to examine traces. > > Background: > During our work on Actel FPGAs (basically, 3-LUT & DFF only), we were aware > of following types of faster counters: > - LFSR counter > - Johnson counter > - "RLA counter" (as tailored using Actel's SmartGen core generator) > > Johnson due to its O(2^n) (n as number of bits) can not be used for longer > counts; LFSR's are hard to invert (table lookup seems to be only known > method), therefore also impractical for wider counters. RLA counter is still > too slow and complex for wider counters and moderate speeds (e.g. >> 24bits @ >100MHz). > > As a consequence, the proposed counter uses synchronous divide-by-two > blocks, each using 1-bit pipeline and carry by single-clock pulse. Design is > simple and fast, preliminary results from Synplify and Actel Designer shows > 32bits @200MHz feasible. > > However, output bit lines are non-proportionaly delayed by discrete number > of clock periods. Therefore, to obtain linear bit word, an inversion formula > needs to be applied. Fortunately, the inversion is simple (unlike LFSR's), > in C (pcount.c): > > for (k = 1; k < n; k++) > if ((y & ((1<<k)-1)) < k) > y = y ^ (1<<k); > > -- it may be implemented in VHDL core, or within CPU as shown, depending on > application requirements. > > I am attaching design files & C language decoder/encoder of counter bit > words. If you have GHDL, you can run the test easily by issuing > "make testrun" or "make testvcd" to examine traces. > > ** My questions are: ** > - does this design exists, is it being used, and if so, what is its name? > - if not, do you find the design useful? > > > Best regards, > Marek Peca >
It's simple and elegant. If you only wanted a divide by 2^n count with a square wave output, you don't really need the inversion formula. It would also work with the initial 'en' input as a count enable, but that would mess up your inversion formula, as the carry propagates even when the input enable is false. In fact, for an n-bit version of this counter, it the input enable were off for n cycles (or more) the output would eventually settle on a normal binary pattern without the inversion. -- Gabor
On Sep 11, 11:41=A0pm, Tim Wescott <t...@seemywebsite.com> wrote:
> Well, your scheme does have the drawback that while the counter is fast, > and the sampling thereof is fast, the algorithm to get from the sampled > value to a binary one takes some clock cycles. > > Still, it'd work well as a capture for a microprocessor, or as a compare > (for something like a PWM). =A0It'd be better yet if you could make it an > up-down counter, but I bet that's harder.
Yes, the bitword inversion calculation is the price paid. I understand fully, that for demanding, high-throughput applications this is an obstacle. However, as you said, for less intensive applications like PWM or timestamping it may be useful. For now, we are going to use it within time interval meter. I have uploaded the design to OpenCores, so anybody is welcome to send their improvements.
> If you have access to a university library, do a literature search -- I'd > be astonished if this hasn't shown up before in something like the IEEE > Circuit Society journal, or in a patent someplace.
Thank you for recommending the journal, I will give a look. Best regards, Marek
On Sep 12, 4:01=A0am, Gabor <ga...@szakacs.org> wrote:
> > It's simple and elegant. =A0If you only wanted a divide by 2^n > count with a square wave output, you don't really need the inversion > formula.
Right.
>=A0It would also work with the initial 'en' input as a count > enable, but that would mess up your inversion formula, as the carry > propagates even when the input enable is false. =A0In fact, for an > n-bit version of this counter, it the input enable were off for > n cycles (or more) the output would eventually settle on a normal > binary pattern without the inversion.
Really? I do not see that at the moment. (Maybe I should try it in a simulation, rather than force the brain to boot.) Marek
robotron wrote:
> On Sep 12, 4:01 am, Gabor <ga...@szakacs.org> wrote: >> It's simple and elegant. If you only wanted a divide by 2^n >> count with a square wave output, you don't really need the inversion >> formula. > > Right. > >> It would also work with the initial 'en' input as a count >> enable, but that would mess up your inversion formula, as the carry >> propagates even when the input enable is false. In fact, for an >> n-bit version of this counter, it the input enable were off for >> n cycles (or more) the output would eventually settle on a normal >> binary pattern without the inversion. > > Really? I do not see that at the moment. > (Maybe I should try it in a simulation, rather than force the brain to > boot.) > > > Marek
It made sense to me because this is essentially a ripple carry counter with pipeline stages in the carry chain, so unless the pipeline is also dependent on the enable input, it should eventually settle on a binary output. I tried it myself, and it works as I expected. Here's my version (VHDL is not my first language so I converted to Verilog). It holds en true for 100 cycles at a time, and the output (eventually) settles on a multiple of 100 (modulo 256 in this case as my pcounter is 8 bits). `timescale 1 ns / 1 ps // pdivtwo by Marek Peca // http://opencores.org/project,pcounter // Verilog adapted from the schematic diagram `default_nettype none module pdivtwo ( input wire en, input wire clk, input wire rst, output reg p, output reg q ); always @ (posedge clk or posedge rst) if (rst) begin p <= 1'b0; q <= 1'b0; end else begin if (en) q <= !q; p <= en & q; end endmodule // pcounter by Marek Peca // http://opencores.org/project,pcounter // Verilog adapted from the schematic diagram module pcounter #( parameter WIDTH = 8 ) ( input wire en, input wire clk, input wire rst, output wire p, output wire [WIDTH-1:0] q ); // Internal carry chain elements wire [WIDTH-1:1] p_int; pdivtwo pctr [WIDTH-1:0] ( .en ({p_int,en}), .clk (clk), .rst (rst), .p ({p,p_int}), .q (q) ); endmodule module pcounter_tb; // Simple test bench enables the counter for 100 clocks at a time. // Inputs reg en; reg clk; reg rst; // Outputs wire p; wire [7:0] q; // Unit Under Test (UUT) pcounter uut ( .en (en), .clk (clk), .rst (rst), .p (p), .q (q) ); initial begin // Initialize Inputs en = 0; clk = 0; rst = 1; #101; rst = 0; end always clk = #5 !clk; integer timer = 0; always @ (posedge clk) if (!rst) begin timer <= timer + 1; case (timer) 10: en <= #1 1; 110: en <= #1 0; 210: en <= #1 1; 310: en <= #1 0; 500: timer <= 0; endcase end endmodule `default_nettype wire Regards, Gabor
robotron wrote:
> Dear colleagues, > > I am sending you a proposal of binary counter, designed to minimize logic > path length in between flip-flops to one gate (MUX) only, at the expense of > not so straightforward binary counting. The reason for this design has > emerged while using Actel (MicroSemi) ProASIC/IGLOO architecture, lacking > any hardwired support for fast carry. > > I have placed VHDL code, schematics, testbench and sample C code to OpenCores: > > http://opencores.org/project,pcounter > > for further review. If you have GHDL, you can run the test easily by issuing > "make testrun" or "make testvcd" to examine traces. > > Background: > During our work on Actel FPGAs (basically, 3-LUT & DFF only), we were aware > of following types of faster counters: > - LFSR counter > - Johnson counter > - "RLA counter" (as tailored using Actel's SmartGen core generator) > > Johnson due to its O(2^n) (n as number of bits) can not be used for longer > counts; LFSR's are hard to invert (table lookup seems to be only known > method), therefore also impractical for wider counters. RLA counter is still > too slow and complex for wider counters and moderate speeds (e.g. >> 24bits @ >100MHz). > > As a consequence, the proposed counter uses synchronous divide-by-two > blocks, each using 1-bit pipeline and carry by single-clock pulse. Design is > simple and fast, preliminary results from Synplify and Actel Designer shows > 32bits @200MHz feasible. > > However, output bit lines are non-proportionaly delayed by discrete number > of clock periods. Therefore, to obtain linear bit word, an inversion formula > needs to be applied. Fortunately, the inversion is simple (unlike LFSR's), > in C (pcount.c): > > for (k = 1; k < n; k++) > if ((y & ((1<<k)-1)) < k) > y = y ^ (1<<k); > > -- it may be implemented in VHDL core, or within CPU as shown, depending on > application requirements. > > I am attaching design files & C language decoder/encoder of counter bit > words. If you have GHDL, you can run the test easily by issuing > "make testrun" or "make testvcd" to examine traces. > > ** My questions are: ** > - does this design exists, is it being used, and if so, what is its name? > - if not, do you find the design useful? > > > Best regards, > Marek Peca
Just to see if this has some application in Xilinx FPGA's I gave it a whirl in a Spartan 6. For a 32-bit counter with registered inputs and only the final p and q going offchip (again with additional registers) the best I could do in the -3 speed grade was 425 MHz. The same size counter and architecture (including a carry out) using the built-in carry chain logic for a normal binary counter resulted in more than 470 MHz. Looking through the timing numbers it appears that routing delays for this counter negate any help you might get by losing the carry chain (in Spartan 6). I imagine it would be a win in a CPLD (if you have the extra macrocells for the 2x register count). In the past I have used LFSR's for long counters in CPLD's - partly for speed, but mostly because of the reduced connectivity requirements. Regards, Gabor
On 9/11/2012 10:01 PM, Gabor wrote:
> On 9/11/2012 2:47 PM, robotron wrote: >> Dear colleagues, >> >> I am sending you a proposal of binary counter, designed to minimize logic >> path length in between flip-flops to one gate (MUX) only, at the >> expense of >> not so straightforward binary counting. The reason for this design has >> emerged while using Actel (MicroSemi) ProASIC/IGLOO architecture, lacking >> any hardwired support for fast carry. >> >> I have placed VHDL code, schematics, testbench and sample C code to >> OpenCores: >> >> http://opencores.org/project,pcounter >> >> for further review. If you have GHDL, you can run the test easily by >> issuing >> "make testrun" or "make testvcd" to examine traces. >> >> Background: >> During our work on Actel FPGAs (basically, 3-LUT & DFF only), we were >> aware >> of following types of faster counters: >> - LFSR counter >> - Johnson counter >> - "RLA counter" (as tailored using Actel's SmartGen core generator) >> >> Johnson due to its O(2^n) (n as number of bits) can not be used for >> longer >> counts; LFSR's are hard to invert (table lookup seems to be only known >> method), therefore also impractical for wider counters. RLA counter is >> still >> too slow and complex for wider counters and moderate speeds (e.g. >>> 24bits @ >100MHz). >> >> As a consequence, the proposed counter uses synchronous divide-by-two >> blocks, each using 1-bit pipeline and carry by single-clock pulse. >> Design is >> simple and fast, preliminary results from Synplify and Actel Designer >> shows >> 32bits @200MHz feasible. >> >> However, output bit lines are non-proportionaly delayed by discrete >> number >> of clock periods. Therefore, to obtain linear bit word, an inversion >> formula >> needs to be applied. Fortunately, the inversion is simple (unlike >> LFSR's), >> in C (pcount.c): >> >> for (k = 1; k < n; k++) >> if ((y & ((1<<k)-1)) < k) >> y = y ^ (1<<k); >> >> -- it may be implemented in VHDL core, or within CPU as shown, >> depending on >> application requirements. >> >> I am attaching design files & C language decoder/encoder of counter bit >> words. If you have GHDL, you can run the test easily by issuing >> "make testrun" or "make testvcd" to examine traces. >> >> ** My questions are: ** >> - does this design exists, is it being used, and if so, what is its name? >> - if not, do you find the design useful? >> >> >> Best regards, >> Marek Peca >> > It's simple and elegant. If you only wanted a divide by 2^n > count with a square wave output, you don't really need the inversion > formula. It would also work with the initial 'en' input as a count > enable, but that would mess up your inversion formula, as the carry > propagates even when the input enable is false. In fact, for an > n-bit version of this counter, it the input enable were off for > n cycles (or more) the output would eventually settle on a normal > binary pattern without the inversion. > > -- Gabor
I've seen this used before. They added delay lines after the counter bits to produce a count output that is simple binary. This was in a high speed network interface and the front end ran very fast relative to the now antiquated FPGA technology. The actual circuit may not have been a counter, it may have been an adder, but it did have a carry chain. In essence, this circuit is a pipelined, bit serial counter. You still need to wait for all the bits to be counted or use the conversion formula. Rick
Dear Gabor,

On Wednesday, September 12, 2012 4:17:56 PM UTC+2, Gabor wrote:
> > Just to see if this has some application in Xilinx FPGA's I gave it > a whirl in a Spartan 6. For a 32-bit counter with registered inputs > and only the final p and q going offchip (again with additional > registers) the best I could do in the -3 speed grade was 425 MHz. > The same size counter and architecture (including a carry out) > using the built-in carry chain logic for a normal binary counter > resulted in more than 470 MHz. Looking through the timing numbers > it appears that routing delays for this counter negate any help > you might get by losing the carry chain (in Spartan 6). I imagine > it would be a win in a CPLD (if you have the extra macrocells > for the 2x register count). In the past I have used LFSR's for > long counters in CPLD's - partly for speed, but mostly because > of the reduced connectivity requirements.
It seems the dedicated carry logic is of real help there. OK, it makes no sense for these FPGAs. It is certainly *much* better at Actel/MicroSemi, we have already put it into our recent design. Thank you very much for the try. Marek
Hello,

On Wednesday, September 12, 2012 5:24:43 PM UTC+2, rickman wrote:
> > I've seen this used before. They added delay lines after the counter=20 > bits to produce a count output that is simple binary. This was in a=20 > high speed network interface and the front end ran very fast relative to=
=20
> the now antiquated FPGA technology. The actual circuit may not have=20 > been a counter, it may have been an adder, but it did have a carry chain. >=20 > In essence, this circuit is a pipelined, bit serial counter. You still=
=20
> need to wait for all the bits to be counted or use the conversion formula=
. interesting! 1. *please*, could you find the original work? 2. Actually, my initial design containted a bank of delay lines, recovering th= e binary counting (maybe unnecessary -- I must check Gabor's post about des= ynchronizing bits to plain binary counting). The drawback is, there is O(n^= 2) DFFs if implemented using shift registers. The k-th bit has to be delaye= d by (n-k) mod 2^k bits, what is still O(n^2). In practice, it gives huge D= FF counts for 32..64bit counters. The delayers may be also implemented using embedded counters -- since there= is no need to delay arbitrary signal, only a pulse "p" (and then divide :2= ). Such a counter may be of ordinary architecture, because it has only log(= n) bits. However, all this seem to me to be too complicated and resource us= age may asymptotically drop to O(n log(n)), but in practice, it can hardly = be less than using shift registers. I must try Gabor's Verilog to see, what happens. Thank you very much, Marek
On 9/12/2012 12:06 PM, robotron wrote:
> Hello, > > On Wednesday, September 12, 2012 5:24:43 PM UTC+2, rickman wrote: >> >> I've seen this used before. They added delay lines after the counter >> bits to produce a count output that is simple binary. This was in a >> high speed network interface and the front end ran very fast relative to >> the now antiquated FPGA technology. The actual circuit may not have >> been a counter, it may have been an adder, but it did have a carry chain. >> >> In essence, this circuit is a pipelined, bit serial counter. You still >> need to wait for all the bits to be counted or use the conversion formula. > > interesting! > > 1. *please*, could you find the original work? > > 2. > Actually, my initial design containted a bank of delay lines, recovering the binary counting (maybe unnecessary -- I must check Gabor's post about desynchronizing bits to plain binary counting). The drawback is, there is O(n^2) DFFs if implemented using shift registers. The k-th bit has to be delayed by (n-k) mod 2^k bits, what is still O(n^2). In practice, it gives huge DFF counts for 32..64bit counters. > > The delayers may be also implemented using embedded counters -- since there is no need to delay arbitrary signal, only a pulse "p" (and then divide :2). Such a counter may be of ordinary architecture, because it has only log(n) bits. However, all this seem to me to be too complicated and resource usage may asymptotically drop to O(n log(n)), but in practice, it can hardly be less than using shift registers. > > I must try Gabor's Verilog to see, what happens. > > Thank you very much, > Marek
This would not have been published. It was part of a design a colleague worked on and this just seemed like the obvious way to do it, but then once you have a solution it always seems obvious, no? The delay lines can be large, but in this design the data was quickly culled and the data rates dropped significantly. Have you thought about clumping bits by twos, threes or fours? I'm not familiar with the logic family you are using, but if it has four input LUTs a grouping of three bits will give a carry out in one LUT with the same delay as one bit of your approach. Then you can use a lot fewer delay line FFs even if it doesn't change the O(n^2) relationship. I mean, it really comes down to the number of FFs needed. If the constant is small enough and your n is not too large, all is good! I'm not sure I understand how delay counters would work. Each edge has to be delayed, but the edges will happen in less than the delay time. You would need to somehow pipeline the edges... If the counter is free running, you really only need to phase each bit correctly. The first bit is 0, 1, 0, 1... so there are only two phases, either one or no FFs to delay it rather than n-1. The second bit has a pattern of four states so the delay is modulo 4 and can be 0, 1, 2 or 3 rather than n-2. Does that help? It should take some of the sting out of a long counter. I think even with an input enable if you route it to all FFs in parallel a modulo delay will still work ok. Have you thought of switching to a device with a built in carry chain? Rick