Here are some more links regarding counter & accumulator 
carry techniques.

--------------------
Links to early Xilinx counter app notes:

Ultra-Fast Synchronous Counters in XC3000 & XC4000 FPGAs
http://www.cs.york.ac.uk/rts/docs/Xilinx-datasource-2003-q1/appnotes/xapp014.pdf

Loadable Binary Counters in a XC3000 FPGA
http://www.cs.york.ac.uk/rts/docs/Xilinx-datasource-2003-q1/appnotes/xapp004.pdf

pages 15-18 of Xcell Journal #7
http://www.xilinx.com/publications/archives/xcell/Xcell7.pdf

--------------------
Haven't found a pdf for XAPP 001 yet:
"
" High-Speed Synchronous Prescaler Counter
" (XAPP 001)
"
" This simple design provides a very basic non-loadable,
" up counter with a count-enable control. However, this 
" simplicity permits it to be both the densest and the 
" second fastest design.
"
" A prescaler (CEP/CET) technique is used to gain speed,
" permitting the ripple-carry portion of the counter 
" eight clock periods in which to settle. Without special 
" adaptation, however, this technique precludes loading 
" the counter. As a non-loadable counter, three bits can 
" be implemented in three CLBs (1 CLB/bit), with the least 
" significant six bits requiring only four CLBs; this 
" explains the compactness. Only one TILO delay is incurred
" in the ripple-carry path for each three bits. 
"
 
This technique of making the low N bits run fast, with 
the upper bits running slower by 2^N, should map well 
into a compact yet fast implementation of a non-loadable 
binary counter for your Actel part.

 I.e., use something like the pcounter scheme for the low 
few bits, then make the upper bits with a ripple carry,
enabled by the carry out of the low bits.

 You probably will need to add special timing constraints 
to get the tools to understand the multicycle carry, and
that the ripple chain is a false path after FF reset.

 The advantage of this is that you would now only need to
deskew N LSB's of the counter for straight binary output.

--------------------
ORCA-3 FPGAs had an optional register in the dedicated carry chain:

" Fast-carry logic and routing to adjacent PFUs for 
" nibble-wide, byte-wide, or longer arithmetic functions,
" with the option to register the PFU carry-out.

--------------------
More carry-pipelined accumulator references:

( I've mentioned accumulators because they are a more 
general carry design problem than are counters, and 
because I know where to look for literature describing 
high speed pipelined versions.)

"Direct Digital Synthesizers: Theory, Design and Applications", Vankka
lib.tkk.fi/Diss/2000/isbn9512253186/isbn9512253186.pdf
See pages 48-49 for accumulator pipelining techniques.


"Single Chip 500 MHz Function Generator"
P.H. Saul, W. Barber, D.G. Taylor, T. Ward
IEE Proceedings, Vol. 138, No. 2, pp 239-243, April 1991

Reprinted in "Direct Digital Frequency Synthesizers", Kroupa (ed), IEEE Press, 1999

Fig. 2 shows the one-bit-per-carry accumulator structure
Fig. 5 shows the accumulator output deskew tree

-Brian

Dear colleagues,

thank you for the pointers to prior art.
I have included link to this newsgroup thread to the project page.

Best regards,
Marek

robotron wrote:
>
> > I've seen this used before. =A0They added delay lines after the
> > counter bits to produce a count output that is simple binary.
> <snip>
> 1. *please*, could you find the original work?
>
Similar sorts of carry pipelining were common in the
early Xilinx XC2000/3000 parts; I recall there being
some fast counter techniques in application notes and
Xcell journals of that era.

Pipelined carry chains at one or two bits per carry were
also commonly used for accumulators and counters in the
GHz clock rate GaAs standard cell GaAs designs that I
worked on in the early 90's.

The pictures from the following TriQuint patent show a
few variants of the input/output deskew trees that can
be implemented for delay equalization of a loadable
accumulator having carry pipelining:
 http://www.google.com/patents/US5140540

( disclaimer : I worked with some of the authors back
when I was doing a foundry design through TriQuint )

- Brian

Am Dienstag, 11. September 2012 20:47:01 UTC+2 schrieb robotron:

>=20
> - does this design exists, is it being used, and if so, what is its name?

If I understand your description correctly it is a carry save accumulator:
http://en.wikipedia.org/wiki/Carry-save_adder

Quote: "To put it another way, we are taking a carry digit from the positio=
n on our right, and passing a carry digit to the left, just as in conventio=
nal addition; but the carry digit we pass to the left is the result of the =
previous calculation and not the current one. In each clock cycle, carries =
only have to move one step along, and not n steps as in conventional additi=
on."

Kolja Sulimma
www.cronologic.de

robotron wrote:
> Dear Gabor,
> 
> On Wednesday, September 12, 2012 4:17:56 PM UTC+2, Gabor wrote:
>> Just to see if this has some application in Xilinx FPGA's I gave it
>> a whirl in a Spartan 6.  For a 32-bit counter with registered inputs
>> and only the final p and q going offchip (again with additional
>> registers) the best I could do in the -3 speed grade was 425 MHz.
>> The same size counter and architecture (including a carry out)
>> using the built-in carry chain logic for a normal binary counter
>> resulted in more than 470 MHz.  Looking through the timing numbers
>> it appears that routing delays for this counter negate any help
>> you might get by losing the carry chain (in Spartan 6).  I imagine
>> it would be a win in a CPLD (if you have the extra macrocells
>> for the 2x register count).  In the past I have used LFSR's for
>> long counters in CPLD's - partly for speed, but mostly because
>> of the reduced connectivity requirements.
> 
> It seems the dedicated carry logic is of real help there.
> OK, it makes no sense for these FPGAs.
> 
> It is certainly *much* better at Actel/MicroSemi, we have already put it into our recent design.
> 
> Thank you very much for the try.
> Marek

Yes, the Spartan 6 look-ahead carry logic is quite fast, and in addition
to the fast gates it has its own fast dedicated routing.  Still 425 MHz
is likely to be much faster than a typically achievable system speed
for all but the most carefully tuned designs.  And it turns out that
having two flip-flops plus a LUT matches the slice architecture of
the newer Xilinx parts, so the resource usage at the slice level is
not bad for the pcounter.  In fact if I get rid of the async reset
in my code, the pcounter and the standard carry-chain counter use
the same resources (8 slices for 32 bits).

Getting back to the inversion issue, it seems to me that if you
want to work back to a binary number and also use the enable input,
you would need to base the inversion on the p bits as well as
q bits, or else base it on the history of the en input as well
as the q bits.  Essentially, knowing the current p bit values
allows the software to finish the carry propagation.

-- Gabor

On Wednesday, September 12, 2012 6:40:30 PM UTC+2, rickman wrote:
> (..)
> If the counter is free running, you really only need to phase each bit 
> correctly.  The first bit is 0, 1, 0, 1... so there are only two phases, 
> either one or no FFs to delay it rather than n-1.  The second bit has a 
> pattern of four states so the delay is modulo 4 and can be 0, 1, 2 or 3 
> rather than n-2.  Does that help?  It should take some of the sting out 
> of a long counter.

Yes, but not much. As I wrote, it's (n-k) mod 2^k.

> (..)
> Have you thought of switching to a device with a built in carry chain?

Yes and no. Currently, the Actel architecture suits our needs for
aerospace, radiation tolerant design quite well. Of course we will
port our design to other families, e.g. (much slower) Atmel and (more
recent) Spartan-6.

I only tried to mention the idea -- because it may be useful somewhere
again, in general.

Marek

On 9/12/2012 12:06 PM, robotron wrote:
> Hello,
>
> On Wednesday, September 12, 2012 5:24:43 PM UTC+2, rickman wrote:
>>
>> I've seen this used before.  They added delay lines after the counter
>> bits to produce a count output that is simple binary.  This was in a
>> high speed network interface and the front end ran very fast relative to
>> the now antiquated FPGA technology.  The actual circuit may not have
>> been a counter, it may have been an adder, but it did have a carry chain.
>>
>> In essence, this circuit is a pipelined, bit serial counter.  You still
>> need to wait for all the bits to be counted or use the conversion formula.
>
> interesting!
>
> 1. *please*, could you find the original work?
>
> 2.
> Actually, my initial design containted a bank of delay lines, recovering the binary counting (maybe unnecessary -- I must check Gabor's post about desynchronizing bits to plain binary counting). The drawback is, there is O(n^2) DFFs if implemented using shift registers. The k-th bit has to be delayed by (n-k) mod 2^k bits, what is still O(n^2). In practice, it gives huge DFF counts for 32..64bit counters.
>
> The delayers may be also implemented using embedded counters -- since there is no need to delay arbitrary signal, only a pulse "p" (and then divide :2). Such a counter may be of ordinary architecture, because it has only log(n) bits. However, all this seem to me to be too complicated and resource usage may asymptotically drop to O(n log(n)), but in practice, it can hardly be less than using shift registers.
>
> I must try Gabor's Verilog to see, what happens.
>
> Thank you very much,
> Marek

This would not have been published.  It was part of a design a colleague 
worked on and this just seemed like the obvious way to do it, but then 
once you have a solution it always seems obvious, no?

The delay lines can be large, but in this design the data was quickly 
culled and the data rates dropped significantly.

Have you thought about clumping bits by twos, threes or fours?  I'm not 
familiar with the logic family you are using, but if it has four input 
LUTs a grouping of three bits will give a carry out in one LUT with the 
same delay as one bit of your approach.  Then you can use a lot fewer 
delay line FFs even if it doesn't change the O(n^2) relationship.

I mean, it really comes down to the number of FFs needed.  If the 
constant is small enough and your n is not too large, all is good!

I'm not sure I understand how delay counters would work.  Each edge has 
to be delayed, but the edges will happen in less than the delay time. 
You would need to somehow pipeline the edges...

If the counter is free running, you really only need to phase each bit 
correctly.  The first bit is 0, 1, 0, 1... so there are only two phases, 
either one or no FFs to delay it rather than n-1.  The second bit has a 
pattern of four states so the delay is modulo 4 and can be 0, 1, 2 or 3 
rather than n-2.  Does that help?  It should take some of the sting out 
of a long counter.

I think even with an input enable if you route it to all FFs in parallel 
a modulo delay will still work ok.

Have you thought of switching to a device with a built in carry chain?

Rick

Hello,

On Wednesday, September 12, 2012 5:24:43 PM UTC+2, rickman wrote:
>
> I've seen this used before.  They added delay lines after the counter=20
> bits to produce a count output that is simple binary.  This was in a=20
> high speed network interface and the front end ran very fast relative to=
=20
> the now antiquated FPGA technology.  The actual circuit may not have=20
> been a counter, it may have been an adder, but it did have a carry chain.
>=20
> In essence, this circuit is a pipelined, bit serial counter.  You still=
=20
> need to wait for all the bits to be counted or use the conversion formula=
.

interesting!

1. *please*, could you find the original work?

2.
Actually, my initial design containted a bank of delay lines, recovering th=
e binary counting (maybe unnecessary -- I must check Gabor's post about des=
ynchronizing bits to plain binary counting). The drawback is, there is O(n^=
2) DFFs if implemented using shift registers. The k-th bit has to be delaye=
d by (n-k) mod 2^k bits, what is still O(n^2). In practice, it gives huge D=
FF counts for 32..64bit counters.

The delayers may be also implemented using embedded counters -- since there=
 is no need to delay arbitrary signal, only a pulse "p" (and then divide :2=
). Such a counter may be of ordinary architecture, because it has only log(=
n) bits. However, all this seem to me to be too complicated and resource us=
age may asymptotically drop to O(n log(n)), but in practice, it can hardly =
be less than using shift registers.

I must try Gabor's Verilog to see, what happens.

Thank you very much,
Marek

Dear Gabor,

On Wednesday, September 12, 2012 4:17:56 PM UTC+2, Gabor wrote:
> 
> Just to see if this has some application in Xilinx FPGA's I gave it
> a whirl in a Spartan 6.  For a 32-bit counter with registered inputs
> and only the final p and q going offchip (again with additional
> registers) the best I could do in the -3 speed grade was 425 MHz.
> The same size counter and architecture (including a carry out)
> using the built-in carry chain logic for a normal binary counter
> resulted in more than 470 MHz.  Looking through the timing numbers
> it appears that routing delays for this counter negate any help
> you might get by losing the carry chain (in Spartan 6).  I imagine
> it would be a win in a CPLD (if you have the extra macrocells
> for the 2x register count).  In the past I have used LFSR's for
> long counters in CPLD's - partly for speed, but mostly because
> of the reduced connectivity requirements.

It seems the dedicated carry logic is of real help there.
OK, it makes no sense for these FPGAs.

It is certainly *much* better at Actel/MicroSemi, we have already put it into our recent design.

Thank you very much for the try.
Marek

On 9/11/2012 10:01 PM, Gabor wrote:
> On 9/11/2012 2:47 PM, robotron wrote:
>> Dear colleagues,
>>
>> I am sending you a proposal of binary counter, designed to minimize logic
>> path length in between flip-flops to one gate (MUX) only, at the
>> expense of
>> not so straightforward binary counting. The reason for this design has
>> emerged while using Actel (MicroSemi) ProASIC/IGLOO architecture, lacking
>> any hardwired support for fast carry.
>>
>> I have placed VHDL code, schematics, testbench and sample C code to
>> OpenCores:
>>
>> http://opencores.org/project,pcounter
>>
>> for further review. If you have GHDL, you can run the test easily by
>> issuing
>> "make testrun" or "make testvcd" to examine traces.
>>
>> Background:
>> During our work on Actel FPGAs (basically, 3-LUT & DFF only), we were
>> aware
>> of following types of faster counters:
>> - LFSR counter
>> - Johnson counter
>> - "RLA counter" (as tailored using Actel's SmartGen core generator)
>>
>> Johnson due to its O(2^n) (n as number of bits) can not be used for
>> longer
>> counts; LFSR's are hard to invert (table lookup seems to be only known
>> method), therefore also impractical for wider counters. RLA counter is
>> still
>> too slow and complex for wider counters and moderate speeds (e.g.
>>> 24bits @ >100MHz).
>>
>> As a consequence, the proposed counter uses synchronous divide-by-two
>> blocks, each using 1-bit pipeline and carry by single-clock pulse.
>> Design is
>> simple and fast, preliminary results from Synplify and Actel Designer
>> shows
>> 32bits @200MHz feasible.
>>
>> However, output bit lines are non-proportionaly delayed by discrete
>> number
>> of clock periods. Therefore, to obtain linear bit word, an inversion
>> formula
>> needs to be applied. Fortunately, the inversion is simple (unlike
>> LFSR's),
>> in C (pcount.c):
>>
>> for (k = 1; k < n; k++)
>> if ((y & ((1<<k)-1)) < k)
>> y = y ^ (1<<k);
>>
>> -- it may be implemented in VHDL core, or within CPU as shown,
>> depending on
>> application requirements.
>>
>> I am attaching design files & C language decoder/encoder of counter bit
>> words. If you have GHDL, you can run the test easily by issuing
>> "make testrun" or "make testvcd" to examine traces.
>>
>> ** My questions are: **
>> - does this design exists, is it being used, and if so, what is its name?
>> - if not, do you find the design useful?
>>
>>
>> Best regards,
>> Marek Peca
>>
> It's simple and elegant. If you only wanted a divide by 2^n
> count with a square wave output, you don't really need the inversion
> formula. It would also work with the initial 'en' input as a count
> enable, but that would mess up your inversion formula, as the carry
> propagates even when the input enable is false. In fact, for an
> n-bit version of this counter, it the input enable were off for
> n cycles (or more) the output would eventually settle on a normal
> binary pattern without the inversion.
>
> -- Gabor

I've seen this used before.  They added delay lines after the counter 
bits to produce a count output that is simple binary.  This was in a 
high speed network interface and the front end ran very fast relative to 
the now antiquated FPGA technology.  The actual circuit may not have 
been a counter, it may have been an adder, but it did have a carry chain.

In essence, this circuit is a pipelined, bit serial counter.  You still 
need to wait for all the bits to be counted or use the conversion formula.

Rick