FPGARelated.com
Forums

Ones Count 64 bit on Xilinx in VHDL

Started by Brad Smallridge July 19, 2005
Hello Group,

What is the best way to count 64 incoming simultaneous
bit signals to determine the number of 1s (in VHDL)?
I have clock cycles to spare but the result must be pipelined
so that each clock cycle produces a new count.

Brad Smallridge
b r a d @ a i v i s i o n . c o m
 


Add them.

Add registers to your path and make your tool retime them.

This has been covered in the newsgroup in the past.  How many levels of
logic you can deal with depends on your device and your clock.  Just adding
the individual bits together will produce the desired results and you can
pipeline to your heart's content allowing a new result every clock (after
the initial latency) in the time it takes to run through one carry-chain
adder.


"Brad Smallridge" <bradsmallridge@dslextreme.com> wrote in message
news:11dr115qoteg87b@corp.supernews.com...
> Hello Group, > > What is the best way to count 64 incoming simultaneous > bit signals to determine the number of 1s (in VHDL)? > I have clock cycles to spare but the result must be pipelined > so that each clock cycle produces a new count. > > Brad Smallridge > b r a d @ a i v i s i o n . c o m
Yeah, I understand this.  But I can't wrap my head around how to code it.

Do you do like this:
if( clk'event and clk='1') then
 partial_sum1_2bit <= '0'&bit0 + '0'&bit1;
 partial_sum2_2bit <= '0'&bit0 + '0'&bit1;
 partial_sum1_3bit <= '0'&partial_sum1_2bit + '0'&partial_sum2_2bit;
 -- and so on
end if;

And then there is the question on how this all synthesizes, probably, for me
at 27MHz, opimized for area not speed.  I could use a little insight from
someone who's done this before.

Brad Smallridge
b r a d @ a i v i s i o n . c o m


I also don't understand what you mean by "having your tool
retime them". I don't have Precision or any advance tools here.


Brad,
what you need is called a Wallace-Tree Adder.
Here is a fairly efficient implementation:
Divide your input into groups of 12 bits, and use each as address bits
to a BlockRAM, loaded to output a 4-bit number that describes the
number of 1s in the address.
You can treat each port independently, so one BlockRAM handles 24
inputs and generates two 4-bit outputs. Three BlockRAMs handle 72
incoming bits, and produce six sets of 4-bit values.  You can combine
them with five 6-bit adders on 3 levels, giving you a total of 4
pipeline delays.
This is just one of many ways to solve your design problem...
I like to use BlockRAMs for unconventional purposes.
Peter Alfke

Brad Smallridge wrote:
> Yeah, I understand this. But I can't wrap my head around how to code it. > > Do you do like this: > if( clk'event and clk='1') then > partial_sum1_2bit <= '0'&bit0 + '0'&bit1; > partial_sum2_2bit <= '0'&bit0 + '0'&bit1; > partial_sum1_3bit <= '0'&partial_sum1_2bit + '0'&partial_sum2_2bit; > -- and so on > end if; > > And then there is the question on how this all synthesizes, probably, for me > at 27MHz, opimized for area not speed. I could use a little insight from > someone who's done this before. > > Brad Smallridge > b r a d @ a i v i s i o n . c o m
64 is 0+63+1
63 is 31+31+1
31 is 15+15+1
15 is 7+7+1
7 is 3+3+1
simple recursion

a few adder rows should be pretty quick and way less resources than
BlockRam, takes about 6 levels of small adders

In case you are interested in price and performance:
3 BlockRAMs plus 6 CLBs, four levels of pipelining, running at 200 MHz+
Not too bad  :-)
Peter Alfke

Brad Smallridge wrote:

>Hello Group, > >What is the best way to count 64 incoming simultaneous >bit signals to determine the number of 1s (in VHDL)? >I have clock cycles to spare but the result must be pipelined >so that each clock cycle produces a new count. > >Brad Smallridge >b r a d @ a i v i s i o n . c o m > > > > >
Brad, Basically, you want to gather bits together in small adders. a wallace tree does that using full adders to compress 3 single bit inputs, all withthe same weight into two signals, a sum and a carry. The sum has the same weight as the inputs, the carry has weight 2x the input. Then you use another layer to sum all like weighted bits, and repeat until you are left with two signals of each weight. You combine those with a conventional adder. What Peter described is going to be more clock cycle efficient because you use the BRAM in place of a wallace tree. His description isn't really a wallace tree because it doesn't have the same structure (no tree of carry-save adders, and the final outputs are complete sums of the bits for those BRAMs, not a carry vector and a sum vector like a wallace tree). You could use wallace trees to combine the results, from the BRAMs, but it isn't efficient in an FPGA. -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
If I were to do it in Verilog, I might use
always @(posedge Clk27M)
  TotalOnes <= in[0]+in[1]+in[2]+in[3]+in[4]+in[5]+in[6]+... and continue
typing until I reach  +in[63];

The synthesizer MAY produce superb results.
If it grinds, split it into groups - 8 groups of 8 or 4 groups of 16 nad add
*those* values together as a multiple-value addition.


"Brad Smallridge" <bradsmallridge@dslextreme.com> wrote in message
news:11dr26q40k9876e@corp.supernews.com...
> Yeah, I understand this. But I can't wrap my head around how to code it. > > Do you do like this: > if( clk'event and clk='1') then > partial_sum1_2bit <= '0'&bit0 + '0'&bit1; > partial_sum2_2bit <= '0'&bit0 + '0'&bit1; > partial_sum1_3bit <= '0'&partial_sum1_2bit + '0'&partial_sum2_2bit; > -- and so on > end if; > > And then there is the question on how this all synthesizes, probably, for
me
> at 27MHz, opimized for area not speed. I could use a little insight from > someone who's done this before. > > Brad Smallridge > b r a d @ a i v i s i o n . c o m
Brad,

There are so many ways of doing this, depending on your FPGA family, 
required timing and the available resouces, but other than using the 
"natural resources", simple LUTs, pipelines, even multipliers, etc., you can 
also use memories. Personally, I like using memories for state machines, 
especially for channelized state machines or LUT for pre-computed CRC 
calculation.

If we are talking about Virtex family, we have 16384 bits RAMs, which can be 
used as 4096x4 LUT, where you have a '1's counter within 12-bit vector, 
which is applied as an address of the entry. Each entry holds the number of 
'1's. It is clear how to expand this concept further to any vector, 
depending on the timing requirements and the available resources.

One way is that you can try 5 memory blocks like this and it will give you 
60 bits covered, then simply add the "data_out"s and the extra bits and 
pipeline them.
There could be more "balanced" or optimal usage of memories and FFs.

I hope i did not make any math mistake here.

Vladislav



"Brad Smallridge" <bradsmallridge@dslextreme.com> wrote in message 
news:11dr115qoteg87b@corp.supernews.com...
> Hello Group, > > What is the best way to count 64 incoming simultaneous > bit signals to determine the number of 1s (in VHDL)? > I have clock cycles to spare but the result must be pipelined > so that each clock cycle produces a new count. > > Brad Smallridge > b r a d @ a i v i s i o n . c o m > > >