Dear all, I have come up with 2 solutions in VHDL, how to count number of bits in input data. The thing I don't understand is why the 2 solutions produce different results, at least with Xilinx ISE and its XST. There is quite a substantial difference in required number of slices/ LUTs. 1. solution with unrolled loop: 41 slices, 73 LUTs 2. solution with loop: 54 slices, 100 LUTs The entity of both architectures is the same: entity one_count is Port ( din : in STD_LOGIC_vector(31 downto 0); dout : out STD_LOGIC_vector(5 downto 0) ); end one_count; The architecture with an unrolled loop is the following: library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.NUMERIC_STD.ALL; entity one_count is Port ( din : in STD_LOGIC_vector(31 downto 0); dout : out STD_LOGIC_vector(5 downto 0) ); end one_count; architecture one_count_unrolled_arch of one_count is signal cnt : integer range 0 to 32; begin cnt <= to_integer(unsigned(din( 0 downto 0))) + to_integer(unsigned(din( 1 downto 1))) + to_integer(unsigned(din( 2 downto 2))) + to_integer(unsigned(din( 3 downto 3))) + to_integer(unsigned(din( 4 downto 4))) + to_integer(unsigned(din( 5 downto 5))) + to_integer(unsigned(din( 6 downto 6))) + to_integer(unsigned(din( 7 downto 7))) + to_integer(unsigned(din( 8 downto 8))) + to_integer(unsigned(din( 9 downto 9))) + to_integer(unsigned(din(10 downto 10))) + to_integer(unsigned(din(11 downto 11))) + to_integer(unsigned(din(12 downto 12))) + to_integer(unsigned(din(13 downto 13))) + to_integer(unsigned(din(14 downto 14))) + to_integer(unsigned(din(15 downto 15))) + to_integer(unsigned(din(16 downto 16))) + to_integer(unsigned(din(17 downto 17))) + to_integer(unsigned(din(18 downto 18))) + to_integer(unsigned(din(19 downto 19))) + to_integer(unsigned(din(20 downto 20))) + to_integer(unsigned(din(21 downto 21))) + to_integer(unsigned(din(22 downto 22))) + to_integer(unsigned(din(23 downto 23))) + to_integer(unsigned(din(24 downto 24))) + to_integer(unsigned(din(25 downto 25))) + to_integer(unsigned(din(26 downto 26))) + to_integer(unsigned(din(27 downto 27))) + to_integer(unsigned(din(28 downto 28))) + to_integer(unsigned(din(29 downto 29))) + to_integer(unsigned(din(30 downto 30))) + to_integer(unsigned(din(31 downto 31))); dout <= std_logic_vector(to_unsigned(cnt,6)); end one_count_unrolled_arch ; And the architecture with a loop is the following: architecture one_count_loop_arch of one_count_loop is signal cnt : integer range 0 to 32; begin process(din) is variable tmp : integer range 0 to 32; begin tmp := to_integer(unsigned(din(0 downto 0))); for i in 1 to 31 loop tmp := tmp + to_integer(unsigned(din(i downto i))); end loop; cnt <= tmp; end process; dout <= std_logic_vector(to_unsigned(cnt,6)); end one_count_loop_arch ; I would be really grateful if somebody could point out what I did wrong with the 2. solution with loop. It certainly must be my mistake, but I can not find it... Additionally, I know that this "brute-force" one counting might not be the optimal approach, but this is just my first attempt to get the job done. If somebody has a better solution, I would appreciate it if you could share it. Regards, Peter
Count bits in VHDL, with loop and unrolled loop produces different results
A VHDL designer observed that unrolled loops and standard for-loops resulted in significantly different LUT and slice utilization when implementing a 32-bit population count (bit counter) in Xilinx XST. The discussion explores how synthesis tools interpret sequential vs. concurrent descriptions and how coding styles affect resource optimization.
The community concluded that while high-end tools like Synplify Pro optimize various RTL styles to a similar result, XST is more sensitive to coding structures. The thread provides several optimized alternatives, including binary trees, lookup table functions, and arithmetic mask-and-shift methods.
- Synthesis results vary significantly between tools like XST and Synplify Pro when handling identical VHDL loop logic.
- Manual unrolling or using a recursive binary tree implementation often helps synthesis tools manage sum widths more efficiently.
- Using a 'mask and add' arithmetic approach (binary tree) proved to be one of the most resource-efficient methods across different compilers.
- XST's Verilog parser sometimes produces better optimization results than its VHDL counterpart for the same logical operation.
- Modern FPGA families with 6-input LUTs may require specific coding patterns or newer compiler versions to reach optimal area utilization.
On Mar 1, 12:59=A0pm, a s <nospa...@gmail.com> wrote:> Dear all, > > I have come up with 2 solutions in VHDL, how to count number of bits > in input data. > The thing I don't understand is why the 2 solutions produce different > results, at least with Xilinx ISE and its XST. > There is quite a substantial difference in required number of slices/ > LUTs. > > 1. solution with unrolled loop: =A0 =A0 =A0 =A041 slices, =A073 LUTs > 2. solution with loop: =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A054 slices, =100 LUTs> > The entity of both architectures is the same: > > entity one_count is > =A0 Port ( din : in =A0STD_LOGIC_vector(31 downto 0); > =A0 =A0 =A0 =A0 =A0dout : out =A0STD_LOGIC_vector(5 downto 0) > =A0 =A0 =A0 =A0 ); > end one_count; > > The architecture with an unrolled loop is the following: > > library IEEE; > use IEEE.STD_LOGIC_1164.ALL; > use IEEE.NUMERIC_STD.ALL; > > entity one_count is > =A0 Port ( din : in =A0STD_LOGIC_vector(31 downto 0); > =A0 =A0 =A0 =A0 =A0dout : out =A0STD_LOGIC_vector(5 downto 0) > =A0 =A0 =A0 =A0 ); > end one_count; > > architecture one_count_unrolled_arch of one_count is > > =A0 signal =A0cnt : integer range 0 to 32; > > begin > > =A0 =A0cnt <=3D to_integer(unsigned(din( 0 downto =A00))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 1 downto =A01))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 2 downto =A02))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 3 downto =A03))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 4 downto =A04))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 5 downto =A05))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 6 downto =A06))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 7 downto =A07))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 8 downto =A08))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 9 downto =A09))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(10 downto 10))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(11 downto 11))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(12 downto 12))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(13 downto 13))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(14 downto 14))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(15 downto 15))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(16 downto 16))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(17 downto 17))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(18 downto 18))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(19 downto 19))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(20 downto 20))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(21 downto 21))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(22 downto 22))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(23 downto 23))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(24 downto 24))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(25 downto 25))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(26 downto 26))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(27 downto 27))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(28 downto 28))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(29 downto 29))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(30 downto 30))) + > =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(31 downto 31))); > > =A0 =A0dout <=3D std_logic_vector(to_unsigned(cnt,6)); > > end one_count_unrolled_arch ; > > And the architecture with a loop is the following: > > architecture one_count_loop_arch of one_count_loop is > > signal =A0cnt : integer range 0 to 32; > > begin > > =A0 process(din) is > =A0 =A0 variable =A0tmp : integer range 0 to 32; > =A0 =A0 begin > =A0 =A0 =A0 tmp :=3D to_integer(unsigned(din(0 downto 0))); > =A0 =A0 =A0 for i in 1 to 31 loop > =A0 =A0 =A0 =A0 =A0 tmp :=3D tmp + to_integer(unsigned(din(i downto i))); > =A0 =A0 =A0 end loop; > =A0 =A0 =A0 cnt <=3D tmp; > =A0 end process; > > =A0 dout <=3D std_logic_vector(to_unsigned(cnt,6)); > > end one_count_loop_arch ; > > I would be really grateful if somebody could point out what I did > wrong with the 2. solution with loop. > It certainly must be my mistake, but I can not find it... > > Additionally, I know that this "brute-force" one counting might not be > the optimal approach, > but this is just my first attempt to get the job done. If somebody has > a better solution, I would > appreciate it if you could share it. > > Regards, > Petersee what you get with this function instead (a function I have used before): function count_ones(slv : std_logic_vector) return natural is varaible n_ones : natural :=3D 0; begin for i in slv'range loop if slv(i) =3D '1' then n_ones :=3D n_ones + 1; end if; end loop; return n_ones; end function count_ones; .... inside architecture, no process needed: dout <=3D std_logic_vector( to_unsigned(count_ones(din), dout'length) ); The beauty with this is the function will work with a std_logic_vector of any length.
A good synthesis tool should be able to optimize either version to the same implementation. But there are semantic differences that Xilinx may be getting hung up on. In the unrolled version, you have a long expression, and there is freedom within vhdl to evaluate it in different orders or groups of operations. In the loop version, since you are continually updating tmp, you are describing an explicitly sequential order in which the calculation takes place. Like I said, a good synthesis tool should be able to handle either one equally well, but you get what you pay for in synthesis tools. If you are looking for a general solution to the problem for any size vector, try a recursive function implentation of a binary tree and see what happens. Just for kicks, you might also put the whole calculation in the loop (0 to 31), with temp set to 0 before the loop. Shouldn't make any difference, but then again, we're already seeing differences where there should be none. On the other hand, if what you have works (fits and meets timing), use the most maintainable, understandable version. It will save you time (=money) in the long run. It is often interesting to find out what is "the optimal" way to code something such that it results in the smallest/fastest circuit. But in the big picture, it most often does not matter, especially when you write some cryptic code to squeeze the last pS/LUT out of it, and you had plenty of slack and space to spare anyway. Nevertheless, knowing how to squeeze every pS/LUT comes in handy every once in a while. Andy
Dear Andy, Tricky, thank you both for your valuable input. Please find my comments below. On Mar 1, 3:49=A0pm, Andy <jonesa...@comcast.net> wrote:> A good synthesis tool should be able to optimize either version to the > same implementation. But there are semantic differences that Xilinx > may be getting hung up on.Aha, that's a good thing, it means that I did not make some obvious mistake. ;-)> If you are looking for a general solution to the problem for any size > vector, try a recursive function implentation of a binary tree and see > what happens.OK, I didn't quite get that but will consider it again.> Just for kicks, you might also put the whole calculation in the loop > (0 to 31), with temp set to 0 before the loop. Shouldn't make any > difference, but then again, we're already seeing differences where > there should be none.Sorry I didn't tell you before. I have already tried that and in this case XST produces the same result.> On the other hand, if what you have works (fits and meets timing), use > the most maintainable, understandable version. It will save you time > (=3Dmoney) in the long run. It is often interesting to find out what is > "the optimal" way to code something such that it results in the > smallest/fastest circuit. But in the big picture, it most often does > not matter, especially when you write some cryptic code to squeeze the > last pS/LUT out of it, and you had plenty of slack and space to spare > anyway. Nevertheless, knowing how to squeeze every pS/LUT comes in > handy every once in a while.Andy, I completely agree with what you have written above. One should strive for maintainable and understandable version. Although, on my particular case, I have to find a good solution in terms of LUT resources, because I need 8 instances of one counters with 64-bit input data. And the device is getting full... Tricky, your approach does indeed look very neat. I like it. Although it is far less efficient than mine. For the same input/output ports, your version with function requires 171 Slices in 313 LUTs. (The minimum that I get with unrolled version is 41 Slices and 73 LUTs).
Here's a slightly different approach to your problem.... It tries to take advantage of the fact that the LUTs are pretty good as lookup tables. It's in Verilog, but you should easily be able to convert it to VHDL. `define V1 module tst ( input [31:0] data, output [5:0] cnt ); `ifdef V1 /* Device utilization summary: --------------------------- Selected Device : 3s50pq208-5 Number of Slices: 35 out of 768 4% Number of 4 input LUTs: 62 out of 1536 4% Number of IOs: 38 Number of bonded IOBs: 0 out of 124 0% */ function [1:0] cnt3; input [2:0] d_in; begin case (d_in) 3'h0: cnt3 = 2'h0; 3'h1: cnt3 = 2'h1; 3'h2: cnt3 = 2'h1; 3'h3: cnt3 = 2'h2; 3'h4: cnt3 = 2'h1; 3'h5: cnt3 = 2'h2; 3'h6: cnt3 = 2'h2; 3'h7: cnt3 = 2'h3; endcase end endfunction assign cnt = cnt3(data[2:0]) + cnt3(data[5:3]) + cnt3(data[8:6]) + cnt3(data[11:9]) + cnt3(data[14:12]) + cnt3(data[17:15]) + cnt3(data[20:18]) + cnt3(data[23:21]) + cnt3(data[26:24]) + cnt3(data[29:27]) + cnt3({1'b0, data[31:30]}) ; `endif `ifdef V2 /* Selected Device : 3s50pq208-5 Number of Slices: 44 out of 768 5% Number of 4 input LUTs: 79 out of 1536 5% Number of IOs: 38 Number of bonded IOBs: 0 out of 124 0% */ function [2:0] cnt4; input [3:0] d_in; begin case (d_in) 4'h0: cnt4 = 3'h0; 4'h1: cnt4 = 3'h1; 4'h2: cnt4 = 3'h1; 4'h3: cnt4 = 3'h2; 4'h4: cnt4 = 3'h1; 4'h5: cnt4 = 3'h2; 4'h6: cnt4 = 3'h2; 4'h7: cnt4 = 3'h3; 4'h8: cnt4 = 3'h1; 4'h9: cnt4 = 3'h2; 4'ha: cnt4 = 3'h2; 4'hb: cnt4 = 3'h3; 4'hc: cnt4 = 3'h2; 4'hd: cnt4 = 3'h3; 4'he: cnt4 = 3'h3; 4'hf: cnt4 = 3'h4; endcase end endfunction assign cnt = cnt4(data[3:0]) + cnt4(data[7:4]) + cnt4(data[11:8]) + cnt4(data[15:12]) + cnt4(data[19:16]) + cnt4(data[23:20]) + cnt4(data[27:24]) + cnt4(data[31:28]) ; `endif endmodule John Providenza
> > If you are looking for a general solution to the problem for any size > > vector, try a recursive function implentation of a binary tree and see > > what happens. > > OK, I didn't quite get that but will consider it again. >The synthesis tool may not be able to figure out that it need not carry all the bits of the sum through every caculation in the loop. A binary tree implementation can manage the sum width at every stage. For a recursive binary tree implementation, define a function that divides the vector into two ~equal parts (i.e. n/2 and n/2 + n mod 2), calls itself on each one, and returns the sum of the two results. Stop recursion when the size of the incoming vector is 1 (just return 1 if the bit is set, and 0 if not). This is synthesizeable as long as the recursion is statically bound (which it is, by the size of the original vector). It should work out pretty close to what johnp's approach does, except work for any size input vector. Andy
In comp.arch.fpga Andy <jonesandy@comcast.net> wrote: (snip)> The synthesis tool may not be able to figure out that it need not > carry all the bits of the sum through every caculation in the loop. A > binary tree implementation can manage the sum width at every stage.Is that the same as a Carry Save Adder tree? Maybe not. The CSA has three inputs and two outputs, so it isn't exactly binary. -- glen
Peter wrote:> > If somebody has a better solution, I would appreciate it if you could share it. >When I looked at this some years back, XST worked well enough at creating an adder cascade using the old "mask and add" software trick that I never bothered writing something more optimal: gen_bcnt32: if ( ALU_WIDTH = 32 ) generate begin process(din) -- multiple variables not needed, but make intermediate steps visible in simulation variable temp : unsigned (ALU_MSB downto 0); variable temp1 : unsigned (ALU_MSB downto 0); variable temp2 : unsigned (ALU_MSB downto 0); variable temp3 : unsigned (ALU_MSB downto 0); variable temp4 : unsigned (ALU_MSB downto 0); variable temp5 : unsigned (ALU_MSB downto 0); begin temp := unsigned(din); temp1 := (temp AND X"5555_5555") + ( ( temp srl 1) AND X"5555_5555"); -- 0..2 out x16 temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2) AND X"3333_3333"); -- 0..4 out x8 temp3 := (temp2 AND X"0707_0707") + ( ( temp2 srl 4) AND X"0707_0707"); -- 0..8 out x4 temp4 := (temp3 AND X"001f_001f") + ( ( temp3 srl 8) AND X"001f_001f"); -- 0..16 out x2 temp5 := (temp4 AND X"0000_003f") + ( ( temp4 srl 16) AND X"0000_003f"); -- 0..32 out cnt <= std_logic_vector(temp5(5 downto 0)); end process; end generate gen_bcnt32; Brian
In comp.arch.fpga Brian Davis <brimdavis@aol.com> wrote: (snip)> When I looked at this some years back, XST worked well enough > at creating an adder cascade using the old "mask and add" software > trick that I never bothered writing something more optimal:(snip)> temp1 := (temp AND X"5555_5555") + ( ( temp srl 1) AND > X"5555_5555"); -- 0..2 out x16 > > temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2) AND > X"3333_3333"); -- 0..4 out x8OK, that would be a binary tree. I believe the CSA adder tree is slightly more efficient, though it might depend on the number of inputs. The binary tree cascade works especially well on word oriented machines, and can be easily written in many high-level languages (with the assumption of knowing the word size). The first stage of a CSA tree starts with N inputs, and generates N/3 two bit outputs that are the sums and carries from N/3 full adders. (If N isn't a multiple of three, then one bit may bypass the stage, and two bits go into a half adder.) The next stage takes the N/3 ones and N/3 twos, and generates N/9 ones, 2N/9 twos, and N/9 fours. You can continue until there is only one bit of each, or sometimes there are other optimizations near the end. Last time I did one, I only needed to know zero, one, two, three, or more than three, which simplifies it slightly. It also pipelines well, but then so does the binary tree. -- glen
Andy nailed it again when he said you get what you pay for regarding synthesis tool. Initially I was synthesising with ISE12.4 and the results were, hm, inconsistent. After switching the synthesis tool to SynplifyPro v2010.03 the results were as expected and, of course, even better than that. Please see the table below. Tricky's version is denoted "funct", where there are major differences between the 2 tools: ---------- 32-bit input data -------- unrolled: XST 74 LUTs, 41 slices unrolled: SynplifyPro 57 LUTs, 34 slices loop: XST 100 LUTs, 54 slices loop: SynplifyPro 57 LUTs, 34 slices funct: XST 317 LUTs, 161 slices funct: SynplifyPro 58 LUTs, 34 slices ---------- 64-bit input data -------- unrolled: XST 174 LUTs, 96 slices unrolled: SynplifyPro 129 LUTs, 80 slices loop: XST 227 LUTs, 121 slices loop: SynplifyPro 129 LUTs, 80 slices funct: XST 813 LUTs, 436 slices funct: SynplifyPro 130 LUTs, 82 slices Peter





