FPGARelated.com
Forums

Bit error counter - how to make it faster

Started by gamer June 27, 2007
My goal is to implement a bit-error counter targeting 1GHz. The
datawidth is parametrizable. I started off this way,

Verilog code:
----------
assign mismatch[datawidth-1:0]  =  input_data ^ expected_data;
assign matched  =  ~( | mismatch);  // reduction NOR

always @(posedge clk  or  posedge reset) begin
   if (reset)
      error_count = 0;
   else if (~matched)
      for (i=0; i<datawidth; i=i+1)
         error_count = error_count + mismatch[i];
end
---------/////////----------------------
The above meets timing for small datawidths (like 8-bits) and starts
failing to meet timing once the datawidth gets larger. I will be glad
for suggestions of alternate ways to implement this bit-error counter.
In practice our datawidths could go upto 256 bits.

Thanks

> The above meets timing for small datawidths (like 8-bits) and starts > failing to meet timing once the datawidth gets larger. I will be glad > for suggestions of alternate ways to implement this bit-error counter. > In practice our datawidths could go upto 256 bits.
Spread your algorithm to work over more than one clock cycle. Use a pipeline stage over eigth clocks. The first stage calculates bit 0 to 7, the second calculates 8 to 15 and so on. In an additional stage you sum up all results. The result is always 9 clocks behind. See http://en.wikipedia.org/wiki/Pipeline_%28computer%29 for a general explanation of pipelines. hth G&#4294967295;nther Jehle
On Jun 27, 12:34 pm, gamer <csan...@gmail.com> wrote:
> My goal is to implement a bit-error counter targeting 1GHz. The > datawidth is parametrizable. I started off this way, > > Verilog code: > ---------- > assign mismatch[datawidth-1:0] = input_data ^ expected_data; > assign matched = ~( | mismatch); // reduction NOR > > always @(posedge clk or posedge reset) begin > if (reset) > error_count = 0; > else if (~matched) > for (i=0; i<datawidth; i=i+1) > error_count = error_count + mismatch[i]; > end > ---------/////////---------------------- > The above meets timing for small datawidths (like 8-bits) and starts > failing to meet timing once the datawidth gets larger. I will be glad > for suggestions of alternate ways to implement this bit-error counter. > In practice our datawidths could go upto 256 bits.
I know no verilog and am not even a C programmer, but ISTM that the summation could be highly parallel by using a tree of adders (perhaps using carry save to further accelerate the summation). Consider it like computing the population count followed by an ordinary addition: one can add together the even and odd mismatch bits in parallel, then add pairs of these sums, etc., finally adding the sum to error_count. (BTW, if I understand correctly one should not need the "reduction NOR" and the "if (~matched)" since if the popcount is zero, adding it to error_count would effectively be a no-op, assuming worst-case speed is the only consideration. As Gunther Jehle pointed out pipelining could provide higher throughput.) Paul A. Clayton just a technophile
On 2007-06-27, gamer <csanjay@gmail.com> wrote:
> else if (~matched) > for (i=0; i<datawidth; i=i+1) > error_count = error_count + mismatch[i];
Although from a software (and possibly a simulation) perspective, that looks like it should reduce the workload by computing 'matched' and "avoiding" the loop if it is not needed, in hardware your speed is going to be limited by the worst case (and the additions are happening all the time, only the enables on the error_count register are affected by your 'if'). If you must have your answer on the very next clock, your options are severely limited. If you only care about the answer occasionally (or with a fixed latency that's greater than one clock), think about how you can take advantage of that. To follow along your original lines of thinking: If you can really find 'matched' at your desired clockrate, and it is infrequent, then you could stuff 'mismatch' into a fifo whenever it's nonzero and add it in a separate loop which only has to keep up with the worst case bit error rate. Or, if, as you said, the process is fast enough for 1 bit, make yourself 'datawidth' 1-bit error counters and add their results whenever you want the total number of errors. -- Ben Jackson AD7GD <ben@ben.com> http://www.ben.com/
gamer wrote:
> My goal is to implement a bit-error counter targeting 1GHz. The > datawidth is parametrizable. I started off this way, > > Verilog code: > ---------- > assign mismatch[datawidth-1:0] = input_data ^ expected_data; > assign matched = ~( | mismatch); // reduction NOR > > always @(posedge clk or posedge reset) begin > if (reset) > error_count = 0; > else if (~matched) > for (i=0; i<datawidth; i=i+1) > error_count = error_count + mismatch[i]; > end > ---------/////////---------------------- > The above meets timing for small datawidths (like 8-bits) and starts
Remember this, it is a key to a possible solution!
> failing to meet timing once the datawidth gets larger. I will be glad > for suggestions of alternate ways to implement this bit-error counter. > In practice our datawidths could go upto 256 bits.
You need to parallelize your error counter, particularly for very wide (256-bit?) paths: A few layers of full adders to reduce the 256 input lines into a more reasonable number, then one or more carry-save accumulators to gather these into something that can be handled very quickly. Only when reading out the number of errors do you propagate the carries to generate the final/real counts. Terje -- - <Terje.Mathisen@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
On 28 Jun., 15:40, Terje Mathisen <terje.mathi...@hda.hydro.com>
wrote:
> A few layers of full adders to reduce the 256 input lines into a more > reasonable number, then one or more carry-save accumulators to gather > these into something that can be handled very quickly. > > Only when reading out the number of errors do you propagate the carries > to generate the final/real counts.
In Virtex-5 the 6-LUTs point to a 6-to-3 reduction instead of full adders. You can also use BRAMs for an 11 to 4 reduction. 512 bits at 500MHz might be easier than 256 bits at 1GHz. Kolja Sulimma
gamer <csanjay@gmail.com> writes:

> My goal is to implement a bit-error counter targeting 1GHz. The > datawidth is parametrizable. I started off this way,
How much resources do you have available and how often do you want to process the counters? If you have lots of registers available and just want to sample the counters at some point in time (e.g. after one hour of operation) and present the result on a PC or whatever you could: Use datawidth X linear feedback shift registers (of sequence length error count, or some more optimal length for the LFSR sequence) as counters. This is only a single XOR delay between two registers. Then sample all the counters. Transfer the results to your PC. Convert the LFSR sequences and add the values before you present the result. However, you can't do this if you have to continuously present the result. You haven't provided enough details about this in your specification. Petter -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
On Jun 27, 9:34 am, gamer <csan...@gmail.com> wrote:
> My goal is to implement a bit-error counter targeting 1GHz. The > datawidth is parametrizable. I started off this way,
How is the input data created? If the register values come from bit set/clear operations, you can keep track of the number of ones with a separate counter that is updated appropriately when the register is updated. For example, suppose that you'd like to know the number of ones in a shift register. If the input bit and the bit shifted off (and thus removed from the register) are the same, the separate counter is unchanged. If the input is 1 and the bit shifted off is 0, the counter is incremented. If the input is 0 and the bit shifted off is 1, the counter is decremented. When the register as a whole is cleared, the counter is also cleared. It's easy enough to extend this to any update scheme where the number of bits that can change each cycle is small, even if those can be in arbitrary positions.
On 6 28 ,   12 34 , gamer <csan...@gmail.com> wrote:
> My goal is to implement a bit-error counter targeting 1GHz. The > datawidth is parametrizable. I started off this way, > > Verilog code: > ---------- > assign mismatch[datawidth-1:0] = input_data ^ expected_data; > assign matched = ~( | mismatch); // reduction NOR >
I guess the critical path of timming is summation chain. There're serveral binary tricks to get number of bit "1" in a word, take C code of 32-bit word for example, there's only 5 summations instead of 32. unsigned int bitcount32(unsigned int x) { x = (x & 0x55555555UL) + ((x >> 1) & 0x55555555UL); // 0-2 in 2 bits x = (x & 0x33333333UL) + ((x >> 2) & 0x33333333UL); // 0-4 in 4 bits #if 1 // Version 1 x = (x & 0x0f0f0f0fUL) + ((x >> 4) & 0x0f0f0f0fUL); // 0-8 in 8 bits x = (x & 0x00ff00ffUL) + ((x >> 8) & 0x00ff00ffUL); // 0-16 in 16 bits x = (x & 0x0000ffffUL) + ((x >> 16) & 0x0000ffffUL); // 0-31 in 32 bits return x; #else // Version 2 x = (x + (x >> 4)) & 0x0f0f0f0fUL; // 0-8 in 4 bits x += x >> 8; // 0-16 in 8 bits x += x >> 16; // 0-32 in 8 bits return x & 0xff; #endif }
> always @(posedge clk or posedge reset) begin > if (reset) > error_count = 0; > else if (~matched) > for (i=0; i<datawidth; i=i+1) > error_count = error_count + mismatch[i]; > end > ---------/////////---------------------- > The above meets timing for small datawidths (like 8-bits) and starts > failing to meet timing once the datawidth gets larger. I will be glad > for suggestions of alternate ways to implement this bit-error counter. > In practice our datawidths could go upto 256 bits. > > Thanks
YUAN, Nan wrote:

(snip)

> There're serveral binary tricks to get number of bit "1" in a word, > take C code of 32-bit word for example, there's only 5 summations > instead of 32.
> unsigned int bitcount32(unsigned int x) > { > x = (x & 0x55555555UL) + ((x >> 1) & 0x55555555UL); // 0-2 in 2 > bits > x = (x & 0x33333333UL) + ((x >> 2) & 0x33333333UL); // 0-4 in 4 > bits > #if 1
(snip)
> #else > // Version 2 > x = (x + (x >> 4)) & 0x0f0f0f0fUL; // 0-8 in 4 bits > x += x >> 8; // 0-16 in 8 bits > x += x >> 16; // 0-32 in 8 bits > return x & 0xff; > #endif
But note that you do a lot more arithmetic than is needed. This takes advantage of the availability of a 32 bit ALU. You do five 32 bit additions with carry, and four 32 bit AND operations. The hardware implementation using carry save adders is much more efficient, in terms of both logic and speed. 11 full adders for the first level, eight for the second level, five for the third level, four for the fourth level, three for the fifth level, three for the sixth level, two each for the seventh and eighth level, and one final OR gate. If I counted right. No carry logic is needed. Some can be half adders for ASIC, and there might be some other changes to make optimal use of FPGA LUTs. -- glen