Greetings, I need to detect runs. I want to look at 65 bits and show when there are 9 consecutive 1s or 0s from the byte boundaries resulting in 8 values per clock. This should be comfortably done in two logic levels (I need clean logic delays). The idea is simple but the implementation is tough. I'm working with Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to design violence to get the results that I believe are "best." Any thoughts on how to do this "better?" (the following code likes fixed fonts) - John_H ===================================== module testRun ( input clk , input [64:0] bytePlus1 , output reg [ 7:0] runByte /* synthesis xc_props = "INIT=R" */ ); // INIT included to force register as FD primitive - bleah reg [23:0] runBits; // I wanted the syn_keep on this combinatorial "reg" wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah reg [ 7:0] runByte_; integer i,j,k; always @(*) begin runBits = -24'h1; runByte_ = -8'h1; k = 0; // overlapping aaa aaaa for( i=0; i<64; i=i+j ) // consecutive aaaa begin // bit regions 876543210 for( j=0; (i%8+j<8) && (j<3); j=j+1 ) runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]); runByte_[i/8] = runByte_[i/8] & runBits_[k]; k = k + 1; end end always @(posedge clk) runByte = runByte_; endmodule
Digesting runs of ones or zeros "well"
Started by ●September 30, 2003
Reply by ●October 1, 20032003-10-01
I can't say I really understand your problem statement. I also don't
see how your code is solving the problem. Can you give a better
explanation of the problem? What do you mean "from the byte
boundaries"? Are you counting only the bit sets of {0..8}, {8..16},
{16..24}...? If so, this seems like an easy problem to implement.
John_H wrote:
>
> Greetings,
>
> I need to detect runs. I want to look at 65 bits and show when there are 9
> consecutive 1s or 0s from the byte boundaries resulting in 8 values per
> clock. This should be comfortably done in two logic levels (I need clean
> logic delays).
>
> The idea is simple but the implementation is tough. I'm working with
> Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to
> design violence to get the results that I believe are "best."
>
> Any thoughts on how to do this "better?" (the following code likes fixed
> fonts)
>
> - John_H
> =====================================
> module testRun ( input clk
> , input [64:0] bytePlus1
> , output reg [ 7:0] runByte /* synthesis xc_props = "INIT=R"
> */
> ); // INIT included to force register as FD primitive - bleah
>
> reg [23:0] runBits; // I wanted the syn_keep on this combinatorial "reg"
> wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah
> reg [ 7:0] runByte_;
> integer i,j,k;
>
> always @(*)
> begin
> runBits = -24'h1;
> runByte_ = -8'h1;
> k = 0; // overlapping aaa aaaa
> for( i=0; i<64; i=i+j ) // consecutive aaaa
> begin // bit regions 876543210
> for( j=0; (i%8+j<8) && (j<3); j=j+1 )
> runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]);
> runByte_[i/8] = runByte_[i/8] & runBits_[k];
> k = k + 1;
> end
> end
> always @(posedge clk) runByte = runByte_;
>
> endmodule
--
Rick "rickman" Collins
rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX
Reply by ●October 1, 20032003-10-01
"rickman" <spamgoeshere4@yahoo.com> wrote in message news:3F7B2060.A4FA04B@yahoo.com...> I can't say I really understand your problem statement. I also don't > see how your code is solving the problem. Can you give a better > explanation of the problem? What do you mean "from the byte > boundaries"? Are you counting only the bit sets of {0..8}, {8..16}, > {16..24}...? If so, this seems like an easy problem to implement.You have it right - 8:0, 16:8, 24:16.... In the 41 bits illustrated below I want to note when the sequence across [16:8] (illustrated by the 9 1s below the count) in result[1]. 11010101000101010111101111111111110010101 pattern 09876543210987654321098765432109876543210 count 00000000000000000000000011111111100000000 (reference) The trouble is that while it seems easy to implement, getting the logic to come out clean in the implementation - the "pushing the rope" problem - doesn't comes easily. If one does a loop looking for 8 adjacent equals or the 8-wide AND of 8 XNORs, the realization takes up 3 levels of logic with FDR primitives resulting in 2x-3x the resources and about twice the delay. The code with the two nested for loops breaks up the 9-bit compare into two 4-bit and one 3-bit compare to verify all the 9 bits are equal to each other and to break the implementation into 2 levels of logic rather than 3+ (the + being from the flop's reset input taking some extra routing delay). I'd love a cleaner approach that doesn't come off so complex and gets realized in the resources it "should" use. It's tough to get it where I want by pushing on the rope.> John_H wrote: > > > > Greetings, > > > > I need to detect runs. I want to look at 65 bits and show when thereare 9> > consecutive 1s or 0s from the byte boundaries resulting in 8 values per > > clock. This should be comfortably done in two logic levels (I needclean> > logic delays). > > > > The idea is simple but the implementation is tough. I'm working with > > Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to > > design violence to get the results that I believe are "best." > > > > Any thoughts on how to do this "better?" (the following code likesfixed> > fonts) > > > > - John_H > > ===================================== > > module testRun ( input clk > > , input [64:0] bytePlus1 > > , output reg [ 7:0] runByte /* synthesis xc_props ="INIT=R"> > */ > > ); // INIT included to force register as FD primitive -bleah> > > > reg [23:0] runBits; // I wanted the syn_keep on this combinatorial"reg"> > wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah > > reg [ 7:0] runByte_; > > integer i,j,k; > > > > always @(*) > > begin > > runBits = -24'h1; > > runByte_ = -8'h1; > > k = 0; // overlapping aaa aaaa > > for( i=0; i<64; i=i+j ) // consecutive aaaa > > begin // bit regions 876543210 > > for( j=0; (i%8+j<8) && (j<3); j=j+1 ) > > runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]); > > runByte_[i/8] = runByte_[i/8] & runBits_[k]; > > k = k + 1; > > end > > end > > always @(posedge clk) runByte = runByte_; > > > > endmodule
Reply by ●October 1, 20032003-10-01
John_H wrote:> > "rickman" <spamgoeshere4@yahoo.com> wrote in message > news:3F7B2060.A4FA04B@yahoo.com... > > I can't say I really understand your problem statement. I also don't > > see how your code is solving the problem. Can you give a better > > explanation of the problem? What do you mean "from the byte > > boundaries"? Are you counting only the bit sets of {0..8}, {8..16}, > > {16..24}...? If so, this seems like an easy problem to implement. > > You have it right - 8:0, 16:8, 24:16.... In the 41 bits illustrated below I > want to note when the sequence across [16:8] (illustrated by the 9 1s below > the count) in result[1]. > > 11010101000101010111101111111111110010101 pattern > 09876543210987654321098765432109876543210 count > 00000000000000000000000011111111100000000 (reference) > > The trouble is that while it seems easy to implement, getting the logic to > come out clean in the implementation - the "pushing the rope" problem - > doesn't comes easily. If one does a loop looking for 8 adjacent equals or > the 8-wide AND of 8 XNORs, the realization takes up 3 levels of logic with > FDR primitives resulting in 2x-3x the resources and about twice the delay. > > The code with the two nested for loops breaks up the 9-bit compare into two > 4-bit and one 3-bit compare to verify all the 9 bits are equal to each other > and to break the implementation into 2 levels of logic rather than 3+ (the + > being from the flop's reset input taking some extra routing delay). > > I'd love a cleaner approach that doesn't come off so complex and gets > realized in the resources it "should" use. It's tough to get it where I > want by pushing on the rope.My experience has been that it does not much matter how you code combinatorial logic like this. The tools run it through a grinder and produce an optimal version (in its own mind). When I want to optimize like this, I either use a "keep" attribute on the wire, or sometimes you can instantiate primitives. For logic I don't think primitives work since gates just get remapped. But I still don't understand your code. Why does the outer loop range over 64 values. I would code two nested loops where the outer loop ranges over the 8 outputs and the inner loop ranges over the 9 inputs for each output. Or just skip the inner loop and use two outputs from two sets of four inputs feeding a 3 input function and use keeps on the first two output arrays. Maybe that is what you are doing, but I can't figure out the code easily. I see you are incrementing the i variable by j and ranging j in the second loop by some complex control expression. Can't you just increment i by 8? for( i=0; i<64; i=i+8 ) begin k = i % 8; for( j=0; j<4; j=j+1 ) begin runBitsA_[k] = runBitsA_[k] & bytePlus1[i+j]; runBitsB_[k] = runBitsB_[k] & bytePlus1[i+j+4]; end runByte_[i] = runBitsA_[i] & runBitsB_[k] & bytePlus1[i+9]; end Put the keep on runBitsA_ and runBitsB_ and you should get your two level structure.> > John_H wrote: > > > > > > Greetings, > > > > > > I need to detect runs. I want to look at 65 bits and show when there > are 9 > > > consecutive 1s or 0s from the byte boundaries resulting in 8 values per > > > clock. This should be comfortably done in two logic levels (I need > clean > > > logic delays). > > > > > > The idea is simple but the implementation is tough. I'm working with > > > Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to > > > design violence to get the results that I believe are "best." > > > > > > Any thoughts on how to do this "better?" (the following code likes > fixed > > > fonts) > > > > > > - John_H > > > ===================================== > > > module testRun ( input clk > > > , input [64:0] bytePlus1 > > > , output reg [ 7:0] runByte /* synthesis xc_props = > "INIT=R" > > > */ > > > ); // INIT included to force register as FD primitive - > bleah > > > > > > reg [23:0] runBits; // I wanted the syn_keep on this combinatorial > "reg" > > > wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah > > > reg [ 7:0] runByte_; > > > integer i,j,k; > > > > > > always @(*) > > > begin > > > runBits = -24'h1; > > > runByte_ = -8'h1; > > > k = 0; // overlapping aaa aaaa > > > for( i=0; i<64; i=i+j ) // consecutive aaaa > > > begin // bit regions 876543210 > > > for( j=0; (i%8+j<8) && (j<3); j=j+1 ) > > > runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]); > > > runByte_[i/8] = runByte_[i/8] & runBits_[k]; > > > k = k + 1; > > > end > > > end > > > always @(posedge clk) runByte = runByte_; > > > > > > endmodule-- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX
Reply by ●October 1, 20032003-10-01
John Pushing rope is a good analogy. Jan Gray, yeah? When writing HDL, on one extreme you can write your code to explicity tell the synthesizer how to structure your logic. You'd instantiate logic primitives and wire them all up. You're basically doing all the work for the synthesizer. The benefit is total control. The problem is a lot of code to write, which means increase chances of human error and a lot of rewriting when you make changes to your design. On the other extreme, you have code that's structureless. You imply no logic or wiring inside. It's purely a black box and leaves it up to the synthesizer to figure things out. The benefits are more compact code and "ease" of modification (if your algorithm isn't too convoluted). The problem of course is the synthesizer usually doesn't do exactly what you want. Much like a 2 year old coming up to you with a smile full of pride saying, "I went potty!" only to realize they did it on the floor. So when the synthesizer isn't smart enough, you need to help it out by putting a little more structure into your code. Basically you're giving the synthesizer hints on what you want. Also hardware engineers tend to better understand algorithms that have a bit of hardware structure to them. We once had a math major write VHDL, and the comment from one engineer was, "Man, his code looks like C." :_D Unfortunately I don't use Verilog, but hopefully some psuedo code will be enough to give you an idea of what I'm thinking. I noticed your code was bit based. Since you only are checking on byte boundaries, it can help to write your code byte based. Good luck with this psuedo code (you should see my C). data[64:0] -- 65-bit input data signal eight_0s_consec_flag[7:0] -- intermediate signals eight_1s_consec_flag[7:0] -- intermediate signals ninth_bit[7:0] -- intermediate signals consec_detect_flag[7:0] -- 8-bit output signal for byte = 0...7 lsb = byte*8 msb = byte*8+7 if data[msb:lsb] = "00000000" then eight_0s_consec_flag[byte] = '1' else eight_0s_consec_flag[byte] = '0' end if if data[msb:lsb] = "111111111" then eight_1s_consec_flag[byte] = '1' else eight_1s_consec_flag[byte] = '0' end if ninth_bit[byte] = data[msb+1] if ninth_bit[byte] = '1' then consec_detect_flag[byte] = eight_1s_consec_flag else consec_detect_flag[byte] = eight_0s_consec_flag end if; end loop Well hopefully that's correct, and makes sense. Let me know if you have any questions or see any problems. Regards, Vinh
Reply by ●October 1, 20032003-10-01
John, 1- How many 65 bit words per second (ms, ns?) do you have to process? 2- Where do the 65 bits come from? (internal, external) 3- Do they get into the FPGA in parallel or serially? 4- Why are you saying that you need two levels of logic? (trying to control delay with combinatorial logic is not a great idea). 5- Why fight with inference? Instantiate what primitives you need. Two logic levels? Two LUT's to look at two consecutive nibbles. One LUT to AND the output of the above with the next most significant bit (the ninth bit). That's it. Two levels. 24 LUT's. Is that what you wanted? -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Martin Euredjian To send private email: 0_0_0_0_@pacbell.net where "0_0_0_0_" = "martineu" "John_H" <johnhandwork@mail.com> wrote in message news:Xwoeb.26$XP3.4711@news-west.eli.net...> Greetings, > > I need to detect runs. I want to look at 65 bits and show when there are9> consecutive 1s or 0s from the byte boundaries resulting in 8 values per > clock. This should be comfortably done in two logic levels (I need clean > logic delays). > > The idea is simple but the implementation is tough. I'm working with > Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to > design violence to get the results that I believe are "best." > > Any thoughts on how to do this "better?" (the following code likes fixed > fonts) > > - John_H > ===================================== > module testRun ( input clk > , input [64:0] bytePlus1 > , output reg [ 7:0] runByte /* synthesis xc_props ="INIT=R"> */ > ); // INIT included to force register as FD primitive -bleah> > reg [23:0] runBits; // I wanted the syn_keep on this combinatorial "reg" > wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah > reg [ 7:0] runByte_; > integer i,j,k; > > always @(*) > begin > runBits = -24'h1; > runByte_ = -8'h1; > k = 0; // overlapping aaa aaaa > for( i=0; i<64; i=i+j ) // consecutive aaaa > begin // bit regions 876543210 > for( j=0; (i%8+j<8) && (j<3); j=j+1 ) > runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]); > runByte_[i/8] = runByte_[i/8] & runBits_[k]; > k = k + 1; > end > end > always @(posedge clk) runByte = runByte_; > > endmodule > >
Reply by ●October 2, 20032003-10-02
Whoopsy, brain-fart. My previous code will create 3 levels of logic. If we
didn't have to detect both nine 1s or nine 0s, then it'd work okay.
Here's an idea for one that should generate 2 levels, but it looks uglier.
Definately not as compact as rickman's.
data[64:0] -- input signal
ninth_bit[7:0] -- intermediate signal
run_dibble[31:0] -- intermediate signal
run_byte[7:0] -- output signal
for byte 0...7
ninth_bit[byte] = data[(byte+1)*8]
for dibble 0...3
lsb = byte*8 + dibble*2
msb = byte*8 + dibble*2 + 1
if data[lsb] = ninth_bit[byte] AND data[msb] = ninth_bit[byte] then
run_dibble[byte*4 + dibble] = 1
else
run_dibble[byte*4 + dibble] = 0
end
end loop
lsb = byte*4
msb = byte*4 + 3
if run_dibble[msb:lsb] = "1111" then
run_byte[byte] = 1
else
run_byte[byte] = 0
end
end loop
Reply by ●October 2, 20032003-10-02
rickman <spamgoeshere4@yahoo.com> wrote in message news:<3F7B3F59.191901A4@yahoo.com>...> My experience has been that it does not much matter how you code > combinatorial logic like this. The tools run it through a grinder and > produce an optimal version (in its own mind). When I want to optimize > like this, I either use a "keep" attribute on the wire, or sometimes you > can instantiate primitives. For logic I don't think primitives work > since gates just get remapped.I overuse the syn_keep attribute and I hate the idea of instantiating LUTs. My Carnot skills aren't exactly used regularly.> But I still don't understand your code. Why does the outer loop range > over 64 values.I've had problems with bit ranges in the past where [i+4:i] is a complaint. Perhaps this isn't an issue with for loops but I've learned to avoid them in general logic. They do work fine in generate blocks, however. I stepped through every bit to make a comparison to the adjacent bit; 3 adjacent comparisons lumped into one variable (with an eventual syn_keep) would give me 4-input functions that should pack into LUTs. The complex end of the inside loop is so that the three "LUTs" per byte are 4-input, 4-input, and 3-input functions.> I would code two nested loops where the outer loop > ranges over the 8 outputs and the inner loop ranges over the 9 inputs > for each output. Or just skip the inner loop and use two outputs from > two sets of four inputs feeding a 3 input function and use keeps on the > first two output arrays. Maybe that is what you are doing, but I can't > figure out the code easily. > > I see you are incrementing the i variable by j and ranging j in the > second loop by some complex control expression. Can't you just > increment i by 8? > > for( i=0; i<64; i=i+8 ) begin > k = i % 8; > for( j=0; j<4; j=j+1 ) begin > runBitsA_[k] = runBitsA_[k] & bytePlus1[i+j]; > runBitsB_[k] = runBitsB_[k] & bytePlus1[i+j+4]; > end > runByte_[i] = runBitsA_[i] & runBitsB_[k] & bytePlus1[i+9]; > end > > Put the keep on runBitsA_ and runBitsB_ and you should get your two > level structure.This works very well for runs of ones only. I need to identify runs of ones or runs of zeros. The technique can be expanded to my needs resulting in runBitsA, B, and C where one of them needs to cover 2 comparisons, not 3 like the others. ...which is really is the approach I was coding but using consecutive bits in a vector rather than {A,B,C} and using the one statement rather than 3 to make the assignments, dealing with the 2 comparison exception by terminating the inside loop early. Thanks for the help.
Reply by ●October 2, 20032003-10-02
"Vinh Pham" <a@a.a> wrote in message news:<XcSeb.39218$5z.21702@twister.socal.rr.com>...> Whoopsy, brain-fart. My previous code will create 3 levels of logic. If we > didn't have to detect both nine 1s or nine 0s, then it'd work okay.Thanks for noticing :-) I like the code below with respect to its symmetry - it's a lot easier to read than the stuff I generated. The four 3-input LUTs feed a single 4-input LUT with (only a) little arguement from the synthesizer, I'm sure. It can be done with fewer LUTs by using 4-input LUTs covering 3 compares each but then the symmetry gets lost and the coding gets unpleasant. I think I have an acceptable solution together that gives me good speed and good utilization which I'll post separately. Thanks for your thoughs with this.> Here's an idea for one that should generate 2 levels, but it looks uglier. > Definately not as compact as rickman's. > > > data[64:0] -- input signal > ninth_bit[7:0] -- intermediate signal > run_dibble[31:0] -- intermediate signal > run_byte[7:0] -- output signal > > > for byte 0...7 > > ninth_bit[byte] = data[(byte+1)*8] > > for dibble 0...3 > > lsb = byte*8 + dibble*2 > msb = byte*8 + dibble*2 + 1 > > if data[lsb] = ninth_bit[byte] AND data[msb] = ninth_bit[byte] then > run_dibble[byte*4 + dibble] = 1 > else > run_dibble[byte*4 + dibble] = 0 > end > > end loop > > lsb = byte*4 > msb = byte*4 + 3 > > if run_dibble[msb:lsb] = "1111" then > run_byte[byte] = 1 > else > run_byte[byte] = 0 > end > > end loop
Reply by ●October 2, 20032003-10-02
"Martin Euredjian" <0_0_0_0_@pacbell.net> wrote in message news:<FALeb.6677$fB4.1788@newssvr29.news.prodigy.com>...> John, > > 1- How many 65 bit words per second (ms, ns?) do you have to process?This run detection is one part of a 100MHz-200MHz mechanism.> 2- Where do the 65 bits come from? (internal, external)Internal, blindsided from BlockRAMs with a new value per cycle.> 3- Do they get into the FPGA in parallel or serially?Entirely parallel, into the BlockRAMs at full width.> 4- Why are you saying that you need two levels of logic? (trying to control > delay with combinatorial logic is not a great idea).If I go from BlockRAM to registers, I have the (relatively) long Tcko delay for the BlockRAM read and associated routing leaving little time to manipulate the data within the period. If I register the data from the BlockRAM, it's best to generate and use the run values in the next cycle requiring moe logic after I flag the runs, suggesting minimum delay is best.> 5- Why fight with inference? Instantiate what primitives you need.The logic primitives are what I've avoided. I don't want to use LUT4 primitives with INIT attributes since I might mess up the carnot map. This is why the inference has been broken down into bits that can be retained (with syn_keep or other method).> Two logic levels? > > Two LUT's to look at two consecutive nibbles. > One LUT to AND the output of the above with the next most significant bit > (the ninth bit). > That's it. Two levels. 24 LUT's. > Is that what you wanted?Almost. The LUTs can't look at full nibbles. Since I need to make sure all bits are equal to each other, there's a "smear." One attempt was to XOR adjacent bits, then to do the 8-wide AND of the result, letting the synth give me the "best" results. It didn't. Thinking about the XOR-to-AND progression, 4 bits are needed to implement 3 bits of the AND, so two 4-bit LUTs and one 3-bit LUT are needed, feeding a 3-input AND. That's it. Two levels. 32 LUTs. But the synth doesn't like my inferrences. I think I have a solution that "works."





