FPGARelated.com
Forums

Chasing Bugs in the Fog

Started by rickman June 17, 2013
I have a bug in a test fixture that is FPGA based.  I had thought it was 
in the software which controls it, but after many hours of chasing it 
around I've concluded it must be in the FPGA code.

I didn't think it was in the VHDL because it had been simulated well and 
the nature of the bug is an occasional dropped character on the receive 
side.  Who can't design a UART?  Well, it could be in the handshake with 
the state machine, but still...

So I finally got around to adding some debug signals which I would 
monitor on an analyzer and guess what, the bug is gone!  I *hate* when 
that happens.  I can change the code so the debug signals only appear 
when a control register is set to enable them, but still, I don't like 
this.  I want to know what is causing this DURN THING!

Anyone see this happen to them before?

Oh yeah, someone in another thread (that I can't find, likely because I 
don't recall the group I posted it in) suggested I add synchronizing FFs 
to the serial data in.  Sure enough I had forgotten to do that.  Maybe 
that was the fix...  of course!  It wasn't metastability, I bet it was 
feeding multiple bits of the state machine!  Durn, I never make that 
sort of error.  Thanks to whoever it was that suggested the obvious that 
I had forgotten.

-- 

Rick
On Mon, 17 Jun 2013 20:00:01 -0400
rickman <gnuarm@gmail.com> wrote:

> So I finally got around to adding some debug signals which I would > monitor on an analyzer and guess what, the bug is gone! I *hate* when > that happens. I can change the code so the debug signals only appear > when a control register is set to enable them, but still, I don't like > this. I want to know what is causing this DURN THING! > > Anyone see this happen to them before? > > Oh yeah, someone in another thread (that I can't find, likely because I > don't recall the group I posted it in) suggested I add synchronizing FFs > to the serial data in. Sure enough I had forgotten to do that. Maybe > that was the fix... of course! It wasn't metastability, I bet it was > feeding multiple bits of the state machine! Durn, I never make that > sort of error. Thanks to whoever it was that suggested the obvious that > I had forgotten. > > -- > > Rick
Not metastability, a race condition. Asynchronous external input headed to multiple clocked elements, each of which it reaches via a different path with a different delay. When you added debugging signals you changed the netlist, which changed the place and route, making unpredictable changes to those delays. In this case, it happened to push it into a place where _as far as you tested_, it seems happy. But it's still unsafe, because as you change other parts of the design, the P&R of that section will still change anyhow, and you start getting my favorite situation, the problem that comes and goes based on entirely unrelated factors. The fix you fixed fixes it. When you resynchronized it on the same clock as you're running around the rest of the logic, you forced that path to become timing constrained. As such, the P&R takes it upon itself to make sure that the timing of that route is irrelevant with respect to the clock period, and your problem goes away for good. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.
>So I finally got around to adding some debug signals which I would >monitor on an analyzer and guess what, the bug is gone! I *hate* when >that happens. I can change the code so the debug signals only appear >when a control register is set to enable them, but still, I don't like >this. I want to know what is causing this DURN THING! > >Anyone see this happen to them before? > >-- > >Rick >
Yes, This is called a "Heisenbug". Usually involves a clock domain crossing mistake. John Eaton --------------------------------------- Posted through http://www.FPGARelated.com
One mistake that is not too hard to make is forgetting to put a synchronize=
r flop on the input of an edge detector, like you might have on a UART inpu=
t (so that the edge detector has two flops, total).  Depending on the routi=
ng delays, this can cause you to miss a sizable percentage of edges.  (Not =
just delayed, but missed completely.)  Using only a single flop is sometime=
s known as using the "greedy path".

(Actually, to mitigate metastability as well, an edge detector ought to hav=
e three flops and an AND gate.  Using two is sometimes known as using the "=
sneaky path".)
On 6/17/2013 8:14 PM, Rob Gaddi wrote:
> On Mon, 17 Jun 2013 20:00:01 -0400 > rickman<gnuarm@gmail.com> wrote: > >> So I finally got around to adding some debug signals which I would >> monitor on an analyzer and guess what, the bug is gone! I *hate* when >> that happens. I can change the code so the debug signals only appear >> when a control register is set to enable them, but still, I don't like >> this. I want to know what is causing this DURN THING! >> >> Anyone see this happen to them before? >> >> Oh yeah, someone in another thread (that I can't find, likely because I >> don't recall the group I posted it in) suggested I add synchronizing FFs >> to the serial data in. Sure enough I had forgotten to do that. Maybe >> that was the fix... of course! It wasn't metastability, I bet it was >> feeding multiple bits of the state machine! Durn, I never make that >> sort of error. Thanks to whoever it was that suggested the obvious that >> I had forgotten. >> >> -- >> >> Rick > > Not metastability, a race condition. Asynchronous external input > headed to multiple clocked elements, each of which it reaches via a > different path with a different delay. > > When you added debugging signals you changed the netlist, which changed > the place and route, making unpredictable changes to those delays.
No, when changing the debug output I added the synchronization FFs which fixed the problem. My point was that when the other poster suggested that I need to sync to the clock I mistook that for metastability forgetting that the input went to multiple sections of logic. So actually I made the same mistake twice... lol
> In > this case, it happened to push it into a place where _as far as you > tested_, it seems happy. But it's still unsafe, because as you change > other parts of the design, the P&R of that section will still change > anyhow, and you start getting my favorite situation, the problem that > comes and goes based on entirely unrelated factors. > > The fix you fixed fixes it. When you resynchronized it on the same > clock as you're running around the rest of the logic, you forced that > path to become timing constrained. As such, the P&R takes it upon > itself to make sure that the timing of that route is irrelevant with > respect to the clock period, and your problem goes away for good.
Just to make sure of what was what (it has been two years since I last worked with this design) I pulled the FFs out and added back just one. Sure enough the bug reappears with no FFs, but goes away with just one. The added debug info available allowed me to see exactly the error and sure enough, when a start bit comes in there is a chance that the two counters are not properly set and the error shows up in the center of the bit where the current contents of the shift register are moved into the holding register as a new char. I guess what most likely happened is that when I wrote the UART code I assumed the sync FFs would be external and when I wrote the wrapper code I assumed the FFs were inside the UART. In other words, I didn't have a proper spec and never gave this problem proper consideration. I will revisit this design and look at the other inputs. No reason to assume I didn't make the same mistake elsewhere. -- Rick
On 6/18/2013 3:16 PM, Kevin Neilson wrote:
> One mistake that is not too hard to make is forgetting to put a synchronizer flop on the input of an edge detector, like you might have on a UART input (so that the edge detector has two flops, total). Depending on the routing delays, this can cause you to miss a sizable percentage of edges. (Not just delayed, but missed completely.) Using only a single flop is sometimes known as using the "greedy path". > > (Actually, to mitigate metastability as well, an edge detector ought to have three flops and an AND gate. Using two is sometimes known as using the "sneaky path".)
Everyone is saying the same thing, so I guess I didn't explain clearly. Someone had already pointed out to me that I needed a synchronizer on the received data signal in another thread that I can't find now. I took them at their word, but was thinking they meant it was about metastability which I figured was not a problem at these speeds (yes, the speeds do make a difference for metastability since you never chase it away, you just minimize it). I wasn't thinking about the serial in signal feeding the state machine, just the shift register. So when I made the changes, which included the synchronizer, it worked. Because I didn't expect the synchronizer to do anything, I had forgotten about it until I was typing the post here. I remembered at the end of the message and realized that was what fixed the problem... Sorry for the confusion. Still, thanks to all who replied and especially the mystery person who suggested it in the other thread wherever that was. -- Rick
Le 18/06/2013 23:45, rickman a &#4294967295;crit :

> I guess what most likely happened is that when I wrote the UART code I > assumed the sync FFs would be external and when I wrote the wrapper code > I assumed the FFs were inside the UART. In other words, I didn't have a > proper spec and never gave this problem proper consideration.
Several years ago a young engineer reused my long proven UART code and modified it, carelessly removing the synchronizing FF. He came to see me and complained that my UART didn't work, it hung after some unpredictable time. I thought for a few minutes, guessed he probably had removed the FF and fixed his problem right away. Nicolas
That's the same thing that happened to me when I had the problem last.  I h=
ad an edge detector connected to a big synchronizer module that was in turn=
 connected to all the input pins.  When I had problems I looked inside the =
synchronizer module and found that it didn't have a flop on that line; it w=
as just wired straight through.
rickman <gnuarm@gmail.com> wrote:
> Everyone is saying the same thing, so I guess I didn't explain clearly. > Someone had already pointed out to me that I needed a synchronizer on > the received data signal in another thread that I can't find now. I > took them at their word, but was thinking they meant it was about > metastability which I figured was not a problem at these speeds (yes, > the speeds do make a difference for metastability since you never chase > it away, you just minimize it). I wasn't thinking about the serial in > signal feeding the state machine, just the shift register.
There's 3 things that could have gone wrong (and might still be doing wrong): You failed to synchronise between the clock domain of the input serial link and the clock of your system (sounds like you fixed this one) You failed to constrain the clocks and other inputs so the synthesis tool knows what timing budget it has to meet You failed timing analysis and didn't notice - in other words the synthesis tool says the design it produced doesn't meet your supplied timing constraints, despite its best efforts. If the failure is small it may still work in some voltage/temperature/silicon situations, but it isn't guaranteed in all cases. Normally the last one will raise big red flags in the tool, assuming the timing analyser does get run as part of the build. However the first two are easy to overlook and you get no warning from the tools. Theo
rickman wrote:


> I didn't think it was in the VHDL because it had been simulated well and > the nature of the bug is an occasional dropped character on the receive > side. &nbsp;Who can't design a UART? &nbsp;Well, it could be in the handshake with > the state machine, but still... >
Any time you recompile an FPGA and the problem disappears or changes, it is a STRONG indication it is a timing problem. Regenerating the place & route changes timings subtly between sections, and may eliminate a marginal setup or hold time problem. You should make sure all signals that cross clock boundaries are properly synchronized, and that you are giving the right clock specification to your clocks in the ucf file. If there are tricky timings on parts connected to the FPGA, then you need to define the timings in the ucf file. Jon