FPGARelated.com
Forums

Real examples of metastability causing bugs

Started by Eli Bendersky January 8, 2008
Hello,

Suppose that I'm sampling an asynchronous signal with a FF, without
using any synchronizers before it. This FF will become metastable from
time to time with a MTBF depending on the device's parameters, the
clock rate and the input signal change rate.

Can you please suggest *real life* examples of how this can make me
fail in a real design, that is, where the time of recovery for the
metastable event is indeed 0. Here are two off the top of my head:

1) The output of this FF can be used directly as the output of the
device, causing an intermediate value on the output for some time,
which can harm other devices.

2) If such an input is sampled by two different FFs for different
purposes, they may end up with different results.

Thanks in advance,
Eli
On Jan 8, 6:20=A0am, Eli Bendersky <eli...@gmail.com> wrote:
> Hello, > > Suppose that I'm sampling an asynchronous signal with a FF, without > using any synchronizers before it. This FF will become metastable from > time to time with a MTBF depending on the device's parameters, the > clock rate and the input signal change rate. >
Your item #2 describes the most common problem, exacerbated by excessive routing delay differences. Peter Alfke
> Can you please suggest *real life* examples of how this can make me > fail in a real design, that is, where the time of recovery for the > metastable event is indeed 0. Here are two off the top of my head: > > 1) The output of this FF can be used directly as the output of the > device, causing an intermediate value on the output for some time, > which can harm other devices. > > 2) If such an input is sampled by two different FFs for different > purposes, they may end up with different results. > > Thanks in advance, > Eli
On Jan 8, 6:38 pm, Peter Alfke <pe...@xilinx.com> wrote:
> On Jan 8, 6:20 am, Eli Bendersky <eli...@gmail.com> wrote:> Hello, > > > Suppose that I'm sampling an asynchronous signal with a FF, without > > using any synchronizers before it. This FF will become metastable from > > time to time with a MTBF depending on the device's parameters, the > > clock rate and the input signal change rate. > > Your item #2 describes the most common problem, exacerbated by > excessive routing delay differences. > Peter Alfke >
Hi Peter, thanks for answering. Could you provide a piece of VHDL/Verilog code that is realistic and has this problem ?
"Eli Bendersky" <eliben@gmail.com> wrote in message 
news:fa77eae8-4d70-412e-9a85-738b04c50647@1g2000hsl.googlegroups.com...
> > Hi Peter, thanks for answering. > Could you provide a piece of VHDL/Verilog code that is realistic and > has this problem ? >
Hi Eli, process(clock) begin if rising_edge(clock) then if bad_input = '1' then count <= (count + 2) mod 8; else count <= (count + 1) mod 8; end if; end if; end process; It's possible for count to increment by 3 if the bad_input gets to bit 1 of count before it gets to bit 0. HTH., Syms.
>...Here are two off the top of my head: > >1... >2)...
Aren't those two reasons enough for avoiding it? Or are we just doing your homework? Mike
On Jan 8, 6:20=A0am, Eli Bendersky <eli...@gmail.com> wrote:
> Hello, > > Suppose that I'm sampling an asynchronous signal with a FF, without > using any synchronizers before it. This FF will become metastable from > time to time with a MTBF depending on the device's parameters, the > clock rate and the input signal change rate. >
Eli, Look at XAPP094 (you can easily google it) It shows the circuit I have used to quantify metastable delay. The delay is short, so you have to be quick to catch it... Peter Alfke
> Can you please suggest *real life* examples of how this can make me > fail in a real design, that is, where the time of recovery for the > metastable event is indeed 0. Here are two off the top of my head: > > 1) The output of this FF can be used directly as the output of the > device, causing an intermediate value on the output for some time, > which can harm other devices. > > 2) If such an input is sampled by two different FFs for different > purposes, they may end up with different results. > > Thanks in advance, > Eli
Symon wrote:
(snip)

> process(clock) > begin > if rising_edge(clock) then > if bad_input = '1' then > count <= (count + 2) mod 8; > else > count <= (count + 1) mod 8; > end if; > end if; > end process;
> It's possible for count to increment by 3 if the bad_input > gets to bit 1 of count before it gets to bit 0.
I would have called this an ordinary setup/hold violation. If the problem is due to timing of bad_input, propagated through the MUX that I presume it generates, then it should be setup/hold violation. Metastability should occur due to clock rate issues, through the appropriate propagation delay, but independent of bad_input, and only if bad_input does satisfy setup/hold. I would say that the usual cause of option 2 in the previous post is also setup/hold violation. Note that this system can fail even with perfect FFs due to different propagation delays. -- glen
On Jan 8, 3:49 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> Symon wrote: > > (snip) > > > process(clock) > > begin > > if rising_edge(clock) then > > if bad_input = '1' then > > count <= (count + 2) mod 8; > > else > > count <= (count + 1) mod 8; > > end if; > > end if; > > end process; > > It's possible for count to increment by 3 if the bad_input > > gets to bit 1 of count before it gets to bit 0. > > I would have called this an ordinary setup/hold violation. > > If the problem is due to timing of bad_input, propagated > through the MUX that I presume it generates, then it should > be setup/hold violation. > > Metastability should occur due to clock rate issues, through > the appropriate propagation delay, but independent of bad_input, > and only if bad_input does satisfy setup/hold. > > I would say that the usual cause of option 2 in the previous > post is also setup/hold violation. > > Note that this system can fail even with perfect FFs due to > different propagation delays. > > -- glen
I agree, #2 is independent of metastability; it is a parallel synchronizer, which is a bad thing. If the propagation skew is more than setup+hold to all of the destination registers, it could meet setup and hold on all of them (avoiding metastability), while still failing functionally (incrementing by 3). Andy
On Jan 8, 3:49 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> Symon wrote: > > (snip) > > > process(clock) > > begin > > if rising_edge(clock) then > > if bad_input = '1' then > > count <= (count + 2) mod 8; > > else > > count <= (count + 1) mod 8; > > end if; > > end if; > > end process; > > It's possible for count to increment by 3 if the bad_input > > gets to bit 1 of count before it gets to bit 0. > > I would have called this an ordinary setup/hold violation. > > If the problem is due to timing of bad_input, propagated > through the MUX that I presume it generates, then it should > be setup/hold violation. > > Metastability should occur due to clock rate issues, through > the appropriate propagation delay, but independent of bad_input, > and only if bad_input does satisfy setup/hold. > > I would say that the usual cause of option 2 in the previous > post is also setup/hold violation. > > Note that this system can fail even with perfect FFs due to > different propagation delays. > > -- glen
Example #1 process (event, out2) is begin if out2= '1' then out1 <= '0'; elsif rising_edge(event) then out1 <= '1'; end if; end process; process (clk2) is begin if rising_edge(clk2) then out2 <= out1; end if; end process; A long, long time ago, I once had this problem on a board (74f74 dual flops), where another circuit running on clk2 was also looking at out2. The second flop (out2) would go metastable true just long enough to reset out1, and then settle false, so the other circuit would not see it on the next clk2. Since out1 had been reset, it was not there the next clk2, and was lost. Note these are very poor design practices for FPGAs (use of async reset for anything but initialization), but were very common on board level designs (when done properly). Andy
Eli Bendersky wrote:

> 1) The output of this FF can be used directly as the output of the > device, causing an intermediate value on the output for some time, > which can harm other devices.
This FF might be used as an input synchronizer intended to eliminate logic races. Setup and hold violations are to be expected for a synchronizer and in almost all cases synchronization succeeds anyway. But maybe once a year, the bowling ball stops on the speed bump and synchronization fails and the synchronizer causes a logic race. The race may or may not cause a bad state transition. A bad transition may or may not cause an observable error. I might be able to improve my odds to say, one synchronization failure in 100 years by using a two stage synchronizer, but I can't eliminate the possibility.
> 2) If such an input is sampled by two different FFs for different > purposes, they may end up with different results.
This is the case of the *missing* synchronizer. This is often confused with metastability, but it is really a design error. I don't have to wait nearly as long to observe an error in this case. -- Mike Treseler