FPGARelated.com
Forums

Single Event Functional Interrupts (SEFI) in Virtex

Started by Praveen April 6, 2005
Hello all,

I am doing a literature survey on SEFIs in Xilinx FPGAs. Unfortunately,
there are not many papers on this. It might either be because this is
not a major issue or because there is not really much work done.

Did you encouter SEFIs in your design? If yes, what mitigation methods
did you use? I would appreciate any of your feedback.

Thank you.

Go to the TechXcusives paper at

http://support.xilinx.com/xlnx/xweb/xil_tx_display.jsp?sGlobalNavPick=&sSecondaryNavPick=&category=&iLanguageID=1&multPartNum=1&sTechX_ID=al_soft_vs_hard

Peter Alfke, Xilinx Applications

Praveen,

To what are you referring?

A single event error in the logic fabric (CLB SEE)?

A single event transient in the fabric (CLB SET)?

A single event upset of a memory cell (SEU config)?

A single event upset of a BRAM memory cell (SEU BRAM)?

A single event transient or single event error which affects the entire 
device (as in fooling the chip into thinking PROG was pulled low)?

We usually refer the the last event (loss of configuration, global 
reset, global tristate) as a SEFI (single event functional interrupt).

Not everyone uses this terminology, but we use it because it is 
descriptive of what happens (on very very very rare occasions!).

Our mil/aero/automotive customers also think this way, and we have 
statistics for the probabilty of any of the above happening, all the way 
from 1 million years for a SEE, or SET in the fabric for the largest 
device, to the article that Peter pointed you to for config upsets (more 
than a 1000 years).

By the way, Virtex 4 has now improved upon the upsets rates due to our 
design techniques and has shown a reduction to 60% (over Virtex II) of 
the previous FIT rates for the configuration memory.  This winds the 
clock back to the days when people didn't even think about SEUs.  Watch 
for the tech Xclusive on this subject (appearing soon).

If you want all the details, contact our mil/aero/automotive group FAEs 
who are trained in this, and have all the field tests, studies, etc. at 
their fingertips.  For us, there is no unknowns in this regard.  After 
all, we are used in airplanes, spacecraft (and automobiles) so these 
folks want to know exactly what the probabilities of failures are, and 
how to mitigate them (deal with them when they occur, or mask them so 
you never see a failure in the system).

Austin
I just realized,

So as not to confuse anyone, if VII is 1000 years MTBE (mean time 
between config upsets -- which is actually better than that based on 
more recent data, but we will just go with it for now), V4 is better, so 
it is ~ 1,667 years for the same number of config bits. (60% of 1667 
years is 1000 years).

If you do nothing about upsets, 90nm is worse than 130nm, is worse than 
180nm, etc.

So, you have to do something to make it better.

We did.

(You're welcome,)

Austin

Hi Austin,

Thanks for the reply. First of all, I am using virtex -II. I am
concerned about the SEUs occuring in controls of the device(leading to
SEFI like behavior). I have obtained a document from SEE consortium
that discussess the different SEFIs (like POR, SMAP and JTAG) and the
ways to mitigate them. I also found other presentations on xilinx
website that discuss the same thing.

I wanted to know if there are any other SEFI issues in Virtex-II and
the mitigating methods that can be used (and have been used).

Thank  you,
-Praveen

Praveen,

You have the paper, and the modes discovered, and the work-arounds.

There is a lot of work on-going by the mil/aero community on radiation 
testing.  Perhaps your company would consider joining the radiation 
effects consortium that we sponsor (if you have need for this)?

Austin
Austin,

The methods discussed in the document mitigate the SEFI but will result
in the loss of data (because of reconfiguration). Could you suggest a
way (if there is one) of mitigating the SEFI without losing the
information ?  We would like to try it out even if it complicated.

Thanks for your input.

-Praveen

Praveen,

Well, the problem is that a SEFI might hit the line which controls the 
"clean-out" (zeroization of all config and BRAM).

If that happens, then basically, the upset has caused the device to 
re-initialize (or just go stupid).  There is no way to prevent this from 
happening.  We are researching how to design so this can not occur, but 
this is a very tough problem.  An event can strike any transistor.

uP, ASSP, and ASICs also have SEFI behavior.  And yes, theirs is 
incredibly rare as well.  FPGAs are similar in SEFI behavior to all the 
other devices.  Maybe better.  I haven't see the SEFI x-section for a 
Pentium chip.

Yes, these SEFI cross sections are incredibly small, and the probability 
is also small that this happens, but presuming it does happen, there is 
really nothing at all that you can do (except detect that it happened, 
and reconfigure the device from scratch).

Systems that have to be hardened against SEFI will use a CPLD, or other 
device, to detect that a SEFI occured, and reconfigure the FPGA.  In the 
time it takes to recongize the SEFI, and reconfigure, all data is lost 
(unless it is part of a redundant system, which is commonly used for 
critical applications).

In the order of increasing robustness:

- no measures taken (susceptible to SEU, and SEFI):  the vast majority 
of all FPGA applications fall into this category, as do ASIC, ASSP, and uP.

- use TMR on the user pattern to remove the effcts of SEUs completely, 
still susceptible to SEFI:  a step that gets rid of SEU effects.  Also 
is used in some special ASICs for mil/aero.

- scrub the config memory (continually reload the config while 
operating): used my many space probes, still susceptible to SEU and 
SEFI, but recovers very quickly, and SEUs do not accumulate leading to 
an overall availability improvement.  This is what the Mars Lander Pryo 
controller did.  The landers themselves just reconfigure once a day 
(enough to mitigate the effects they anticipated).

- scrub and use TMR:  now we only have SEFI to worry about.  The best 
choice for getting to the level of reliability where the only thing that 
can be of any trouble is a SEFI.  Good enough for just about anything 
except where human life is concerned.

- readback the config and fix the bits that flipped (use of V4 
FRAME_ECC):  similar to scrubbing, but faster and less hardware.  Same 
as above non-TMR scrubbing case.

- readback and fix config for a TMR design: only SEFIs to worry about. 
Good for just about anythign excepting a human life.

- monitor the device with another device (eg CPLD) for SEFI, reconfigure 
if a SEFI occurs: used in critical space and avionics.  May also be 
doing TMR, scrubbing, etc. as well.  This is still not good enough for a 
human life situation unless the time to recover is fast enough not to 
matter.

- provide one other FPGA, dual redundant:  use of dual rednundant allows 
for transfer away from a fault, used for even higher availability (each 
individual unit may be scrubbing, use TMR, etc.  There may also be a 
"voter" to switch between FPGA's in case of SEFI).  Almost the highest 
level of availability, in that we still don't trust even this 
arrangement for human life situations.  It may get used in military 
systems where the probabvility of death is much much higher than the 
probability of a systems failure, so added system availability is not 
needed (a real toght decision, one I gladly don't have to make).

- fully duplicated, dual redundant:  used by things like commercial 
airliners, and airports.  Two redundant systems that can be selected 
manually by the air traffic controllers or pilots in the unlikely event 
that one of the redundant systems fail.  Within each redundant system, 
various levels of protection may, or may not be necessary, since the 
entire system is duplicated.  A system with no scrubbing of the FPGAs, 
but with many self-checks that are done independent of the FPGA is used 
in fact in all US and Canadian Airports for all communications between 
the ground and air, and ground and ground.  I designed it.  If one 
redundant unit either detects a failure in itself, or the redundant unit 
it is paired with detects that its partner has gone stupid, it switches 
itself in, and the other out in less than 50 ms.  If the air traffic 
controllers can't talk to the airplanes for some reason, they have a 
manual switch they push to transfer everything over to another set of 
com links, radios, antennas, etc.

Austin
> I am doing a literature survey on SEFIs in Xilinx FPGAs. Unfortunately, > there are not many papers on this. It might either be because this is > not a major issue or because there is not really much work done.
You might find the following interesting/relevant ... (It's just something I came across while Googling) ... http://klabs.org/mapld04/presentations/session_c/4_c144_swift_s.ppt Kris
Kris,

Nice ppt presentation.  They quote 65 years in orbit around the earth 
mean time between SEFI.

We have published results of heavy ion testing that states we are at 
(least) 1.5E-6 SEFI/day in earth orbit, which is 1,800 years between 
SEFI.  Not sure where the discepancy comes from.

It may be that work was done on commercial parts, instead of using the 
Qpro series (which has EPI wafers).  If you are going to go into space, 
you are better off using the Qpro devices.

http://direct.xilinx.com/bvdocs/publications/ds124.pdf
Page 3, Table 3.

Not sure that EPI alone would have more than a factor of 2 improvement 
in upset rate (SEU or SEFI).

Could be that the authors of the ppt also divided by 24 (thinking our 
specification was in hours).  That yields a number closer (76 years--but 
wrong nonetheless).

Sea level is ~ 40 times less upsets, so a SEFI at sea level is ~ 7,300 
years.

We have some customers with more than 250,000 Virtex II's in the field 
(monitored), and that would mean they would have ~ 35 SEFI's a year. 
Since they have had far fewer (in fact:  none reported), one has to take 
even this projection as overly conservative for us earthlings on the ground.

Also, space based projection of failures use heavy ions, and earth based 
projections of failures use protons, and neutrons.  There is factor of 
(at least) 1e5 to 1e6 there in terms of the size of the "bullet!"

For example, the cross section for a Virtex II memory cell is ~2.283E-14 
for neutrons, and is ~8E-8 for a heavy ion.  These are from recent tests 
with neutrons and with heavy ions (Xilinx Radiation Effects Consortium).

Sort of like a grain of sand vs. a locomotive engine.

This is a good analogy:  if you are hit by a train, what do you do?  If 
you are hit by a grain of sand, what do you do?

Compare our Xilinx Virtex II FPGA with a popular uP: (for SEFI)

http://www.spacemicro.com/services_files/SM_NSREC_Paper_2004.pdf

with up to a few SEFI per day (worst case), to one SEFI a year (best case).

So the next time you see the "blue screen of death" on your laptop 
computer, was that a SEFI, or was it Microsquat?

Austin