FPGARelated.com
Forums

FPGA partial/catastrophic failure mode question

Started by Neil Steiner December 18, 2008
Is it true that when FPGAs fail, they typically experience sudden and 
complete failure, rather than gradual localized degradation?

The question stems from comments that I received from a reviewer, 
concerning the ability to arbitrarily relocate circuitry inside an FPGA. 
  This reviewer had heard it said that sudden and complete failure was 
the norm, in which case the ability to move circuitry elsewhere in the 
same device would be pointless.

I would appreciate comments from anybody with actual experience in the 
matter.  I am interested if your thoughts both with respect to normal 
aging, and to damage from radiation or other effects.

Neil
Do you mean partial reconfiguration? If those words are foreign to
you, look it up on the Xilinx or Altera websites. I'm unfamiliar with
the procedure in detail, but understand that it's commonly done. The
key point is to lock down all of the configuration that is/isn't
changing (presumably to specific area groups, then there exists a
mechanism for changing the remaining circuitry.

Chris
On Dec 18, 9:34=A0pm, Neil Steiner <neil.stei...@east.isi.edu> wrote:
> Is it true that when FPGAs fail, they typically experience sudden and > complete failure, rather than gradual localized degradation? > > The question stems from comments that I received from a reviewer, > concerning the ability to arbitrarily relocate circuitry inside an FPGA. > =A0 This reviewer had heard it said that sudden and complete failure was > the norm, in which case the ability to move circuitry elsewhere in the > same device would be pointless. > > I would appreciate comments from anybody with actual experience in the > matter. =A0I am interested if your thoughts both with respect to normal > aging, and to damage from radiation or other effects. > > Neil
Our experience has been that there are extremely low rates of field failure in the FPGA's that we use (mostly Xilinx, some Lattice and a few Altera). The types of failure that are typically localized are usually upset events rather than silicon degradation. However I must emphasize that our company does not deliver products into aerospace applications where they are exposed to large amounts of radiation. When a part fails, we generally replace it without looking to see if the failure was localized. Our view of the symptoms is that it doesn't work with our standard pattern, so it's broken. Where I have seen localized degradation (partial burn-out) it has been on I/O drivers. This generally leads to an unusable device without some rewiring at the board level. There are clearly applications for your idea at the manufacturing defect level, however. For example Xilinx uses parts that are only tested to work with a particular pattern for their volume discount ASIC-replacement program. In addition to the reduced test time and therefore cost, this theoretically improves yields. Obviously the ability to use parts with manufacturing defects for general use would be a big plus to Xilinx, especially on the high-end parts that tend to have lower yields (check out the price tags on the XC2V8000 if you want to see what low yield does to cost). If the defects could be mapped reliably you may have usable parts with an effectively slightly smaller fabric size at a fraction of the price or the "perfect" silicon. In order for this sort of application to get to volume use, however you would need to apply the relocation at the tail end of the build process. It is unlikely that large-scale users of these devices will want to run place&route for every chip that goes out the door. Small-scale users like ASIC-simulation where the bitstream is generally only used once would benefit from this. Regards, Gabor
Neil Steiner <neil.steiner@east.isi.edu> wrote in news:494B083B.8090902
@east.isi.edu:

> Is it true that when FPGAs fail, they typically experience sudden and > complete failure, rather than gradual localized degradation? > > The question stems from comments that I received from a reviewer, > concerning the ability to arbitrarily relocate circuitry inside an
FPGA.
> This reviewer had heard it said that sudden and complete failure was > the norm, in which case the ability to move circuitry elsewhere in the > same device would be pointless. > > I would appreciate comments from anybody with actual experience in the > matter. I am interested if your thoughts both with respect to normal > aging, and to damage from radiation or other effects. > > Neil
In the one genuinely faulty part that I've seen, it was a very localised failure in the middle of the fabric, and rerunning the P&R tools with some trivial code change (which makes it use different resources) could mask the fault. This was repeatable, and definitely due to a bad spot on the die. Oh, it was an engineering sample. I assume the fault was due to inadequate testing at the factory rather than some field failure. I've also seen failures on early production parts or engineering samples due to preliminary speed files having overly optimistic timing. This causes problems when the P&R tools use a slower part of the die. In that case, moving to a different part of the die could improve matters. But the real fix is to wait for the supplier to finalise the speed files. Perhaps your reviewer was thinking about the sort of failures associated with exceeding the absolute maximum ratings of the part. Regards, Allan
> Do you mean partial reconfiguration? If those words are foreign to > you, look it up on the Xilinx or Altera websites. I'm unfamiliar with > the procedure in detail, but understand that it's commonly done. The > key point is to lock down all of the configuration that is/isn't > changing (presumably to specific area groups, then there exists a > mechanism for changing the remaining circuitry.
My defect avoidance is indeed through active partial reconfiguration, although I work at a much finer granularity than the slots that people typically use for PR. As for Altera, their products do not yet support partial reconfiguration, so while defect avoidance should still be feasible (if one had knowledge of the bitstream format), it would require a full reconfiguration of the device.
Thanks for the reply Gabor.

> Where I have seen localized degradation (partial burn-out) it > has been on I/O drivers. This generally leads to an unusable > device without some rewiring at the board level.
Very interesting. As you say, this would require changes outside of the device itself, which could be a problem in aerospace applications. I wonder though if the driver degradation could be reduced by over-designing the board. In other words, if I knew that a system would be difficult to access, but I wanted it to keep running as long as possible, it sounds like I might be well served by conservative (defensive?) I/O design rules.
> There are clearly applications for your idea at the manufacturing > defect level, however. For example Xilinx uses parts that are > only tested to work with a particular pattern for their volume > discount ASIC-replacement program. In addition to the reduced > test time and therefore cost, this theoretically improves yields.
I believe the test time is the most commonly cited reason for EasyPath. I'm sure Xilinx could provide some very interesting failure mode details here, but it seems a little, you know, tacky to ask them.
> Obviously the ability to use parts with manufacturing defects > for general use would be a big plus to Xilinx, especially on the > high-end parts that tend to have lower yields (check out the > price tags on the XC2V8000 if you want to see what low yield > does to cost). If the defects could be mapped reliably you > may have usable parts with an effectively slightly smaller fabric > size at a fraction of the price or the "perfect" silicon.
And I'm delighted to hear somebody else echoing an argument that I've made in published work. I would happily have taken "mostly good" XC2V10000 or XC2VP125 devices, but now I digress.
> In order for this sort of application to get to volume use, > however you would need to apply the relocation at the tail > end of the build process. It is unlikely that large-scale > users of these devices will want to run place&route for > every chip that goes out the door. Small-scale users like > ASIC-simulation where the bitstream is generally only used > once would benefit from this.
You are right, of course, that this would be inconvenient in the context of current manufacturing. I'm thinking of it in a different context though, where a device or system manages its own configuration and performs its own place and route, something that I've demonstrated for V2P. Admittedly, there needs to be a minimum of known good logic for the base design, but perhaps that's where EasyPath comes in. For mainstream systems we're certainly not there yet, but I suspect it may come to that in 10 or 15 years, depending on the yield with upcoming technologies.
> In the one genuinely faulty part that I've seen, it was a very localised > failure in the middle of the fabric, and rerunning the P&R tools with > some trivial code change (which makes it use different resources) could > mask the fault. > > This was repeatable, and definitely due to a bad spot on the die. > > Oh, it was an engineering sample. I assume the fault was due to > inadequate testing at the factory rather than some field failure.
Fascinating. I had been wondering whether a failure like that would defeat the power rails or the configuration shift registers, but apparently not necessarily so.
> Perhaps your reviewer was thinking about the sort of failures associated > with exceeding the absolute maximum ratings of the part.
The context under consideration was an aerospace system able to perform its own placement and routing, and therefore able to work around damage that might be sustained during its lifetime. I've demonstrated a system that can do its own placement, routing, and partial reconfiguration while continuing to run, but since its back-end implementation tools are hosted on the FPGA, the defect avoidance capability is useless unless the FPGA remains mostly functional. That was the point that the reviewer was making, and the question implicit in my post.
Neil Steiner <neil.steiner@east.isi.edu> wrote:
 
> The context under consideration was an aerospace system able to perform > its own placement and routing, and therefore able to work around damage > that might be sustained during its lifetime.
> I've demonstrated a system that can do its own placement, routing, and > partial reconfiguration while continuing to run, but since its back-end > implementation tools are hosted on the FPGA, the defect avoidance > capability is useless unless the FPGA remains mostly functional. That > was the point that the reviewer was making, and the question implicit in > my post.
I once went to a talk by someone running Linux on a PPC in a Xilinx chip, and then doing partial reconfiguration from that running Linux system. You do have to be careful not to configure yourself out, though. Also, no protection against failure modes including the PPC and its connection to the configuration lines. -- glen
> I once went to a talk by someone running Linux on a PPC in a Xilinx > chip, and then doing partial reconfiguration from that running > Linux system. You do have to be careful not to configure yourself > out, though. Also, no protection against failure modes including > the PPC and its connection to the configuration lines.
Thank you for stating my mantra! ;) The key to doing this successfully is to give the system a dynamic model of itself that stays in sync with the changes that it undergoes. That not only tells it what wires and logic are or are not in use, but also allows it to avoid clobbering existing wires or logic or injected defects. With that foundation in place, I have demonstrated the ability to implement or remove EDIF circuits at will, and arbitrarily add, extend, trim, or remove connections between those circuits and/or the base system, without requiring the slot model mandated by the PR flow. It turns out that partial active reconfiguration works really well if one is careful to avoid the kinds of things you allude to. But returning to the original point of the post, if FPGA failures are typically sudden and catastrophic, then my ability to avoid masked defects is not particularly useful.
Neil Steiner <neil.steiner@east.isi.edu> wrote:
(snip)
 
> The key to doing this successfully is to give the system a dynamic model > of itself that stays in sync with the changes that it undergoes. That > not only tells it what wires and logic are or are not in use, but also > allows it to avoid clobbering existing wires or logic or injected defects.
> With that foundation in place, I have demonstrated the ability to > implement or remove EDIF circuits at will, and arbitrarily add, extend, > trim, or remove connections between those circuits and/or the base > system, without requiring the slot model mandated by the PR flow. It > turns out that partial active reconfiguration works really well if one > is careful to avoid the kinds of things you allude to.
Reminds me of: http://en.wikipedia.org/wiki/Core_War -- glen