Being over-careful is good practice especially for presentation to the management but factual issues should not be distorted for that sake.
What I mean if an FPGA low level functionality (registers, luts, blocks, and successful timing) doesn't work as expected for each and every clock edge then we ought to give up or we add mitigation logic of some type.
To my understanding mitigation is the domain of safety critical applications or aerospace and it requires small sized designs with plenty plenty of work to detect or correct events such as SEU...etc.
But I notice many fpga engineers carry some remnants of mitigation mindset everywhere and it is this that I hate...
The most famous example: if a state machine enters unreachable state and then what?
Is it enough just to add when others ? and does various state coding imply fpga registers could go wrong?
I believe we should then reset the state and then reset all associated signals and reset input module and reset output module and well reset all system and don't worry about your design switching on and off as a result. Else a partial mitigation is pointless.
Another example "if count = n ..." and somebody would review the code and say no sir add if count > n as well to make it more solid just in case.
My answer is that my counter is meant not to be more than n but less and I should not bother about unreachable state of > n. I do not want to mask any potential bugs...
There might be tens of thousands or hundreds of thousands of registers in a design so why just make some more solid and forget others.
so in short we either mitigate faults correctly when the application is critical or assume fpga will work and should be trusted instead of half-baked solutions for non-critical cases.
Here we need to differentiate between fpga primary logic per se and user made functionality out of that logic.
FPGA primary logic should be trusted but user design may not be. Flaky designs are common and mitigation may be considered here but can also mask bugs or patch them up.
For example there might be cases when some logic is locking to some signal and loss of lock may need to be mitigated.
Another example: some designs decide I/Q pairing from a serial I-Q-I-Q stream based on just one single initial check but I would prefer to have self-correcting logic if I am not sure about my simulation cover or not confident about the module that generates the stream of I-Q-I-Q designed by my next door neighbour
This sounds like a design -unfairly- needs to check and correct its inputs which is the responsibility of input module.
Any comments appreciated.
This is an area I can help you with having designed a few FPGA for space and other SIL 4 applications along with many more commercial applications.
They do work right up until the point the do not and you need to ensure that for your application and environment there is no issues. It is not just the SEU/SEE which needs to be considered but a whole range of environmental conditions and applications.
you cannot just rely upon the when others of the state machine generally without additional constraints to stop kit being optimised out as there generally is no entry path to it.
I have seen vibration on bond wires within a FPGA cause a state machine to malfunction, it is costly to discover this at a later point in time. Hence we why many people carry over good design practices. The key of course is knowing the difference between the approaches required, simple mitigation e.g ECC on memories, safe counters and state machines are much simpler and have a minimal overhead compared to say implementing TMR and in that case which TMR.
As I said I wrote and spoke quite a bit about this over the years following my design experiences I outlined some of the issues and challenges below for those that are interested.
Just my few comments to get the discussion started.
I gave a 65 minute lecture on it FPGA kongress this year which was I recreated for the xilinx blog you can find a copy of it here 65 minute video
I also wrote a paper on how to develop the safety critical state machines Paper on state machines here
I spoke at EE live a few years ago on a similar topic too so check out this link presentation
Some aspects of MTBF
Thanks Adam for the contribution and the links, I will look into them.
You confirmed that mitigation is not as easy as just putting back one or so register that keeled over. However, my discussion is based on two perspectives; one is that how to mitigate (critical applications) but the other is that when not to (non-critical) and my view is "respect fpga and do not add any mitigation if the application is non critical".
In this connection I am not easy with Quartus having a setting called "safe state machine". May be that determines state encoding and implementing when others... but it implies that otherwise it is unsafe and so I better use safe setting or keep it quiet and defend myself. All this in unnecessary for non critical applications.
These issues may be taken easily by designers but in a team environment when we have to interface or review code it can lead to conflicts and also creates a loose culture of "technical knowledge".