Guys We have just laid out a board and want to put the thermal analysis to bed (it's conduction cooled so not much room for error). If the xilinx estimator says we are going to use 25 watts does anyone know the best way to code an FPGA so that it will get nice and hot. The estimator is just that, but is there a more accurate way of writing some code so that a particular clock input will generate a particular amount of heat. A 2000 D type serial chain where every flip flop is toggling every clock which blinks an LED is obviously one way but doesn't seem very ellegant. We have wired up the internal temp sense diode to take a look at the result (and yes we know how noisy and innacurate they are). Any experiences? Colin
making an fpga hot
Started by ●December 3, 2004
Reply by ●December 3, 20042004-12-03
colin wrote:> Guys > > We have just laid out a board and want to put the thermal analysis to > bed (it's conduction cooled so not much room for error). If the xilinx > estimator says we are going to use 25 watts does anyone know the best > way to code an FPGA so that it will get nice and hot. > > The estimator is just that, but is there a more accurate way of > writing some code so that a particular clock input will generate a > particular amount of heat. A 2000 D type serial chain where every flip > flop is toggling every clock which blinks an LED is obviously one way > but doesn't seem very ellegant. >If your goal is just to generate heat, use all the LUTs as SRL's, make use of all the BRAM's, and drive all the I/O's with a nice high current drive strength. Marc
Reply by ●December 3, 20042004-12-03
Coiln, Just make a huge shift register, or all DFF's toggling, and then just vary the clock input (or the shifted data input pattern from ....000001, to 101010....etc). That is what we do. Austin colin wrote:> Guys > > We have just laid out a board and want to put the thermal analysis to > bed (it's conduction cooled so not much room for error). If the xilinx > estimator says we are going to use 25 watts does anyone know the best > way to code an FPGA so that it will get nice and hot. > > The estimator is just that, but is there a more accurate way of > writing some code so that a particular clock input will generate a > particular amount of heat. A 2000 D type serial chain where every flip > flop is toggling every clock which blinks an LED is obviously one way > but doesn't seem very ellegant. > > We have wired up the internal temp sense diode to take a look at the > result (and yes we know how noisy and innacurate they are). > > Any experiences? > > Colin
Reply by ●December 3, 20042004-12-03
"colin" <colin_toogood@yahoo.com> wrote in message news:885a4a4a.0412030423.4f6b7e7c@posting.google.com...> We have wired up the internal temp sense diode to take a look at the > result (and yes we know how noisy and innacurate they are). > > Any experiences? >Well, I've found the diode isn't particularly noisy nor especially inaccurate! It gives repeatable and consistent (between parts) results, certainly good enough for your application. You have routed its connections together and away from big switching currents, I presume?! I use copper sheet to move heat to where I can get rid of it. Cu is 400 W/m/K, about twice as good as Aluminium. Don't use copper alloys. Very useful if you've got boards stacked closely together, you can get the heat out from between the boards. I've never tried heat pipes, but they're meant to be very good indeed. Finally, you'll find that the FPGAs work at elevated temperature for a long time. I recall a thread on CAF all about FPGAs down boreholes where they were running for weeks at 175C. You might be enlightened by a quick trawl of CAF in Google Groups. So, what's the lifetime of your product? How long will you be working for that company? All part of the engineering compromise!! Good luck, Syms.
Reply by ●December 6, 20042004-12-06
Reply by ●December 8, 20042004-12-08
Hi Colin, Below I try to give some insight into how to make a hot design, though I do question the motivation of doing so. A simple FF chain comes no where close to achieving a high (or even average) core power. All of the phenomena I describe below are modeled in the recently released Quartus II 4.2 software via its PowerPlay Power Analyzer. Target Stratix II or Max II and you'll get very accurate estimates of how all these factors affect your power consumption. You can try out the Power Analyzer in the Quartus II 4.2 Web Edition software available from www.altera.com. If you're trying to figure out if a given design will work on your board after it's been made, the best bet is to try the chip out in the lab using stimulus (vectors) that reflect the worst-case operating conditions for the chip. I can make you a design that will burn many many Watts of power, but that doesn't mean your design will. A dynamic power measurement from the lab is the most accurate estimate possible -- just remember to use the manufacturer's spec for worst-case static power (at worst-case temperature) since the unit you have on your board is likely NOT worst-case.> The estimator is just that, but is there a more accurate way of > writing some code so that a particular clock input will generate a > particular amount of heat. A 2000 D type serial chain where every flip > flop is toggling every clock which blinks an LED is obviously one way > but doesn't seem very ellegant.There are many factors that affect overall dynamic power consumption of an FPGA design. I will highlight a few critical ones below, and make suggestions along the way to build a design to turn your FPGA into the hot-plate you desire. It is *not* as simple as making one big shift-register... (0) Transition Density. You want to toggle as much every cycle as possible. Toggle FF/shift register achieve this, as do XOR functions (if you want to utilize the LUT too). (1) Routing Utilization. The routing buffers, multiplexers, and wiring in an FPGA can add up to a large amount of switching capacitance and short-circuit (crowbar) current. To maximize dynamic power, you must use a lot of routing. A simple FF chain will actually use very little routing, unless you purposely make the placement very bad by using region constraints such as LogicLock regions. You could, for example, constrain the even bits of your chain to one-half the chip and the odd bits to the other half, and this will greatly increase routing utilization. Or use something other than FFs to increase the number and fanout of the routed wires. Of course, you'll need to experiment a little to find the right balance between high utilization and still being able to route! (2) LUT Configuration. A LUT configured as an AND gate does not burn nearly as much power as one configured as an XOR. This difference is due to the number of internal nodes in the circuit that toggle states upon the toggle of in input signal. On top of this, the output of an XOR will toggle upon the toggle of any input -- so chaining together XORs will result in a cascade of glitching (if there are no pipeline registers), which can further increase your power. To get the most accurate estimate of LUT power, you must consider the functionality of the LUT -- Quartus II can do this for you. (3) Clock Network. The vast majority of power on a high-fanout clock will be burned *inside* the LABs (on the LAB-wide clock), not on the global clock network. If you distribute a clock such that it fans out to one FF (out of 16) in every LAB of the device, this will maximize this internal LAB clock network power. You can achieve this through location constraints applied to these FFs. And the more clocks you use, the more you will burn. You can use the PLLs to step up the clock frequency to help increase the toggle rate. (4) RAMs. A RAM can burn significant power if you perform reads & writes every cycle (keep the clock enable asserted). Just hook up all the RAMs in the device to be in dual-port mode writing & reading random data every cycle, and you've got some more power. (5) I/Os. You can burn an arbitrary amount of power with your I/Os, depending on external termination resistance, contention, I/O standard, drive strength, load capacitance, etc. Let's just pretend you don't have I/Os to make life easier. Hopefully that gives you some ideas of where to go to burn some power. If your using a Xilinx chip, I'm sure similar techniques will apply, though their tools may not be able to fully predict the results you will see. Regards, Paul Leventis Altera Corp.
Reply by ●December 8, 20042004-12-08
Hi Paul, Comments/Questions below! "Paul Leventis (at home)" <paulleventis-news@yahoo.ca> wrote in message news:686dnTKPrvwyGyvcRVn-pQ@rogers.com...> (2) LUT Configuration. A LUT configured as an AND gate does not burn > nearly > as much power as one configured as an XOR. This difference is due to the > number of internal nodes in the circuit that toggle states upon the toggle > of in input signal. On top of this, (blah, blah, XORs transition more)Could you explain that a little more? I thought that the LUT was just a 16x1 RAM. Is the extra power consumed only when two inputs change? e.g. 00 => 11 into the XOR would still have 0 as its output but it might transistion through the 1 output state? I understand that XOR gates are more likely to transition, but you seem to be saying there's some additional internal reason why they consume power.> > Paul Leventis > Altera Corp. >Cheers, Syms.
Reply by ●December 8, 20042004-12-08
The logic transitions in the routing and subsequent differential delays through the LUT can make for many more transitions than a simple buffer implemented in a LUT. Unless all the LUT inputs are precisely timed so that the edges change together, you wind up with a walk through several of the LUT addresses in the process of settling to the next clock. A paper presented at FPGA a few years ago went as far as to say that as much as 30-40% of the power in a typical fpga design is due to propagating glitches in the logic between flip-flops, and they showed that by heavily pipelining the design, the power consumption improved dramatically. Symon wrote:> Hi Paul, > Comments/Questions below! > > "Paul Leventis (at home)" <paulleventis-news@yahoo.ca> wrote in message > news:686dnTKPrvwyGyvcRVn-pQ@rogers.com... > > (2) LUT Configuration. A LUT configured as an AND gate does not burn > > nearly > > as much power as one configured as an XOR. This difference is due to the > > number of internal nodes in the circuit that toggle states upon the toggle > > of in input signal. On top of this, (blah, blah, XORs transition more) > > Could you explain that a little more? I thought that the LUT was just a 16x1 > RAM. Is the extra power consumed only when two inputs change? e.g. 00 => 11 > into the XOR would still have 0 as its output but it might transistion > through the 1 output state? I understand that XOR gates are more likely to > transition, but you seem to be saying there's some additional internal > reason why they consume power. > > > > > Paul Leventis > > Altera Corp. > > > Cheers, Syms.-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
Reply by ●December 8, 20042004-12-08
Hi Symon,> > (2) LUT Configuration. A LUT configured as an AND gate does not burn > > nearly > > as much power as one configured as an XOR. This difference is due tothe> > number of internal nodes in the circuit that toggle states upon thetoggle> > of in input signal. On top of this, (blah, blah, XORs transition more) > > Could you explain that a little more? I thought that the LUT was just a16x1> RAM. Is the extra power consumed only when two inputs change? e.g. 00 =>11> into the XOR would still have 0 as its output but it might transistion > through the 1 output state? I understand that XOR gates are more likely to > transition, but you seem to be saying there's some additional internal > reason why they consume power.While logically a LUT is just 16x1 ROM, physically it is not built the same way as a RAM. A traditional RAM is built with a 2D-array of bits, where a row is selected by decoding the address, and a pair of differential bit lines per cell is precharged and then the cell pulls one side down which is amplified by a sense-amplifier to speed things up (gross simplification). In that structure, regardless of what you are reading, you burn the same power since the reads are differential, and you burn power on each read, regardless of the previously read value, since all that precharge, pull-down and sensing happens every read. A LUT however is traditionally built as a multiplexor tree. You have 16 SRAM cells feeding a tree of 2:1 muxes. The 4 inputs of the LUT each control one level of the tree. There is a diagram below for a 2-LUT. Let's take a 2-LUT implementing an XOR as an example (see diagram). We have x = A?1:0 and y = A?0:1, and f = B?y:x. Let's say A switches from 0-->1 (and B = 0). Node x toggles from a 0 to 1. Node y toggles from a 1 to a 0. And node f toggles from a 0 to a 1 (with x). So you have not only the output of the LUT toggling, but also the internal stages. If you extend the example to an N-LUT, you'll see that a toggle on input A results in 2^(N-1) first stage nodes toggling, 2^(N-2) second stage, etc. or 2^N - 1 nodes toggling *internal* to the LUT. If you look at an AND instead, you'll see that only one first stage node toggles state with a change in A. A B +-+ | | |0|-|\ x | +++ | |__ | +-+ | | |\ |1|-|/ | | +++ | | |__ f +-+ | | | |1|-|\ y| | +++ | |__|/ +-+ | | |0|-|/ +++ So in conclusion, an XOR not only results in a higher output switching probability (which should be modeled by your simulation vectors or assumed toggle rate), but also results in higher *internal* switching activity. Hence power of a LUT is not constant in LUT mask. In fact, it also changes as a function of what the "static probabilities" of each input are, or % of the time those inputs are 1 or 0, since assymetric LUT masks result in assymetric internal states as a function of input values. Regards, Paul Leventis Altera Corp.
Reply by ●December 8, 20042004-12-08
Hi Ray et al: Good point on glitching. On a related note, this glitching also makes power analysis difficult. Even with good-quality simulation vectors for a design, the resulting gate-level simulation results will contain glitches. Are the glitches real? If so, then they should count towards power. But sufficiently short glitches will never propagate through the routing, or even through the gate. This is why we recommend that our users employ glitch filtering on simulation results. This can be done with the Quartus II 4.2 simulator or with 3rd party simulators (via the control file emitted by Quartus II). We find that very glitchy designs do not correlate well unless this glitch filtering is used. In addition, the resulting VCD files produced by 3rd party sims need to be further filtered by Quartus in order to improve accuracy further. For further information on power analysis, the Quartus II PowerPlay Power Analyzer and glitch filtering specifically, please see http://www.altera.com/literature/hb/qts/qts_qii53013.pdf. And yes, pipelining is an excellent way to reduce glitching and thus dynamic power. At some point, the pipeline registers and additional clock routing will add more power than the glitches removed, but for glitch-heavy designs (anything with XORs, such as adders, multipliers, and parity trees, and "randomizing" circuits such as encryption) pipeling will help a lot. Regards, Paul Leventis Altera Corp.






