There are 22 messages in this thread.
You are currently looking at messages 0 to 10.
Hi, I finally understand the reason when a flip-flops can be replaced by a latch. Here is the excerpt from the paper "Atom Processor Core Made FPGA Synthesizable" Optimized for a frequency range from 800MHz to 1.86Ghz, the original Atom design makes extensive use of latches to support time borrowing along the critical timing paths. With level-sensitive latches, a signal may have a delay larger than the clock period and may flush through the latches without causing incorrect data propagation, whereas the delay of a signal in designs with edge-triggered flip-flops must be smaller than the clock period to ensure the correctness of data propagation across flip-flop stages [3]. It is well known that the static timing analysis of latch-based pipeline designs with level-sensitive latches is challenging due to two salient characteristics of time borrowing [2, 3, 14]: (1) a delay in one pipeline stage depends on the delays in the previous pipeline stage. (2) in a pipeline design, not only do the longest and shortest delays from a primary input to a primary output need to be propagated through the pipeline stages, but also the critical probabilities that the delays on latches violate setup-time and hold-time constraints. Such high dependency across the pipeline stages makes it very difficult to gauge the impact of correlations among delay random variables, especially the correlations resulting from reconvergent fanouts. Due to this innate difficulty, synthesis tools like DC-FPGA simply do not support latch analysis and synthesis correctly." In short, a pipeline with several FFs can be replaced with a pipeline with two FFs in the ends and normal latches inserted between them to steal time slack. FF1 ---> FF2 ---> FF3 ---> FF4 FF1 ------->l2 --------> l3--> FF4. I saw the circuits before, but not realized what the basic reason was. With the above paper, I now know that the technology is not a new, it originated in 1980s. Weng
Yes, latch-based design is much older than flop-based design, for the simple reason that it can be cheaper. Think about it -- every flop is really two latches! (At least for static designs that can be clocked down to DC...) Where I work (at a chip company), we're still occasionally converting latch-based designs into flop-based ones. But (and this is a big but) FPGAs themselves (not just the design tools) are designed for flop-based design, so if you use latch-based designs with FPGAs you are not only stressing the timing tools, you are also avoiding the nice, packaged, back-to-back dedicated latches they give you called flops. Pat On Feb 11, 2:05=A0pm, Weng Tianxiang <wtx...@gmail.com> wrote: > Hi, > I finally understand the reason when a flip-flops can be replaced by a > latch. > > Here is the excerpt from the paper "Atom Processor Core Made FPGA > Synthesizable" > Optimized for a frequency range from 800MHz to 1.86Ghz, > the original Atom design makes extensive use of latches > to support time borrowing along the critical timing paths. > With level-sensitive latches, a signal may have a delay larger > than the clock period and may flush through the latches > without causing incorrect data propagation, whereas the delay > of a signal in designs with edge-triggered flip-flops must > be smaller than the clock period to ensure the correctness of > data propagation across flip-flop stages [3]. It is well known > that the static timing analysis of latch-based pipeline designs > with level-sensitive latches is challenging due to two > salient characteristics of time borrowing [2, 3, 14]: (1) a > delay in one pipeline stage depends on the delays in the previous > pipeline stage. (2) in a pipeline design, not only do > the longest and shortest delays from a primary input to a > primary output need to be propagated through the pipeline > stages, but also the critical probabilities that the delays on > latches violate setup-time and hold-time constraints. Such > high dependency across the pipeline stages makes it very > difficult to gauge the impact of correlations among delay > random variables, especially the correlations resulting from > reconvergent fanouts. Due to this innate difficulty, synthesis > tools like DC-FPGA simply do not support latch analysis > and synthesis correctly." > > In short, a pipeline with several FFs can be replaced with a pipeline > with two FFs in the ends and normal latches inserted between them to > steal time slack. > > FF1 ---> FF2 ---> FF3 ---> FF4 > FF1 ------->l2 --------> l3--> FF4. > > I saw the circuits before, but not realized what the basic reason was. > With the above paper, I now know that the technology is not a new, it > originated in 1980s. > > Weng______________________________
In comp.arch.fpga Patrick Maupin <p...@gmail.com> wrote: > Yes, latch-based design is much older than flop-based design, for the > simple reason that it can be cheaper. Think about it -- every flop is > really two latches! (At least for static designs that can be clocked > down to DC...) Where I work (at a chip company), we're still > occasionally converting latch-based designs into flop-based ones. Often using a two (or more) phase clock. Some latches work on one phase, some on the other. With appropriately non-overlapping, one avoids race conditions and the timing isn't so hard to get right. > But (and this is a big but) FPGAs themselves (not just the design > tools) are designed for flop-based design, so if you use latch-based > designs with FPGAs you are not only stressing the timing tools, you > are also avoiding the nice, packaged, back-to-back dedicated latches > they give you called flops. Well, you could use a sequence of FF's, clocking on different clock edges, or the same edge of two clocks. That allows for some of the advantages. If there was enough demand, I suppose FPGA companies would build transparent latch based devices. (Who remembers the 7475?) In pipelined processors of years past the Earle latch combined one level of logic with the latch logic, reducing the latch delay. -- glen______________________________
On Feb 11, 3:05 pm, Weng Tianxiang <wtx...@gmail.com> wrote: > Hi, > I finally understand the reason when a flip-flops can be replaced by a > latch. > > Here is the excerpt from the paper "Atom Processor Core Made FPGA > Synthesizable" > Optimized for a frequency range from 800MHz to 1.86Ghz, > the original Atom design makes extensive use of latches > to support time borrowing along the critical timing paths. > With level-sensitive latches, a signal may have a delay larger > than the clock period and may flush through the latches > without causing incorrect data propagation, whereas the delay > of a signal in designs with edge-triggered flip-flops must > be smaller than the clock period to ensure the correctness of > data propagation across flip-flop stages [3]. It is well known > that the static timing analysis of latch-based pipeline designs > with level-sensitive latches is challenging due to two > salient characteristics of time borrowing [2, 3, 14]: (1) a > delay in one pipeline stage depends on the delays in the previous > pipeline stage. (2) in a pipeline design, not only do > the longest and shortest delays from a primary input to a > primary output need to be propagated through the pipeline > stages, but also the critical probabilities that the delays on > latches violate setup-time and hold-time constraints. Such > high dependency across the pipeline stages makes it very > difficult to gauge the impact of correlations among delay > random variables, especially the correlations resulting from > reconvergent fanouts. Due to this innate difficulty, synthesis > tools like DC-FPGA simply do not support latch analysis > and synthesis correctly." > > In short, a pipeline with several FFs can be replaced with a pipeline > with two FFs in the ends and normal latches inserted between them to > steal time slack. > > FF1 ---> FF2 ---> FF3 ---> FF4 > FF1 ------->l2 --------> l3--> FF4. > > I saw the circuits before, but not realized what the basic reason was. > With the above paper, I now know that the technology is not a new, it > originated in 1980s. > > Weng I'm a little unclear on how this works. Is this just a matter of the outputs of the latches settling earlier if the logic path is faster so that the next stage actually has more setup time? This requires that there be a minimum delay in any given path so that the correct data is latched on the current clock cycle while the result for the next clock cycle is still propagating through the logic. I can see where this might be helpful, but it would be a nightmare to analyze in timing, mainly because of the wide range of delays with process, voltage and temperature (PVT). I have been told you need to allow 2:1 range when considering all three. I think similar issues are involved when considering async design (or more accurately termed self-timed). In that design method the variations in delay affect the timing of both the data path and clock path so that they are largely nulled out so that the min delays do not need to include the full 2:1 range compared to the max. Some amount of slack time must be given so the clock arrives after the data, but otherwise all the speed of the logic is utilized at all times. This also is supposed to provide for lower noise designs because there is no chip wide clock giving rise to simultaneous switching noise. Self- timed logic does not really result in significant increases in processing speed because although the max speed can be faster, an application can never rely on that faster speed being available. But for applications where there is optional processing that can be done using the left over clock cycles (poor term in this case, but you know what I mean) it can be useful. In the case of using latches in place of registers, the speed gains are always usable. But can't the same sort of gains be made by register leveling? If you have logic that is slower than a clock cycle followed by logic that is faster than a clock cycle, why not just move some of the slow logic across the register to the faster logic section? Rick______________________________
On Feb 11, 8:33=A0pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote: > In comp.arch.fpga Patrick Maupin <pmau...@gmail.com> wrote: > > > But (and this is a big but) FPGAs themselves (not just the design > > tools) are designed for flop-based design, so if you use latch-based > > designs with FPGAs you are not only stressing the timing tools, you > > are also avoiding the nice, packaged, back-to-back dedicated latches > > they give you called flops. > > Well, you could use a sequence of FF's, clocking on different clock > edges, or the same edge of two clocks. =A0 > I actually did this in Xilinx FPGAs back in 1999. The specific problem I was solving was an insufficient number of global clocks (a lot of interconnects with source-based clocking). Xilinx has solutions for this now (regional clocks), but not back then. So I used regular interconnect for clocking, and that was very high skew, so that you couldn't guarantee that the same edge was, in fact, the same edge for all the flops on the clock. The solution was to do as you said -- the inputs to every flop were from flops clocked on the opposite edge. That, and reducing the amount of logic in that clock domain and clock-crossing to a "real" clock domain as soon as possible.
On Feb 12, 10:32=A0am, rickman <gnu...@gmail.com> wrote: > In the case of using latches in place of registers, the speed gains > are always usable. =A0But can't the same sort of gains be made by > register leveling? =A0If you have logic that is slower than a clock > cycle followed by logic that is faster than a clock cycle, why not > just move some of the slow logic across the register to the faster > logic section? That's a similar technique, to be sure, for speed-gains. But as I wrote in an earlier post, I think the primary motivation for latch- based design was originally cost. For example, since each flop is really two latches, if you are going to have logic which ANDs together the output of two flops, you could replace that with ANDing the output of two latches, and outputting that result through another latch, for a net savings of 75% of the latches.______________________________
On Feb 12, 7:35=A0pm, Patrick Maupin <pmau...@gmail.com> wrote: > On Feb 12, 10:32=A0am, rickman <gnu...@gmail.com> wrote: > > > In the case of using latches in place of registers, the speed gains > > are always usable. =A0But can't the same sort of gains be made by > > register leveling? =A0If you have logic that is slower than a clock > > cycle followed by logic that is faster than a clock cycle, why not > > just move some of the slow logic across the register to the faster > > logic section? > > That's a similar technique, to be sure, for speed-gains. =A0But as I > wrote in an earlier post, I think the primary motivation for latch- > based design was originally cost. =A0For example, since each flop is > really two latches, if you are going to have logic which ANDs together > the output of two flops, you could replace that with ANDing the output > of two latches, and outputting that result through another latch, for > a net savings of 75% of the latches. Your method's target and the target used by CPU designers inserting latches in the pipeline line are totally different. They use it because a combinational signal time delay is tool long to fit within one clock cycle and too short within two clock cycles in a pipeline, not in any places you may want to. Weng
On Feb 12, 11:32=A0am, rickman <gnu...@gmail.com> wrote: <snip> > > In the case of using latches in place of registers, the speed gains > are always usable. =A0But can't the same sort of gains be made by > register leveling? =A0If you have logic that is slower than a clock > cycle followed by logic that is faster than a clock cycle, why not > just move some of the slow logic across the register to the faster > logic section? > > Rick I argued with my coworker for a few days about the benefit of latches versus registers before I finally realized the advantage of latch based designs. Not only is granularity less of a problem (e.g., only able to fit 2 logic delays in a level rather than the maximum 2.8 available, losing nearly 30%) but synchronous delays are different. Rather than accounting for Tco+Tsu for every register in a chain of a few clock cycles where register leveling is helpful, only the Tito transparent latch delay (minus the Tilo LUT delay) needs to be added for each latch in the chain [using Xilinx timing nomenclature]. I agree that the register based FPGAs are probably designed (and tested) to minimize Tsu and Tco without strong consideration for Tito and that the timing analysis is NOT set up to do a good job with "latch leveled" timing analysis. When I do use latches (when transferring data between rising/falling time domains for a fast clock, for instance) I have to specify false values around the latch for synchronous analysis rather than the precise values through the latch because the analysis wants to see registers at each stage even with the proper analysis flag turned on. If the analyzer would recognize a chain of rise/fall/rise/fall controlled latches and automatically increase the timing constraint by a half period for each stage, we'd potentially have a powerful tool at our disposal. But they don't so we don't. At least not in FPGAs. - John_H
In comp.arch.fpga John_H <n...@johnhandwork.com> wrote: (snip) > I argued with my coworker for a few days about the benefit of latches > versus registers before I finally realized the advantage of latch > based designs. Not only is granularity less of a problem (e.g., only > able to fit 2 logic delays in a level rather than the maximum 2.8 > available, losing nearly 30%) but synchronous delays are different. > Rather than accounting for Tco+Tsu for every register in a chain of a > few clock cycles where register leveling is helpful, only the Tito > transparent latch delay (minus the Tilo LUT delay) needs to be added > for each latch in the chain [using Xilinx timing nomenclature]. I would have thought that they were fast enough now for that not to matter so much. My thought would be that clock skew, even with the fancy clock distribution system, would be the important factor. If the granularity is the problem then you might try clocking some on rising and some on falling edge (if available) or having two clocks with known phase difference. That would be especially true if the DLL's could generate the appropriate clocks. > I agree that the register based FPGAs are probably designed (and > tested) to minimize Tsu and Tco without strong consideration for Tito > and that the timing analysis is NOT set up to do a good job with > "latch leveled" timing analysis. > When I do use latches (when transferring data between rising/falling > time domains for a fast clock, for instance) I have to specify false > values around the latch for synchronous analysis rather than the > precise values through the latch because the analysis wants to see > registers at each stage even with the proper analysis flag turned on. > If the analyzer would recognize a chain of rise/fall/rise/fall > controlled latches and automatically increase the timing constraint by > a half period for each stage, we'd potentially have a powerful tool at > our disposal. But they don't so we don't. At least not in FPGAs. That sounds useful. If it gets popular enough, maybe they will add it. -- glen
On Feb 13, 3:09=A0pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote: <snip> > > Rather than accounting for Tco+Tsu for every register in a chain of a > > few clock cycles where register leveling is helpful, only the Tito > > transparent latch delay (minus the Tilo LUT delay) needs to be added > > for each latch in the chain [using Xilinx timing nomenclature]. > > I would have thought that they were fast enough now for that > not to matter so much. =A0My thought would be that clock skew, > even with the fancy clock distribution system, would be the important > factor. Clock skew becomes entirely unimportant in the latch scheme as I know it unless CLK and CLK180 are used instead of normal and inverted versions of the same clock. The latches are explicitly alternated posedge/negedge/posedge/negedge effectively decomposing a conceptual register into its two latches and balancing the logic between them. For clock skew to be an issue, two consecutive latches would have to be transparent long enough for the logic path plus delays to sneak through; that won't happen when using the normal and invert of the *same* clock net unless things are very, very wrong in the latch design. > If the granularity is the problem then you might try clocking > some on rising and some on falling edge (if available) or having > two clocks with known phase difference. =A0That would be especially > true if the DLL's could generate the appropriate clocks. Some... registers? Using the posedge and negedge in a registered arrangement would simply exacerbate the granularity problem, able to fit fewer whole delays into the same clock period by dividing the logic into two phases. The latches allow longer delays to move the valid data further toward the end of the transparent window and shorter delays to move it back, always with the safeguard that data for the next (half) cycle isn't allowed to be valid any sooner than the front edge of the transparent window. The description comes out a little muddy which is why it took me a few days to buy in to the whole concept. It's sweet! It just takes some timing diagrams and head scratching. And it's certainly not set up for proper analysis especially in the Xilinx tools where I experimented with the phase domain changes. - John_H