FPGARelated.com
Forums

Automatic latency balancing in VHDL-implemented complex pipelined systems

Started by Unknown September 29, 2015
On 9/29/2015 5:01 PM, Tim Wescott wrote:
> On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote: > >> On 9/29/2015 4:22 PM, Tim Wescott wrote: >>> On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote: >>> >>>> wzab01@gmail.com wrote: >>>> >>>>> Last time I have spent a lot of time on development of quite complex >>>>> high speed data processing systems in FPGA. >>>>> They all had pipeline architecture, and data were processed in >>>>> parallel in multiple pipelines with different latencies. >>>> >>>>> The worst thing was that those latencies were changing during >>>>> development. For example some operations were performed by blocks >>>>> with tree structure, so the number of levels depended on number of >>>>> inputs handled by each node. >>>>> The number of inputs in each node was varied to find the acceptable >>>>> balance between the number of levels and maximum clock speed. I also >>>>> had to add some pipeline registers to improve timing. >>>> >>>> I have heard that some synthesis software now knows how to move around >>>> pipeline registers to optimize timing. I haven't tried using the >>>> feature yet, though. >>> >>> I knew about this sort of thing ten years ago, although I've never used >>> it (for FPGA I'm mostly an armchair coach). >>> >>> At the time that my FPGA friends were rhapsodizing about it, the >>> designer still needed to specify the total delay, but the tools took >>> the responsibility for distributing it. >>> >>> It makes sense to do it that way, because you're the one that has to >>> decide how much delay is right, and who has to make sure that the >>> timing for section A matches the timing for section B -- for the moment >>> at least that's really beyond the tool's ability to cope. >> >> I'm not picturing the model you are describing. If all sections have >> the same clock, they all have the same timing constraint, no? As to the >> tools distributing the delays, again, each stage has the same timing >> constraint so unless there are complications such as inputs with >> separately specified delays, the tool just has to move logic across >> register boundaries to make each section meet the timing spec or better >> to balance all the delays in case you wish to have the fastest possible >> clock rate. >> >> Maybe by timing you mean the clock cycles the OP is talking about? > > The way I've seen it, rather than carefully hand-designing a pipeline, > you just design a system that's basically > > .---------------------. .-------. > data in -->| combinatorial logic |---->| delay |----> data out > '---------------------' '-------' > > where the "delay" block just delays all the outputs from the > combinatorial block by some number of clocks. > > Then you tell the tool "move delays as you see fit", and it magically > distributes the delay in a hopefully-optimal way within the combinatorial > logic, making it pipelined.
Yes, but you talked about the tool not being able to "cope" with matching the delays in section A and B. I'm not following that. -- Rick
On 9/29/2015 5:58 PM, wzab01@gmail.com wrote:
> W dniu wtorek, 29 września 2015 21:41:26 UTC+1 użytkownik rickman > napisał: >> On 9/29/2015 4:22 PM, Tim Wescott wrote: >>> On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote: >>> >>>> wzab01@gmail.com wrote: >>>> >>>>> Last time I have spent a lot of time on development of quite >>>>> complex high speed data processing systems in FPGA. They all >>>>> had pipeline architecture, and data were processed in >>>>> parallel in multiple pipelines with different latencies. >>>> >>>>> The worst thing was that those latencies were changing >>>>> during development. For example some operations were >>>>> performed by blocks with tree structure, so the number of >>>>> levels depended on number of inputs handled by each node. The >>>>> number of inputs in each node was varied to find the >>>>> acceptable balance between the number of levels and maximum >>>>> clock speed. I also had to add some pipeline registers to >>>>> improve timing. >>>> >>>> I have heard that some synthesis software now knows how to move >>>> around pipeline registers to optimize timing. I haven't tried >>>> using the feature yet, though. >>> >>> I knew about this sort of thing ten years ago, although I've >>> never used it (for FPGA I'm mostly an armchair coach). >>> >>> At the time that my FPGA friends were rhapsodizing about it, the >>> designer still needed to specify the total delay, but the tools >>> took the responsibility for distributing it. >>> >>> It makes sense to do it that way, because you're the one that has >>> to decide how much delay is right, and who has to make sure that >>> the timing for section A matches the timing for section B -- for >>> the moment at least that's really beyond the tool's ability to >>> cope. >> >> I'm not picturing the model you are describing. If all sections >> have the same clock, they all have the same timing constraint, no? >> As to the tools distributing the delays, again, each stage has the >> same timing constraint so unless there are complications such as >> inputs with separately specified delays, the tool just has to move >> logic across register boundaries to make each section meet the >> timing spec or better to balance all the delays in case you wish to >> have the fastest possible clock rate. >> >> Maybe by timing you mean the clock cycles the OP is talking about? >> >> -- >> >> Rick > > The problem I'm dealing with is just about the number of clock > cycles, by which data in each data path are delayed.
Yes, I understand the problem you are addressing. I have never done a design where this was much of a problem, but I'm sure some designs are much larger and more complex than the ones I have done.
> The equal distribution of delay between stages of pipeline is so > technology specific, that it probably must be handled by the vendor > provided tools and in fact usually it is. In old Xilinx tools it was > "register balancing", in Altera tools and in new Xilinx tools it is > "register retiming". > > So my problem is not so complex. And yes, it was solved in GUI based > tools many years ago. In old Xilinx System Generator it was a special > "sync" block which was doing that. Just see Fig. 4 in my old paper > from 2003 ( > http://tesla.desy.de/new_pages/TESLA_Reports/2003/pdf_files/tesla2003-05.pdf > ). > > The importance of the problem is still emphasized by the vendors of > block-based tools (e.g. > http://www.mathworks.com/help/hdlcoder/examples/delay-balancing-and-validation-model-workflow-in-hdl-coder.html > )
Yes, it is important to have a tool to do this when the design is large or your timing margins are tight. It can save a lot of work.
> However I've never see tool like this available for designs written > in pure HDL, not composed from blocks in GUI based tool... > > I have found that for designs with pipelines with lengths depending > on different parameters and somehow interconnected in a complex way > there is really a need for a tool for automatic verification, or even > better for automatic adjustment of those lengths. Without that you > can easily get incorrect design which processes data misaligned in > time. > > So that was the motivation. Sorry if my original post was somehow > misleading.
Not to me. :) -- Rick
On Tue, 29 Sep 2015 18:33:32 -0400, rickman wrote:

> On 9/29/2015 5:01 PM, Tim Wescott wrote: >> On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote: >> >>> On 9/29/2015 4:22 PM, Tim Wescott wrote: >>>> On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote: >>>> >>>>> wzab01@gmail.com wrote: >>>>> >>>>>> Last time I have spent a lot of time on development of quite >>>>>> complex high speed data processing systems in FPGA. >>>>>> They all had pipeline architecture, and data were processed in >>>>>> parallel in multiple pipelines with different latencies. >>>>> >>>>>> The worst thing was that those latencies were changing during >>>>>> development. For example some operations were performed by blocks >>>>>> with tree structure, so the number of levels depended on number of >>>>>> inputs handled by each node. >>>>>> The number of inputs in each node was varied to find the acceptable >>>>>> balance between the number of levels and maximum clock speed. I >>>>>> also had to add some pipeline registers to improve timing. >>>>> >>>>> I have heard that some synthesis software now knows how to move >>>>> around pipeline registers to optimize timing. I haven't tried using >>>>> the feature yet, though. >>>> >>>> I knew about this sort of thing ten years ago, although I've never >>>> used it (for FPGA I'm mostly an armchair coach). >>>> >>>> At the time that my FPGA friends were rhapsodizing about it, the >>>> designer still needed to specify the total delay, but the tools took >>>> the responsibility for distributing it. >>>> >>>> It makes sense to do it that way, because you're the one that has to >>>> decide how much delay is right, and who has to make sure that the >>>> timing for section A matches the timing for section B -- for the >>>> moment at least that's really beyond the tool's ability to cope. >>> >>> I'm not picturing the model you are describing. If all sections have >>> the same clock, they all have the same timing constraint, no? As to >>> the tools distributing the delays, again, each stage has the same >>> timing constraint so unless there are complications such as inputs >>> with separately specified delays, the tool just has to move logic >>> across register boundaries to make each section meet the timing spec >>> or better to balance all the delays in case you wish to have the >>> fastest possible clock rate. >>> >>> Maybe by timing you mean the clock cycles the OP is talking about? >> >> The way I've seen it, rather than carefully hand-designing a pipeline, >> you just design a system that's basically >> >> .---------------------. .-------. >> data in -->| combinatorial logic |---->| delay |----> data out >> '---------------------' '-------' >> >> where the "delay" block just delays all the outputs from the >> combinatorial block by some number of clocks. >> >> Then you tell the tool "move delays as you see fit", and it magically >> distributes the delay in a hopefully-optimal way within the >> combinatorial logic, making it pipelined. > > Yes, but you talked about the tool not being able to "cope" with > matching the delays in section A and B. I'm not following that.
Basically I meant that you need to be responsible for lining up the delays in all the sections -- you can't make one section delay by five more clocks without identifying all the other pertinent sections that depend on that and make them delay by five more clocks, too. If the tool could do everything we'd all be wiring houses for a living. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 9/29/2015 7:19 PM, Tim Wescott wrote:
> On Tue, 29 Sep 2015 18:33:32 -0400, rickman wrote: > >> On 9/29/2015 5:01 PM, Tim Wescott wrote: >>> On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote: >>> >>>> On 9/29/2015 4:22 PM, Tim Wescott wrote: >>>>> On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote: >>>>> >>>>>> wzab01@gmail.com wrote: >>>>>> >>>>>>> Last time I have spent a lot of time on development of quite >>>>>>> complex high speed data processing systems in FPGA. >>>>>>> They all had pipeline architecture, and data were processed in >>>>>>> parallel in multiple pipelines with different latencies. >>>>>> >>>>>>> The worst thing was that those latencies were changing during >>>>>>> development. For example some operations were performed by blocks >>>>>>> with tree structure, so the number of levels depended on number of >>>>>>> inputs handled by each node. >>>>>>> The number of inputs in each node was varied to find the acceptable >>>>>>> balance between the number of levels and maximum clock speed. I >>>>>>> also had to add some pipeline registers to improve timing. >>>>>> >>>>>> I have heard that some synthesis software now knows how to move >>>>>> around pipeline registers to optimize timing. I haven't tried using >>>>>> the feature yet, though. >>>>> >>>>> I knew about this sort of thing ten years ago, although I've never >>>>> used it (for FPGA I'm mostly an armchair coach). >>>>> >>>>> At the time that my FPGA friends were rhapsodizing about it, the >>>>> designer still needed to specify the total delay, but the tools took >>>>> the responsibility for distributing it. >>>>> >>>>> It makes sense to do it that way, because you're the one that has to >>>>> decide how much delay is right, and who has to make sure that the >>>>> timing for section A matches the timing for section B -- for the >>>>> moment at least that's really beyond the tool's ability to cope. >>>> >>>> I'm not picturing the model you are describing. If all sections have >>>> the same clock, they all have the same timing constraint, no? As to >>>> the tools distributing the delays, again, each stage has the same >>>> timing constraint so unless there are complications such as inputs >>>> with separately specified delays, the tool just has to move logic >>>> across register boundaries to make each section meet the timing spec >>>> or better to balance all the delays in case you wish to have the >>>> fastest possible clock rate. >>>> >>>> Maybe by timing you mean the clock cycles the OP is talking about? >>> >>> The way I've seen it, rather than carefully hand-designing a pipeline, >>> you just design a system that's basically >>> >>> .---------------------. .-------. >>> data in -->| combinatorial logic |---->| delay |----> data out >>> '---------------------' '-------' >>> >>> where the "delay" block just delays all the outputs from the >>> combinatorial block by some number of clocks. >>> >>> Then you tell the tool "move delays as you see fit", and it magically >>> distributes the delay in a hopefully-optimal way within the >>> combinatorial logic, making it pipelined. >> >> Yes, but you talked about the tool not being able to "cope" with >> matching the delays in section A and B. I'm not following that. > > Basically I meant that you need to be responsible for lining up the > delays in all the sections -- you can't make one section delay by five > more clocks without identifying all the other pertinent sections that > depend on that and make them delay by five more clocks, too. > > If the tool could do everything we'd all be wiring houses for a living.
Ok, but that is not the tool CAD vendors provide. That is the tool the OP is talking about. -- Rick
Any VHDL compiler cannot be a useful compiler unless it respects the user
entered registers. Though it may fit an equivalent arrangement as in
register retiming for timing purposes.

Register delay stages is obviously what we are talking about rather than
combinatorial/routing delays which is a concern for each register timing
and which the tool decides together with any constraints from user.

It is up to user to decide the register delay stages. It cannot be
technology sensitive unless you are doing some high level coding that does
not specify registers. I don't know what this level is though.

How come a user build a design without being correct about register delay.
How do you add streams or multiply or switch etc. and ask the tool to do
the job?

Kaz
---------------------------------------
Posted through http://www.FPGARelated.com
On 29/09/2015 22:01, Tim Wescott wrote:
> On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote:
..
> > The way I've seen it, rather than carefully hand-designing a pipeline, > you just design a system that's basically > > .---------------------. .-------. > data in -->| combinatorial logic |---->| delay |----> data out > '---------------------' '-------' > > where the "delay" block just delays all the outputs from the > combinatorial block by some number of clocks. > > Then you tell the tool "move delays as you see fit", and it magically > distributes the delay in a hopefully-optimal way within the combinatorial > logic, making it pipelined. > > As I said, I've never done it -- I couldn't even tell you what search > terms to use to find out what the tool vendors call the process. >
As mentioned before just search for register retiming. It works exactly as you described although it is not perfect. It can move combinational logic between register pairs to balance the slack. Register retiming is a relative old technology and has been available on most independent tools (like Mentor's Precision and Synopsys's Synplify) and Vendor synthesis tools for many years. From what I understand vendor tools can only move logic into one direction due to a patent owned by Mentor Graphics. # Info: [7004]: Starting retiming program ... # Info: [7012]: Phase 1 # Info: [7012]: Phase 2 # Info: [7012]: Phase 3 # Info: [7012]: Phase 4 # Info: [7012]: Total number of DSPs processed : 0 # Info: [7012]: Total number of registers added : 138 # Info: [7012]: Total number of registers removed : 66 # Info: [7012]: Total number of logic elements added : 0 Register retiming is something you want to enable by default unless you are planning to use an equivalence checker, Hans www.ht-lab.com
>On 29/09/2015 22:01, Tim Wescott wrote: >> On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote: >.. >> >> The way I've seen it, rather than carefully hand-designing a pipeline, >> you just design a system that's basically >> >> .---------------------. .-------. >> data in -->| combinatorial logic |---->| delay |----> data out >> '---------------------' '-------' >> >> where the "delay" block just delays all the outputs from the >> combinatorial block by some number of clocks. >> >> Then you tell the tool "move delays as you see fit", and it magically >> distributes the delay in a hopefully-optimal way within the
combinatorial
>> logic, making it pipelined. >> >> As I said, I've never done it -- I couldn't even tell you what search >> terms to use to find out what the tool vendors call the process. >> >As mentioned before just search for register retiming. It works exactly >as you described although it is not perfect. It can move combinational >logic between register pairs to balance the slack. Register retiming is >a relative old technology and has been available on most independent >tools (like Mentor's Precision and Synopsys's Synplify) and Vendor >synthesis tools for many years. From what I understand vendor tools can >only move logic into one direction due to a patent owned by Mentor >Graphics. > ># Info: [7004]: Starting retiming program ... ># Info: [7012]: Phase 1 ># Info: [7012]: Phase 2 ># Info: [7012]: Phase 3 ># Info: [7012]: Phase 4 ># Info: [7012]: Total number of DSPs processed : 0 ># Info: [7012]: Total number of registers added : 138 ># Info: [7012]: Total number of registers removed : 66 ># Info: [7012]: Total number of logic elements added : 0 > >Register retiming is something you want to enable by default unless you >are planning to use an equivalence checker, > >Hans >www.ht-lab.com
Register retiming is a technique to help timing of setup/hold of a given path. It does not and should not change latency of path in terms of clock periods. The OP is referring to latency of a path in terms of clock periods rather than delay issues within a given path. Kaz --------------------------------------- Posted through http://www.FPGARelated.com
W dniu środa, 30 września 2015 09:41:15 UTC+1 użytkownik kaz napisał:
> Any VHDL compiler cannot be a useful compiler unless it respects the user > entered registers. Though it may fit an equivalent arrangement as in > register retiming for timing purposes. > > Register delay stages is obviously what we are talking about rather than > combinatorial/routing delays which is a concern for each register timing > and which the tool decides together with any constraints from user. > > It is up to user to decide the register delay stages. It cannot be > technology sensitive unless you are doing some high level coding that does > not specify registers. I don't know what this level is though. > > How come a user build a design without being correct about register delay. > How do you add streams or multiply or switch etc. and ask the tool to do > the job?
In the systems which I have to build there are some paremetrized components, in which latency depends on their parameters. Unfortunately I can not publish the original designs but a simplified version of one of those systems is provided as a demonstration of the method on OpenCores. For example I have a block for finding the maximum value from certain number of inputs. It is a tree built from elemantary comparators. When looking for optimal implementation (in terms of resource usage and maximum clock frequency) I have to select the number of values compared simultaneously in such a basic comparator. My implementation automatically adjusts number of stages to the number of inputs in an elementary comparator and in the whole system. Of course the number of stages affects the latency (delay in number of clocks). There are many such blocks which may be adjusted independently. Tryig to keep design adjusted properly (in a sense that all latencies in parallel pipelines are equal) is really difficult and error-prone. So thats why I needed a tool which does it for me. Of course I have to analyze the results, and sometime introduce manual corrections... Does it answer the question above? Regards, Wojtek
>W dniu środa, 30 września 2015 09:41:15 UTC+1 użytkownik >kaz napisał: >> Any VHDL compiler cannot be a useful compiler unless it respects the
user
>> entered registers. Though it may fit an equivalent arrangement as in >> register retiming for timing purposes. >> >> Register delay stages is obviously what we are talking about rather
than
>> combinatorial/routing delays which is a concern for each register
timing
>> and which the tool decides together with any constraints from user. >> >> It is up to user to decide the register delay stages. It cannot be >> technology sensitive unless you are doing some high level coding that >does >> not specify registers. I don't know what this level is though. >> >> How come a user build a design without being correct about register >delay. >> How do you add streams or multiply or switch etc. and ask the tool to
do
>> the job? > >In the systems which I have to build there are some paremetrized >components, in which latency depends on their parameters. Unfortunately I
can not
>publish the original designs but a simplified version of one of those
systems
>is provided as a demonstration of the method on OpenCores. >For example I have a block for finding the maximum value from certain >number of inputs. It is a tree built from elemantary comparators. >When looking for optimal implementation (in terms of resource usage and >maximum clock frequency) I have to select the number of values compared >simultaneously in such a basic comparator. My implementation
automatically
>adjusts number of stages to the number of inputs in an elementary
comparator and
>in the whole system. Of course the number of stages affects the latency >(delay in number of clocks). There are many such blocks which may be
adjusted
>independently. >Tryig to keep design adjusted properly (in a sense that all latencies in >parallel pipelines are equal) is really difficult and error-prone. >So thats why I needed a tool which does it for me. >Of course I have to analyze the results, and sometime introduce manual >corrections... >Does it answer the question above? > >Regards, >Wojtek
so in short you regenerate some components with new latency different from the intended and tested one. I will just balance the latency manually and run the test. I don't see much practical scope for automating such change. Kaz --------------------------------------- Posted through http://www.FPGARelated.com