comp.arch.fpga | Automatic latency balancing in VHDL-implemented complex pipelined systems

Hi,
Last time I have spent a lot of time on development of quite complex high speed data processing systems in FPGA. They all had pipeline architecture, and data were processed in parallel in multiple  pipelines with different latencies.

The worst thing was that those latencies were changing during development. For example some operations were performed by blocks with tree structure, so the number of levels depended on number of inputs handled by each node. The number of inputs in each node was varied to find the acceptable balance between the number of levels and maximum clock speed. I also had to add some pipeline registers to improve timing.

Entire designs were written in pure VHDL,  so I had to adjust latencies manually, to ensure that data coming from different paths arrive in the next block in the same clock cycle. It was really a nightmare so I dreamed about an automated way to ensure proper equalization of latencies.

After some work I have elaborated a solution which I'd like to share with the community. It is available under the BSD license on the OpenCores website http://opencores.org/project,lateq . The paper with detailed description is available on arXiv.org http://arxiv.org/abs/1509.08111.

I'll appreciate any comments.
I hope that the proposed  method will be useful for others.

With best regards,
Wojtek

Reply by glen herrmannsfeldt ●September 29, 20152015-09-29

wzab01@gmail.com wrote:

> Last time I have spent a lot of time on development of quite 
> complex high speed data processing systems in FPGA. 
> They all had pipeline architecture, and data were processed in 
> parallel in multiple  pipelines with different latencies.

> The worst thing was that those latencies were changing 
> during development. For example some operations were 
> performed by blocks with tree structure, so the number of 
> levels depended on number of inputs handled by each node. 
> The number of inputs in each node was varied to find the 
> acceptable balance between the number of levels and maximum 
> clock speed. I also had to add some pipeline registers to 
> improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried
using the feature yet, though.  

I think it can move registers, but maybe not add them. You might
need enough registers in place for it to move them around.

I used to work on systolic arrays, which are really just very long
(hundred or thousands of stages) pipelines. It is pretty hard to 
hand optimize them that long.

-- glen

Reply by ●September 29, 20152015-09-29

W dniu wtorek, 29 wrze=C5=9Bnia 2015 07:49:09 UTC+1 u=C5=BCytkownik glen he=
rrmannsfeldt napisa=C5=82:
> wzab01@gmail.com wrote:
>=20
> > Last time I have spent a lot of time on development of quite=20
> > complex high speed data processing systems in FPGA.=20
> > They all had pipeline architecture, and data were processed in=20
> > parallel in multiple  pipelines with different latencies.
> =20
> > The worst thing was that those latencies were changing=20
> > during development. For example some operations were=20
> > performed by blocks with tree structure, so the number of=20
> > levels depended on number of inputs handled by each node.=20
> > The number of inputs in each node was varied to find the=20
> > acceptable balance between the number of levels and maximum=20
> > clock speed. I also had to add some pipeline registers to=20
> > improve timing.
>=20
> I have heard that some synthesis software now knows how to move
> around pipeline registers to optimize timing. I haven't tried
> using the feature yet, though. =20
>=20
> I think it can move registers, but maybe not add them. You might
> need enough registers in place for it to move them around.
>=20
> I used to work on systolic arrays, which are really just very long
> (hundred or thousands of stages) pipelines. It is pretty hard to=20
> hand optimize them that long.
>

Yes, of course the pipeline registers may be moved (e.g. using the "retimin=
g" feature). I usually keep this option switched on for implementation.
My method only ensures, that the number of pipeline stages is the same in a=
ll parallel paths. And keeping track of that was really a huge problem in b=
igger designs.
--=20
Wojtek

Reply by kaz ●September 29, 20152015-09-29

>W dniu wtorek, 29 wrze&Aring;&#155;nia 2015 07:49:09 UTC+1 u&Aring;&frac14;ytkownik glen
>herrmannsfeldt napisa&Aring;&#130;:
>> wzab01@gmail.com wrote:
>> 
>> > Last time I have spent a lot of time on development of quite 
>> > complex high speed data processing systems in FPGA. 
>> > They all had pipeline architecture, and data were processed in 
>> > parallel in multiple  pipelines with different latencies.
>>  
>> > The worst thing was that those latencies were changing 
>> > during development. For example some operations were 
>> > performed by blocks with tree structure, so the number of 
>> > levels depended on number of inputs handled by each node. 
>> > The number of inputs in each node was varied to find the 
>> > acceptable balance between the number of levels and maximum 
>> > clock speed. I also had to add some pipeline registers to 
>> > improve timing.
>> 
>> I have heard that some synthesis software now knows how to move
>> around pipeline registers to optimize timing. I haven't tried
>> using the feature yet, though.  
>> 
>> I think it can move registers, but maybe not add them. You might
>> need enough registers in place for it to move them around.
>> 
>> I used to work on systolic arrays, which are really just very long
>> (hundred or thousands of stages) pipelines. It is pretty hard to 
>> hand optimize them that long.
>>
>
>Yes, of course the pipeline registers may be moved (e.g. using the
>"retiming" feature). I usually keep this option switched on for
implementation.
>My method only ensures, that the number of pipeline stages is the same
in
>all parallel paths. And keeping track of that was really a huge problem
in
>bigger designs.
>-- 
>Wojtek

Not sure why you expect the tool to do what you should do and do so for
simulation tool. How can you you simulate a design that synthesis will put
for you registers?

Kaz
---------------------------------------
Posted through http://www.FPGARelated.com

Reply by ●September 29, 20152015-09-29

W dniu wtorek, 29 wrze=C5=9Bnia 2015 11:50:53 UTC+1 u=C5=BCytkownik kaz nap=
isa=C5=82:
> >W dniu wtorek, 29 wrze=C3=85=E2=80=BAnia 2015 07:49:09 UTC+1 u=C3=85=C2=
=BCytkownik glen
> >herrmannsfeldt napisa=C3=85=E2=80=9A:
> >> wzab01@gmail.com wrote:
> >>=20
> >> > Last time I have spent a lot of time on development of quite=20
> >> > complex high speed data processing systems in FPGA.=20
> >> > They all had pipeline architecture, and data were processed in=20
> >> > parallel in multiple  pipelines with different latencies.
> >> =20
> >> > The worst thing was that those latencies were changing=20
> >> > during development. For example some operations were=20
> >> > performed by blocks with tree structure, so the number of=20
> >> > levels depended on number of inputs handled by each node.=20
> >> > The number of inputs in each node was varied to find the=20
> >> > acceptable balance between the number of levels and maximum=20
> >> > clock speed. I also had to add some pipeline registers to=20
> >> > improve timing.
> >>=20
> >> I have heard that some synthesis software now knows how to move
> >> around pipeline registers to optimize timing. I haven't tried
> >> using the feature yet, though. =20
> >>=20
> >> I think it can move registers, but maybe not add them. You might
> >> need enough registers in place for it to move them around.
> >>=20
> >> I used to work on systolic arrays, which are really just very long
> >> (hundred or thousands of stages) pipelines. It is pretty hard to=20
> >> hand optimize them that long.
> >>
> >
> >Yes, of course the pipeline registers may be moved (e.g. using the
> >"retiming" feature). I usually keep this option switched on for
> implementation.
> >My method only ensures, that the number of pipeline stages is the same
> in
> >all parallel paths. And keeping track of that was really a huge problem
> in
> >bigger designs.
> >--=20
> >Wojtek
>=20
> Not sure why you expect the tool to do what you should do and do so for
> simulation tool. How can you you simulate a design that synthesis will pu=
t
> for you registers?
>=20

The tool is supposed to ensure that the appropriate number of registers is =
added.
In case of high-level parametrized description it is really difficult to av=
oid mistakes. Therefore an automated tool is preferred.
The registers are put not only for synthesis, but also for simulation.
I hope, that my preprint explains more clearly both motivation and implemen=
tation.

Regards,
Wojtek

Reply by Tim Wescott ●September 29, 20152015-09-29

On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

> wzab01@gmail.com wrote:
> 
>> Last time I have spent a lot of time on development of quite complex
>> high speed data processing systems in FPGA.
>> They all had pipeline architecture, and data were processed in parallel
>> in multiple  pipelines with different latencies.
>  
>> The worst thing was that those latencies were changing during
>> development. For example some operations were performed by blocks with
>> tree structure, so the number of levels depended on number of inputs
>> handled by each node.
>> The number of inputs in each node was varied to find the acceptable
>> balance between the number of levels and maximum clock speed. I also
>> had to add some pipeline registers to improve timing.
> 
> I have heard that some synthesis software now knows how to move around
> pipeline registers to optimize timing. I haven't tried using the feature
> yet, though.

I knew about this sort of thing ten years ago, although I've never used 
it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the designer 
still needed to specify the total delay, but the tools took the 
responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to 
decide how much delay is right, and who has to make sure that the timing 
for section A matches the timing for section B -- for the moment at least 
that's really beyond the tool's ability to cope.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by rickman ●September 29, 20152015-09-29

On 9/29/2015 4:22 PM, Tim Wescott wrote:
> On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:
>
>> wzab01@gmail.com wrote:
>>
>>> Last time I have spent a lot of time on development of quite complex
>>> high speed data processing systems in FPGA.
>>> They all had pipeline architecture, and data were processed in parallel
>>> in multiple  pipelines with different latencies.
>>
>>> The worst thing was that those latencies were changing during
>>> development. For example some operations were performed by blocks with
>>> tree structure, so the number of levels depended on number of inputs
>>> handled by each node.
>>> The number of inputs in each node was varied to find the acceptable
>>> balance between the number of levels and maximum clock speed. I also
>>> had to add some pipeline registers to improve timing.
>>
>> I have heard that some synthesis software now knows how to move around
>> pipeline registers to optimize timing. I haven't tried using the feature
>> yet, though.
>
> I knew about this sort of thing ten years ago, although I've never used
> it (for FPGA I'm mostly an armchair coach).
>
> At the time that my FPGA friends were rhapsodizing about it, the designer
> still needed to specify the total delay, but the tools took the
> responsibility for distributing it.
>
> It makes sense to do it that way, because you're the one that has to
> decide how much delay is right, and who has to make sure that the timing
> for section A matches the timing for section B -- for the moment at least
> that's really beyond the tool's ability to cope.

I'm not picturing the model you are describing.  If all sections have 
the same clock, they all have the same timing constraint, no?  As to the 
tools distributing the delays, again, each stage has the same timing 
constraint so unless there are complications such as inputs with 
separately specified delays, the tool just has to move logic across 
register boundaries to make each section meet the timing spec or better 
to balance all the delays in case you wish to have the fastest possible 
clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

-- 

Rick

Reply by Tim Wescott ●September 29, 20152015-09-29

On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote:

> On 9/29/2015 4:22 PM, Tim Wescott wrote:
>> On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:
>>
>>> wzab01@gmail.com wrote:
>>>
>>>> Last time I have spent a lot of time on development of quite complex
>>>> high speed data processing systems in FPGA.
>>>> They all had pipeline architecture, and data were processed in
>>>> parallel in multiple  pipelines with different latencies.
>>>
>>>> The worst thing was that those latencies were changing during
>>>> development. For example some operations were performed by blocks
>>>> with tree structure, so the number of levels depended on number of
>>>> inputs handled by each node.
>>>> The number of inputs in each node was varied to find the acceptable
>>>> balance between the number of levels and maximum clock speed. I also
>>>> had to add some pipeline registers to improve timing.
>>>
>>> I have heard that some synthesis software now knows how to move around
>>> pipeline registers to optimize timing. I haven't tried using the
>>> feature yet, though.
>>
>> I knew about this sort of thing ten years ago, although I've never used
>> it (for FPGA I'm mostly an armchair coach).
>>
>> At the time that my FPGA friends were rhapsodizing about it, the
>> designer still needed to specify the total delay, but the tools took
>> the responsibility for distributing it.
>>
>> It makes sense to do it that way, because you're the one that has to
>> decide how much delay is right, and who has to make sure that the
>> timing for section A matches the timing for section B -- for the moment
>> at least that's really beyond the tool's ability to cope.
> 
> I'm not picturing the model you are describing.  If all sections have
> the same clock, they all have the same timing constraint, no?  As to the
> tools distributing the delays, again, each stage has the same timing
> constraint so unless there are complications such as inputs with
> separately specified delays, the tool just has to move logic across
> register boundaries to make each section meet the timing spec or better
> to balance all the delays in case you wish to have the fastest possible
> clock rate.
> 
> Maybe by timing you mean the clock cycles the OP is talking about?

The way I've seen it, rather than carefully hand-designing a pipeline, 
you just design a system that's basically

            .---------------------.     .-------.
 data in -->| combinatorial logic |---->| delay |----> data out
            '---------------------'     '-------'

where the "delay" block just delays all the outputs from the 
combinatorial block by some number of clocks.

Then you tell the tool "move delays as you see fit", and it magically 
distributes the delay in a hopefully-optimal way within the combinatorial 
logic, making it pipelined.

As I said, I've never done it -- I couldn't even tell you what search 
terms to use to find out what the tool vendors call the process.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by glen herrmannsfeldt ●September 29, 20152015-09-29

Tim Wescott <seemywebsite@myfooter.really> wrote:

(snip, I wrote)

>> I have heard that some synthesis software now knows how to move around
>> pipeline registers to optimize timing. I haven't tried using the feature
>> yet, though.

> I knew about this sort of thing ten years ago, although I've never used 
> it (for FPGA I'm mostly an armchair coach).

> At the time that my FPGA friends were rhapsodizing about it, the designer 
> still needed to specify the total delay, but the tools took the 
> responsibility for distributing it.

Some time ago, and before I knew about this, I was working on designs
for some very long pipelines, thousands of steps.  Each step is
fairly simple, and all are alike (except for data values). 

I figured that in an FPGA, the pipeline would go across the array,
then down and across backwards, until it got to the end.

I then figured that the delay at the end, where it turned around to
go back, would be longer than other delays, but didn't know how to
modify my code.

As with many pipelines, I can add registers to all the signals without 
affecting the results, though they will come out a little later.
But where to add the registers?

It turned out to be too expensive, so never got built, or even close.
Sometime later, I learned about this feature, but never went back
to try it.

> It makes sense to do it that way, because you're the one that has to 
> decide how much delay is right, and who has to make sure that the timing 
> for section A matches the timing for section B -- for the moment at least 
> that's really beyond the tool's ability to cope.

One could put in sets of optional registers, such that either all or
none of a set get implemented. That might not be so hard, but you
do need a way to say it.

-- glen

Reply by ●September 29, 20152015-09-29

W dniu wtorek, 29 wrze&#347;nia 2015 21:41:26 UTC+1 u&#380;ytkownik rickman napisa&#322;:
> On 9/29/2015 4:22 PM, Tim Wescott wrote:
> > On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:
> >
> >> wzab01@gmail.com wrote:
> >>
> >>> Last time I have spent a lot of time on development of quite complex
> >>> high speed data processing systems in FPGA.
> >>> They all had pipeline architecture, and data were processed in parallel
> >>> in multiple  pipelines with different latencies.
> >>
> >>> The worst thing was that those latencies were changing during
> >>> development. For example some operations were performed by blocks with
> >>> tree structure, so the number of levels depended on number of inputs
> >>> handled by each node.
> >>> The number of inputs in each node was varied to find the acceptable
> >>> balance between the number of levels and maximum clock speed. I also
> >>> had to add some pipeline registers to improve timing.
> >>
> >> I have heard that some synthesis software now knows how to move around
> >> pipeline registers to optimize timing. I haven't tried using the feature
> >> yet, though.
> >
> > I knew about this sort of thing ten years ago, although I've never used
> > it (for FPGA I'm mostly an armchair coach).
> >
> > At the time that my FPGA friends were rhapsodizing about it, the designer
> > still needed to specify the total delay, but the tools took the
> > responsibility for distributing it.
> >
> > It makes sense to do it that way, because you're the one that has to
> > decide how much delay is right, and who has to make sure that the timing
> > for section A matches the timing for section B -- for the moment at least
> > that's really beyond the tool's ability to cope.
> 
> I'm not picturing the model you are describing.  If all sections have 
> the same clock, they all have the same timing constraint, no?  As to the 
> tools distributing the delays, again, each stage has the same timing 
> constraint so unless there are complications such as inputs with 
> separately specified delays, the tool just has to move logic across 
> register boundaries to make each section meet the timing spec or better 
> to balance all the delays in case you wish to have the fastest possible 
> clock rate.
> 
> Maybe by timing you mean the clock cycles the OP is talking about?
> 
> -- 
> 
> Rick

The problem I'm dealing with is just about the number of clock cycles, by which data in each data path are delayed.

The equal distribution of delay between stages of pipeline is so technology specific, that it probably must be handled by the vendor provided tools and in fact usually it is. In old Xilinx tools it was "register balancing", in Altera tools and in new Xilinx tools it is "register retiming".

So my problem is not so complex. And yes, it was solved in GUI based tools many years ago.
In old Xilinx System Generator it was a special "sync" block which was doing that. Just see Fig. 4 in my old paper from 2003 ( http://tesla.desy.de/new_pages/TESLA_Reports/2003/pdf_files/tesla2003-05.pdf ).

The importance of the problem is still emphasized by the vendors of block-based 
tools (e.g. http://www.mathworks.com/help/hdlcoder/examples/delay-balancing-and-validation-model-workflow-in-hdl-coder.html )

However I've never see tool like this available for designs written in pure HDL,
not composed from blocks in GUI based tool...

I have found that for designs with pipelines with lengths depending on different parameters and somehow interconnected in a complex way there is really a need for a tool for automatic verification, or even better for automatic adjustment of those lengths. 
Without that you can easily get incorrect design which processes data misaligned in time.

So that was the motivation.
Sorry if my original post was somehow misleading.

Regards,
Wojtek

Previous12 Next

Automatic latency balancing in VHDL-implemented complex pipelined systems

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group