comp.arch.fpga | Multi-FPGA Interconnection: latest techniques

Hi Experts,

In FPGA Prototyping/Emulation flows,  Multi-FPGA partitioning puts limitation on performance  due to limited IO pins.
What are the latest Multi-FPGA Interconnection  techniques available today?  By using Multi Gigabit Transceivers , how much performance improvement is expected ?

Thanks in Advance
Parth

Reply by Theo ●September 24, 20202020-09-24

partha sarathy <gparthu@gmail.com> wrote:
> Hi Experts,
> 
> In FPGA Prototyping/Emulation flows,  Multi-FPGA partitioning puts
> limitation on performance due to limited IO pins.
> What are the latest Multi-FPGA Interconnection  techniques available
> today?  By using Multi Gigabit Transceivers , how much performance
> improvement is expected ?

How much performance do you want?  There are transceivers upwards of 56Gbps
these days.  Questions:

How many transceivers can you get at that speed?
How to route an nn Gbps signal from one place to another?
How many transceivers can you successfully route and at what speed?
How to make that reliable in the face of bit errors, packet loss and other errors?
What end to end bandwidth can you actually acheive?
What latency impact does all that extra processing have?

Relevant paper of mine:
https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf

Theo

Reply by partha sarathy ●September 24, 20202020-09-24

On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
> partha sarathy <gpa...@gmail.com> wrote: 
> > Hi Experts, 
> > 
> > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts 
> > limitation on performance due to limited IO pins. 
> > What are the latest Multi-FPGA Interconnection techniques available 
> > today? By using Multi Gigabit Transceivers , how much performance 
> > improvement is expected ?
> How much performance do you want? There are transceivers upwards of 56Gbps 
> these days. Questions: 
> 
> How many transceivers can you get at that speed? 
> How to route an nn Gbps signal from one place to another? 
> How many transceivers can you successfully route and at what speed? 
> How to make that reliable in the face of bit errors, packet loss and other errors? 
> What end to end bandwidth can you actually acheive? 
> What latency impact does all that extra processing have? 
> 
> Relevant paper of mine: 
> https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf 
> 
> Theo

Hi Theo, 

Thanks a lot for the reply.

On Xilinx UltraScale  board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie  HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.
 individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.?
If partitioning is done manual or by auto partition tool , can  BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.

Reply by Rick C ●September 24, 20202020-09-24

On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote:
> On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
> > partha sarathy <gpa...@gmail.com> wrote: 
> > > Hi Experts, 
> > > 
> > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts 
> > > limitation on performance due to limited IO pins. 
> > > What are the latest Multi-FPGA Interconnection techniques available 
> > > today? By using Multi Gigabit Transceivers , how much performance 
> > > improvement is expected ?
> > How much performance do you want? There are transceivers upwards of 56Gbps 
> > these days. Questions: 
> > 
> > How many transceivers can you get at that speed? 
> > How to route an nn Gbps signal from one place to another? 
> > How many transceivers can you successfully route and at what speed? 
> > How to make that reliable in the face of bit errors, packet loss and other errors? 
> > What end to end bandwidth can you actually acheive? 
> > What latency impact does all that extra processing have? 
> > 
> > Relevant paper of mine: 
> > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf 
> > 
> > Theo
> 
> Hi Theo, 
> 
> Thanks a lot for the reply.
> 
> On Xilinx UltraScale  board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie  HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.
>  individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on
> 
> Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.?
> If partitioning is done manual or by auto partition tool , can  BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.

Are you sure you aren't doing something wrong?  The purpose of pin muxing would seem to be to increase the data rate.  But I assume this will incur pipeline delays.  Or do I not understand how this is being used? 

-- 

  Rick C.

  - Get 1,000 miles of free Supercharging
  - Tesla referral code - https://ts.la/richard11209

Reply by partha sarathy ●September 25, 20202020-09-25

On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote:
> On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote: 
> > On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: 
> > > partha sarathy <gpa...@gmail.com> wrote: 
> > > > Hi Experts, 
> > > > 
> > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts 
> > > > limitation on performance due to limited IO pins. 
> > > > What are the latest Multi-FPGA Interconnection techniques available 
> > > > today? By using Multi Gigabit Transceivers , how much performance 
> > > > improvement is expected ? 
> > > How much performance do you want? There are transceivers upwards of 56Gbps 
> > > these days. Questions: 
> > > 
> > > How many transceivers can you get at that speed? 
> > > How to route an nn Gbps signal from one place to another? 
> > > How many transceivers can you successfully route and at what speed? 
> > > How to make that reliable in the face of bit errors, packet loss and other errors? 
> > > What end to end bandwidth can you actually acheive? 
> > > What latency impact does all that extra processing have? 
> > > 
> > > Relevant paper of mine: 
> > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf 
> > > 
> > > Theo 
> > 
> > Hi Theo, 
> > 
> > Thanks a lot for the reply. 
> > 
> > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. 
> > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on 
> > 
> > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? 
> > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
> Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? 
> 
> -- 
> 
> Rick C. 
> 
> - Get 1,000 miles of free Supercharging 
> - Tesla referral code - https://ts.la/richard11209
Hi Rick, 
Thanks for the reply with details.
Does the gigabit transceiver pipe line inserted delay  count more than  20ns say for 50MHz FPGA clocks?


Best Regards
Parth

Reply by Rick C ●September 26, 20202020-09-26

On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote:
> On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote:
> > On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote: 
> > > On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: 
> > > > partha sarathy <gpa...@gmail.com> wrote: 
> > > > > Hi Experts, 
> > > > > 
> > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts 
> > > > > limitation on performance due to limited IO pins. 
> > > > > What are the latest Multi-FPGA Interconnection techniques available 
> > > > > today? By using Multi Gigabit Transceivers , how much performance 
> > > > > improvement is expected ? 
> > > > How much performance do you want? There are transceivers upwards of 56Gbps 
> > > > these days. Questions: 
> > > > 
> > > > How many transceivers can you get at that speed? 
> > > > How to route an nn Gbps signal from one place to another? 
> > > > How many transceivers can you successfully route and at what speed? 
> > > > How to make that reliable in the face of bit errors, packet loss and other errors? 
> > > > What end to end bandwidth can you actually acheive? 
> > > > What latency impact does all that extra processing have? 
> > > > 
> > > > Relevant paper of mine: 
> > > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf 
> > > > 
> > > > Theo 
> > > 
> > > Hi Theo, 
> > > 
> > > Thanks a lot for the reply. 
> > > 
> > > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. 
> > > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on 
> > > 
> > > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? 
> > > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
> > Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? 
> > 
> > -- 
> > 
> > Rick C. 
> > 
> > - Get 1,000 miles of free Supercharging 
> > - Tesla referral code - https://ts.la/richard11209
> Hi Rick, 
> Thanks for the reply with details.
> Does the gigabit transceiver pipe line inserted delay  count more than  20ns say for 50MHz FPGA clocks?

Sorry, I'm not at all clear about what you are doing.  

Maybe I misunderstood what you meant by pin muxing.  Are they using fewer pins and sending data for multiple signals over each pin?  That would definitely slow things down.  

Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay.  I'm not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry.  If that is how they are working, it would indeed take a full clock cycle of latency.  

-- 

  Rick C.

  + Get 1,000 miles of free Supercharging
  + Tesla referral code - https://ts.la/richard11209

Reply by Theo ●September 26, 20202020-09-26

Rick C <gnuarm.deletethisbit@gmail.com> wrote:
> > Does the gigabit transceiver pipe line inserted delay  count more than
> > 20ns say for 50MHz FPGA clocks?
> 
> Sorry, I'm not at all clear about what you are doing.  
> 
> Maybe I misunderstood what you meant by pin muxing.  Are they using fewer
> pins and sending data for multiple signals over each pin?  That would
> definitely slow things down.
> 
> Using SERDES (the gigabit transceiver you mention) should speed that up,
> but might include some pipeline delay.  I'm not that familiar with their
> operation, but I assume you have to parallel load a register that is
> shifted out at high speed and loaded into a shift register on the
> receiving end, then parallel loaded into another register to be presented
> to the rest of the circuitry.  If that is how they are working, it would
> indeed take a full clock cycle of latency.

That's right - you get a parallel FIFO interface. There's no guarantee what
you put in will get to the other end reliably (if BER is 10^-9 say and your
bit rate is 10Gbps, that's one error every 0.1s).  So on these kinds of
links to be reliable you need some kind of error correction or
retransmission.  In the Bluelink case, that was hundreds of ns.

Basically you end up with something approaching a full radio stack, just
over wires.

Theo

Reply by partha sarathy ●September 27, 20202020-09-27

On Saturday, September 26, 2020 at 8:55:16 AM UTC+5:30, gnuarm.del...@gmail.com wrote:
> On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote: 
> > On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote: 
> > > On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote: 
> > > > On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: 
> > > > > partha sarathy <gpa...@gmail.com> wrote: 
> > > > > > Hi Experts, 
> > > > > > 
> > > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts 
> > > > > > limitation on performance due to limited IO pins. 
> > > > > > What are the latest Multi-FPGA Interconnection techniques available 
> > > > > > today? By using Multi Gigabit Transceivers , how much performance 
> > > > > > improvement is expected ? 
> > > > > How much performance do you want? There are transceivers upwards of 56Gbps 
> > > > > these days. Questions: 
> > > > > 
> > > > > How many transceivers can you get at that speed? 
> > > > > How to route an nn Gbps signal from one place to another? 
> > > > > How many transceivers can you successfully route and at what speed? 
> > > > > How to make that reliable in the face of bit errors, packet loss and other errors? 
> > > > > What end to end bandwidth can you actually acheive? 
> > > > > What latency impact does all that extra processing have? 
> > > > > 
> > > > > Relevant paper of mine: 
> > > > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf 
> > > > > 
> > > > > Theo 
> > > > 
> > > > Hi Theo, 
> > > > 
> > > > Thanks a lot for the reply. 
> > > > 
> > > > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. 
> > > > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on 
> > > > 
> > > > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? 
> > > > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated. 
> > > Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? 
> > > 
> > > -- 
> > > 
> > > Rick C. 
> > > 
> > > - Get 1,000 miles of free Supercharging 
> > > - Tesla referral code - https://ts.la/richard11209 
> > Hi Rick, 
> > Thanks for the reply with details. 
> > Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks?
> Sorry, I'm not at all clear about what you are doing. 
> 
> Maybe I misunderstood what you meant by pin muxing. Are they using fewer pins and sending data for multiple signals over each pin? That would definitely slow things down. 
> 
> Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay. I'm not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clock cycle of latency. 
> 
> -- 
> 
> Rick C. 
> 
> + Get 1,000 miles of free Supercharging 
> + Tesla referral code - https://ts.la/richard11209
Hi Rick, 
Thanks for the clarifications.  It is obvious now that the Serdes is not suitable for Pin Muxing

Regards
Parth

Reply by partha sarathy ●September 29, 20202020-09-29

On Sunday, September 27, 2020 at 9:31:42 AM UTC+5:30, partha sarathy wrote:
> On Saturday, September 26, 2020 at 8:55:16 AM UTC+5:30, gnuarm.del...@gmail.com wrote: 
> > On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote: 
> > > On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote: 
> > > > On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote: 
> > > > > On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: 
> > > > > > partha sarathy <gpa...@gmail.com> wrote: 
> > > > > > > Hi Experts, 
> > > > > > > 
> > > > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts 
> > > > > > > limitation on performance due to limited IO pins. 
> > > > > > > What are the latest Multi-FPGA Interconnection techniques available 
> > > > > > > today? By using Multi Gigabit Transceivers , how much performance 
> > > > > > > improvement is expected ? 
> > > > > > How much performance do you want? There are transceivers upwards of 56Gbps 
> > > > > > these days. Questions: 
> > > > > > 
> > > > > > How many transceivers can you get at that speed? 
> > > > > > How to route an nn Gbps signal from one place to another? 
> > > > > > How many transceivers can you successfully route and at what speed? 
> > > > > > How to make that reliable in the face of bit errors, packet loss and other errors? 
> > > > > > What end to end bandwidth can you actually acheive? 
> > > > > > What latency impact does all that extra processing have? 
> > > > > > 
> > > > > > Relevant paper of mine: 
> > > > > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf 
> > > > > > 
> > > > > > Theo 
> > > > > 
> > > > > Hi Theo, 
> > > > > 
> > > > > Thanks a lot for the reply. 
> > > > > 
> > > > > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. 
> > > > > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on 
> > > > > 
> > > > > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? 
> > > > > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated. 
> > > > Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? 
> > > > 
> > > > -- 
> > > > 
> > > > Rick C. 
> > > > 
> > > > - Get 1,000 miles of free Supercharging 
> > > > - Tesla referral code - https://ts.la/richard11209 
> > > Hi Rick, 
> > > Thanks for the reply with details. 
> > > Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks? 
> > Sorry, I'm not at all clear about what you are doing. 
> > 
> > Maybe I misunderstood what you meant by pin muxing. Are they using fewer pins and sending data for multiple signals over each pin? That would definitely slow things down. 
> > 
> > Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay. I'm not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clock cycle of latency. 
> > 
> > -- 
> > 
> > Rick C. 
> > 
> > + Get 1,000 miles of free Supercharging 
> > + Tesla referral code - https://ts.la/richard11209 
> Hi Rick,
> Thanks for the clarifications. It is obvious now that the Serdes is not suitable for Pin Muxing 
> 
> Regards 
> Parth

Multi-Gigabit Transceiver (MGT): Configurable hard-macros MGTs are implemented for
inter-FPGA communication. The data rate can be as high as ~ 10Gbps [MGT, 2014]. Nevertheless, the MGT has a high latency (~ 30 fast clock cycles) that limits the system clock frequency and only a few is available. When the TDM ratio is 4, the system clock frequency
is ~ 7MHz [Tang et al., 2014]. In addition, the communication between MGTs is not errorfree. They come with a non-null bit error rate (BER). Therefore, at this moment, MGT is not used as inter-FPGA communication architecture in multi-FPGA prototyping

Reply by Theo ●September 30, 20202020-09-30

partha sarathy <gparthu@gmail.com> wrote:
> Multi-Gigabit Transceiver (MGT): Configurable hard-macros MGTs are
> implemented for inter-FPGA communication.  The data rate can be as high as
> ~ 10Gbps [MGT, 2014].  Nevertheless, the MGT has a high latency (~ 30 fast
> clock cycles) that limits the system clock frequency and only a few is
> available.  When the TDM ratio is 4, the system clock frequency is ~ 7MHz
> [Tang et al., 2014].  In addition, the communication between MGTs is not
> errorfree.  They come with a non-null bit error rate (BER).  Therefore, at
> this moment, MGT is not used as inter-FPGA communication architecture in
> multi-FPGA prototyping

It really depends on what you mean by 'prototyping'.  If you have
interconnect which is tolerant of latency, such that the system doesn't mind
that messages take several cycles to get from one place to another (typical
of a network-on-chip implementing say AXI), then using MGT with a
reliability layer is fine for functional verification.

If you mean dumping a hairball of an RTL netlist across multiple FPGAs and
slowing the the clock until everything works in a single cycle, then they're
not right for that job.

They're both prototyping, but at different levels of abstraction.

Theo

Multi-FPGA Interconnection: latest techniques

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group