FPGARelated.com
Forums

Multi-FPGA Interconnection: latest techniques

Started by partha sarathy September 24, 2020
Hi Experts,

In FPGA Prototyping/Emulation flows,  Multi-FPGA partitioning puts limitation on performance  due to limited IO pins.
What are the latest Multi-FPGA Interconnection  techniques available today?  By using Multi Gigabit Transceivers , how much performance improvement is expected ?

Thanks in Advance
Parth
partha sarathy <gparthu@gmail.com> wrote:
> Hi Experts, > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts > limitation on performance due to limited IO pins. > What are the latest Multi-FPGA Interconnection techniques available > today? By using Multi Gigabit Transceivers , how much performance > improvement is expected ?
How much performance do you want? There are transceivers upwards of 56Gbps these days. Questions: How many transceivers can you get at that speed? How to route an nn Gbps signal from one place to another? How many transceivers can you successfully route and at what speed? How to make that reliable in the face of bit errors, packet loss and other errors? What end to end bandwidth can you actually acheive? What latency impact does all that extra processing have? Relevant paper of mine: https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf Theo
On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote:
> partha sarathy <gpa...@gmail.com> wrote: > > Hi Experts, > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts > > limitation on performance due to limited IO pins. > > What are the latest Multi-FPGA Interconnection techniques available > > today? By using Multi Gigabit Transceivers , how much performance > > improvement is expected ? > How much performance do you want? There are transceivers upwards of 56Gbps > these days. Questions: > > How many transceivers can you get at that speed? > How to route an nn Gbps signal from one place to another? > How many transceivers can you successfully route and at what speed? > How to make that reliable in the face of bit errors, packet loss and other errors? > What end to end bandwidth can you actually acheive? > What latency impact does all that extra processing have? > > Relevant paper of mine: > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf > > Theo
Hi Theo, Thanks a lot for the reply. On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote:
> On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: > > partha sarathy <gpa...@gmail.com> wrote: > > > Hi Experts, > > > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts > > > limitation on performance due to limited IO pins. > > > What are the latest Multi-FPGA Interconnection techniques available > > > today? By using Multi Gigabit Transceivers , how much performance > > > improvement is expected ? > > How much performance do you want? There are transceivers upwards of 56Gbps > > these days. Questions: > > > > How many transceivers can you get at that speed? > > How to route an nn Gbps signal from one place to another? > > How many transceivers can you successfully route and at what speed? > > How to make that reliable in the face of bit errors, packet loss and other errors? > > What end to end bandwidth can you actually acheive? > > What latency impact does all that extra processing have? > > > > Relevant paper of mine: > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf > > > > Theo > > Hi Theo, > > Thanks a lot for the reply. > > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on > > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.
Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? -- Rick C. - Get 1,000 miles of free Supercharging - Tesla referral code - https://ts.la/richard11209
On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote:
> On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote: > > On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: > > > partha sarathy <gpa...@gmail.com> wrote: > > > > Hi Experts, > > > > > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts > > > > limitation on performance due to limited IO pins. > > > > What are the latest Multi-FPGA Interconnection techniques available > > > > today? By using Multi Gigabit Transceivers , how much performance > > > > improvement is expected ? > > > How much performance do you want? There are transceivers upwards of 56Gbps > > > these days. Questions: > > > > > > How many transceivers can you get at that speed? > > > How to route an nn Gbps signal from one place to another? > > > How many transceivers can you successfully route and at what speed? > > > How to make that reliable in the face of bit errors, packet loss and other errors? > > > What end to end bandwidth can you actually acheive? > > > What latency impact does all that extra processing have? > > > > > > Relevant paper of mine: > > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf > > > > > > Theo > > > > Hi Theo, > > > > Thanks a lot for the reply. > > > > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. > > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on > > > > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? > > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated. > Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? > > -- > > Rick C. > > - Get 1,000 miles of free Supercharging > - Tesla referral code - https://ts.la/richard11209
Hi Rick, Thanks for the reply with details. Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks? Best Regards Parth
On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote:
> On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote: > > On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote: > > > On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: > > > > partha sarathy <gpa...@gmail.com> wrote: > > > > > Hi Experts, > > > > > > > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts > > > > > limitation on performance due to limited IO pins. > > > > > What are the latest Multi-FPGA Interconnection techniques available > > > > > today? By using Multi Gigabit Transceivers , how much performance > > > > > improvement is expected ? > > > > How much performance do you want? There are transceivers upwards of 56Gbps > > > > these days. Questions: > > > > > > > > How many transceivers can you get at that speed? > > > > How to route an nn Gbps signal from one place to another? > > > > How many transceivers can you successfully route and at what speed? > > > > How to make that reliable in the face of bit errors, packet loss and other errors? > > > > What end to end bandwidth can you actually acheive? > > > > What latency impact does all that extra processing have? > > > > > > > > Relevant paper of mine: > > > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf > > > > > > > > Theo > > > > > > Hi Theo, > > > > > > Thanks a lot for the reply. > > > > > > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. > > > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on > > > > > > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? > > > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated. > > Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? > > > > -- > > > > Rick C. > > > > - Get 1,000 miles of free Supercharging > > - Tesla referral code - https://ts.la/richard11209 > Hi Rick, > Thanks for the reply with details. > Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks?
Sorry, I'm not at all clear about what you are doing. Maybe I misunderstood what you meant by pin muxing. Are they using fewer pins and sending data for multiple signals over each pin? That would definitely slow things down. Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay. I'm not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clock cycle of latency. -- Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209
Rick C <gnuarm.deletethisbit@gmail.com> wrote:
> > Does the gigabit transceiver pipe line inserted delay count more than > > 20ns say for 50MHz FPGA clocks? > > Sorry, I'm not at all clear about what you are doing. > > Maybe I misunderstood what you meant by pin muxing. Are they using fewer > pins and sending data for multiple signals over each pin? That would > definitely slow things down. > > Using SERDES (the gigabit transceiver you mention) should speed that up, > but might include some pipeline delay. I'm not that familiar with their > operation, but I assume you have to parallel load a register that is > shifted out at high speed and loaded into a shift register on the > receiving end, then parallel loaded into another register to be presented > to the rest of the circuitry. If that is how they are working, it would > indeed take a full clock cycle of latency.
That's right - you get a parallel FIFO interface. There's no guarantee what you put in will get to the other end reliably (if BER is 10^-9 say and your bit rate is 10Gbps, that's one error every 0.1s). So on these kinds of links to be reliable you need some kind of error correction or retransmission. In the Bluelink case, that was hundreds of ns. Basically you end up with something approaching a full radio stack, just over wires. Theo
On Saturday, September 26, 2020 at 8:55:16 AM UTC+5:30, gnuarm.del...@gmail.com wrote:
> On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote: > > On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote: > > > On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote: > > > > On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: > > > > > partha sarathy <gpa...@gmail.com> wrote: > > > > > > Hi Experts, > > > > > > > > > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts > > > > > > limitation on performance due to limited IO pins. > > > > > > What are the latest Multi-FPGA Interconnection techniques available > > > > > > today? By using Multi Gigabit Transceivers , how much performance > > > > > > improvement is expected ? > > > > > How much performance do you want? There are transceivers upwards of 56Gbps > > > > > these days. Questions: > > > > > > > > > > How many transceivers can you get at that speed? > > > > > How to route an nn Gbps signal from one place to another? > > > > > How many transceivers can you successfully route and at what speed? > > > > > How to make that reliable in the face of bit errors, packet loss and other errors? > > > > > What end to end bandwidth can you actually acheive? > > > > > What latency impact does all that extra processing have? > > > > > > > > > > Relevant paper of mine: > > > > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf > > > > > > > > > > Theo > > > > > > > > Hi Theo, > > > > > > > > Thanks a lot for the reply. > > > > > > > > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. > > > > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on > > > > > > > > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? > > > > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated. > > > Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? > > > > > > -- > > > > > > Rick C. > > > > > > - Get 1,000 miles of free Supercharging > > > - Tesla referral code - https://ts.la/richard11209 > > Hi Rick, > > Thanks for the reply with details. > > Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks? > Sorry, I'm not at all clear about what you are doing. > > Maybe I misunderstood what you meant by pin muxing. Are they using fewer pins and sending data for multiple signals over each pin? That would definitely slow things down. > > Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay. I'm not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clock cycle of latency. > > -- > > Rick C. > > + Get 1,000 miles of free Supercharging > + Tesla referral code - https://ts.la/richard11209
Hi Rick, Thanks for the clarifications. It is obvious now that the Serdes is not suitable for Pin Muxing Regards Parth
On Sunday, September 27, 2020 at 9:31:42 AM UTC+5:30, partha sarathy wrote:
> On Saturday, September 26, 2020 at 8:55:16 AM UTC+5:30, gnuarm.del...@gmail.com wrote: > > On Friday, September 25, 2020 at 10:25:38 PM UTC-4, partha sarathy wrote: > > > On Thursday, September 24, 2020 at 11:24:39 PM UTC+5:30, gnuarm.del...@gmail.com wrote: > > > > On Thursday, September 24, 2020 at 12:08:03 PM UTC-4, partha sarathy wrote: > > > > > On Thursday, September 24, 2020 at 8:16:21 PM UTC+5:30, Theo wrote: > > > > > > partha sarathy <gpa...@gmail.com> wrote: > > > > > > > Hi Experts, > > > > > > > > > > > > > > In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts > > > > > > > limitation on performance due to limited IO pins. > > > > > > > What are the latest Multi-FPGA Interconnection techniques available > > > > > > > today? By using Multi Gigabit Transceivers , how much performance > > > > > > > improvement is expected ? > > > > > > How much performance do you want? There are transceivers upwards of 56Gbps > > > > > > these days. Questions: > > > > > > > > > > > > How many transceivers can you get at that speed? > > > > > > How to route an nn Gbps signal from one place to another? > > > > > > How many transceivers can you successfully route and at what speed? > > > > > > How to make that reliable in the face of bit errors, packet loss and other errors? > > > > > > What end to end bandwidth can you actually acheive? > > > > > > What latency impact does all that extra processing have? > > > > > > > > > > > > Relevant paper of mine: > > > > > > https://www.cl.cam.ac.uk/~atm26/pubs/FPL2014-ClusterInterconnect.pdf > > > > > > > > > > > > Theo > > > > > > > > > > Hi Theo, > > > > > > > > > > Thanks a lot for the reply. > > > > > > > > > > On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. > > > > > individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on > > > > > > > > > > Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? > > > > > If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated. > > > > Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used? > > > > > > > > -- > > > > > > > > Rick C. > > > > > > > > - Get 1,000 miles of free Supercharging > > > > - Tesla referral code - https://ts.la/richard11209 > > > Hi Rick, > > > Thanks for the reply with details. > > > Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks? > > Sorry, I'm not at all clear about what you are doing. > > > > Maybe I misunderstood what you meant by pin muxing. Are they using fewer pins and sending data for multiple signals over each pin? That would definitely slow things down. > > > > Using SERDES (the gigabit transceiver you mention) should speed that up, but might include some pipeline delay. I'm not that familiar with their operation, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clock cycle of latency. > > > > -- > > > > Rick C. > > > > + Get 1,000 miles of free Supercharging > > + Tesla referral code - https://ts.la/richard11209 > Hi Rick, > Thanks for the clarifications. It is obvious now that the Serdes is not suitable for Pin Muxing > > Regards > Parth
Multi-Gigabit Transceiver (MGT): Configurable hard-macros MGTs are implemented for inter-FPGA communication. The data rate can be as high as ~ 10Gbps [MGT, 2014]. Nevertheless, the MGT has a high latency (~ 30 fast clock cycles) that limits the system clock frequency and only a few is available. When the TDM ratio is 4, the system clock frequency is ~ 7MHz [Tang et al., 2014]. In addition, the communication between MGTs is not errorfree. They come with a non-null bit error rate (BER). Therefore, at this moment, MGT is not used as inter-FPGA communication architecture in multi-FPGA prototyping
partha sarathy <gparthu@gmail.com> wrote:
> Multi-Gigabit Transceiver (MGT): Configurable hard-macros MGTs are > implemented for inter-FPGA communication. The data rate can be as high as > ~ 10Gbps [MGT, 2014]. Nevertheless, the MGT has a high latency (~ 30 fast > clock cycles) that limits the system clock frequency and only a few is > available. When the TDM ratio is 4, the system clock frequency is ~ 7MHz > [Tang et al., 2014]. In addition, the communication between MGTs is not > errorfree. They come with a non-null bit error rate (BER). Therefore, at > this moment, MGT is not used as inter-FPGA communication architecture in > multi-FPGA prototyping
It really depends on what you mean by 'prototyping'. If you have interconnect which is tolerant of latency, such that the system doesn't mind that messages take several cycles to get from one place to another (typical of a network-on-chip implementing say AXI), then using MGT with a reliability layer is fine for functional verification. If you mean dumping a hairball of an RTL netlist across multiple FPGAs and slowing the the clock until everything works in a single cycle, then they're not right for that job. They're both prototyping, but at different levels of abstraction. Theo