How fast can you really get data in and out of an FPGA? With current pin layouts it is possible to hook four (or maybe even five) DDR memory DIMM modules to a single chip. Let's say you can create memory controllers that run at 200MHz (as claimed in an Xcell article), for a total bandwidth of 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) * (1byte/8bits)= 5*3.2GB/s=16GB/s Assuming an application that needs more BW than this, does anyone know a way around this bottleneck? Is this a physical limit with current memory technology? Fernando
FPGAs and DRAM bandwidth
Started by ●November 7, 2003
Reply by ●November 8, 20032003-11-08
fortiz80@tutopia.com (Fernando) wrote in message news:<2658f0d3.0311071117.3bf6eaea@posting.google.com>...> How fast can you really get data in and out of an FPGA? > With current pin layouts it is possible to hook four (or maybe even > five) DDR memory DIMM modules to a single chip. > > Let's say you can create memory controllers that run at 200MHz (as > claimed in an Xcell article), for a total bandwidth of > 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) * > (1byte/8bits)= > 5*3.2GB/s=16GB/s > > Assuming an application that needs more BW than this, does anyone know > a way around this bottleneck? Is this a physical limit with current > memory technology? > > FernandoOTOH If you want more bandwidth than DDR DRAM, you could go for RamBus, RLDRAM or the other NetRam or whatever its called. The RLDRAM devices separate the I/Os for pure bandwidth, no turning the bus or clock around nonsense and reduce latency from 60-80ns range down to 20ns or so, that is true RAS cycle. Micron & Infineon do the RLDRAM, another group does the NetRam (Hynix, Samsung maybe). The RLDRAM can run the bus upto 400MHz, double pumped to 800MHz and can use most every cycle to move data 2x and receive control 1x. It is 8 ways banked so every 2.5ns another true random access can start to each bank once every 20ns. The architecture supports 8,16,32-36 bit width IOs IIRC. Sizes are 256M now. I was quoted price about $20 something, cheap for the speed, but far steeper than PC ram. Data can come out in 1,2 or 4 words per address. Think I got all that right. Details on Micron.com. I was told there are Xilinx interfaces for them, I got docs at Xilinx but haven't eaten them yet. They also have interfaces for the RamBus & NetRam. AVNET (??) also has a dev board with couple of RLDRAM parts on them connected to a Virtex2 part, but I think these are the 1st gen RLDRAM parts which are 250MHz 25ns cycle so the interface must work. Anyway, I only wish my PC could use them, I'd willingly pay mucho $ for a mobo that would use them but that will never happen. I quite fancy using one for FPGA cpu, only I could probably keep 8 nested cpus busy 1 bank each since cpus will be far closer to 20ns cycle than 2.5ns. The interface would then be a mux-demux box on my side. The total BW would far surpass any old P4, but the latency is the most important thing for me. Hope that helps johnjakson_usa_com
Reply by ●November 8, 20032003-11-08
Lots of good points in your reply, here is why I think these technologies don't apply to problem that requires large and fast memory. RLDRAM: very promising, but the densities do not seem to increase significantly over time (500Mbits now ~ 64MB). To the best of my knowledge, nobody is making DIMMS with these chips, so they're stuck as cache or network memory. RDRAM (RAMBUS): as you said, only the slowest parts can be used with FPGAs because of the very high frequency of the serial protocol. The current slowest RDRAMs run at 800 MHz, a forbidden range for FPGAs (Xilinx guys, please jump in and correct me if I'm wrong) Am I missing something? Are there any ASICs out there that interface memory DIMMS and FPGAs? Is there any way to use the rocket I/Os to communicate with memory chips? or maybe a completely different solution to the memory bottleneck not mentioned here? johnjakson@yahoo.com (john jakson) wrote in message news:<adb3971c.0311072139.6dab6951@posting.google.com>...> fortiz80@tutopia.com (Fernando) wrote in message news:<2658f0d3.0311071117.3bf6eaea@posting.google.com>... > > How fast can you really get data in and out of an FPGA? > > With current pin layouts it is possible to hook four (or maybe even > > five) DDR memory DIMM modules to a single chip. > > > > Let's say you can create memory controllers that run at 200MHz (as > > claimed in an Xcell article), for a total bandwidth of > > 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) * > > (1byte/8bits)= > > 5*3.2GB/s=16GB/s > > > > Assuming an application that needs more BW than this, does anyone know > > a way around this bottleneck? Is this a physical limit with current > > memory technology? > > > > Fernando > > OTOH > > If you want more bandwidth than DDR DRAM, you could go for RamBus, > RLDRAM or the other NetRam or whatever its called. The RLDRAM devices > separate the I/Os for pure bandwidth, no turning the bus or clock > around nonsense and reduce latency from 60-80ns range down to 20ns or > so, that is true RAS cycle. > > Micron & Infineon do the RLDRAM, another group does the NetRam (Hynix, > Samsung maybe). > > The RLDRAM can run the bus upto 400MHz, double pumped to 800MHz and > can use most every cycle to move data 2x and receive control 1x. It is > 8 ways banked so every 2.5ns another true random access can start to > each bank once every 20ns. The architecture supports 8,16,32-36 bit > width IOs IIRC. Sizes are 256M now. I was quoted price about $20 > something, cheap for the speed, but far steeper than PC ram. Data can > come out in 1,2 or 4 words per address. Think I got all that right. > Details on Micron.com. I was told there are Xilinx interfaces for > them, I got docs at Xilinx but haven't eaten them yet. They also have > interfaces for the RamBus & NetRam. AVNET (??) also has a dev board > with couple of RLDRAM parts on them connected to a Virtex2 part, but I > think these are the 1st gen RLDRAM parts which are 250MHz 25ns cycle > so the interface must work. > > Anyway, I only wish my PC could use them, I'd willingly pay mucho $ > for a mobo that would use them but that will never happen. I quite > fancy using one for FPGA cpu, only I could probably keep 8 nested cpus > busy 1 bank each since cpus will be far closer to 20ns cycle than > 2.5ns. The interface would then be a mux-demux box on my side. The > total BW would far surpass any old P4, but the latency is the most > important thing for me. > > Hope that helps > > johnjakson_usa_com
Reply by ●November 8, 20032003-11-08
Fernando wrote:> How fast can you really get data in and out of an FPGA? > With current pin layouts it is possible to hook four (or maybe even > five) DDR memory DIMM modules to a single chip. > > Let's say you can create memory controllers that run at 200MHz (as > claimed in an Xcell article), for a total bandwidth of > 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) * > (1byte/8bits)= > 5*3.2GB/s=16GB/s > > Assuming an application that needs more BW than this, does anyone know > a way around this bottleneck? Is this a physical limit with current > memory technology?Probably can get a little better. With a 2V8000 in a FF1517 package, there are 1,108 IOs. (!) If we shared address and control lines between banks (timing is easier on these lines), it looks to me like 11 DIMMs could be supported. Data pins 64 DQS pins 8 CS,CAS, RAS,addr 12 (with sharing) ==== 92 1108/92 = 11 with 100 pins left over for VTH, VRP, VRN, clock, reset, ... Of course, the communication to the outside world would also need go somewhere... -- Phil Hays
Reply by ●November 9, 20032003-11-09
Sharing the control pins is a good idea; the only thing that concerns me is the PCB layout. This is not my area of expertise, but seems to me that it would be pretty challenging to put (let's say) 10 DRAM DIMMs and a big FPGA on a single board. It can get even uglier if symmetric traces are required to each memory sharing the control lines...(not sure if this is required) Anyway, I'll start looking into it Thanks Phil Hays <SpamPostmaster@attbi.com> wrote in message news:<3FAD39F6.5572E42C@attbi.com>...> Fernando wrote: > > > How fast can you really get data in and out of an FPGA? > > With current pin layouts it is possible to hook four (or maybe even > > five) DDR memory DIMM modules to a single chip. > > > > Let's say you can create memory controllers that run at 200MHz (as > > claimed in an Xcell article), for a total bandwidth of > > 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) * > > (1byte/8bits)= > > 5*3.2GB/s=16GB/s > > > > Assuming an application that needs more BW than this, does anyone know > > a way around this bottleneck? Is this a physical limit with current > > memory technology? > > Probably can get a little better. With a 2V8000 in a FF1517 package, > there are 1,108 IOs. (!) If we shared address and control lines between > banks (timing is easier on these lines), it looks to me like 11 DIMMs > could be supported. > > Data pins 64 > DQS pins 8 > CS,CAS, > RAS,addr 12 (with sharing) > ==== > 92 > > 1108/92 = 11 with 100 pins left over for VTH, VRP, VRN, clock, reset, ... > > Of course, the communication to the outside world would also need go > somewhere...
Reply by ●November 9, 20032003-11-09
Fernando wrote:> Phil Hays <SpamPostmaster@attbi.com> wrote in message news:<3FAD39F6.5572E42C@attbi.com>... > >>Fernando wrote: >> >> >>>How fast can you really get data in and out of an FPGA? >>>With current pin layouts it is possible to hook four (or maybe even >>>five) DDR memory DIMM modules to a single chip. >>> >>>Let's say you can create memory controllers that run at 200MHz (as >>>claimed in an Xcell article), for a total bandwidth of >>>5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) * >>>(1byte/8bits)= >>>5*3.2GB/s=16GB/s >>> >>>Assuming an application that needs more BW than this, does anyone know >>>a way around this bottleneck? Is this a physical limit with current >>>memory technology? >> >>Probably can get a little better. With a 2V8000 in a FF1517 package, >>there are 1,108 IOs. (!) If we shared address and control lines between >>banks (timing is easier on these lines), it looks to me like 11 DIMMs >>could be supported. >> >>Data pins 64 >>DQS pins 8 >>CS,CAS, >>RAS,addr 12 (with sharing) >> ==== >> 92 >> >>1108/92 = 11 with 100 pins left over for VTH, VRP, VRN, clock, reset, ... >>>>Of course, the communication to the outside world would also need go >>somewhere... Of course, the 2V8000 is REALLY expensive. I'm sure there is a pricing sweat spot where it makes sense to break it up into multiple smaller parts, providing both more pins and lower cost (something like two 2VP30's or 2VP40's [between the two: 1288 to 1608 I/Os, depending on package]). They could be inter-connected using the internal SERDES. The SERDES could also be used for communicating with the outside world. > Sharing the control pins is a good idea; the only thing that concerns > me is the PCB layout. This is not my area of expertise, but seems to > me that it would be pretty challenging to put (let's say) 10 DRAM > DIMMs and a big FPGA on a single board. It may be challenging, but that is what is encountered when trying to push the envelope, as it appears you are trying to do. This sometimes entails accepting a bit less design margin to fulfill the requirements in the alloted space or budget. Knowing what you can safely give up, and where you can give it up, requires expertise (and so if you don't have that expertise, you'll need to find someone that does). If you are really set on meeting the memory requirements, you may need to be open to something besides DIMM's (or perhaps make your own custom DIMM's). A possible alternative: it looks like Toshiba is in the process of releasing their 512 Mbit FCRAM. It supposedly provides 400 Mbps per data bit (using 200 MHz DDR... not a problem for modern FPGAs). > It can get even uglier if symmetric traces are required to each memory > sharing the control lines...(not sure if this is required) I don't know what tools/budget you have available to you. Cadence allows you to put a bus property on as many nets as you want. You can then constrain all nets that form that bus to be within X% of each other (in terms of length). Good luck, Marc
Reply by ●November 9, 20032003-11-09
In article <2658f0d3.0311090256.21ce5a9a@posting.google.com>, Fernando <fortiz80@tutopia.com> wrote:>Sharing the control pins is a good idea; the only thing that concerns >me is the PCB layout. This is not my area of expertise, but seems to >me that it would be pretty challenging to put (let's say) 10 DRAM >DIMMs and a big FPGA on a single board.Simple. Use external registers for the control lines, and drive 4 registers which then drive 4 DIMMs each. Adds a cycle of latency, but so what? -- Nicholas C. Weaver nweaver@cs.berkeley.edu
Reply by ●November 9, 20032003-11-09
Fernando wrote:> Sharing the control pins is a good idea; the only thing that concerns > me is the PCB layout. This is not my area of expertise, but seems to > me that it would be pretty challenging to put (let's say) 10 DRAM > DIMMs and a big FPGA on a single board.Don't forget simultaneous switching considerations. Driving 640 pins at 200 Mhz would probably require a bit of cleverness. Maybe you could run different banks at different phases of the clock. Hopefully your app does not need to write all DIMMs at once. Jeff
Reply by ●November 9, 20032003-11-09
Fernando - Your instincts are right on with respect to the difficulty of fitting that many DIMMs on a board and interfacing to them from a single FPGA. Forget about it. The bottom line is that there's a trade-off between memory size and speed, and memory is almost always the limiting factor in system throughput. If you need lots of memory then DRAM is probably your best/only option, and the max reasonable throughput is about what you calculated, but even the 5-DIMM 320-bit-wide data bus in your example would be a very tough PCB layout. If you can partition your memory into smaller fast-path memory and slower bulk memory, then on-chip memory is the fastest you'll find and you can use SDRAM for the bulk. Another option, if you can tolerate latency, is to spread the memory out to multiple PCBs/daughtercards, each with a dedicated memory controller, and use multiple lanes of extremely fast serial I/O between the master and slave memory controllers. A hierarchy of smaller/faster and larger/slower memories is a common approach, e.g., on-chip core-rate L1 cache, off-chip fast L2 cache, and slower bulk SDRAM in the case of microprocessors. If you tossed out some specific system requirements here you'd probably get some good feedback because this is a common dilemma. Robert "Fernando" <fortiz80@tutopia.com> wrote in message news:2658f0d3.0311090256.21ce5a9a@posting.google.com...> Sharing the control pins is a good idea; the only thing that concerns > me is the PCB layout. This is not my area of expertise, but seems to > me that it would be pretty challenging to put (let's say) 10 DRAM > DIMMs and a big FPGA on a single board. > > It can get even uglier if symmetric traces are required to each memory > sharing the control lines...(not sure if this is required) > > Anyway, I'll start looking into it > > Thanks > > Phil Hays <SpamPostmaster@attbi.com> wrote in messagenews:<3FAD39F6.5572E42C@attbi.com>...> > Fernando wrote: > > > > > How fast can you really get data in and out of an FPGA? > > > With current pin layouts it is possible to hook four (or maybeeven> > > five) DDR memory DIMM modules to a single chip. > > > > > > Let's say you can create memory controllers that run at 200MHz (as > > > claimed in an Xcell article), for a total bandwidth of > > > 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) *(2words/cycle) *> > > (1byte/8bits)= > > > 5*3.2GB/s=16GB/s > > > > > > Assuming an application that needs more BW than this, does anyoneknow> > > a way around this bottleneck? Is this a physical limit withcurrent> > > memory technology? > > > > Probably can get a little better. With a 2V8000 in a FF1517package,> > there are 1,108 IOs. (!) If we shared address and control linesbetween> > banks (timing is easier on these lines), it looks to me like 11DIMMs> > could be supported. > > > > Data pins 64 > > DQS pins 8 > > CS,CAS, > > RAS,addr 12 (with sharing) > > ==== > > 92 > > > > 1108/92 = 11 with 100 pins left over for VTH, VRP, VRN, clock,reset, ...> > > > Of course, the communication to the outside world would also need go > > somewhere...
Reply by ●November 10, 20032003-11-10
It would seem to me that the idea of using custom "serial dimms" combined with Virtex II Pro high speed serial I/O capabilities might be the best way to get a boost in data moving capabilities. This would avoid having to drive hundreds of pins (and related issues) and would definetly simplify board layout. I haven't done the numbers. I'm just thinking out loud. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Martin Euredjian To send private email: 0_0_0_0_@pacbell.net where "0_0_0_0_" = "martineu" "Robert Sefton" <rsefton@abc.net> wrote in message news:bomt69$1equ1q$1@ID-212988.news.uni-berlin.de...> Fernando - > > Your instincts are right on with respect to the difficulty of fitting > that many DIMMs on a board and interfacing to them from a single FPGA. > Forget about it. The bottom line is that there's a trade-off between > memory size and speed, and memory is almost always the limiting factor > in system throughput. If you need lots of memory then DRAM is probably > your best/only option, and the max reasonable throughput is about what > you calculated, but even the 5-DIMM 320-bit-wide data bus in your > example would be a very tough PCB layout. > > If you can partition your memory into smaller fast-path memory and > slower bulk memory, then on-chip memory is the fastest you'll find and > you can use SDRAM for the bulk. Another option, if you can tolerate > latency, is to spread the memory out to multiple PCBs/daughtercards, > each with a dedicated memory controller, and use multiple lanes of > extremely fast serial I/O between the master and slave memory > controllers. > > A hierarchy of smaller/faster and larger/slower memories is a common > approach, e.g., on-chip core-rate L1 cache, off-chip fast L2 cache, and > slower bulk SDRAM in the case of microprocessors. If you tossed out some > specific system requirements here you'd probably get some good feedback > because this is a common dilemma. > > Robert > > "Fernando" <fortiz80@tutopia.com> wrote in message > news:2658f0d3.0311090256.21ce5a9a@posting.google.com... > > Sharing the control pins is a good idea; the only thing that concerns > > me is the PCB layout. This is not my area of expertise, but seems to > > me that it would be pretty challenging to put (let's say) 10 DRAM > > DIMMs and a big FPGA on a single board. > > > > It can get even uglier if symmetric traces are required to each memory > > sharing the control lines...(not sure if this is required) > > > > Anyway, I'll start looking into it > > > > Thanks > > > > Phil Hays <SpamPostmaster@attbi.com> wrote in message > news:<3FAD39F6.5572E42C@attbi.com>... > > > Fernando wrote: > > > > > > > How fast can you really get data in and out of an FPGA? > > > > With current pin layouts it is possible to hook four (or maybe > even > > > > five) DDR memory DIMM modules to a single chip. > > > > > > > > Let's say you can create memory controllers that run at 200MHz (as > > > > claimed in an Xcell article), for a total bandwidth of > > > > 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * > (2words/cycle) * > > > > (1byte/8bits)= > > > > 5*3.2GB/s=16GB/s > > > > > > > > Assuming an application that needs more BW than this, does anyone > know > > > > a way around this bottleneck? Is this a physical limit with > current > > > > memory technology? > > > > > > Probably can get a little better. With a 2V8000 in a FF1517 > package, > > > there are 1,108 IOs. (!) If we shared address and control lines > between > > > banks (timing is easier on these lines), it looks to me like 11 > DIMMs > > > could be supported. > > > > > > Data pins 64 > > > DQS pins 8 > > > CS,CAS, > > > RAS,addr 12 (with sharing) > > > ==== > > > 92 > > > > > > 1108/92 = 11 with 100 pins left over for VTH, VRP, VRN, clock, > reset, ... > > > > > > Of course, the communication to the outside world would also need go > > > somewhere... > >





