FPGARelated.com
Forums

Memory Latency

Started by Ghostboy December 19, 2009
Hi,

The memory on a fpga is limited. But the latency of external memory is
still a bottleneck. Especially because the local bus has to share its
resources. Anybody who knows a development board/fpga that has a reduced
latency of the external memory? 	   
					
---------------------------------------		
This message was sent using the comp.arch.fpga web interface on
http://www.FPGARelated.com
Ghostboy wrote:
> Hi, > > The memory on a fpga is limited. But the latency of external memory is > still a bottleneck. Especially because the local bus has to share its > resources. Anybody who knows a development board/fpga that has a reduced > latency of the external memory?
Hi, you don't specify what "reduced latency" is, reduced compared to what ? What is your goal ? Can you speak a bit about your application ? BTW latency is (roughly) proportionnal of - the price of the components - their age - the required capacity all 3 are very closely related. If you only need a few megabytes, some parts (quite expensive) are very fast : there are synchronous SRAMs, some with dual data rate, used in the telecom industry, that go above 200MHz in pipelined mode. Some recent Altera/Xilinx go even faster on expensive reference boards, IIRC. Look at these manufacturers : ISSI, GSI, IDT, Cypress,... Example of an interesting part that I found with a broker : GSI's GS8322Z36B-225 has 1M words of 36 bits, capable of 225MHz (cycle time below 5ns). Some newer parts are even faster (350MHz ?) and have dedicated data buses for read and write. Now, the price may be a problem, not only the part itself but also the PCB technology that the BGA packaging requires... If you need a ready-made solution, it's going to cost you what you'll get (a lot). But there are probably many ways to turn your problem around so it does not kill your budget : for example if your application can be designed to use cache optimisations, strip-mining or space-time locality, then your onchip SRAM could be enough. But I suppose that if large SRAMs exist, it is because not all algorithms can be tweaked this way :-) hope it helps, yg -- http://ygdes.com / http://yasep.org
On Dec 20, 2:35=A0am, whygee <y...@yg.yg> wrote:
> Ghostboy wrote: > > Hi, > > > The memory on a fpga is limited. But the latency of external memory is > > still a bottleneck. Especially because the local bus has to share its > > resources. Anybody who knows a development board/fpga that has a reduce=
d
> > latency of the external memory? =A0 =A0 =A0 > > Hi, > > you don't specify what "reduced latency" is, > reduced compared to what ? What is your goal ? > Can you speak a bit about your application ? > > BTW latency is (roughly) proportionnal of > =A0 - the price of the components > =A0 - their age > =A0 - the required capacity > all 3 are very closely related. > > If you only need a few megabytes, > some parts (quite expensive) are very fast : > there are synchronous SRAMs, some with dual data rate, > used in the telecom industry, that go above 200MHz > in pipelined mode. Some recent Altera/Xilinx go > even faster on expensive reference boards, IIRC. > Look at these manufacturers : ISSI, GSI, IDT, Cypress,... > > Example of an interesting part that I found with > a broker : GSI's GS8322Z36B-225 has 1M words of 36 bits, > capable of 225MHz (cycle time below 5ns). Some newer > parts are even faster (350MHz ?) and have dedicated > data buses for read and write. Now, the price may be > a problem, not only the part itself but also the > PCB technology that the BGA packaging requires... > > If you need a ready-made solution, it's going to cost > you what you'll get (a lot). But there are probably > many ways to turn your problem around so it does not > kill your budget : for example if your application > can be designed to use cache optimisations, strip-mining > or space-time locality, then your onchip SRAM could be enough. > But I suppose that if large SRAMs exist, it is because > not all algorithms can be tweaked this way :-) > > hope it helps, > yg > --http://ygdes.com/http://yasep.org
latency depends memory TYPE mostly 0 latency (async) memory cost MUCH more then dynamic memory per bit Antti
Thank you.

I'll tell a little bit more. The development board I have now is the XUPV2
with the Virtex-2 Pro XC2VP30 FPGA. The purpose of the project is to make
video processing possible in real-time and with high resolutions. So the
internal memory will not suffice. But I heard that the latency of the local
bus is too high to make that possible. So I thought anybody could advice me
another board or FPGA. Or am I just bad informed about those latencies?	  

					
---------------------------------------		
This message was sent using the comp.arch.fpga web interface on
http://www.FPGARelated.com
Antti wrote:
> latency depends memory TYPE mostly
yes.
> 0 latency (async) memory
there can not be "0 latency" :-) it's better to measure this in nanoseconds.
> cost MUCH more then dynamic memory per bit
sure. it's always a trade-off... and the reduced cost of DRAM is offset by the complexity of driving the complex signals :-/
> Antti
yg PS: is your server back online ? and your email ? -- http://ygdes.com / http://yasep.org
On Dec 20, 9:48=A0am, "Ghostboy" <Ghost...@dommel.be> wrote:
> The purpose of the project is to make > video processing possible in real-time and with high resolutions. So the > internal memory will not suffice.
Video processing typically doesn't need access to large memories in order to do the processing. The processing operations are relatively local. For that, I'm sure you'll find that the Virtex has sufficient memory. The larger, slower, external memory is used for the bulk storage. In order to process the data, you move it from the external memory into the internal memory, process it and store it back in the external memory. In that scenario, memory latency is generally not an issue, only clock frequency.
> But I heard that the latency of the local > bus is too high to make that possible. So I thought anybody could advice =
me
> another board or FPGA. Or am I just bad informed about those latencies? =
=A0
>
You seem to be misinformed about the requirements that you have for your own video processing and how one would implement that function in hardware. Kevin Jennings
Thanks
But what if I need to buffer a frame with a resolution of 1024*768 and it
will constantly be updated?



>On Dec 20, 9:48=A0am, "Ghostboy" <Ghost...@dommel.be> wrote: >> The purpose of the project is to make >> video processing possible in real-time and with high resolutions. So
the
>> internal memory will not suffice. > >Video processing typically doesn't need access to large memories in >order to do the processing. The processing operations are relatively >local. For that, I'm sure you'll find that the Virtex has sufficient >memory. The larger, slower, external memory is used for the bulk >storage. In order to process the data, you move it from the external >memory into the internal memory, process it and store it back in the >external memory. > >In that scenario, memory latency is generally not an issue, only clock >frequency. > >> But I heard that the latency of the local >> bus is too high to make that possible. So I thought anybody could advice
=
>me >> another board or FPGA. Or am I just bad informed about those latencies?
=
>=A0 >> > >You seem to be misinformed about the requirements that you have for >your own video processing and how one would implement that function in >hardware. > >Kevin Jennings >
--------------------------------------- This message was sent using the comp.arch.fpga web interface on http://www.FPGARelated.com
Hi,

Ghostboy wrote:
> Thanks > But what if I need to buffer a frame with a resolution of 1024*768 and it > will constantly be updated?
You seem to mistake latency, access time and bandwidth. Video streams are fairly stable, so latency issues can easily be "masked", "shadowed", pipelined... given some FIFOs here and there for example. now, the speed is another issue but it can easily be overcome, just do the math and read the datasheets. 1024*768*3 (assuming 888 RGB) = 2.3MB If you need double buffering (if you can't exploit some smart-ass pointer techniques) then the requirement is 5M bytes. With a kit that has 8MB of RAM, you're fairly safe. now, the bandwidth : let's say you need to read and write the whole buffer for every frame, 30 times per second : 2.3MB*2*30 = 141MB/s that's more than what a PCI bus can handle but given a 32-bit wide bus, it is reduced to 35MHz speed. Account for refresh cycles, bus turnaround cycles, blanking times, inefficient packing of the RGB components (0RGB aligned to 32-bit boundaries) and other stuff, so you need around 60MHz. Most decent and recent kits should do this out of the box. Even this old board http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150398039709 (i'm not the seller) has 6MB of fast asynchronous SRAM nicely organised in 24-bit words, no byte is wasted with 32-bit alignment. it can store a bit more than 2 frames of 768Kpx. Access time is 33ns, so let's say it can work at pixel speeed (27 or 30MHz) with 96 bits each time : that's 360M Bytes per second. With no refresh cycle, no burst, no address multiplex and no DRAM bank to manage... OK I cheat : this board is obviously designed for video applications. And I have not checked the schematics. But if only I could make Altera's tool work on my computer... :-( Have fun, yg -- http://ygdes.com / http://yasep.org
Hi,

Ghostboy wrote:
> Thanks > But what if I need to buffer a frame with a resolution of 1024*768 and it > will constantly be updated?
You seem to mistake latency, access time and bandwidth. Video streams are fairly stable, so latency issues can easily be "masked", "shadowed", pipelined... given some FIFOs here and there for example. now, the speed is another issue but it can easily be overcome, just do the math and read the datasheets. 1024*768*3 (assuming 888 RGB) = 2.3MB If you need double buffering (if you can't exploit some smart-ass pointer techniques) then the requirement is 5M bytes. With a kit that has 8MB of RAM, you're fairly safe. now, the bandwidth : let's say you need to read and write the whole buffer for every frame, 30 times per second : 2.3MB*2*30 = 141MB/s that's more than what a PCI bus can handle but given a 32-bit wide bus, it is reduced to 35MHz speed. Account for refresh cycles, bus turnaround cycles, blanking times, inefficient packing of the RGB components (0RGB aligned to 32-bit boundaries) and other stuff, so you need around 60MHz. Most decent and recent kits should do this out of the box. Even this old board http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150398039709 (i'm not the seller) has 6MB of fast asynchronous SRAM nicely organised in 24-bit words, no byte is wasted with 32-bit alignment. it can store a bit more than 2 frames of 768Kpx. Access time is 33ns, so let's say it can work at pixel speeed (27 or 30MHz) with 96 bits each time : that's 360M Bytes per second. With no refresh cycle, no burst, no address multiplex and no DRAM bank to manage... OK I cheat : this board is obviously designed for video applications. And I have not checked the schematics. But if only I could make Altera's tool work on my computer... :-( Have fun, yg -- http://ygdes.com / http://yasep.org
sorry for the double post,
due to a NNTP server hickup.

-- 
http://ygdes.com / http://yasep.org