FPGARelated.com
Forums

VGA and framebuffer interface (Waste of BlockRAM)

Started by Isaac Bosompem February 5, 2006
Hi everyone, I have recently purchased a XC3S200 based board with 256KB
Flash, 256KB platform flash and 32KB SRAM. So out of my interests I
figured I would design a simple SoC as a learning excercise. I have
designed a VGA framebuffer which does 640x480 (but uses pixel doubling
so 320x240x2-bit). A complete framebuffer is ~19KB.

At this point I decided I would have to read the framebuffer at a line
at a time. A scanline in this mode would need 80 bytes of memory.

Naturally I decided to infer a block RAM with 8-bit data width (well
9-bit, but I am not using parity).

The problem though is that when the Block RAM is 8-bits, you get almost
2KB of space!! So that means I am wasting more than 90% of the space!!

I was looking into using a  8  128x1 distributed RAM and wire them in a
way to extend the data word to 8-bits. I am not certain how much of my
logic resources this would eat up.

I am fairly new to the FPGA's so I'm not certain if these are the best
methods to buffer such a small amount of memory.
What would you do if you were in my situation?

Regards 
-Isaac

I just would like to also add that the unit is working perfectly. I
just would like some suggestions. Thanks

I don't know about spartan 3. On a spartan 2, you can  hook 8 blocks of
RAMB4_S1 in parallel, giving you a total of  1KByte.

HI,


I am not sure i understand your architecture, can you please describe
exactly what you are doing with a single BRAM?

Pending further information, One this is for sure, you dont want to use
D-RAM as long as you can avoid it.

I may have completely misunderstood, but why can't you address the
entire block RAM (with its 2K depth) using a combination of the
horizontal and vertical address lines?

so suppose you keep your current form factor for the BRAM block (8-bit
wide).
then you could have the high 7 horizontal pixel counter bits hooked up
to the low seven address lines of the BRAM, giving you 128 addressable
locations. Each location contains 4 pixels, which you can then
multiplex on to the output with addtional logic (as i presume u are
already doing).

the high 4 address lines to the BRAM block can be connected to the low
4 bits of the vertical line counter. Thus, you would be using 16*80 =
1280 bytes of the total available 2K.

You would still have 4 additional vertical counter bits remaining,
which means you will have to use a total of 16 BRAM blocks. This also
implies you will need a 16to1 8-bit wide MUX. 

Hope this helps.

You have two choice :
 - Keep the 1 line prefetch architecture and use Distributed RAM. 128x8
will take you 16 Slices but you then need a 4:1 mux to select between
pixels
 - Continue with the block ram and fetch 8 lines at a time, then you
can use the asymettric port width features of the BRAM to select
between pixels.

Memory in all FPGAs is relatively expensive and generally limited in size.

One Xilinx feature that is generally useful to Video applications is the 
SRL16 mode of the LUTs. You get 16 bits of storage per LUT. With these you 
can build a line FIFO either in x8 or x1 format very efficiently. We often 
use these in conjuction with external memory for some of the video work we 
do and have 2 or 3 lines of data stored within the FPGA.

Be careful of the 128x1 macro. I am not sure if this is supported in 
Spartan-3 due to the fact that only half the LUTs can be configured as RAM 
in the Spartan-3.

John Adair
Enterpoint Ltd. - Home of Raggedstone1. The Low Cost Spartan3 Development 
Board.
http://www.enterpoint.co.uk

"Isaac Bosompem" <x86asm@gmail.com> wrote in message 
news:1139186254.703000.140870@f14g2000cwb.googlegroups.com...
> Hi everyone, I have recently purchased a XC3S200 based board with 256KB > Flash, 256KB platform flash and 32KB SRAM. So out of my interests I > figured I would design a simple SoC as a learning excercise. I have > designed a VGA framebuffer which does 640x480 (but uses pixel doubling > so 320x240x2-bit). A complete framebuffer is ~19KB. > > At this point I decided I would have to read the framebuffer at a line > at a time. A scanline in this mode would need 80 bytes of memory. > > Naturally I decided to infer a block RAM with 8-bit data width (well > 9-bit, but I am not using parity). > > The problem though is that when the Block RAM is 8-bits, you get almost > 2KB of space!! So that means I am wasting more than 90% of the space!! > > I was looking into using a 8 128x1 distributed RAM and wire them in a > way to extend the data word to 8-bits. I am not certain how much of my > logic resources this would eat up. > > I am fairly new to the FPGA's so I'm not certain if these are the best > methods to buffer such a small amount of memory. > What would you do if you were in my situation? > > Regards > -Isaac >
er ... 32 slices not 16 ...

but as John pointed out, the 128x1 macro might not work in
spartan3/virtex4 ...

abgoyal@gmail.com wrote:
> HI, > > > I am not sure i understand your architecture, can you please describe > exactly what you are doing with a single BRAM? > > Pending further information, One this is for sure, you dont want to use > D-RAM as long as you can avoid it. > > I may have completely misunderstood, but why can't you address the > entire block RAM (with its 2K depth) using a combination of the > horizontal and vertical address lines? > > so suppose you keep your current form factor for the BRAM block (8-bit > wide). > then you could have the high 7 horizontal pixel counter bits hooked up > to the low seven address lines of the BRAM, giving you 128 addressable > locations. Each location contains 4 pixels, which you can then > multiplex on to the output with addtional logic (as i presume u are > already doing). > > the high 4 address lines to the BRAM block can be connected to the low > 4 bits of the vertical line counter. Thus, you would be using 16*80 = > 1280 bytes of the total available 2K. > > You would still have 4 additional vertical counter bits remaining, > which means you will have to use a total of 16 BRAM blocks. This also > implies you will need a 16to1 8-bit wide MUX. > > Hope this helps.
The problem with this method is that, I am using 8-bit wide SRAM. I wish I had 16-bit wide SDRAM on my board! I would like to get an SoC up so if I make changes to the VGA hardware I would like to relieve some pressure off the bus. I will have a softcore CPU and a delta-sigma DAC running so I would like the VGA hardware to use as little of the time as possible. So if I want to buffer more data, I would also like to reduce the frequency of the buffer reads. I can read a byte from the SRAM (fortunately) at every clock cycle. That means reading 1280 bytes during the H blanking period (I believe I have close to 100 clk cycles in HBlanking, I am not at home now) is impossible (unless I make the SRAM reading unit operate asynchronously with the VGA dot clock). So technically I am limited to either reading the whole frame (which will need most of the BRAMs available in my Spartan3) or read it a scanline at a time. I will check into using the SRL16's, but taking a quick glance through the app note I can't seem to find how to reload data in to the SRL16's. Do I have to shift it in a bit at a time at each clock cycle? It is possible, that I may have to eat my losses :(
"Isaac Bosompem" <x86asm@gmail.com> wrote in message 
news:1139186254.703000.140870@f14g2000cwb.googlegroups.com...
> Hi everyone, I have recently purchased a XC3S200 based board with 256KB > Flash, 256KB platform flash and 32KB SRAM. So out of my interests I > figured I would design a simple SoC as a learning excercise. I have > designed a VGA framebuffer which does 640x480 (but uses pixel doubling > so 320x240x2-bit). A complete framebuffer is ~19KB. > > At this point I decided I would have to read the framebuffer at a line > at a time. A scanline in this mode would need 80 bytes of memory. > > Naturally I decided to infer a block RAM with 8-bit data width (well > 9-bit, but I am not using parity). > > The problem though is that when the Block RAM is 8-bits, you get almost > 2KB of space!! So that means I am wasting more than 90% of the space!!
If your frame buffer is in the off-chip SRAM and you want the BlockRAM as the line buffer, don't look at the BlockRAM as wasting 90% of the space. Most people don't end up using all their BlockRAM making this an ideal use. If you implement the buffer in logic to avoid wasting the BlockRAM, you end up wasting 100% of the BlockRAM by not using it rather than the 90% you were concerned about. If you're trying to use the BlockRAMs for other functionality and are concerned about running out of memory *but* you have plenty of logic resources then the 40-45 LUTs (for 8-9 bits at 80 byte depth) is a good tradeoff.
> I was looking into using a 8 128x1 distributed RAM and wire them in a > way to extend the data word to 8-bits. I am not certain how much of my > logic resources this would eat up. > > I am fairly new to the FPGA's so I'm not certain if these are the best > methods to buffer such a small amount of memory. > What would you do if you were in my situation? > > Regards > -Isaac
Another consideration: if the BlockRAM is used as a single port (you use one address to write during the blanking and that same address to read when it's active) you have a second single-port memory in that same BlockRAM to access the remainder of that 2kByte memory. To share resources like this usually requires that you instatiate the BlockRAM primitive rather than inferring the memory. Another suggestion since you're concerned about reading the data during the blanking period: are you pushing the SRAM near its maximum clock rate? (Probably not if you're doing pixel doubling.) If you increase the clock speed with the DCM, you can increase the data throughput into and out of the external SRAM. The BlockRAM can take in data at the SRAM's maximum rate with ease (as will the SRLs). Using a DCM (Digital Clock Manager, I believe) requires a little more care in your design with the suggestion that you use the 1X clock output from the DCM to "phase match" the higher speed clock to the 1X clock rather than using the input clock that feeds the DCM.
John_H wrote:
> "Isaac Bosompem" <x86asm@gmail.com> wrote in message > news:1139186254.703000.140870@f14g2000cwb.googlegroups.com... > > Hi everyone, I have recently purchased a XC3S200 based board with 256KB > > Flash, 256KB platform flash and 32KB SRAM. So out of my interests I > > figured I would design a simple SoC as a learning excercise. I have > > designed a VGA framebuffer which does 640x480 (but uses pixel doubling > > so 320x240x2-bit). A complete framebuffer is ~19KB. > > > > At this point I decided I would have to read the framebuffer at a line > > at a time. A scanline in this mode would need 80 bytes of memory. > > > > Naturally I decided to infer a block RAM with 8-bit data width (well > > 9-bit, but I am not using parity). > > > > The problem though is that when the Block RAM is 8-bits, you get almost > > 2KB of space!! So that means I am wasting more than 90% of the space!! > > If your frame buffer is in the off-chip SRAM and you want the BlockRAM as > the line buffer, don't look at the BlockRAM as wasting 90% of the space. > Most people don't end up using all their BlockRAM making this an ideal use. > If you implement the buffer in logic to avoid wasting the BlockRAM, you end > up wasting 100% of the BlockRAM by not using it rather than the 90% you were > concerned about. If you're trying to use the BlockRAMs for other > functionality and are concerned about running out of memory *but* you have > plenty of logic resources then the 40-45 LUTs (for 8-9 bits at 80 byte > depth) is a good tradeoff. > > > I was looking into using a 8 128x1 distributed RAM and wire them in a > > way to extend the data word to 8-bits. I am not certain how much of my > > logic resources this would eat up. > > > > I am fairly new to the FPGA's so I'm not certain if these are the best > > methods to buffer such a small amount of memory. > > What would you do if you were in my situation? > > > > Regards > > -Isaac > > Another consideration: if the BlockRAM is used as a single port (you use one > address to write during the blanking and that same address to read when it's > active) you have a second single-port memory in that same BlockRAM to access > the remainder of that 2kByte memory. To share resources like this usually > requires that you instatiate the BlockRAM primitive rather than inferring > the memory. > > Another suggestion since you're concerned about reading the data during the > blanking period: are you pushing the SRAM near its maximum clock rate? > (Probably not if you're doing pixel doubling.) If you increase the clock > speed with the DCM, you can increase the data throughput into and out of the > external SRAM. The BlockRAM can take in data at the SRAM's maximum rate > with ease (as will the SRLs). Using a DCM (Digital Clock Manager, I > believe) requires a little more care in your design with the suggestion that > you use the 1X clock output from the DCM to "phase match" the higher speed > clock to the 1X clock rather than using the input clock that feeds the DCM.
Yes I am concerned about potentially running out of on-chip space. It is not a major issue right now. But I would like to see how you guys would handle it and I am very happy with the responses I got! I'm sorry for confusing you guys but I did instance the BlockRAM I did not infer it. I was thinking of allowing external modules to access the rest of the memory through the 2nd port. That would allow me to access the rest of the space in the BlockRAM when I need it in the future. I am not pushing the SRAM to its maximum speed. The SRAM on my board has a 20ns access time, so I get a little less than 50Mhz when taking setup and hold times into account. I might be able to use a lot more of the BlockRAM using that speed but that would require me to utilize the 2nd port making it unavailable to external entities. The framebuffer reader will stay ahead of the raster counters. I will try and see if a clock multiply will help, thanks for the tip with the DCM. If you had not told me that I would have used the original signal for the parts that run at 25Mhz.