FPGARelated.com
Forums

video buffering scheme, nonsequential access (no spatial locality)

Started by wallge January 24, 2007
I am doing some embedded video processing, where I store an incoming
frame of video, then based on some calculations in another part of the
system, I warp that buffered frame of video. Now when the frame goes
into the buffer
(an off-FPGA SDRAM chip), it is simply written in one pixel at a time
in row major ordering.

The problem with this is that I will not be accessing it in this way. I
may want to do some arbitrary image rotation. This means
 the first pixel I want to access is not the first one I put in the
buffer, It might actually be the last one in the buffer. If I am doing
full page reads, or even burst reads, I will get a bunch of pixels that
I will not need to determine the output pixel value. If i just do
single reads, this waists a bunch of clock cycles setting up the SDRAM,
telling it which row to activate and which column to read from. After
the read is done, you then have to issue the precharge command to close
the row. There is a high degree of inefficiency to this. It takes 5,
maybe 10 clock cycles just to retrieve one
pixel value.

Does anyone know a good way to organize a frame buffer to be more
friendly (and more optimal) to nonsequential access (like the kind we
might need if we wanted to warp the input image via some
linear/nonlinear transformation)?

Well, there won't be a schema that fits every possible transform ...
(if there was that would mean the SDRAM would be as flexible as SRAM
...)

Can't you narrow a little the type of access you want to do ?

I've done something similar in the past.  In my project I was
doing small-angle rotation, so I knew ahead of time the maximum
line-to-line skew of pixels that became vertical in the output
image, and it was small (like 1).  When I started the project,
however I had the idea that the best way to accomplish the
general case of rotation is to make a cache memory in
the FPGA.  The parts I was using at the time (XCV50's)
were a bit small to implement a decent cache, but I would
think newer parts could do this quite handily.

Also important are using the minimum burst size in the
SDRAM to reduce you cache-line access time.

HTH,
Gabor

On Jan 24, 2:36 pm, "wallge" <wal...@gmail.com> wrote:
> I am doing some embedded video processing, where I store an incoming > frame of video, then based on some calculations in another part of the > system, I warp that buffered frame of video. Now when the frame goes > into the buffer > (an off-FPGA SDRAM chip), it is simply written in one pixel at a time > in row major ordering. > > The problem with this is that I will not be accessing it in this way. I > may want to do some arbitrary image rotation. This means > the first pixel I want to access is not the first one I put in the > buffer, It might actually be the last one in the buffer. If I am doing > full page reads, or even burst reads, I will get a bunch of pixels that > I will not need to determine the output pixel value. If i just do > single reads, this waists a bunch of clock cycles setting up the SDRAM, > telling it which row to activate and which column to read from. After > the read is done, you then have to issue the precharge command to close > the row. There is a high degree of inefficiency to this. It takes 5, > maybe 10 clock cycles just to retrieve one > pixel value. > > Does anyone know a good way to organize a frame buffer to be more > friendly (and more optimal) to nonsequential access (like the kind we > might need if we wanted to warp the input image via some > linear/nonlinear transformation)?
Hello,

It all depends on your needs, of course, but block-style ordering can
help
a bit to relieve the problem by breaking the 1D-orientedness of
raster-scan sort.

For instance, you can pack pixel by 16, which will represent a 4x4
square in your image. When
retrieving the data, you get data from both dimentions, which will have
much better spatial locality
than a line of 16 pixels. This may help you quite a bit.

Peano-style or quadtree-style walking of the image could also be
investigated,
but my memories from it is that it's quite a bit more complicated...

JB

"wallge" <wallge@gmail.com> writes:

> I am doing some embedded video processing, where I store an incoming > frame of video, then based on some calculations in another part of the > system, I warp that buffered frame of video. Now when the frame goes > into the buffer > (an off-FPGA SDRAM chip), it is simply written in one pixel at a time > in row major ordering. > > The problem with this is that I will not be accessing it in this way. I > may want to do some arbitrary image rotation. This means > the first pixel I want to access is not the first one I put in the > buffer, It might actually be the last one in the buffer. If I am doing > full page reads, or even burst reads, I will get a bunch of pixels that > I will not need to determine the output pixel value. If i just do > single reads, this waists a bunch of clock cycles setting up the SDRAM, > telling it which row to activate and which column to read from. After > the read is done, you then have to issue the precharge command to close > the row. There is a high degree of inefficiency to this. It takes 5, > maybe 10 clock cycles just to retrieve one > pixel value. >
If you are doing truly arbitrary warping, then is it not right that you can never get an optimal organisation for all warps?
> Does anyone know a good way to organize a frame buffer to be more > friendly (and more optimal) to nonsequential access (like the kind we > might need if we wanted to warp the input image via some > linear/nonlinear transformation)? >
Could you do some kind of caching scheme where you read an entire DRAM row in at a time, and "hope it comes in handy" later? Failing that, can you use SSRAM for your frame buffer? Or, can you parallelise your task so that it operates on (eg) 4 wildly different areas of input data at a time, which means you can use the banking mechanism of the DRAMs to hide the latency? Those are my initial thoughts (whilst waiting for a very loooooong simulation to run :-) Cheers, Martin -- martin.j.thompson@trw.com TRW Conekt - Consultancy in Engineering, Knowledge and Technology http://www.conekt.net/electronics.html
wallge wrote:

> Does anyone know a good way to organize a frame buffer to be more > friendly (and more optimal) to nonsequential access
Sounds like a RAM. If it didn't fit in fpga block ram I would use an external device. -- Mike Treseler
"Gabor" <gabor@alacron.com> wrote in message 
news:1169736163.476029.150290@l53g2000cwa.googlegroups.com...
> I've done something similar in the past. In my project I was > doing small-angle rotation, so I knew ahead of time the maximum > line-to-line skew of pixels that became vertical in the output > image, and it was small (like 1). When I started the project, > however I had the idea that the best way to accomplish the > general case of rotation is to make a cache memory in > the FPGA. The parts I was using at the time (XCV50's) > were a bit small to implement a decent cache, but I would > think newer parts could do this quite handily.
Another option (depending on your mapping) would be to do it in two passes. There's a transpose in the middle, so it would probably be best to do it in small sections to an on-chip transpose buffer, before writing it out to the intermediate store. Have you thought about what order of filtering you need? Check out Digital Image Warping by Wolberg, or one of Alvy Ray Smith's scan line ordering papers.
I should have been more specific in my question.

I have to use a small (64 Mbit) mobile sdram. I can't choose
to use a different storage element in the system (other than *some*
FPGA buffering, though not full frame).

I have heard some discussion of the way in which graphic accelerator
boards do memory transactions, storing pixels in blocks of neighbor
pixels
(instead of being organized row major). In other words the spatial
locality
in the SDRAM buffer might look like:

Image pixels:
                  N2 N3 N4
                  N1  P  N5
                  N8 N7 N6

Memory organization:
ADDR     DATA
0x0000      P
0x0001     N1
0x0002     N2
0x0003     N3
0x0004     N4
0x0005     N5
0x0006     N6
0x0007     N7
0x0008     N8


Where P is the central pixel of interest, and the N's are its
neighbors.
We organize the pixels in the SDRAM buffer not by rows, but by regions
of interest.
This way if we are doing some kind of Image warp and we want to get
more bang for the buck
in terms of read latency, we are more likely to reuse pixels in the
neighborhood of the currently accessed pixel
than if we were arranged in a row or column major ordering (consider
the case were we wanted to rotate an image by 47.2 degrees from input
to output).

Has anyone seen something like this or know of any resources online
with regard to memory buffer organization schemes for graphics or image
processing?



On Jan 24, 2:36 pm, "wallge" <wal...@gmail.com> wrote:
> I am doing some embedded video processing, where I store an incoming > frame of video, then based on some calculations in another part of the > system, I warp that buffered frame of video. Now when the frame goes > into the buffer > (an off-FPGA SDRAM chip), it is simply written in one pixel at a time > in row major ordering. > > The problem with this is that I will not be accessing it in this way. I > may want to do some arbitrary image rotation. This means > the first pixel I want to access is not the first one I put in the > buffer, It might actually be the last one in the buffer. If I am doing > full page reads, or even burst reads, I will get a bunch of pixels that > I will not need to determine the output pixel value. If i just do > single reads, this waists a bunch of clock cycles setting up the SDRAM, > telling it which row to activate and which column to read from. After > the read is done, you then have to issue the precharge command to close > the row. There is a high degree of inefficiency to this. It takes 5, > maybe 10 clock cycles just to retrieve one > pixel value. > > Does anyone know a good way to organize a frame buffer to be more > friendly (and more optimal) to nonsequential access (like the kind we > might need if we wanted to warp the input image via some > linear/nonlinear transformation)?
"wallge" <wallge@gmail.com> wrote in message 
news:1169747314.537493.237140@l53g2000cwa.googlegroups.com...

> > Image pixels: > N2 N3 N4 > N1 P N5 > N8 N7 N6
Have you thought about what order of filtering you'll need to use?
I am not doing any image filtering.
This is not a filtering operation.
It is an interpolation operation
typically bilinear or bicubic
to do image transformations.

On Jan 25, 1:00 pm, "Pete Fraser" <pfra...@covad.net> wrote:
> "wallge" <wal...@gmail.com> wrote in messagenews:1169747314.537493.237140@l53g2000cwa.googlegroups.com... > > > > > Image pixels: > > N2 N3 N4 > > N1 P N5 > > N8 N7 N6Have you thought about what order of filtering you'll > need to use?