This is a multi-part message in MIME format.

------=_NextPart_000_0036_01D02B82.77215380
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


  "Mike Field" <mikefield1969@gmail.com> wrote in message =
news:8c1434a4-c435-463c-9783-0d35911a33b3@googlegroups.com...
  On Thursday, 8 January 2015 08:51:55 UTC+13, Tomas D.  wrote:
  > The FPGA will be Altera Cyclone V with one hard memory controller =
(5CEFA2=20
  > device). I am trying to check if it will be sufficient to use one =
DDR3=20
  > memory chip or it's better to use two devices with 32bit memory bus, =
thus=20
  > increasing the bandwidth.
  >=20

  The most efficient method which uses frame buffer that is external to =
the FPGA will result in one write and one read per pixel, if you use a =
DDR module you will need a bit more than 2x the bandwidth of the video =
stream, so you will need a bit over 900MB/s for 24-bit 1080p @ 60 Hz. =
You will need to carefully plan how memory will be accessed to maximise =
available memory bandwidth.

  For 1080p video, if you can hold 180 rows of pixel data inside your =
FPGA you don't actually need external memory to buffer the frames at =
all, and you can achieve lower latency too (approx the time for 181 =
lines). The idea being to use a rolling buffer of 180 rows that you =
sample/extract your output pixels from. The cost of the larger FPGA =
might be offset by the savings in not requiring the external memory, =
smaller PCB and so on.

  There is a sweet spot for 720p video, where you can get away with =
holding just 128 rows for +/- 5 degrees of rotation, requiring only half =
a MB of block RAM. This assumes that you are not interpolating between =
pixels.

  If you are performing interpolation, then you might need to be really =
cunning and use the extra cycles found in the blanking interval to give =
additional cycles required for the extra memory accesses needed when you =
walk through the pixels. Your access pattern might be something like=20
    =20
  1234......
  ...5678...
  ......9ABC
  ..........

  In this case it takes 12 cycles to access the data needed for =
interpolating 10 pixels (because of the additional cycle required for =
access 5 & 9 when it jumps lines). You will then need something like a =
FIFO to remove the gaps in the output pixel stream. For 1080p, you have =
about 280 cycles in the horizontal blanking interval, a little more than =
what you will need for a +/- 5 degree rotation, where you will have at =
most 167 changes between lines.

---------------------------------------

Hi,
thank you for all the ideas.
The resolution I will be playing with is 1200x1080 - a little but more =
than FullHD. Since I gonna use Altera FPGA, there's no point for me not =
to use Altera VIP suite. This means, that I will get bursts of real =
image data and there will be no blanking periods - just spaces between =
bursts.
The memory will still be used for a frame buffer, because the input =
video stream will have different clock source, so can end up dropping =
frames.
However, I am interested in image rotation techniques and the literature =
at the moment. You're describing the methods you did, which, for me, =
never played with image processing, is kinda difficult to get. Maybe you =
have resources to read about these rotation methods first of all?

Thank you.

BR
Tomas D.
------=_NextPart_000_0036_01D02B82.77215380
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3D"text/html; charset=3Diso-8859-1" =
http-equiv=3DContent-Type>
<META name=3DGENERATOR content=3D"MSHTML 11.00.9600.17496">
<STYLE></STYLE>
</HEAD>
<BODY>
<BLOCKQUOTE>
  <DIV><FONT size=3D2 face=3DArial></FONT>&nbsp;</DIV>
  <DIV><FONT size=3D2 face=3DArial>"Mike Field" &lt;</FONT><A=20
  href=3D"mailto:mikefield1969@gmail.com"><FONT size=3D2=20
  face=3DArial>mikefield1969@gmail.com</FONT></A><FONT size=3D2 =
face=3DArial>&gt;=20
  wrote in message </FONT><A=20
  =
href=3D"news:8c1434a4-c435-463c-9783-0d35911a33b3@googlegroups.com"><FONT=
 size=3D2=20
  =
face=3DArial>news:8c1434a4-c435-463c-9783-0d35911a33b3@googlegroups.com</=
FONT></A><FONT=20
  size=3D2 face=3DArial>...</FONT></DIV>
  <DIV><FONT size=3D2 face=3DArial>On Thursday, 8 January 2015 08:51:55 =
UTC+13,=20
  Tomas D.&nbsp; wrote:<BR>&gt; The FPGA will be Altera Cyclone V with =
one hard=20
  memory controller (5CEFA2 <BR>&gt; device). I am trying to check if it =
will be=20
  sufficient to use one DDR3 <BR>&gt; memory chip or it's better to use =
two=20
  devices with 32bit memory bus, thus <BR>&gt; increasing the =
bandwidth.<BR>&gt;=20
  <BR><BR>The most efficient method which uses frame buffer that is =
external to=20
  the FPGA will result in one write and one read per pixel, if you use a =
DDR=20
  module you will need a bit more than 2x the bandwidth of the video =
stream, so=20
  you will need a bit over 900MB/s for 24-bit 1080p @ 60 Hz. You will =
need to=20
  carefully plan how memory will be accessed to maximise available =
memory=20
  bandwidth.<BR><BR>For 1080p video, if you can hold 180 rows of pixel =
data=20
  inside your FPGA you don't actually need external memory to buffer the =
frames=20
  at all, and you can achieve lower latency too (approx the time for 181 =
lines).=20
  The idea being to use a rolling buffer of 180 rows that you =
sample/extract=20
  your output pixels from. The cost of the larger FPGA might be offset =
by the=20
  savings in not requiring the external memory, smaller PCB and so=20
  on.<BR><BR>There is a sweet spot for 720p video, where you can get =
away with=20
  holding just 128 rows for +/- 5 degrees of rotation, requiring only =
half a MB=20
  of block RAM. This assumes that you are not interpolating between=20
  pixels.<BR><BR>If you are performing interpolation, then you might =
need to be=20
  really cunning and use the extra cycles found in the blanking interval =
to give=20
  additional cycles required for the extra memory accesses needed when =
you walk=20
  through the pixels. Your access pattern might be something like=20
  <BR>&nbsp;&nbsp;=20
  <BR>1234......<BR>...5678...<BR>......9ABC<BR>..........<BR><BR>In =
this case=20
  it takes 12 cycles to access the data needed for interpolating 10 =
pixels=20
  (because of the additional cycle required for access 5 &amp; 9 when it =
jumps=20
  lines). You will then need something like a FIFO to remove the gaps in =
the=20
  output pixel stream. For 1080p, you have about 280 cycles in the =
horizontal=20
  blanking interval, a little more than what you will need for a +/- 5 =
degree=20
  rotation, where you will have at most 167 changes between=20
lines.</FONT></DIV></BLOCKQUOTE><FONT size=3D2 face=3DArial>
<DIV><BR></FONT><FONT size=3D2=20
face=3DArial>---------------------------------------</FONT></DIV>
<DIV><FONT size=3D2 face=3DArial></FONT>&nbsp;</DIV>
<DIV><FONT size=3D2 face=3DArial>Hi,</FONT></DIV>
<DIV><FONT size=3D2 face=3DArial>thank you for all the =
ideas.</FONT></DIV>
<DIV><FONT size=3D2 face=3DArial>The resolution I will be playing with =
is 1200x1080=20
- a little but more than FullHD. Since I gonna use Altera FPGA, there's =
no point=20
for me not to use Altera VIP suite. This means, that I will get bursts =
of real=20
image data and there will be no blanking periods - just spaces between=20
bursts.</FONT></DIV>
<DIV><FONT size=3D2 face=3DArial>The memory will still be used for a =
frame buffer,=20
because the input video stream will have different clock source, so can =
end up=20
dropping frames.</FONT></DIV>
<DIV><FONT size=3D2 face=3DArial>However, I am interested in image =
rotation=20
techniques and the literature at the moment. You're describing the =
methods you=20
did, which, for me, never played with image processing, is kinda =
difficult to=20
get. Maybe you have resources to read about these rotation methods first =
of=20
all?</FONT></DIV>
<DIV><FONT size=3D2 face=3DArial></FONT>&nbsp;</DIV>
<DIV><FONT size=3D2 face=3DArial>Thank you.</FONT></DIV>
<DIV><FONT size=3D2 face=3DArial></FONT>&nbsp;</DIV>
<DIV><FONT size=3D2 face=3DArial>BR</FONT></DIV>
<DIV><FONT size=3D2 face=3DArial>Tomas D.</DIV></FONT></BODY></HTML>

------=_NextPart_000_0036_01D02B82.77215380--

On 1/7/15 11:23 PM, Mike Field wrote:
> On Thursday, 8 January 2015 08:51:55 UTC+13, Tomas D.  wrote:
>> The FPGA will be Altera Cyclone V with one hard memory controller
>> (5CEFA2 device). I am trying to check if it will be sufficient to
>> use one DDR3 memory chip or it's better to use two devices with
>> 32bit memory bus, thus increasing the bandwidth.
>>
>
> The most efficient method which uses frame buffer that is external to
> the FPGA will result in one write and one read per pixel, if you use
> a DDR module you will need a bit more than 2x the bandwidth of the
> video stream, so you will need a bit over 900MB/s for 24-bit 1080p @
> 60 Hz. You will need to carefully plan how memory will be accessed to
> maximise available memory bandwidth.
>
> For 1080p video, if you can hold 180 rows of pixel data inside your
> FPGA you don't actually need external memory to buffer the frames at
> all, and you can achieve lower latency too (approx the time for 181
> lines). The idea being to use a rolling buffer of 180 rows that you
> sample/extract your output pixels from. The cost of the larger FPGA
> might be offset by the savings in not requiring the external memory,
> smaller PCB and so on.
>
> There is a sweet spot for 720p video, where you can get away with
> holding just 128 rows for +/- 5 degrees of rotation, requiring only
> half a MB of block RAM. This assumes that you are not interpolating
> between pixels.
>
> If you are performing interpolation, then you might need to be really
> cunning and use the extra cycles found in the blanking interval to
> give additional cycles required for the extra memory accesses needed
> when you walk through the pixels. Your access pattern might be
> something like
>
> 1234......
 > ...5678...
 > ......9ABC
 > ..........
>
> In this case it takes 12 cycles to access the data needed for
> interpolating 10 pixels (because of the additional cycle required for
> access 5 & 9 when it jumps lines). You will then need something like
> a FIFO to remove the gaps in the output pixel stream. For 1080p, you
> have about 280 cycles in the horizontal blanking interval, a little
> more than what you will need for a +/- 5 degree rotation, where you
> will have at most 167 changes between lines.
>
>
> Mike
>

On the need for 12 cycles here. My experience is that FPGA's tend NOT to 
have giant blocks of memory, but a lot of "smaller" blocks (perhaps of 
differing size.  This 1/2 MB memory is likely made of smaller blocks and 
could be defined as 2 separate memories, one for even lines, and one for 
odd, which says that you 4,5 and 8,9 could be accessed simultaneously. 
(Actually, if you are interpolating the pixels, you are going to almost 
always want two lines of data, the line above and below your fractional 
position, and the point before and after, and so may want to 4 way 
interleave your memory).

On Thursday, 8 January 2015 08:51:55 UTC+13, Tomas D.  wrote:
> The FPGA will be Altera Cyclone V with one hard memory controller (5CEFA2=
=20
> device). I am trying to check if it will be sufficient to use one DDR3=20
> memory chip or it's better to use two devices with 32bit memory bus, thus=
=20
> increasing the bandwidth.
>=20

The most efficient method which uses frame buffer that is external to the F=
PGA will result in one write and one read per pixel, if you use a DDR modul=
e you will need a bit more than 2x the bandwidth of the video stream, so yo=
u will need a bit over 900MB/s for 24-bit 1080p @ 60 Hz. You will need to c=
arefully plan how memory will be accessed to maximise available memory band=
width.

For 1080p video, if you can hold 180 rows of pixel data inside your FPGA yo=
u don't actually need external memory to buffer the frames at all, and you =
can achieve lower latency too (approx the time for 181 lines). The idea bei=
ng to use a rolling buffer of 180 rows that you sample/extract your output =
pixels from. The cost of the larger FPGA might be offset by the savings in =
not requiring the external memory, smaller PCB and so on.

There is a sweet spot for 720p video, where you can get away with holding j=
ust 128 rows for +/- 5 degrees of rotation, requiring only half a MB of blo=
ck RAM. This assumes that you are not interpolating between pixels.

If you are performing interpolation, then you might need to be really cunni=
ng and use the extra cycles found in the blanking interval to give addition=
al cycles required for the extra memory accesses needed when you walk throu=
gh the pixels. Your access pattern might be something like=20
  =20
1234......
...5678...
......9ABC
..........

In this case it takes 12 cycles to access the data needed for interpolating=
 10 pixels (because of the additional cycle required for access 5 & 9 when =
it jumps lines). You will then need something like a FIFO to remove the gap=
s in the output pixel stream. For 1080p, you have about 280 cycles in the h=
orizontal blanking interval, a little more than what you will need for a +/=
- 5 degree rotation, where you will have at most 167 changes between lines.

Mike

Tomas D. wrote:
> Hello,
> I've come up with an issue, where I need to rotate the incoming video stream 
> image by +/-5 degrees with 0.5 degree step. The problem now is to identify 
> the most resource saving approach, which would also use the memory as 
> efficiently as possible, because I need to design a new PCB and do a 
> component selection.
> The FPGA will be Altera Cyclone V with one hard memory controller (5CEFA2 
> device). I am trying to check if it will be sufficient to use one DDR3 
> memory chip or it's better to use two devices with 32bit memory bus, thus 
> increasing the bandwidth.
> 
> The incoming video stream is from the camera, which has a separate clock, 
> thus the frame buffer is a requirement.
> 
> I've come accross two options so far:
> 1) Image rotation by shearing:
> https://www.ocf.berkeley.edu/~fricke/projects/israel/paeth/rotation_by_shearing.html
> 
> It seems like this is kinda easy approach, but it will require at least 
> three memory accesses. In a combination with regular 3 frames frame buffer, 
> I could end up doing 5 memory read/write cycles.
> 
> 2) Image rotation by having lookup table of each pixel. If the lookup table 
> will be placed into the memory, then this will require one access to read 
> the location and another access to read the pixels and write them to the 
> moved location.
> 
> I am not sure which method is used the most common in the FPGA video 
> processing? Maybe you, experts, have good resources to read about this?
> 
> Thank you.
> 
> Regards
> Tomas D. 
> 
> 

Some time ago I did image rotation for a check scanner that used
a line-scan camera.  My issue was mostly the general lack of BRAM
in the small (XCV50) FPGAs I used and I had to come up with an
algorithm that only read small groups of pixels at a time.  My
suggestion is to try to find a part that has as much internal
RAM as you can reasonably afford.  Then remember that when reading
you want to keep as much of the data you actually read (full bursts)
so you don't have to re-read during the same rotation pass.  I would
not think that going to a two-pass shearing algorithm will really
save much in terms of logic.  I didn't need to go that way and I
used relatively small parts with no internal hardware multipliers.
The algorithm I used simply started with the location of the first
destination pixel of the first output line, which may be located
at some point outside the actual input image.  Remember that free
rotation usually requires a larger "canvas" than the input image.
In my case I didn't really need the whole input image since the
output image was calculated by the detected corners of the check.
Then the algorithm simply walked a pixel at a time by adding a
delta to the starting location.  You need to read pixels surrounding
each computed X,Y location and interpolate.  My interpolation was
simply linear and used only the 4 nearest neighbors, but a more
robust algorithm would either use more neighboring pixels or do
some filtering on the input image before rotation.  When you get
to the end of the first output line, you go back to the original
pixel location plus on orthogonal delta to get the first pixel
of the second output line and so on.
My algorithm used reads of 4 adjacent pixels in each of three
adjacent rows to fill internal memory in 3 x 4 blocks.  The starting
point of these 3 x 4 blocks depended on the direction of rotation,
but it allowed me to do +/- 14 degrees max.  I did not need to
do sine / cosine in my design because there was a processor that
looked at the incoming raw image to find the check corners and
directly programmed the starting pixel location and X,Y deltas.

-- 
Gabor