comp.arch.fpga | What to do with an improved algorithm?

Hi,

I think I've got a really good way to improve a commonly used & well establ=
ished algorithm that is often used in FPGAs, and it all checks out. The imp=
lementation completes the same tasks in 2/3rds the cycles and using 2/3rds =
the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit in=
put values.

I can't find anything similar designs in a Google patent search, or looking=
 through journal articles. Once you are familiar with the original algorith=
m, and the optimization is explained it becomes pretty self-evident in retr=
ospect. It just seems the right way to do things.

What should I do?=20

Should I just throw the implementation on a website somewhere as a curiosit=
y?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at al=
l)=20

Attempt to patent and then commercialize it?

Thanks!

Mike

Reply by Gene Filatov ●September 3, 20182018-09-03

I think the best option is to write an article -- or a patent.

Simply because it's an extra opportunity to verify that your approach is 
correct.

If there's a hidden mistake, a student might be unable to see it.

Gene


On 03.09.2018 13:17, Mike Field wrote:
> Hi,
>
> I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).
>
> I've verified that the output is correct over the entire range of 32-bit input values.
>
> I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.
>
> What should I do?
>
> Should I just throw the implementation on a website somewhere as a curiosity?
>
> Publish it in an article?
>
> Pass it to a local student to make a paper from it? (I'm not studying at all)
>
> Attempt to patent and then commercialize it?
>
> Thanks!
>
> Mike
>
>

Reply by Mike Butts ●September 3, 20182018-09-03

I agree with Gene, plus you might consider publishing the IP as open source=
 code on a website of your own or opencores.org.

  --Mike

On Monday, September 3, 2018 at 3:41:02 AM UTC-7, Gene Filatov wrote:
> I think the best option is to write an article -- or a patent.
>=20
> Simply because it's an extra opportunity to verify that your approach is=
=20
> correct.
>=20
> If there's a hidden mistake, a student might be unable to see it.
>=20
> Gene
>=20
>=20
> On 03.09.2018 13:17, Mike Field wrote:
> > Hi,
> >
> > I think I've got a really good way to improve a commonly used & well es=
tablished algorithm that is often used in FPGAs, and it all checks out. The=
 implementation completes the same tasks in 2/3rds the cycles and using 2/3=
rds the resources of an standard Xilinx IP block, with comparable timing).
> >
> > I've verified that the output is correct over the entire range of 32-bi=
t input values.
> >
> > I can't find anything similar designs in a Google patent search, or loo=
king through journal articles. Once you are familiar with the original algo=
rithm, and the optimization is explained it becomes pretty self-evident in =
retrospect. It just seems the right way to do things.
> >
> > What should I do?
> >
> > Should I just throw the implementation on a website somewhere as a curi=
osity?
> >
> > Publish it in an article?
> >
> > Pass it to a local student to make a paper from it? (I'm not studying a=
t all)
> >
> > Attempt to patent and then commercialize it?
> >
> > Thanks!
> >
> > Mike
> >
> >

Reply by Michael Kellett ●September 4, 20182018-09-04

On 03/09/2018 11:17, Mike Field wrote:
> Hi,
> 
> I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).
> 
> I've verified that the output is correct over the entire range of 32-bit input values.
> 
> I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.
> 
> What should I do?
> 
> Should I just throw the implementation on a website somewhere as a curiosity?
> 
> Publish it in an article?
> 
> Pass it to a local student to make a paper from it? (I'm not studying at all)
> 
> Attempt to patent and then commercialize it?
> 
> Thanks!
> 
> Mike
> 
> 
I'd publish - since you are not already in the IP licensing/patenting 
groove I doubt if you would make any money from it but you might gain 
kudos which may help you career and business. Xilinx might want to 
publish it - which might give a lot more visibility.

If you have a web site you could put it on that.

MK

Reply by kkoorndyk ●September 4, 20182018-09-04

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:
> Hi,
>=20
> I think I've got a really good way to improve a commonly used & well esta=
blished algorithm that is often used in FPGAs, and it all checks out. The i=
mplementation completes the same tasks in 2/3rds the cycles and using 2/3rd=
s the resources of an standard Xilinx IP block, with comparable timing).
>=20
> I've verified that the output is correct over the entire range of 32-bit =
input values.
>=20
> I can't find anything similar designs in a Google patent search, or looki=
ng through journal articles. Once you are familiar with the original algori=
thm, and the optimization is explained it becomes pretty self-evident in re=
trospect. It just seems the right way to do things.
>=20
> What should I do?=20
>=20
> Should I just throw the implementation on a website somewhere as a curios=
ity?
>=20
> Publish it in an article?
>=20
> Pass it to a local student to make a paper from it? (I'm not studying at =
all)=20
>=20
> Attempt to patent and then commercialize it?
>=20
> Thanks!
>=20
> Mike

Licensing and selling IP comes with a bit of a learning curve and requires =
an investment on your part.  As Michael mentions, without some of that fram=
ework already in place, a license vetted by an IP attorney, and a good mark=
eting plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting =
up on a personal Github account rather than opencores.org which makes you c=
onform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have b=
een in flux since the recent reorg), and their e-magazine Xcell Journal (ag=
ain, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive
https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html


--Kris

Reply by Mike Butts ●September 4, 20182018-09-04

On Tuesday, September 4, 2018 at 7:48:41 AM UTC-7, kkoorndyk wrote:
> If you want your name more prominently attached to it, I'd suggest postin=
g up on a personal Github account rather than opencores.org which makes you=
 conform to their requirements (such as wishbone interface, etc.).
>=20
OpenCores encourages use of the Wishbone interface for SoC components and t=
hey do offer coding guidelines, but there are no requirements for either. F=
or example in the entire DSP core section there are 38 entries, none of whi=
ch are marked as Wishbone compliant.

Reply by Mike Field ●September 5, 20182018-09-05

On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk  wrote:
> On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:
> > Hi,
> >=20
> > I think I've got a really good way to improve a commonly used & well es=
tablished algorithm that is often used in FPGAs, and it all checks out. The=
 implementation completes the same tasks in 2/3rds the cycles and using 2/3=
rds the resources of an standard Xilinx IP block, with comparable timing).
> >=20
> > I've verified that the output is correct over the entire range of 32-bi=
t input values.
> >=20
> > I can't find anything similar designs in a Google patent search, or loo=
king through journal articles. Once you are familiar with the original algo=
rithm, and the optimization is explained it becomes pretty self-evident in =
retrospect. It just seems the right way to do things.
> >=20
> > What should I do?=20
> >=20
> > Should I just throw the implementation on a website somewhere as a curi=
osity?
> >=20
> > Publish it in an article?
> >=20
> > Pass it to a local student to make a paper from it? (I'm not studying a=
t all)=20
> >=20
> > Attempt to patent and then commercialize it?
> >=20
> > Thanks!
> >=20
> > Mike
>=20
> Licensing and selling IP comes with a bit of a learning curve and require=
s an investment on your part.  As Michael mentions, without some of that fr=
amework already in place, a license vetted by an IP attorney, and a good ma=
rketing plan, you might not see a return on that investment.
>=20
> If you want your name more prominently attached to it, I'd suggest postin=
g up on a personal Github account rather than opencores.org which makes you=
 conform to their requirements (such as wishbone interface, etc.).
>=20
> Xilinx always welcomes guest articles on their blogs (although those have=
 been in flux since the recent reorg), and their e-magazine Xcell Journal (=
again, seems to have been discontinued and the Xcell Daily Blog archived)
>=20
> https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive
> https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog
>=20
> https://www.xilinx.com/about/xcell-publications/xcell-journal.html
>=20
>=20
> --Kris

I never though I would agree with Rick, but....

All sounds like too much work. So here is a quick summary with C-like pseud=
o-code. I'll put the HDL code up somewhere soon once I am happy with it. I =
am removing the last rounding errors.

I've been playing with CORDIC, and have come up with what looks to be an ov=
erlooked optimization. I've done a bit of googling, and haven't found anyth=
ing - maybe it is a novel approach?

I've tested it with 32-bit inputs and outputs, and it is within +/-2, and a=
nd average error of around 0.6.I a am sure with a bit more analysis of wher=
e the errors are coming from I can get it more accurate.

This has two parts to it, both by themselves seem quite trivial, but comple=
ment each other quite nicely.

Scaling Z
---------
1. The 'z' value in CORDIC uses becomes smaller and smaller as stages incre=
ase:

The core of CORDIC for SIN() and COS() is:
  x =3D INITIAL;
  y =3D INITIAL;
  for(i =3D 0; i < CORDIC_REPS; i++ ) {
    int64_t tx,ty;
    // divide to scale the current vector
    tx =3D x >> (i+1);
    ty =3D y >> (i+1);

    // Either add or subtract at right angles to the current=20
    x -=3D (z > 0 ? ty : -ty);
    y +=3D (z > 0 ? tx : -tx);
    z -=3D (z > 0 ? angles[i] : -angles[i]);
}


The value for angle[] is all important, for example:

  angle[0] =3D 1238021
  angle[1] =3D 654136
  angle[2] =3D 332050
  angle[3] =3D 166670
  angle[4] =3D 83415
  angle[5] =3D 41718
  angle[6] =3D 20860
  angle[7] =3D 10430
  angle[8] =3D 5215
  angle[9] =3D 2607
  angle[10] =3D 1303
  angle[11] =3D 652
  angle[12] =3D 326
  angle[13] =3D 163
  angle[14] =3D 81
  angle[15] =3D 41
  angle[16] =3D 20
  angle[17] =3D 10
  angle[18] =3D 5
  angle[19] =3D 3
  angle[20] =3D 1

If you make the following change:

  for(i =3D 0; i < CORDIC_REPS; i++ ) {
    int64_t tx,ty;
    // divide to scale the current vector
    tx =3D x >> (i+1);
    ty =3D y >> (i+1);

    // Either add or subtract at right angles
    x -=3D (z > 0 ? ty : -ty);
    y +=3D (z > 0 ? tx : -tx);
    z -=3D (z > 0 ? angles[i] : -angles[i]);

    //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    z <<=3D 1; // Double the result of 'z'
    //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  }

Then you can use all the bits in angle[], because you can scale by 2^i (thi=
s is data from a different set of parameters, hence the values and count is=
 different):
  angle[0] =3D 1238021
  angle[1] =3D 1308273
  angle[2] =3D 1328199
  angle[3] =3D 1333354
  angle[4] =3D 1334654
  angle[5] =3D 1334980
  angle[6] =3D 1335061
  angle[7] =3D 1335082
  angle[8] =3D 1335087
  angle[9] =3D 1335088
  angle[10] =3D 1335088
  angle[11] =3D 1335088
  angle[12] =3D 1335088
  angle[13] =3D 1335088
  angle[14] =3D 1335088
  angle[15] =3D 1335088
  angle[16] =3D 1335088
  angle[17] =3D 1335088
  angle[18] =3D 1335088
  angle[19] =3D 1335088
  angle[20] =3D 1335088
  angle[21] =3D 1335088
  angle[22] =3D 1335088
  angle[23] =3D 1335088
  angle[24] =3D 1335088
  angle[25] =3D 1335088
  angle[26] =3D 1335088
  angle[27] =3D 1335088
  angle[28] =3D 1335088
  angle[29] =3D 1335088

...and angle[i] rapidly becomes a constant value after the first 9 or 10 it=
erations. This is what you would expect, as the angle gets smaller and smal=
ler.


Part 2: Add a lookup table
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
If you split the input into:

  [2 MSB] quadrant
  [next 9 bits] an lookup table index
  [the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1)

And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which ho=
ld the SIN/COS values at the center of the range =3D e.g. initial[i] =3D sc=
ale_factor * sin(PI/2.0/1024*(2*i+1));

Because you need both the SIN() and COS() starting point, you can get them =
from the same table (screaming out "dual port memory!" to me)

You can then do a standard lookup to get the starting points, 9 cycles into=
 the CORDIC:

  /* Use Dual Port memory for this */
  if(quadrant & 1) {
    x =3D initial[index];
    y =3D initial[TABLE_SIZE-1-index];
  } else {
    x =3D initial[TABLE_SIZE-1-index];
    y =3D initial[index];
  }

   /* Subtract half the sector angle from Z */
   z -=3D 1 << (CORDIC_BITS-1);

   /* Now do standard CORDIC, with a lot of work already done */
   ...

This removes ~8 cycles of latency.

The end result
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
If you combine both of these you can get rid of the angles[] table complete=
ly - it is now a constant.

  /* Use Dual Port memory for this */
  if(quadrant & 1) {
    x =3D initial[index];
    y =3D initial[TABLE_SIZE-1-index];
  } else {
    x =3D initial[TABLE_SIZE-1-index];
    y =3D initial[index];
  }

  /* Subtract half the sector angle from Z */
  z -=3D 1 << (CORDIC_BITS-1);

  /* Now do standard CORDIC, with a lot of work already done,=20
     so less repetitions are needed for the same accuracy */

  for(i =3D 0; i < CORDIC_REPS; i++ ) {
    int64_t tx,ty;
    // Add rounding and divide to scale the current vector
    tx =3D x >> (INDEX_BITS+i);
    ty =3D y >> (INDEX_BITS+i);

    // Either add or subtract at right angles
    x -=3D (z > 0 ? ty : -ty);
    y +=3D (z > 0 ? tx : -tx);
    z -=3D (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT);
    z <<=3D 1;=20
  }

Advantages of this method
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
If you have fully unrolled it to generate a full value per cycle, you end u=
p with:
- 1 BRAM block used (bad)
- 9 less CORDIC stages (good)
- 8 or 9 cycles less latency (good)

For 16-bit values this may only need 5 stages, rather than 14.

If you are trying to minimize area, generating an n-bit value every ~n cycl=
es you end up with:

- 1 BRAM block used (bad)
- 8 or 9 cycles less latency (good)
- no need for the angles[] table (good)
- Less levels of logic, for faster FMAX (good)

For 16-bit values, this could double the number of calculations you can com=
pute at a given clock rate.

You can also tune things some what - you can always throw more BRAM blocks =
at it to reduce the number of CORDIC stages/iterations required, if you hav=
e blocks to spare - but one block to remove 9 stages is pretty good.

What do you think?

Reply by Gene Filatov ●September 5, 20182018-09-05

On 05.09.2018 8:40, Mike Field wrote:
> On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk  wrote:
>> On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:
>>>
>>> I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).
>>>
>>> I've verified that the output is correct over the entire range of 32-bit input values.
>>>
>>> I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.
>>>
>>> What should I do?
>>>
>>> Should I just throw the implementation on a website somewhere as a curiosity?
>>>
>>> Publish it in an article?
>>>
>>> Pass it to a local student to make a paper from it? (I'm not studying at all)
>>>
>>> Attempt to patent and then commercialize it?
>>>
>>> Thanks!
>>>
>>> Mike
>>
>> Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part.  As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.
>>
>> If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).
>>
>> Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)
>>
>> https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive
>> https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog
>>
>> https://www.xilinx.com/about/xcell-publications/xcell-journal.html
>>
>>
>> --Kris
>
> I never though I would agree with Rick, but....
>
> All sounds like too much work. So here is a quick summary with C-like pseudo-code. I'll put the HDL code up somewhere soon once I am happy with it. I am removing the last rounding errors.
>
> I've been playing with CORDIC, and have come up with what looks to be an overlooked optimization. I've done a bit of googling, and haven't found anything - maybe it is a novel approach?
>
> I've tested it with 32-bit inputs and outputs, and it is within +/-2, and and average error of around 0.6.I a am sure with a bit more analysis of where the errors are coming from I can get it more accurate.
>
> This has two parts to it, both by themselves seem quite trivial, but complement each other quite nicely.
>
> Scaling Z
> ---------
> 1. The 'z' value in CORDIC uses becomes smaller and smaller as stages increase:
>
> The core of CORDIC for SIN() and COS() is:
>   x = INITIAL;
>   y = INITIAL;
>   for(i = 0; i < CORDIC_REPS; i++ ) {
>     int64_t tx,ty;
>     // divide to scale the current vector
>     tx = x >> (i+1);
>     ty = y >> (i+1);
>
>     // Either add or subtract at right angles to the current
>     x -= (z > 0 ? ty : -ty);
>     y += (z > 0 ? tx : -tx);
>     z -= (z > 0 ? angles[i] : -angles[i]);
> }
>
>
> The value for angle[] is all important, for example:
>
>   angle[0] = 1238021
>   angle[1] = 654136
>   angle[2] = 332050
>   angle[3] = 166670
>   angle[4] = 83415
>   angle[5] = 41718
>   angle[6] = 20860
>   angle[7] = 10430
>   angle[8] = 5215
>   angle[9] = 2607
>   angle[10] = 1303
>   angle[11] = 652
>   angle[12] = 326
>   angle[13] = 163
>   angle[14] = 81
>   angle[15] = 41
>   angle[16] = 20
>   angle[17] = 10
>   angle[18] = 5
>   angle[19] = 3
>   angle[20] = 1
>
> If you make the following change:
>
>   for(i = 0; i < CORDIC_REPS; i++ ) {
>     int64_t tx,ty;
>     // divide to scale the current vector
>     tx = x >> (i+1);
>     ty = y >> (i+1);
>
>     // Either add or subtract at right angles
>     x -= (z > 0 ? ty : -ty);
>     y += (z > 0 ? tx : -tx);
>     z -= (z > 0 ? angles[i] : -angles[i]);
>
>     //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>     z <<= 1; // Double the result of 'z'
>     //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>   }
>
> Then you can use all the bits in angle[], because you can scale by 2^i (this is data from a different set of parameters, hence the values and count is different):
>   angle[0] = 1238021
>   angle[1] = 1308273
>   angle[2] = 1328199
>   angle[3] = 1333354
>   angle[4] = 1334654
>   angle[5] = 1334980
>   angle[6] = 1335061
>   angle[7] = 1335082
>   angle[8] = 1335087
>   angle[9] = 1335088
>   angle[10] = 1335088
>   angle[11] = 1335088
>   angle[12] = 1335088
>   angle[13] = 1335088
>   angle[14] = 1335088
>   angle[15] = 1335088
>   angle[16] = 1335088
>   angle[17] = 1335088
>   angle[18] = 1335088
>   angle[19] = 1335088
>   angle[20] = 1335088
>   angle[21] = 1335088
>   angle[22] = 1335088
>   angle[23] = 1335088
>   angle[24] = 1335088
>   angle[25] = 1335088
>   angle[26] = 1335088
>   angle[27] = 1335088
>   angle[28] = 1335088
>   angle[29] = 1335088
>
> ...and angle[i] rapidly becomes a constant value after the first 9 or 10 iterations. This is what you would expect, as the angle gets smaller and smaller.
>
>
> Part 2: Add a lookup table
> ==========================
> If you split the input into:
>
>   [2 MSB] quadrant
>   [next 9 bits] an lookup table index
>   [the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1)
>
> And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which hold the SIN/COS values at the center of the range = e.g. initial[i] = scale_factor * sin(PI/2.0/1024*(2*i+1));
>
> Because you need both the SIN() and COS() starting point, you can get them from the same table (screaming out "dual port memory!" to me)
>
> You can then do a standard lookup to get the starting points, 9 cycles into the CORDIC:
>
>   /* Use Dual Port memory for this */
>   if(quadrant & 1) {
>     x = initial[index];
>     y = initial[TABLE_SIZE-1-index];
>   } else {
>     x = initial[TABLE_SIZE-1-index];
>     y = initial[index];
>   }
>
>    /* Subtract half the sector angle from Z */
>    z -= 1 << (CORDIC_BITS-1);
>
>    /* Now do standard CORDIC, with a lot of work already done */
>    ...
>
> This removes ~8 cycles of latency.
>
> The end result
> ==============
> If you combine both of these you can get rid of the angles[] table completely - it is now a constant.
>
>   /* Use Dual Port memory for this */
>   if(quadrant & 1) {
>     x = initial[index];
>     y = initial[TABLE_SIZE-1-index];
>   } else {
>     x = initial[TABLE_SIZE-1-index];
>     y = initial[index];
>   }
>
>   /* Subtract half the sector angle from Z */
>   z -= 1 << (CORDIC_BITS-1);
>
>   /* Now do standard CORDIC, with a lot of work already done,
>      so less repetitions are needed for the same accuracy */
>
>   for(i = 0; i < CORDIC_REPS; i++ ) {
>     int64_t tx,ty;
>     // Add rounding and divide to scale the current vector
>     tx = x >> (INDEX_BITS+i);
>     ty = y >> (INDEX_BITS+i);
>
>     // Either add or subtract at right angles
>     x -= (z > 0 ? ty : -ty);
>     y += (z > 0 ? tx : -tx);
>     z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT);
>     z <<= 1;
>   }
>
> Advantages of this method
> =========================
> If you have fully unrolled it to generate a full value per cycle, you end up with:
> - 1 BRAM block used (bad)
> - 9 less CORDIC stages (good)
> - 8 or 9 cycles less latency (good)
>
> For 16-bit values this may only need 5 stages, rather than 14.
>
> If you are trying to minimize area, generating an n-bit value every ~n cycles you end up with:
>
> - 1 BRAM block used (bad)
> - 8 or 9 cycles less latency (good)
> - no need for the angles[] table (good)
> - Less levels of logic, for faster FMAX (good)
>
> For 16-bit values, this could double the number of calculations you can compute at a given clock rate.
>
> You can also tune things some what - you can always throw more BRAM blocks at it to reduce the number of CORDIC stages/iterations required, if you have blocks to spare - but one block to remove 9 stages is pretty good.
>
> What do you think?
>

As far as your "revised" angle[i] converging to a constant is concerned, 
there's a simple explanation using the first two terms of the taylor 
series for the arctan function:

arctan(x) = x - 1/3 * x^3 + o(x^3)

So that

angle[i] = arctan(2^-i) / pi * 2^i = (1/pi) * ( 1 - 1/3 * 2^-(2*i)) + 
o(2^-(2*i))

Based on which you can easily say how many stages of the conventional 
cordic algorithm do you need to skip (i.e. store the outputs in a lookup 
table) for a given bit precision.

I don't know the literature well, but I think it would be cool if you 
actually write an article detailing your approach!

Gene

Reply by Brian Davis ●September 7, 20182018-09-07

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:
>
> The implementation completes the same tasks in 2/3rds 
> the cycles and using 2/3rds the resources of an 
> standard Xilinx IP block, with comparable timing).
>
If perchance this is related to your recent CORDIC rotator code, 
I've seen a number of CORDIC optimization schemes over the years 
to reduce the number of rotation stages, IIRC typically either 
by a 'jump start' or merging/optimizing rotation stages.

If I ever manage to find my folder of CORDIC papers, I'll post some links...

-------
some notes on your CORDIC implementation
http://hamsterworks.co.nz/mediawiki/index.php/CORDIC

 - instead of quadrant folding, given I & Q you can do octant folding 
   (0-45 degrees) using the top three bits of the phase

 - if you pass in the bit widths and stages as generics, you can 
   initialize the constant arctan table on-the-fly in a function 
   using VHDL reals

-Brian

Reply by Brian Davis ●September 7, 20182018-09-07

earlier, I wrote:
>
> If perchance this is related to your recent CORDIC rotator code, 
> I've seen a number of CORDIC optimization schemes over the years 
> to reduce the number of rotation stages, IIRC typically either 
> by a 'jump start' or merging/optimizing rotation stages.
> 
oops, for some reason, when first reading this thread I didn't 
see the later posts with the explanation... I'd swear they weren't
there, but maybe I was just scroll-impaired.

-Brian

Previous12 Next

What to do with an improved algorithm?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group