Hi, I think I've got a really good way to improve a commonly used & well establ= ished algorithm that is often used in FPGAs, and it all checks out. The imp= lementation completes the same tasks in 2/3rds the cycles and using 2/3rds = the resources of an standard Xilinx IP block, with comparable timing). I've verified that the output is correct over the entire range of 32-bit in= put values. I can't find anything similar designs in a Google patent search, or looking= through journal articles. Once you are familiar with the original algorith= m, and the optimization is explained it becomes pretty self-evident in retr= ospect. It just seems the right way to do things. What should I do?=20 Should I just throw the implementation on a website somewhere as a curiosit= y? Publish it in an article? Pass it to a local student to make a paper from it? (I'm not studying at al= l)=20 Attempt to patent and then commercialize it? Thanks! Mike

# What to do with an improved algorithm?

Started by ●September 3, 2018

Reply by ●September 3, 20182018-09-03

I think the best option is to write an article -- or a patent. Simply because it's an extra opportunity to verify that your approach is correct. If there's a hidden mistake, a student might be unable to see it. Gene On 03.09.2018 13:17, Mike Field wrote:> Hi, > > I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing). > > I've verified that the output is correct over the entire range of 32-bit input values. > > I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things. > > What should I do? > > Should I just throw the implementation on a website somewhere as a curiosity? > > Publish it in an article? > > Pass it to a local student to make a paper from it? (I'm not studying at all) > > Attempt to patent and then commercialize it? > > Thanks! > > Mike > >

Reply by ●September 3, 20182018-09-03

I agree with Gene, plus you might consider publishing the IP as open source= code on a website of your own or opencores.org. --Mike On Monday, September 3, 2018 at 3:41:02 AM UTC-7, Gene Filatov wrote:> I think the best option is to write an article -- or a patent. >=20 > Simply because it's an extra opportunity to verify that your approach is==20> correct. >=20 > If there's a hidden mistake, a student might be unable to see it. >=20 > Gene >=20 >=20 > On 03.09.2018 13:17, Mike Field wrote: > > Hi, > > > > I think I've got a really good way to improve a commonly used & well es=tablished algorithm that is often used in FPGAs, and it all checks out. The= implementation completes the same tasks in 2/3rds the cycles and using 2/3= rds the resources of an standard Xilinx IP block, with comparable timing).> > > > I've verified that the output is correct over the entire range of 32-bi=t input values.> > > > I can't find anything similar designs in a Google patent search, or loo=king through journal articles. Once you are familiar with the original algo= rithm, and the optimization is explained it becomes pretty self-evident in = retrospect. It just seems the right way to do things.> > > > What should I do? > > > > Should I just throw the implementation on a website somewhere as a curi=osity?> > > > Publish it in an article? > > > > Pass it to a local student to make a paper from it? (I'm not studying a=t all)> > > > Attempt to patent and then commercialize it? > > > > Thanks! > > > > Mike > > > >

Reply by ●September 4, 20182018-09-04

On 03/09/2018 11:17, Mike Field wrote:> Hi, > > I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing). > > I've verified that the output is correct over the entire range of 32-bit input values. > > I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things. > > What should I do? > > Should I just throw the implementation on a website somewhere as a curiosity? > > Publish it in an article? > > Pass it to a local student to make a paper from it? (I'm not studying at all) > > Attempt to patent and then commercialize it? > > Thanks! > > Mike > >I'd publish - since you are not already in the IP licensing/patenting groove I doubt if you would make any money from it but you might gain kudos which may help you career and business. Xilinx might want to publish it - which might give a lot more visibility. If you have a web site you could put it on that. MK

Reply by ●September 4, 20182018-09-04

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:> Hi, >=20 > I think I've got a really good way to improve a commonly used & well esta=blished algorithm that is often used in FPGAs, and it all checks out. The i= mplementation completes the same tasks in 2/3rds the cycles and using 2/3rd= s the resources of an standard Xilinx IP block, with comparable timing).>=20 > I've verified that the output is correct over the entire range of 32-bit =input values.>=20 > I can't find anything similar designs in a Google patent search, or looki=ng through journal articles. Once you are familiar with the original algori= thm, and the optimization is explained it becomes pretty self-evident in re= trospect. It just seems the right way to do things.>=20 > What should I do?=20 >=20 > Should I just throw the implementation on a website somewhere as a curios=ity?>=20 > Publish it in an article? >=20 > Pass it to a local student to make a paper from it? (I'm not studying at =all)=20>=20 > Attempt to patent and then commercialize it? >=20 > Thanks! >=20 > MikeLicensing and selling IP comes with a bit of a learning curve and requires = an investment on your part. As Michael mentions, without some of that fram= ework already in place, a license vetted by an IP attorney, and a good mark= eting plan, you might not see a return on that investment. If you want your name more prominently attached to it, I'd suggest posting = up on a personal Github account rather than opencores.org which makes you c= onform to their requirements (such as wishbone interface, etc.). Xilinx always welcomes guest articles on their blogs (although those have b= een in flux since the recent reorg), and their e-magazine Xcell Journal (ag= ain, seems to have been discontinued and the Xcell Daily Blog archived) https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog https://www.xilinx.com/about/xcell-publications/xcell-journal.html --Kris

Reply by ●September 4, 20182018-09-04

On Tuesday, September 4, 2018 at 7:48:41 AM UTC-7, kkoorndyk wrote:> If you want your name more prominently attached to it, I'd suggest postin=g up on a personal Github account rather than opencores.org which makes you= conform to their requirements (such as wishbone interface, etc.).>=20OpenCores encourages use of the Wishbone interface for SoC components and t= hey do offer coding guidelines, but there are no requirements for either. F= or example in the entire DSP core section there are 38 entries, none of whi= ch are marked as Wishbone compliant.

Reply by ●September 5, 20182018-09-05

On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk wrote:> On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote: > > Hi, > >=20 > > I think I've got a really good way to improve a commonly used & well es=tablished algorithm that is often used in FPGAs, and it all checks out. The= implementation completes the same tasks in 2/3rds the cycles and using 2/3= rds the resources of an standard Xilinx IP block, with comparable timing).> >=20 > > I've verified that the output is correct over the entire range of 32-bi=t input values.> >=20 > > I can't find anything similar designs in a Google patent search, or loo=king through journal articles. Once you are familiar with the original algo= rithm, and the optimization is explained it becomes pretty self-evident in = retrospect. It just seems the right way to do things.> >=20 > > What should I do?=20 > >=20 > > Should I just throw the implementation on a website somewhere as a curi=osity?> >=20 > > Publish it in an article? > >=20 > > Pass it to a local student to make a paper from it? (I'm not studying a=t all)=20> >=20 > > Attempt to patent and then commercialize it? > >=20 > > Thanks! > >=20 > > Mike >=20 > Licensing and selling IP comes with a bit of a learning curve and require=s an investment on your part. As Michael mentions, without some of that fr= amework already in place, a license vetted by an IP attorney, and a good ma= rketing plan, you might not see a return on that investment.>=20 > If you want your name more prominently attached to it, I'd suggest postin=g up on a personal Github account rather than opencores.org which makes you= conform to their requirements (such as wishbone interface, etc.).>=20 > Xilinx always welcomes guest articles on their blogs (although those have=been in flux since the recent reorg), and their e-magazine Xcell Journal (= again, seems to have been discontinued and the Xcell Daily Blog archived)>=20 > https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive > https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog >=20 > https://www.xilinx.com/about/xcell-publications/xcell-journal.html >=20 >=20 > --KrisI never though I would agree with Rick, but.... All sounds like too much work. So here is a quick summary with C-like pseud= o-code. I'll put the HDL code up somewhere soon once I am happy with it. I = am removing the last rounding errors. I've been playing with CORDIC, and have come up with what looks to be an ov= erlooked optimization. I've done a bit of googling, and haven't found anyth= ing - maybe it is a novel approach? I've tested it with 32-bit inputs and outputs, and it is within +/-2, and a= nd average error of around 0.6.I a am sure with a bit more analysis of wher= e the errors are coming from I can get it more accurate. This has two parts to it, both by themselves seem quite trivial, but comple= ment each other quite nicely. Scaling Z --------- 1. The 'z' value in CORDIC uses becomes smaller and smaller as stages incre= ase: The core of CORDIC for SIN() and COS() is: x =3D INITIAL; y =3D INITIAL; for(i =3D 0; i < CORDIC_REPS; i++ ) { int64_t tx,ty; // divide to scale the current vector tx =3D x >> (i+1); ty =3D y >> (i+1); // Either add or subtract at right angles to the current=20 x -=3D (z > 0 ? ty : -ty); y +=3D (z > 0 ? tx : -tx); z -=3D (z > 0 ? angles[i] : -angles[i]); } The value for angle[] is all important, for example: angle[0] =3D 1238021 angle[1] =3D 654136 angle[2] =3D 332050 angle[3] =3D 166670 angle[4] =3D 83415 angle[5] =3D 41718 angle[6] =3D 20860 angle[7] =3D 10430 angle[8] =3D 5215 angle[9] =3D 2607 angle[10] =3D 1303 angle[11] =3D 652 angle[12] =3D 326 angle[13] =3D 163 angle[14] =3D 81 angle[15] =3D 41 angle[16] =3D 20 angle[17] =3D 10 angle[18] =3D 5 angle[19] =3D 3 angle[20] =3D 1 If you make the following change: for(i =3D 0; i < CORDIC_REPS; i++ ) { int64_t tx,ty; // divide to scale the current vector tx =3D x >> (i+1); ty =3D y >> (i+1); // Either add or subtract at right angles x -=3D (z > 0 ? ty : -ty); y +=3D (z > 0 ? tx : -tx); z -=3D (z > 0 ? angles[i] : -angles[i]); //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! z <<=3D 1; // Double the result of 'z' //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! } Then you can use all the bits in angle[], because you can scale by 2^i (thi= s is data from a different set of parameters, hence the values and count is= different): angle[0] =3D 1238021 angle[1] =3D 1308273 angle[2] =3D 1328199 angle[3] =3D 1333354 angle[4] =3D 1334654 angle[5] =3D 1334980 angle[6] =3D 1335061 angle[7] =3D 1335082 angle[8] =3D 1335087 angle[9] =3D 1335088 angle[10] =3D 1335088 angle[11] =3D 1335088 angle[12] =3D 1335088 angle[13] =3D 1335088 angle[14] =3D 1335088 angle[15] =3D 1335088 angle[16] =3D 1335088 angle[17] =3D 1335088 angle[18] =3D 1335088 angle[19] =3D 1335088 angle[20] =3D 1335088 angle[21] =3D 1335088 angle[22] =3D 1335088 angle[23] =3D 1335088 angle[24] =3D 1335088 angle[25] =3D 1335088 angle[26] =3D 1335088 angle[27] =3D 1335088 angle[28] =3D 1335088 angle[29] =3D 1335088 ...and angle[i] rapidly becomes a constant value after the first 9 or 10 it= erations. This is what you would expect, as the angle gets smaller and smal= ler. Part 2: Add a lookup table =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D If you split the input into: [2 MSB] quadrant [next 9 bits] an lookup table index [the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1) And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which ho= ld the SIN/COS values at the center of the range =3D e.g. initial[i] =3D sc= ale_factor * sin(PI/2.0/1024*(2*i+1)); Because you need both the SIN() and COS() starting point, you can get them = from the same table (screaming out "dual port memory!" to me) You can then do a standard lookup to get the starting points, 9 cycles into= the CORDIC: /* Use Dual Port memory for this */ if(quadrant & 1) { x =3D initial[index]; y =3D initial[TABLE_SIZE-1-index]; } else { x =3D initial[TABLE_SIZE-1-index]; y =3D initial[index]; } /* Subtract half the sector angle from Z */ z -=3D 1 << (CORDIC_BITS-1); /* Now do standard CORDIC, with a lot of work already done */ ... This removes ~8 cycles of latency. The end result =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D If you combine both of these you can get rid of the angles[] table complete= ly - it is now a constant. /* Use Dual Port memory for this */ if(quadrant & 1) { x =3D initial[index]; y =3D initial[TABLE_SIZE-1-index]; } else { x =3D initial[TABLE_SIZE-1-index]; y =3D initial[index]; } /* Subtract half the sector angle from Z */ z -=3D 1 << (CORDIC_BITS-1); /* Now do standard CORDIC, with a lot of work already done,=20 so less repetitions are needed for the same accuracy */ for(i =3D 0; i < CORDIC_REPS; i++ ) { int64_t tx,ty; // Add rounding and divide to scale the current vector tx =3D x >> (INDEX_BITS+i); ty =3D y >> (INDEX_BITS+i); // Either add or subtract at right angles x -=3D (z > 0 ? ty : -ty); y +=3D (z > 0 ? tx : -tx); z -=3D (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT); z <<=3D 1;=20 } Advantages of this method =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D If you have fully unrolled it to generate a full value per cycle, you end u= p with: - 1 BRAM block used (bad) - 9 less CORDIC stages (good) - 8 or 9 cycles less latency (good) For 16-bit values this may only need 5 stages, rather than 14. If you are trying to minimize area, generating an n-bit value every ~n cycl= es you end up with: - 1 BRAM block used (bad) - 8 or 9 cycles less latency (good) - no need for the angles[] table (good) - Less levels of logic, for faster FMAX (good) For 16-bit values, this could double the number of calculations you can com= pute at a given clock rate. You can also tune things some what - you can always throw more BRAM blocks = at it to reduce the number of CORDIC stages/iterations required, if you hav= e blocks to spare - but one block to remove 9 stages is pretty good. What do you think?

Reply by ●September 5, 20182018-09-05

On 05.09.2018 8:40, Mike Field wrote:> On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk wrote: >> On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote: >>> >>> I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing). >>> >>> I've verified that the output is correct over the entire range of 32-bit input values. >>> >>> I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things. >>> >>> What should I do? >>> >>> Should I just throw the implementation on a website somewhere as a curiosity? >>> >>> Publish it in an article? >>> >>> Pass it to a local student to make a paper from it? (I'm not studying at all) >>> >>> Attempt to patent and then commercialize it? >>> >>> Thanks! >>> >>> Mike >> >> Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment. >> >> If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.). >> >> Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived) >> >> https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive >> https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog >> >> https://www.xilinx.com/about/xcell-publications/xcell-journal.html >> >> >> --Kris > > I never though I would agree with Rick, but.... > > All sounds like too much work. So here is a quick summary with C-like pseudo-code. I'll put the HDL code up somewhere soon once I am happy with it. I am removing the last rounding errors. > > I've been playing with CORDIC, and have come up with what looks to be an overlooked optimization. I've done a bit of googling, and haven't found anything - maybe it is a novel approach? > > I've tested it with 32-bit inputs and outputs, and it is within +/-2, and and average error of around 0.6.I a am sure with a bit more analysis of where the errors are coming from I can get it more accurate. > > This has two parts to it, both by themselves seem quite trivial, but complement each other quite nicely. > > Scaling Z > --------- > 1. The 'z' value in CORDIC uses becomes smaller and smaller as stages increase: > > The core of CORDIC for SIN() and COS() is: > x = INITIAL; > y = INITIAL; > for(i = 0; i < CORDIC_REPS; i++ ) { > int64_t tx,ty; > // divide to scale the current vector > tx = x >> (i+1); > ty = y >> (i+1); > > // Either add or subtract at right angles to the current > x -= (z > 0 ? ty : -ty); > y += (z > 0 ? tx : -tx); > z -= (z > 0 ? angles[i] : -angles[i]); > } > > > The value for angle[] is all important, for example: > > angle[0] = 1238021 > angle[1] = 654136 > angle[2] = 332050 > angle[3] = 166670 > angle[4] = 83415 > angle[5] = 41718 > angle[6] = 20860 > angle[7] = 10430 > angle[8] = 5215 > angle[9] = 2607 > angle[10] = 1303 > angle[11] = 652 > angle[12] = 326 > angle[13] = 163 > angle[14] = 81 > angle[15] = 41 > angle[16] = 20 > angle[17] = 10 > angle[18] = 5 > angle[19] = 3 > angle[20] = 1 > > If you make the following change: > > for(i = 0; i < CORDIC_REPS; i++ ) { > int64_t tx,ty; > // divide to scale the current vector > tx = x >> (i+1); > ty = y >> (i+1); > > // Either add or subtract at right angles > x -= (z > 0 ? ty : -ty); > y += (z > 0 ? tx : -tx); > z -= (z > 0 ? angles[i] : -angles[i]); > > //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > z <<= 1; // Double the result of 'z' > //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > } > > Then you can use all the bits in angle[], because you can scale by 2^i (this is data from a different set of parameters, hence the values and count is different): > angle[0] = 1238021 > angle[1] = 1308273 > angle[2] = 1328199 > angle[3] = 1333354 > angle[4] = 1334654 > angle[5] = 1334980 > angle[6] = 1335061 > angle[7] = 1335082 > angle[8] = 1335087 > angle[9] = 1335088 > angle[10] = 1335088 > angle[11] = 1335088 > angle[12] = 1335088 > angle[13] = 1335088 > angle[14] = 1335088 > angle[15] = 1335088 > angle[16] = 1335088 > angle[17] = 1335088 > angle[18] = 1335088 > angle[19] = 1335088 > angle[20] = 1335088 > angle[21] = 1335088 > angle[22] = 1335088 > angle[23] = 1335088 > angle[24] = 1335088 > angle[25] = 1335088 > angle[26] = 1335088 > angle[27] = 1335088 > angle[28] = 1335088 > angle[29] = 1335088 > > ...and angle[i] rapidly becomes a constant value after the first 9 or 10 iterations. This is what you would expect, as the angle gets smaller and smaller. > > > Part 2: Add a lookup table > ========================== > If you split the input into: > > [2 MSB] quadrant > [next 9 bits] an lookup table index > [the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1) > > And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which hold the SIN/COS values at the center of the range = e.g. initial[i] = scale_factor * sin(PI/2.0/1024*(2*i+1)); > > Because you need both the SIN() and COS() starting point, you can get them from the same table (screaming out "dual port memory!" to me) > > You can then do a standard lookup to get the starting points, 9 cycles into the CORDIC: > > /* Use Dual Port memory for this */ > if(quadrant & 1) { > x = initial[index]; > y = initial[TABLE_SIZE-1-index]; > } else { > x = initial[TABLE_SIZE-1-index]; > y = initial[index]; > } > > /* Subtract half the sector angle from Z */ > z -= 1 << (CORDIC_BITS-1); > > /* Now do standard CORDIC, with a lot of work already done */ > ... > > This removes ~8 cycles of latency. > > The end result > ============== > If you combine both of these you can get rid of the angles[] table completely - it is now a constant. > > /* Use Dual Port memory for this */ > if(quadrant & 1) { > x = initial[index]; > y = initial[TABLE_SIZE-1-index]; > } else { > x = initial[TABLE_SIZE-1-index]; > y = initial[index]; > } > > /* Subtract half the sector angle from Z */ > z -= 1 << (CORDIC_BITS-1); > > /* Now do standard CORDIC, with a lot of work already done, > so less repetitions are needed for the same accuracy */ > > for(i = 0; i < CORDIC_REPS; i++ ) { > int64_t tx,ty; > // Add rounding and divide to scale the current vector > tx = x >> (INDEX_BITS+i); > ty = y >> (INDEX_BITS+i); > > // Either add or subtract at right angles > x -= (z > 0 ? ty : -ty); > y += (z > 0 ? tx : -tx); > z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT); > z <<= 1; > } > > Advantages of this method > ========================= > If you have fully unrolled it to generate a full value per cycle, you end up with: > - 1 BRAM block used (bad) > - 9 less CORDIC stages (good) > - 8 or 9 cycles less latency (good) > > For 16-bit values this may only need 5 stages, rather than 14. > > If you are trying to minimize area, generating an n-bit value every ~n cycles you end up with: > > - 1 BRAM block used (bad) > - 8 or 9 cycles less latency (good) > - no need for the angles[] table (good) > - Less levels of logic, for faster FMAX (good) > > For 16-bit values, this could double the number of calculations you can compute at a given clock rate. > > You can also tune things some what - you can always throw more BRAM blocks at it to reduce the number of CORDIC stages/iterations required, if you have blocks to spare - but one block to remove 9 stages is pretty good. > > What do you think? >As far as your "revised" angle[i] converging to a constant is concerned, there's a simple explanation using the first two terms of the taylor series for the arctan function: arctan(x) = x - 1/3 * x^3 + o(x^3) So that angle[i] = arctan(2^-i) / pi * 2^i = (1/pi) * ( 1 - 1/3 * 2^-(2*i)) + o(2^-(2*i)) Based on which you can easily say how many stages of the conventional cordic algorithm do you need to skip (i.e. store the outputs in a lookup table) for a given bit precision. I don't know the literature well, but I think it would be cool if you actually write an article detailing your approach! Gene

Reply by ●September 7, 20182018-09-07

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:> > The implementation completes the same tasks in 2/3rds > the cycles and using 2/3rds the resources of an > standard Xilinx IP block, with comparable timing). >If perchance this is related to your recent CORDIC rotator code, I've seen a number of CORDIC optimization schemes over the years to reduce the number of rotation stages, IIRC typically either by a 'jump start' or merging/optimizing rotation stages. If I ever manage to find my folder of CORDIC papers, I'll post some links... ------- some notes on your CORDIC implementation http://hamsterworks.co.nz/mediawiki/index.php/CORDIC - instead of quadrant folding, given I & Q you can do octant folding (0-45 degrees) using the top three bits of the phase - if you pass in the bit widths and stages as generics, you can initialize the constant arctan table on-the-fly in a function using VHDL reals -Brian

Reply by ●September 7, 20182018-09-07

earlier, I wrote:> > If perchance this is related to your recent CORDIC rotator code, > I've seen a number of CORDIC optimization schemes over the years > to reduce the number of rotation stages, IIRC typically either > by a 'jump start' or merging/optimizing rotation stages. >oops, for some reason, when first reading this thread I didn't see the later posts with the explanation... I'd swear they weren't there, but maybe I was just scroll-impaired. -Brian