Hi, I think I've got a really good way to improve a commonly used & well establ= ished algorithm that is often used in FPGAs, and it all checks out. The imp= lementation completes the same tasks in 2/3rds the cycles and using 2/3rds = the resources of an standard Xilinx IP block, with comparable timing). I've verified that the output is correct over the entire range of 32-bit in= put values. I can't find anything similar designs in a Google patent search, or looking= through journal articles. Once you are familiar with the original algorith= m, and the optimization is explained it becomes pretty self-evident in retr= ospect. It just seems the right way to do things. What should I do?=20 Should I just throw the implementation on a website somewhere as a curiosit= y? Publish it in an article? Pass it to a local student to make a paper from it? (I'm not studying at al= l)=20 Attempt to patent and then commercialize it? Thanks! Mike

# What to do with an improved algorithm?

Started by ●September 3, 2018

Posted by ●September 3, 2018

I think the best option is to write an article -- or a patent. Simply because it's an extra opportunity to verify that your approach is correct. If there's a hidden mistake, a student might be unable to see it. Gene On 03.09.2018 13:17, Mike Field wrote:> Hi, > > I think I've got a really good way to improve a commonly used & well establishedalgorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).> > I've verified that the output is correct over the entire range of 32-bit inputvalues.> > I can't find anything similar designs in a Google patent search, or lookingthrough journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.> > What should I do? > > Should I just throw the implementation on a website somewhere as a curiosity? > > Publish it in an article? > > Pass it to a local student to make a paper from it? (I'm not studying at all) > > Attempt to patent and then commercialize it? > > Thanks! > > Mike > >

Posted by ●September 3, 2018

I agree with Gene, plus you might consider publishing the IP as open source= code on a website of your own or opencores.org. --Mike On Monday, September 3, 2018 at 3:41:02 AM UTC-7, Gene Filatov wrote:> I think the best option is to write an article -- or a patent. >=20 > Simply because it's an extra opportunity to verify that your approach is==20> correct. >=20 > If there's a hidden mistake, a student might be unable to see it. >=20 > Gene >=20 >=20 > On 03.09.2018 13:17, Mike Field wrote: > > Hi, > > > > I think I've got a really good way to improve a commonly used & well es=tablished algorithm that is often used in FPGAs, and it all checks out. The= implementation completes the same tasks in 2/3rds the cycles and using 2/3= rds the resources of an standard Xilinx IP block, with comparable timing).> > > > I've verified that the output is correct over the entire range of 32-bi=t input values.> > > > I can't find anything similar designs in a Google patent search, or loo=king through journal articles. Once you are familiar with the original algo= rithm, and the optimization is explained it becomes pretty self-evident in = retrospect. It just seems the right way to do things.> > > > What should I do? > > > > Should I just throw the implementation on a website somewhere as a curi=osity?> > > > Publish it in an article? > > > > Pass it to a local student to make a paper from it? (I'm not studying a=t all)> > > > Attempt to patent and then commercialize it? > > > > Thanks! > > > > Mike > > > >

Posted by ●September 4, 2018

On 03/09/2018 11:17, Mike Field wrote:> Hi, > > I think I've got a really good way to improve a commonly used & well establishedalgorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).> > I've verified that the output is correct over the entire range of 32-bit inputvalues.> > I can't find anything similar designs in a Google patent search, or lookingthrough journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.> > What should I do? > > Should I just throw the implementation on a website somewhere as a curiosity? > > Publish it in an article? > > Pass it to a local student to make a paper from it? (I'm not studying at all) > > Attempt to patent and then commercialize it? > > Thanks! > > Mike > >I'd publish - since you are not already in the IP licensing/patenting groove I doubt if you would make any money from it but you might gain kudos which may help you career and business. Xilinx might want to publish it - which might give a lot more visibility. If you have a web site you could put it on that. MK

Posted by ●September 4, 2018

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:> Hi, >=20 > I think I've got a really good way to improve a commonly used & well esta=blished algorithm that is often used in FPGAs, and it all checks out. The i= mplementation completes the same tasks in 2/3rds the cycles and using 2/3rd= s the resources of an standard Xilinx IP block, with comparable timing).>=20 > I've verified that the output is correct over the entire range of 32-bit =input values.>=20 > I can't find anything similar designs in a Google patent search, or looki=ng through journal articles. Once you are familiar with the original algori= thm, and the optimization is explained it becomes pretty self-evident in re= trospect. It just seems the right way to do things.>=20 > What should I do?=20 >=20 > Should I just throw the implementation on a website somewhere as a curios=ity?>=20 > Publish it in an article? >=20 > Pass it to a local student to make a paper from it? (I'm not studying at =all)=20>=20 > Attempt to patent and then commercialize it? >=20 > Thanks! >=20 > MikeLicensing and selling IP comes with a bit of a learning curve and requires = an investment on your part. As Michael mentions, without some of that fram= ework already in place, a license vetted by an IP attorney, and a good mark= eting plan, you might not see a return on that investment. If you want your name more prominently attached to it, I'd suggest posting = up on a personal Github account rather than opencores.org which makes you c= onform to their requirements (such as wishbone interface, etc.). Xilinx always welcomes guest articles on their blogs (although those have b= een in flux since the recent reorg), and their e-magazine Xcell Journal (ag= ain, seems to have been discontinued and the Xcell Daily Blog archived) https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog https://www.xilinx.com/about/xcell-publications/xcell-journal.html --Kris

Posted by ●September 4, 2018

On Tuesday, September 4, 2018 at 7:48:41 AM UTC-7, kkoorndyk wrote:> If you want your name more prominently attached to it, I'd suggest postin=g up on a personal Github account rather than opencores.org which makes you= conform to their requirements (such as wishbone interface, etc.).>=20OpenCores encourages use of the Wishbone interface for SoC components and t= hey do offer coding guidelines, but there are no requirements for either. F= or example in the entire DSP core section there are 38 entries, none of whi= ch are marked as Wishbone compliant.

Posted by ●September 5, 2018

On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk wrote:> On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote: > > Hi, > >=20 > > I think I've got a really good way to improve a commonly used & well es=tablished algorithm that is often used in FPGAs, and it all checks out. The= implementation completes the same tasks in 2/3rds the cycles and using 2/3= rds the resources of an standard Xilinx IP block, with comparable timing).> >=20 > > I've verified that the output is correct over the entire range of 32-bi=t input values.> >=20 > > I can't find anything similar designs in a Google patent search, or loo=king through journal articles. Once you are familiar with the original algo= rithm, and the optimization is explained it becomes pretty self-evident in = retrospect. It just seems the right way to do things.> >=20 > > What should I do?=20 > >=20 > > Should I just throw the implementation on a website somewhere as a curi=osity?> >=20 > > Publish it in an article? > >=20 > > Pass it to a local student to make a paper from it? (I'm not studying a=t all)=20> >=20 > > Attempt to patent and then commercialize it? > >=20 > > Thanks! > >=20 > > Mike >=20 > Licensing and selling IP comes with a bit of a learning curve and require=s an investment on your part. As Michael mentions, without some of that fr= amework already in place, a license vetted by an IP attorney, and a good ma= rketing plan, you might not see a return on that investment.>=20 > If you want your name more prominently attached to it, I'd suggest postin=g up on a personal Github account rather than opencores.org which makes you= conform to their requirements (such as wishbone interface, etc.).>=20 > Xilinx always welcomes guest articles on their blogs (although those have=been in flux since the recent reorg), and their e-magazine Xcell Journal (= again, seems to have been discontinued and the Xcell Daily Blog archived)>=20 > https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive > https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog >=20 > https://www.xilinx.com/about/xcell-publications/xcell-journal.html >=20 >=20 > --KrisI never though I would agree with Rick, but.... All sounds like too much work. So here is a quick summary with C-like pseud= o-code. I'll put the HDL code up somewhere soon once I am happy with it. I = am removing the last rounding errors. I've been playing with CORDIC, and have come up with what looks to be an ov= erlooked optimization. I've done a bit of googling, and haven't found anyth= ing - maybe it is a novel approach? I've tested it with 32-bit inputs and outputs, and it is within +/-2, and a= nd average error of around 0.6.I a am sure with a bit more analysis of wher= e the errors are coming from I can get it more accurate. This has two parts to it, both by themselves seem quite trivial, but comple= ment each other quite nicely. Scaling Z --------- 1. The 'z' value in CORDIC uses becomes smaller and smaller as stages incre= ase: The core of CORDIC for SIN() and COS() is: x =3D INITIAL; y =3D INITIAL; for(i =3D 0; i < CORDIC_REPS; i++ ) { int64_t tx,ty; // divide to scale the current vector tx =3D x >> (i+1); ty =3D y >> (i+1); // Either add or subtract at right angles to the current=20 x -=3D (z > 0 ? ty : -ty); y +=3D (z > 0 ? tx : -tx); z -=3D (z > 0 ? angles[i] : -angles[i]); } The value for angle[] is all important, for example: angle[0] =3D 1238021 angle[1] =3D 654136 angle[2] =3D 332050 angle[3] =3D 166670 angle[4] =3D 83415 angle[5] =3D 41718 angle[6] =3D 20860 angle[7] =3D 10430 angle[8] =3D 5215 angle[9] =3D 2607 angle[10] =3D 1303 angle[11] =3D 652 angle[12] =3D 326 angle[13] =3D 163 angle[14] =3D 81 angle[15] =3D 41 angle[16] =3D 20 angle[17] =3D 10 angle[18] =3D 5 angle[19] =3D 3 angle[20] =3D 1 If you make the following change: for(i =3D 0; i < CORDIC_REPS; i++ ) { int64_t tx,ty; // divide to scale the current vector tx =3D x >> (i+1); ty =3D y >> (i+1); // Either add or subtract at right angles x -=3D (z > 0 ? ty : -ty); y +=3D (z > 0 ? tx : -tx); z -=3D (z > 0 ? angles[i] : -angles[i]); //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! z <<=3D 1; // Double the result of 'z' //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! } Then you can use all the bits in angle[], because you can scale by 2^i (thi= s is data from a different set of parameters, hence the values and count is= different): angle[0] =3D 1238021 angle[1] =3D 1308273 angle[2] =3D 1328199 angle[3] =3D 1333354 angle[4] =3D 1334654 angle[5] =3D 1334980 angle[6] =3D 1335061 angle[7] =3D 1335082 angle[8] =3D 1335087 angle[9] =3D 1335088 angle[10] =3D 1335088 angle[11] =3D 1335088 angle[12] =3D 1335088 angle[13] =3D 1335088 angle[14] =3D 1335088 angle[15] =3D 1335088 angle[16] =3D 1335088 angle[17] =3D 1335088 angle[18] =3D 1335088 angle[19] =3D 1335088 angle[20] =3D 1335088 angle[21] =3D 1335088 angle[22] =3D 1335088 angle[23] =3D 1335088 angle[24] =3D 1335088 angle[25] =3D 1335088 angle[26] =3D 1335088 angle[27] =3D 1335088 angle[28] =3D 1335088 angle[29] =3D 1335088 ...and angle[i] rapidly becomes a constant value after the first 9 or 10 it= erations. This is what you would expect, as the angle gets smaller and smal= ler. Part 2: Add a lookup table =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D If you split the input into: [2 MSB] quadrant [next 9 bits] an lookup table index [the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1) And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which ho= ld the SIN/COS values at the center of the range =3D e.g. initial[i] =3D sc= ale_factor * sin(PI/2.0/1024*(2*i+1)); Because you need both the SIN() and COS() starting point, you can get them = from the same table (screaming out "dual port memory!" to me) You can then do a standard lookup to get the starting points, 9 cycles into= the CORDIC: /* Use Dual Port memory for this */ if(quadrant & 1) { x =3D initial[index]; y =3D initial[TABLE_SIZE-1-index]; } else { x =3D initial[TABLE_SIZE-1-index]; y =3D initial[index]; } /* Subtract half the sector angle from Z */ z -=3D 1 << (CORDIC_BITS-1); /* Now do standard CORDIC, with a lot of work already done */ ... This removes ~8 cycles of latency. The end result =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D If you combine both of these you can get rid of the angles[] table complete= ly - it is now a constant. /* Use Dual Port memory for this */ if(quadrant & 1) { x =3D initial[index]; y =3D initial[TABLE_SIZE-1-index]; } else { x =3D initial[TABLE_SIZE-1-index]; y =3D initial[index]; } /* Subtract half the sector angle from Z */ z -=3D 1 << (CORDIC_BITS-1); /* Now do standard CORDIC, with a lot of work already done,=20 so less repetitions are needed for the same accuracy */ for(i =3D 0; i < CORDIC_REPS; i++ ) { int64_t tx,ty; // Add rounding and divide to scale the current vector tx =3D x >> (INDEX_BITS+i); ty =3D y >> (INDEX_BITS+i); // Either add or subtract at right angles x -=3D (z > 0 ? ty : -ty); y +=3D (z > 0 ? tx : -tx); z -=3D (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT); z <<=3D 1;=20 } Advantages of this method =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D If you have fully unrolled it to generate a full value per cycle, you end u= p with: - 1 BRAM block used (bad) - 9 less CORDIC stages (good) - 8 or 9 cycles less latency (good) For 16-bit values this may only need 5 stages, rather than 14. If you are trying to minimize area, generating an n-bit value every ~n cycl= es you end up with: - 1 BRAM block used (bad) - 8 or 9 cycles less latency (good) - no need for the angles[] table (good) - Less levels of logic, for faster FMAX (good) For 16-bit values, this could double the number of calculations you can com= pute at a given clock rate. You can also tune things some what - you can always throw more BRAM blocks = at it to reduce the number of CORDIC stages/iterations required, if you hav= e blocks to spare - but one block to remove 9 stages is pretty good. What do you think?

Posted by ●September 5, 2018

On 05.09.2018 8:40, Mike Field wrote:> On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk wrote: >> On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote: >>> >>> I think I've got a really good way to improve a commonly used & well establishedalgorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).>>> >>> I've verified that the output is correct over the entire range of 32-bit inputvalues.>>> >>> I can't find anything similar designs in a Google patent search, or lookingthrough journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.>>> >>> What should I do? >>> >>> Should I just throw the implementation on a website somewhere as a curiosity? >>> >>> Publish it in an article? >>> >>> Pass it to a local student to make a paper from it? (I'm not studying at all) >>> >>> Attempt to patent and then commercialize it? >>> >>> Thanks! >>> >>> Mike >> >> Licensing and selling IP comes with a bit of a learning curve and requires aninvestment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.>> >> If you want your name more prominently attached to it, I'd suggest posting up ona personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).>> >> Xilinx always welcomes guest articles on their blogs (although those have been influx since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)>> >> https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive >> https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog >> >> https://www.xilinx.com/about/xcell-publications/xcell-journal.html >> >> >> --Kris > > I never though I would agree with Rick, but.... > > All sounds like too much work. So here is a quick summary with C-like pseudo-code.I'll put the HDL code up somewhere soon once I am happy with it. I am removing the last rounding errors.> > I've been playing with CORDIC, and have come up with what looks to be anoverlooked optimization. I've done a bit of googling, and haven't found anything - maybe it is a novel approach?> > I've tested it with 32-bit inputs and outputs, and it is within +/-2, and andaverage error of around 0.6.I a am sure with a bit more analysis of where the errors are coming from I can get it more accurate.> > This has two parts to it, both by themselves seem quite trivial, but complementeach other quite nicely.> > Scaling Z > --------- > 1. The 'z' value in CORDIC uses becomes smaller and smaller as stages increase: > > The core of CORDIC for SIN() and COS() is: > x = INITIAL; > y = INITIAL; > for(i = 0; i < CORDIC_REPS; i++ ) { > int64_t tx,ty; > // divide to scale the current vector > tx = x >> (i+1); > ty = y >> (i+1); > > // Either add or subtract at right angles to the current > x -= (z > 0 ? ty : -ty); > y += (z > 0 ? tx : -tx); > z -= (z > 0 ? angles[i] : -angles[i]); > } > > > The value for angle[] is all important, for example: > > angle[0] = 1238021 > angle[1] = 654136 > angle[2] = 332050 > angle[3] = 166670 > angle[4] = 83415 > angle[5] = 41718 > angle[6] = 20860 > angle[7] = 10430 > angle[8] = 5215 > angle[9] = 2607 > angle[10] = 1303 > angle[11] = 652 > angle[12] = 326 > angle[13] = 163 > angle[14] = 81 > angle[15] = 41 > angle[16] = 20 > angle[17] = 10 > angle[18] = 5 > angle[19] = 3 > angle[20] = 1 > > If you make the following change: > > for(i = 0; i < CORDIC_REPS; i++ ) { > int64_t tx,ty; > // divide to scale the current vector > tx = x >> (i+1); > ty = y >> (i+1); > > // Either add or subtract at right angles > x -= (z > 0 ? ty : -ty); > y += (z > 0 ? tx : -tx); > z -= (z > 0 ? angles[i] : -angles[i]); > > //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > z <<= 1; // Double the result of 'z' > //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > } > > Then you can use all the bits in angle[], because you can scale by 2^i (this isdata from a different set of parameters, hence the values and count is different):> angle[0] = 1238021 > angle[1] = 1308273 > angle[2] = 1328199 > angle[3] = 1333354 > angle[4] = 1334654 > angle[5] = 1334980 > angle[6] = 1335061 > angle[7] = 1335082 > angle[8] = 1335087 > angle[9] = 1335088 > angle[10] = 1335088 > angle[11] = 1335088 > angle[12] = 1335088 > angle[13] = 1335088 > angle[14] = 1335088 > angle[15] = 1335088 > angle[16] = 1335088 > angle[17] = 1335088 > angle[18] = 1335088 > angle[19] = 1335088 > angle[20] = 1335088 > angle[21] = 1335088 > angle[22] = 1335088 > angle[23] = 1335088 > angle[24] = 1335088 > angle[25] = 1335088 > angle[26] = 1335088 > angle[27] = 1335088 > angle[28] = 1335088 > angle[29] = 1335088 > > ...and angle[i] rapidly becomes a constant value after the first 9 or 10iterations. This is what you would expect, as the angle gets smaller and smaller.> > > Part 2: Add a lookup table > ========================== > If you split the input into: > > [2 MSB] quadrant > [next 9 bits] an lookup table index > [the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1) > > And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which hold theSIN/COS values at the center of the range = e.g. initial[i] = scale_factor * sin(PI/2.0/1024*(2*i+1));> > Because you need both the SIN() and COS() starting point, you can get them fromthe same table (screaming out "dual port memory!" to me)> > You can then do a standard lookup to get the starting points, 9 cycles into theCORDIC:> > /* Use Dual Port memory for this */ > if(quadrant & 1) { > x = initial[index]; > y = initial[TABLE_SIZE-1-index]; > } else { > x = initial[TABLE_SIZE-1-index]; > y = initial[index]; > } > > /* Subtract half the sector angle from Z */ > z -= 1 << (CORDIC_BITS-1); > > /* Now do standard CORDIC, with a lot of work already done */ > ... > > This removes ~8 cycles of latency. > > The end result > ============== > If you combine both of these you can get rid of the angles[] table completely - itis now a constant.> > /* Use Dual Port memory for this */ > if(quadrant & 1) { > x = initial[index]; > y = initial[TABLE_SIZE-1-index]; > } else { > x = initial[TABLE_SIZE-1-index]; > y = initial[index]; > } > > /* Subtract half the sector angle from Z */ > z -= 1 << (CORDIC_BITS-1); > > /* Now do standard CORDIC, with a lot of work already done, > so less repetitions are needed for the same accuracy */ > > for(i = 0; i < CORDIC_REPS; i++ ) { > int64_t tx,ty; > // Add rounding and divide to scale the current vector > tx = x >> (INDEX_BITS+i); > ty = y >> (INDEX_BITS+i); > > // Either add or subtract at right angles > x -= (z > 0 ? ty : -ty); > y += (z > 0 ? tx : -tx); > z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT); > z <<= 1; > } > > Advantages of this method > ========================= > If you have fully unrolled it to generate a full value per cycle, you end upwith:> - 1 BRAM block used (bad) > - 9 less CORDIC stages (good) > - 8 or 9 cycles less latency (good) > > For 16-bit values this may only need 5 stages, rather than 14. > > If you are trying to minimize area, generating an n-bit value every ~n cycles youend up with:> > - 1 BRAM block used (bad) > - 8 or 9 cycles less latency (good) > - no need for the angles[] table (good) > - Less levels of logic, for faster FMAX (good) > > For 16-bit values, this could double the number of calculations you can compute ata given clock rate.> > You can also tune things some what - you can always throw more BRAM blocks at itto reduce the number of CORDIC stages/iterations required, if you have blocks to spare - but one block to remove 9 stages is pretty good.> > What do you think? >As far as your "revised" angle[i] converging to a constant is concerned, there's a simple explanation using the first two terms of the taylor series for the arctan function: arctan(x) = x - 1/3 * x^3 + o(x^3) So that angle[i] = arctan(2^-i) / pi * 2^i = (1/pi) * ( 1 - 1/3 * 2^-(2*i)) + o(2^-(2*i)) Based on which you can easily say how many stages of the conventional cordic algorithm do you need to skip (i.e. store the outputs in a lookup table) for a given bit precision. I don't know the literature well, but I think it would be cool if you actually write an article detailing your approach! Gene

Posted by ●September 7, 2018

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:> > The implementation completes the same tasks in 2/3rds > the cycles and using 2/3rds the resources of an > standard Xilinx IP block, with comparable timing). >If perchance this is related to your recent CORDIC rotator code, I've seen a number of CORDIC optimization schemes over the years to reduce the number of rotation stages, IIRC typically either by a 'jump start' or merging/optimizing rotation stages. If I ever manage to find my folder of CORDIC papers, I'll post some links... ------- some notes on your CORDIC implementation http://hamsterworks.co.nz/mediawiki/index.php/CORDIC - instead of quadrant folding, given I & Q you can do octant folding (0-45 degrees) using the top three bits of the phase - if you pass in the bit widths and stages as generics, you can initialize the constant arctan table on-the-fly in a function using VHDL reals -Brian

Posted by ●September 7, 2018

earlier, I wrote:> > If perchance this is related to your recent CORDIC rotator code, > I've seen a number of CORDIC optimization schemes over the years > to reduce the number of rotation stages, IIRC typically either > by a 'jump start' or merging/optimizing rotation stages. >oops, for some reason, when first reading this thread I didn't see the later posts with the explanation... I'd swear they weren't there, but maybe I was just scroll-impaired. -Brian