FPGARelated.com
Forums

Xilinx Square Root Unit

Started by Robin Bruce March 21, 2006
Hi guys,

I was wondering if anyone here knows the technique that the Xilinx
floating-point square-root core employs to get its results. I ask
because we've got a unit that does the same job with near-identical
latency and clock speed but uses far less resource (330 slices, versus
Xilinx's 624 slices for single-precision accuracy).

Interestingly my paper on this unit was rejected from FCCM on the
grounds that "more recent
developments in the field of square root calculation" have been made.
Could anyone cast any light on what these are? The algorithm I used is
400 years old, but why should that make any difference if it provides
the correct answer for less resource use?

If anyone knows of a unit that provides pipelined single-precision
square-root calculations over the full range possible inputs, I'd be
interested to hear about it.

Thanks for your attention,

Robin Bruce

Robin Bruce wrote:

> I was wondering if anyone here knows the technique that the Xilinx > floating-point square-root core employs to get its results. I ask > because we've got a unit that does the same job with near-identical > latency and clock speed but uses far less resource (330 slices, versus > Xilinx's 624 slices for single-precision accuracy). > Interestingly my paper on this unit was rejected from FCCM on the > grounds that "more recent > developments in the field of square root calculation" have been made.
Perhaps someone other than Xilinx has a better one.
> Could anyone cast any light on what these are? The algorithm I used is > 400 years old, but why should that make any difference if it provides > the correct answer for less resource use?
No difference at all if your goal is to use a square root on an fpga. If your goal is research existing algorithms and find a better one, then do that.
> If anyone knows of a unit that provides pipelined single-precision > square-root calculations over the full range possible inputs, I'd be > interested to hear about it.
Here's some related discussion: http://groups.google.com/groups?q=vhdl+square+root -- Mike Treseler
I posted a few hours ago but it has yet to show so I'll try again. Most
modern square root algorithms do multiple bits per clock cycle. There
was a great paper out of Thailand (I believe) a few years back with a
very simple square root algorithm that does two bits per clock cycle. I
was unable to locate the paper but I did find some C code I used to
test it. Using it I was able to do a fine FP square root operator using
half the resources of Xilinx and having half the latency for both
pipelined and unpipelined versions. I actually think I could cut the
latency in half again if I worked at it and used just a few more
resources to do two adds in a clock cycle. If anyone recognizes the C
code below, please post the source!

DWORD n = 90; //input 8 bits worth: 90 should give 9r9
DWORD bl = sizeof(n)*8;//bitlen
DWORD hibit = 1<<(bl-1);
WORD s = 0;
DWORD r = 0;//actually sizeof(s) + sign bit
for(int i = (bl>>1)-1;i >=0; i--){
	if(!(r & hibit)){
		r = (r<<2) | (n>>(i<<1)&3);
		r -= ((s<<2) | 1);
	}
	else{
		r = (r<<2) | (n>>(i<<1)&3);
		r += ((s<<2) | 3);
	}
	if(!(r & hibit))
		s = (s<<1) | 1;
	else
		s = s<<1;
}
if((r & hibit)) r += ((s<<1) | 1);

Hi,
I am interested in the algorithm too.

By googling, the following address shows 16,700 articles.

http://scholar.google.com/scholar?as_q=&num=100&btnG=Search+Scholar&as_epq=square+root&as_oq=&as_eq=&as_occt=title&as_sauthors=&as_publication=&as_ylo=2000&as_yhi=&as_allsubj=some&as_subj=eng&hl=en&lr=

Weng

Would it be this paper?

http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/proceedings/&toc=comp/proceedings/fccm/1997/8159/00/8159toc.xml&DOI=10.1109/FPGA.1997.624623

Good post, thanks.

Robin

Hi Robin,
It is not the one Brannon mentioned. The paper you list was published
in 1997.

Weng

I found the correct paper:

http://scholar.google.com/url?sa=U&q=http://www.cp.eng.chula.ac.th/~krerk/publication/iscit-sqrt.pdf

Entitled "an FPGA implementation of a fixed-point square root
operation". Of course it works for floating point too; you just run it
on the mantissa.

Hi Brannon,
Thank you for your help.

It is great for several people in a group with great interest in an
algorithm and their cooperation rediscovers its best implementation
published somewhere and then shares it with everyone.

Weng

Hi Robin,

"Robin Bruce" <robin.bruce@gmail.com> wrote in message
news:1142947226.866898.178060@e56g2000cwe.googlegroups.com...
> Hi guys,
> I was wondering if anyone here knows the technique that the Xilinx > floating-point square-root core employs to get its results. > I ask because we've got a unit that does the same job with near-identical > latency and clock speed but uses far less resource (330 slices, versus > Xilinx's 624 slices for single-precision accuracy).
I expect that it is very similar to the algorithm your own unit is using. I don't think I can go into details though. As you will see from the product datasheet, this core is based on IP that Xilinx licensed from QinetiQ (UK) Ltd. We have been working on improvements since we brought this in-house, but the focus has been on the more "mainstream" floating-point operators (e.g. add, multiply). So it's good to know that there is at least some interest in this square-root core too! To cut a long story short: we know where these inefficiencies are, and our core should indeed be a fair bit smaller when fully pipelined than it currently is. The engineering team is aware of this so there should be a further-optimized version available in a future IP release (watch this space). Incidentally, how come you have so many square-roots to do? We racked our brains and couldn't think of a "killer app" that would require a ~300MHz fully pipelined square-root unit. Perhaps this use case is more common than we thought? Thanks a lot for your feedback in any case. Cheers, -Ben- (@xilinx)
Hi Ben,

>From asking around my guess would be that you guys are doing some kind
of SRT division related technique. I'm about a million miles away from being an expert on this, but it was my understanding that this technique was best suited to the case where a division unit and a square root unit are sharing resources. <bold> I might be wrong on that though <\bold>. The technique one of my esteemed colleagues settled on was derived from Napier's bones. From what I can tell so far the technique performs well in terms of clock frequency and resource use in comparison to other techniques being deployed. It has a similar latency to the core that you guys have developed, but I understand now that there are other techniques out there that can reduce latency by calculating two bits of the result per cycle. I imagine this would reduce the clock frequency possible when you're fully-pipelined. Not to take anything away from what you guys have done though, when you've made such an effort to allow the cores to be customisable to the degree that they are, then it's perfectly understandable that they're not fully optimised in every way :) The interest I've got in the square root core is from the perspective of wanting to provide a design environment to HPC programmers who want to use FPGAs to get serious computational speed-up. For that I need fully pipelined floating-point cores for single and double precision. These floating point cores would then be linked together to implement specific HPC functions that are maximally pipelined. Given that data sets are going to have to be big to justify the use of FPGAs, and that the functions will be pipelined, the cores' main qualities should be, in order of importance: 1) High Max Clock Frequency - 2) Low Resource Use 3) Low Latency (A distant third) Latency isn't so much of a concern, as for a pipelined HPC function with a large data set, the contribution of the core latencies to total computation time should be such a tiny fraction as to make it insignificant in a lot of cases. Anyway, I think I've rambled on a bit. The point is to say that I don't know what the application is either just yet, but I'm taking a "Build it and they will come" style approach. How much are you guys considering the emerging Reconfigurable HPC crowd? I would have thought that given your experience with the floating-point cores it would be right up your street. Cheers, Robin (@Nallatech)