FPGARelated.com
Forums

proth siever in FPGA?

Started by Alex June 28, 2005
Hello,

I know FPGA's are not suitable for sieving large numbers however
one set of numbers (Proth numbers, related to the Sierpinski
conjecture) I think would fit an FPGA quite well.

Sorry if this is a relatively basic or silly question but I'm
a novice with FPGAs although I've been reading up on them and
playing around with the software.

I'd be dealing with numbers in the 32 to 64 bit range (current
limits of sieving are around 50 bit numbers) and hoping to put
many sievers on one FPGA (I have a Spartan-3 dev kit with
200k gates).

The basic algorithm is this:-

p - 64 bit number (a prime) although top bit will never be set
r - 64 bit number (0 <= r < p)
maxn - 50,000,000 but easily replaced with 2^26 if easier

The above would be supplied by a controlling computer via
parallel or RS232 connection and stored somewhere in the 1M-byte
of RAM. The controlling computer would make sure there are plenty
of these values available in memory for the sievers to grab.

An individual siever would:-

grab p and r from memory and mark this as 'in progress'.
set initial_r=r /* r being another 64 bit var */
set n=1 /* n is a 64 bit counter */
while( n <= maxn ) {
  if( r == 0 ) {
     stop. shove n back into memory and mark result as done(1)
  }
  r=r<<1; /* multiply by 2 */
  r--; /* decrement r by 1. r will always be > 1 so no need for
          bounds checking */
  if( r >= p ) {
     r-=p; /* again, always +ve result */
  }
  n++; /* increment n by one */
  if( r == initial_r ) { /* comparison */
     stop. shove n back into memory and mark result as done(2)
  }
}

Obviously I'd like as many sieve units as possible on the final
FPGA but I'd be looking to implement just one at first (without
the control unit), then the control function, then multiple
sievers.

Sound feasible for an FPGA?
Sound feasible for an FPGA novice :-) ?

Ta,

-Alex

Hi Alex,

> Sound feasible for an FPGA?
The single Siever won't be much of a problem, but you will ultimately be limited by your memory bandwidth if you want to implement multiple Sievers.
> Sound feasible for an FPGA novice :-) ?
Oh, definitely doable. Apart from the ultra-basics it will teach you FSM design, pipelining and other useful stuff. Later on you should be able to tack an embedded CPU (preferrably NIOS due to its easy bus structure) to the Siever(s). However, please use a simple SRAM. They're more expensive, but at least you don't have to complicate things by having to write the SDRAM state machine as well. Good luck! Ben
Thanks Ben!

>From my existing testing the inner loop will be iterated all 50,000,000
times on >99.9% of runs. Even if I do one loop per clock cycle (highly unrealistic) on the 50MHz Spartan-3 I' looking at one test per second per siever, and only 512bits going in and out of the siever per test. There's no memory access whilst in the main loop so I mind if it's a bit slow grabbing data or returning results. As for the memory the main chip (XC3S200FT256) has 12 18kbit block RAM (216K bits) the board comes with a 1M-Byte SRAM (in two 256Kx16 chips) on the back. No SDRAM to worry about. Ta. -Alex
Slightly off topic since I don't know this proth sieve

If you are interested in trully huge primes, perhaps the mersenne
primes search might be something to consider as background material.

If you google comp arch group for FPGA Mersenne you should find a
recent (1-2 month or so) mention about an engine that did huge 64 bit
int FFTs on I think 2^24 points, this is used to do huge number
arithmetic in shorter order. It used 384 18.18 muls on biggest V2Pro to
drive the FFT kernal. The 64b math was needed to prevent loss of
accuracy to do these huge muls needed for prime checking. The math was
mostly beyond me but the FPGA HW was quite interesting.

Ofcourse it turns out the memory management was the real problem here,
if you can imagine doing a 2^24 pt 64 precision FFT.

This must be some sort of Uni assignment right?
better than what I was offered

johnjakson at usa dot com

Hi Alex,
I don't know anything about "seiving" but  it sounds like a good first
project - the operations inside your loop are straight forward to
implement in FPGA logic (comparison/shifting/add/subtract). Although
you didn't state it, I'm assuming the overall goal is to achieve the
fastest throughtput possible given the starter kit hardware
constaints?, i.e. maximise sieved numbers per second.

Why do you feel you can't do one iteration of that loop per clock
cycle? If you pipeline your logic you will get through the loop once
per clock cycle, but there will be a latency of 2,3, 4 etc clock cycles
depending on how deeply you need to pipeline.  You should be able to
achieve well above 50MHz, the max clock rate will ultimately depend on
how deeply you pipeline.

I don't necessarily agree with Ben's statement that the memory
(external SRAM?) bandwidth will be the limiting factor on throughput.
Say each siever uses 100 slices and a XC3S200 provides 1920 slices,
then you can have 19 sievers operating in parallel.  If each siever
takes 1 second to complete the algorithm then every second you need 19x
( p read, r read, n write) which is a trivial amount of bandwidth.

If speed is your goal then optimise the algorithm first (although I
guess you will need some experience and knowledge of the hardware/FPGA
before you can do this). Maybe I didn't understand the algorithm, but
if the loop runs to completion 99.9% of the time then there must be a
way of exploiting this? Wouldn't it be better to first do a coarse run
through by going e.g r<<2, and  n+2 etc Well, make it as "coarse" as
possible. 

Regards
Andrew

Thanks for the comments (Andrew & JJ). Let me explain a bit.
You asked for this! :-)

I don't need an FFT because I don't need to multiply numbers.
The sieving can be performed on 'small' integers (64-bits
gives a very large scope) using addition, subtraction, left
shifting by one bit and comparisons.

No multiplication and no floating-point makes it much
easier.

Proth sieving has some unique characteristics that make it
different from the Mersenne primes research.

The main goal is to maximise numbers sieved based on time.
This splits into two seperate goals:-
- Maximising the speed of each individual siever.
- Maximising the number of sievers on the FPGA.

Obviously pipelining may be required to complete one
iteration of the loop within one clock cycle, but then that
may increase the number of slices used by each siever,
bringing down the number of sievers that I can place on the
FPGA.

I'm playing around with a Spartan-3 starter kit as that is
the only platform I have for development. This is a hobby
project (not University project), if I can get reasonable
performance then I may consider moving up to a larger FPGA
with more gates (to get more sievers).

I'll only ever be working with dev kits. I don't intend
(unless the performance is stunning) that I'd ever consider
to build anything with multiple FPGAs, although my inner-geek
would love to have a box with lots of blinkenlights and
serious amounts of processing power.

The 50MHz comment was purely my brain not working. I had
glanced at the description and the mentioning of a 50MHz
oscillator had stuck in my brain. I then assumed that the
final solution would be limited to 50MHz. I've no idea
how quick an XC3S200 can actually run, or would run with
this 'design'. (could anyone give a rough guess?)

As for the actual problem I want to solve, here's a 'quick'
description of Proth sieving and why I want to do it.

Proth numbers are numbers of the form k*2^n+1 where
2^n > k. k should be odd (for obvious reasons).

A Sierpinski number (of the second kind) is a value of k
so that k*2^n+1 is composite for all k.

k=78557 will always produce composite numbers for any
positive n you pick. This has been proven.

To prove that a k is not a Sierpinski number one must find a
value for n such that k*2^n+1 produces a prime number.

The Sierpinski conjecture is that 78557 is the lowest value
of k possible. Of the 39278 other odd candidates, primes
have been found for all but 9.

The conjecture is regarded as true by most people and so
primes for the other 9 k-values are expected to be found.

There's more on the Sierpinski problem here:-
    http://www.seventeenorbust.com/
(That URL looks a bit suspect but trust me on this. There
used to be 17 candidate k-values but the above project has
found primes for 8 of them.)

The above site is running a distributed attack (a la SETI,
Cancer@Home, etc) to find primes.

The numbers involved are extremely large. The most recent
prime was found for k=27653 at n=9167433.

That's 27653*2^9167433+1, a nice 2,759,677 digits in decimal.

Trying to work with those kinds of things on a FPGA would be
madness.

Primality of these numbers can be proven with a Proth Test.

http://mathworld.wolfram.com/ can be useful too if you want
to read up any more on this.

Trying to prove primality of 2.7 million digit numbers is not
an easy task. It can take a month on a 3GHz Intel P4. The
equation involved is:
    let N=k*2^n+1
and if:-
    a^((N-1)/2) mod N = N-1
then N is prime. The value of a used is usually the lowest
odd prime that is not a factor of k. Since the remaining
k-values being tested are all prime then a=3 usually works.

Now consider the info above. In the above case N is a 2.7
million digit number. (N-1)/2 may have the same, or one digit
less than N. Still an insanely large number.

3^99 (a 2 digit number) results in a 49 digit number.
3^999 (a 3 digit number) results in a 478 digit number.

Now consider 3^N where N has 2.7 million digits.

Big.

Anyway, on to the sieving.

So the project has produced a large number of k,n pairs that
need to be tested. Each Proth test takes a long time.

Sieving is the process of removing candidate numbers by trial
division with known primes. For example, one of my computers
is checking the range of primes from
   882000000000000 to 883000000000000
and, amongst others, it's found that 882347066940847 is a
factor of 24737*2^80174983+1.

It has found 17 unique factors in the week that it has been
running. That's 17 k,n pairs that do not need to be proth
tested. Those proth tests would have run for 17 months on
the same machine only to not find any primes.

This is the value of sieving.

OK, so an FPGA won't be good at dealing with large (i.e.
10 million bit) numbers. So why am I here clogging up this
newsgroup with my ramblings?

It's because Proth numbers (k*2^n+1) have some properties
that make sieving easy.

Say I'm checking when k=10223 and my prime of choice is
p=882319864863617.

I compute the remainder of (k*2)+1 mod p. Which, when
testing such large primes, is simply (k*2)+1.

r = 10223*2+1 = 20447 (that's for n=1)

Starting with (k*2^n)+1
- subtract the 1 we added on =k*2^n
- double the number (k*2^n)*2 = k*2^(n+1)
- add one = (k*2^(n+1))+1

More importantly, these actions can be performed on the
remainder without having to keep track of the actual value
of k*2^n+1. So:-

r_next = (2*(r-1))+1 mod p = 2r-1 mod p

The 'mod p' part is quite expensive. Since we are only
dealing with a doubling of the previous r (which we knew
was guaranteed to be less than p) we know that, at most,
r could grow to less than 2p. This is why 'mod p' is
replaced by a simple subtraction of p if r >= p.

If r is ever equal to 1 then it will always remain 1, we
can stop processing (for the FPGA I wouldn't even bother
passing this on as a work unit).

If r is ever equal to 0 then it means that p is factor of
k*2^n+1 at whatever n the algorithm has iterated to. We can
stop here and record the value of n.

So with relatively little effort we can check for factors
for very large values of n without having to keep track of
the large (multi-million digit) values. The snag is that
we have to compute r for every n along the way.

This is why I can't run a coarser run by skipping values.
Each iteration must be performed and the value of r checked.
There is no way to predict future values of r without
computing them.

There's another stop condition, if we ever get back to the
r value we started at (when n=1) then we know we can stop.
The r values will just follow the same pattern again. At
this point I'd like to stop and record the fact that no
factor was found (otherwise I would have stopped before) for
this k,p pair.

The modulus operation involved has an important bearing on
the size of this loop. Say I'm checking a number against
p=3. Consider all of the possible values of r and what they
transform to:-

2x-1 mod 3

0 -> (2*0)-1 mod 3 == 2.
1 -> (2*1)-1 mod 3 == 1.
2 -> (2*2)-1 mod 3 == 0.

The maximum number of values r can have is p, but r=1 is
a special case, so the maximum loop size would be p-1.

Some values of p can produce smaller sub-loops.

2x-1 mod 7:

0-> (2*0)-1 mod 7 == 6.
1-> (2*1)-1 mod 7 == 1.
2-> (2*2)-1 mod 7 == 3.
3-> (2*3)-1 mod 7 == 5.
4-> (2*4)-1 mod 7 == 0.
5-> (2*5)-1 mod 7 == 2.
6-> (2*6)-1 mod 7 == 4.

0->6->4->0
1->1
2->3->5->2

So if my initial value of r had been 2 I could loop
endlessly without ever finding a factor (r=0). This is why
it's important to check to see if we've returned to the
initial value of r.

The sieving effort involved in the SoB project above has put
an arbitrary limit on the upper bound of n of 50 million.
Proth testing a k,n pair where n~50M would take a long long
time, but sieving up to that range is relatively cheap. If,
two years down the road, the project is looking at n values
up at that range we'll be thankful that we've sieved up
there.

As I said above, the maximum size of the loop is p-1. So
for p=882319864863617 this would mean a maximum loop size
of 882,319,864,863,616.

This is why 99.9% of the runs will completely iterate up to
n=50,000,000.

It's actually closer to 99.999999%. From the numbers above
there is a 1 in 17,646,397 chance that p will be a factor of
k*2^n+1 with n between 1 and 50,000,000.
(This isn't strictly true due to the possibility of smaller
sub-loops, but gives you an idea of the numbers involved).

Also, by recording the value of r when I do reach the end
of the loop it means I can continue this at a later date.
I'd only need r at n=1 (trivial to compute) and the value
of r at a given n and I can continue on my merry way.

The faster we can sieve, the faster we can rule out more
and more k,n pairs. The less pairs we have the faster
sieving becomes (although only slightly) and so we can sieve
out higher and higher primes.

*THIS* is why I'm interested in an FPGA helping out. They
have the capability of being much much faster than
traditional CPUs.

If you've read this far then thank you. Reward yourself
with a coffee/tea/walk/cigarette/nap/whatever.

Ta, -Alex

Alex wrote:
> Thanks for the comments (Andrew & JJ). Let me explain a bit. > You asked for this! :-) > > I don't need an FFT because I don't need to multiply numbers. > The sieving can be performed on 'small' integers (64-bits > gives a very large scope) using addition, subtraction, left > shifting by one bit and comparisons. > > No multiplication and no floating-point makes it much > easier. >
snipping all to spare the BW Well that was quite an interesting read and links too. Right now its still over my head since its not my hobby or work interest but still it points to the challenges in an FPGA design. But since I know the simple seive of eratosthenes very well I'll come back to this later as a possible cpu/memory benchmark. I think starting with the starter kit is an excellent idea, when you reach best case solution you can atleast speculate what bigger HW could do. On the other hand you could avoid this HW experiment and keep it in C for quite a while longer. When I model my cpu project, I don't use Verilog simulation, I write in a form of HDL C that allows the C model to run in std VC6 compiler on a std PC, this C RTL (register transfer level) code runs about 20x slower than simple minded C code that has no HW detail but it also runs orders faster than Verilog simulation. The C RTL code though looks so much like Verilog it can be hand translated with some grepping back into Verilog for the actual FPGA synthesis. This might not be so easy for VHDL since its syntax is not like C. You could reasonably experiment with C RTL engines that are orders magnitude larger than any FPGA could actually handle and you still have the PC memory and disk resources. So far you are going backwards by about 20x if you do what I suggest, but then with a simple C RTL translation effort you have a Verilog that can be synthesized into any FPGA provided you understood the architecture of the FPGA. Now with the C RTL model you can have your simulator count cycles and HW costs without ever going to HW. But you will need to port the C RTL to Verilog to find out what your clock freq limit is going to be. What I like about this approach is that lots of architectures can be explored and with my HW exp in hand I know what to expect when I go to synthesis. Exploring architecture is relatively painless in your fav C IDE, but you can evolve the arch towards one that just flies through the FPGA tools. Spending alot of time in Webpack trying to get optimal results leads to great frustration. You might also look at HandelC but $$$. For the long term, I still think big DRAM is going to be a big heasdache since I see poor locality of reference written all over this project, ie SRAM cache will be useless. You could take a look at Micron's RLDRAM which performs like a DRAM in density and like an SRAM in speed. It does this by allowing accesses to start every 2.5ns, with 8 in flight and data accesses come out after 20ns or 8 clocks. If you can arrange for interleaved engines to sync with this, you can get 32MByte RLDRAM effectively working at 2.5ns but you need good HW pipeline design, and this is high end FPGA stuff ie V2Pro and above. These accesses can be fully random to any address provided the bank is already done with previous job. For spartan3 best grade, webpack 6.3 was giving me 225MHz or so for a 16bit cpu datapath with the 16bit adder and dual port blockram both close to the limit. I use 2 clocks to get a 32bit datapath with 4way ported BRam. Others here may have gotten better results. I mostly work on V2Pro about 1.5x faster, and it seems webpack 7.1 has slowed things down some (I dunno why). If you start adding numbers >16b unpipelined, your perf will go to hell pretty quickly, you need to pipeline either at the 16b or 32b add level. best of luck to this project regards johnjakson at usa dot com ps If you like I can send you C RTL sample code.
Ok, without looking at it in more detail the math is beyond me at this
point. If you want to get a feeling for clock rates achievable on the
FPGA, just try some simple stuff and put it through the tools. Create a
16 bit adder in VHDL/Verilog and put it through the tools. Do the same
for comparator. Try 32 bit and 64bit adders. Make sure you register the
inputs and outputs of course. Not a bad way to learn an HDL language
either. Try something simple in the HDL and have a look to see what the
synthesizer created.

Remember that the underlying spartan 3 logic cell has one 4 input
lookup table (LUT) and 1 flipflop. The flip flop can be used for
pipelining and often it is already an unused resource so effectively it
comes for free.

Yes there is always a tradeoff between FPGA resource usage and speed
(clock rate). E.g. a bit serial multiplier uses very little logic
resource but takes multiple clock cycles to compute the result.
Alternatively a fully parallel multiplier can produce a result every
clock cycle.

Regards
Andrew

Hi Alex,

First let me thank you for what has to be a trully unique posting
from a new participant to comp.arch.fpga  .

  -- Clearly stated problem (but missing density or speed as goal,
     expected execution time (i.e. typical loop count per r,p pair))

  -- Obvious effort put into defining the data to be manipulated
     and careful description of what the data looks like

  -- Well stated algorithm, with appropriate comments

  -- Some initial thoughts on bandwidth to host system with
     a proposed solution to keep the FPGA engine busy

  -- Algorithm presented in terms that cleanly map to hardware
     constructs

  -- Some boundary case analysis (the r-- , r-=p) , since special
     cases are the major pain in hardware algorithm accelaerators

  -- No silly request that we email you the answer

  -- Just really refreshingly clear!

>Obviously I'd like as many sieve units as possible on the final >FPGA but I'd be looking to implement just one at first (without >the control unit), then the control function, then multiple >sievers.
Good plan, get one going. See where the bottle necks are. See how big it is.
>Sound feasible for an FPGA?
Sure
>Sound feasible for an FPGA novice :-) ?
Depends on your other hardware design experience. Looks like you need the following types of functional units:
>grab p and r from memory and mark this as 'in progress'.
Memory interface, pair of 64 bit registers
>set initial_r=r /* r being another 64 bit var */
another 64 bit register
>set n=1 /* n is a 64 bit counter */
64 bit counter
>while( n <= maxn ) {
Since you say that maxn is 2^26 or 50E6, maybe n is not 64 bits? anyway, a comparator (includes a subtractor)
> if( r == 0 ) {
64 bit test for 0
> stop. shove n back into memory and mark result as done(1) > } > r=r<<1; /* multiply by 2 */
a 64 bit 2-1 mux
> r--; /* decrement r by 1. r will always be > 1 so no need for > bounds checking */
Ooops, looks like r is a loadable counter, rather than just a register
> if( r >= p ) {
64 bit comparitor (includes a 64 bit subtractor)
> r-=p; /* again, always +ve result */
another 64 bit subtractor
> } > n++; /* increment n by one */ > if( r == initial_r ) { /* comparison */
64 bit comparitor (no subtractor for this one)
> stop. shove n back into memory and mark result as done(2) > } >}
plus a controller. The initial design should not care about timing, just get it to do the algorithm. Then there are wonderful opportunities for pipelining: -- between the various algorithm statements, -- within the counters and adders/subtractors (because at 64 bits there may be oppportunities that do not exist for smaller structures) -- Multiple instances of the algorithm processing different r,p pairs, but sharing the pipeline. Barrel processing. This is hinted at, because it looks like your algorithm can't start the next itteration of the loop until the current one is finished. This is also referred to as C-slow approach by Nicholas Weaver who posts to this news group on occasion. Try a Google search and this: http://www.cs.berkeley.edu/~yatish/cs252/252slides.ppt Looks like you have some fun ahead of you. On 28 Jun 2005 06:58:12 -0700, "Alex" <alex@greenbank.org> wrote:
>Hello, > >I know FPGA's are not suitable for sieving large numbers however >one set of numbers (Proth numbers, related to the Sierpinski >conjecture) I think would fit an FPGA quite well. > >Sorry if this is a relatively basic or silly question but I'm >a novice with FPGAs although I've been reading up on them and >playing around with the software. > >I'd be dealing with numbers in the 32 to 64 bit range (current >limits of sieving are around 50 bit numbers) and hoping to put >many sievers on one FPGA (I have a Spartan-3 dev kit with >200k gates). > >The basic algorithm is this:- > >p - 64 bit number (a prime) although top bit will never be set >r - 64 bit number (0 <= r < p) >maxn - 50,000,000 but easily replaced with 2^26 if easier > >The above would be supplied by a controlling computer via >parallel or RS232 connection and stored somewhere in the 1M-byte >of RAM. The controlling computer would make sure there are plenty >of these values available in memory for the sievers to grab. > >An individual siever would:- > >grab p and r from memory and mark this as 'in progress'. >set initial_r=r /* r being another 64 bit var */ >set n=1 /* n is a 64 bit counter */ >while( n <= maxn ) { > if( r == 0 ) { > stop. shove n back into memory and mark result as done(1) > } > r=r<<1; /* multiply by 2 */ > r--; /* decrement r by 1. r will always be > 1 so no need for > bounds checking */ > if( r >= p ) { > r-=p; /* again, always +ve result */ > } > n++; /* increment n by one */ > if( r == initial_r ) { /* comparison */ > stop. shove n back into memory and mark result as done(2) > } >} > >Obviously I'd like as many sieve units as possible on the final >FPGA but I'd be looking to implement just one at first (without >the control unit), then the control function, then multiple >sievers. > >Sound feasible for an FPGA? >Sound feasible for an FPGA novice :-) ? > >Ta, > >-Alex
Philip Freidin Fliptronics
Hello,

Philip Freidin wrote:
> Hi Alex, > > First let me thank you for what has to be a trully unique posting > from a new participant to comp.arch.fpga .
Thanks. I've been around Usenet long enough to know what to do and, more importantly, what not to do. The project is a learning experience for me, I don't want anyone to give me the answer :-) And thank you for your comments, they've helped me look in the right directions. I'm a complete novice when it comes to FPGAs and VHDL. So I've bought _The Student's Guide To VHDL_ by Peter J. Anderson (ISBN 1558605207) based on various recommendations. I've implemented a very simple siever limited to the IEEE 1164 INTEGER type (signed 32-bit). Obviously this isn't much help for the final version (64-bit numbers) but it's helping me understand the whole thing. The algorithm has changed slightly (and for the better). I have eliminated the need for the decrement of r, and I'm now checking to see if r is set to one of 9 values. (I've been playing around with this today in a text editor. I only have the Xilinx software at home on so please read through any syntax errors :-) (Also note that this is purely functional and unoptimised.) function sieve_it ( p, c1, c2, c3, c4, c5, c6, c7, c8, c9 : in INTEGER ) return INTEGER is variable n : INTEGER := 0; variable r : INTEGER := 1; begin while n < 50000000 loop r := r*2; -- double it (max p is 30 bits with signed integers) -- with 64 bit unsigned integers then p can go up to -- 2^63-1 n := n+1; -- increment n if r >= p then r := r-p; -- r was 0 < r < p before being doubled. -- after being doubled it must be 0 < r < 2p-2 -- so we will only ever need to subtract p once to bring it -- back into the range 0 < r < p -- since r is always +ve we never have to deal with negative -- numbers. Should use unsigned integers but leave that until -- later on. end if; -- r now holds 2^n mod p. -- Check if r (2^n mod p) is equal to one of our target values -- As far as concurrency is concerned, n must have been -- incremented by this point. all other operations must be -- performed sequentially. if r=c1 or r=c2 or ... r=c9 then -- got one, return n return n; end if; end loop; return 0; end function sieve_it; Up to 9 'check' values are provided, if not all values are needed then a dummy value (either 0 or p) can be provided as r can never end up at either of these values. [MATHS BIT] Computing test cases is easy. Consider k=19249 and p=100003. We are looking to solve the discrete logarithm (= means equivalence):- 19249*2^n+1 = 0 (mod 100003) => 19249*2^n = -1 (mod 100003) : subtract 1 from each side => 2^n = -1 * 19249^-1 (mod 100003) : divide by 19249 (Don't worry how this bit works!) 19249^-1 = 3486 (mod 100003) : 3486*19249 = 1 (mod 100003) => 2^n = (100003-3486) (mod 100003) => 2^n = 96517 (mod 100003) I can compute hundreds of thousands of these test cases per second on a humble PC. This is not the tricky part. There are 8 other candidate k values (aside from 19249), hence the 9 possible values to check for. Some checking (performed on a PC) can eliminate certain k values (quadratic residues and Legendre symbols) and so not all 9 c values may be needed. [END OF MATHS BIT] So we call sieve_it with p=100003 and c1=96517 (c2..c9 are all 0). It should return n=9484 because 19249*2^9484+1 = 0 (mod 100003). (Warning, this number has around 2900 digits so put your calculators down.) I don't care which value (c1 to c9) matched, in truth I don't really care what the magic n value was. All I want to know is there was a match for any n in the range 0..50M. Here's why... The hits are so rare that just knowing that a certain p will provide a result is all that is necessary. Take the range of p from 882T (8.82*10^14) to 883T. I've sieved this range with a software siever and so I know exactly how many factors were found. On average the prime density is about 1 in 33 in that range. So there are roughly 1E12/33 = 3E10 primes between 882T and 883T. That's 3E10 individual calls to sieve_it() that need to be made. This is why I want to fit as many sievers on the FPGA as possible, and make them as fast as possible. Out of those 3E10 primes, a mere 866 were factors of the numbers. That's one hit for every ~ 35,000,000 calls to sieve_it. So the plan is something like this:- 1. Implement and test sieve_it function. I have a wrapper entity that sets up vars for the inputs, calls sieve_it and then writes the result out to an input:- entity proth_sieve is ( clk,reset : in STD_LOGIC; nresult : out INTEGER ) end proth_sieve ... proth_u_like : process is begin --assign values to tp,t1..t9 tp := 100003; t1 := 96517; result := sieve_it( tp,t1,t2,t3,t4,t5,t6,t7,t8,t9 ); nresult <= result after 100 ns; wait on clk; -- sleep until restarted end process proth_u_like; I've also got another version of the siever that completes one iteration of the loop per clock cycle. It writes out the current value of r to an output port every loop and waits on the clk. That way I can see what is going on with ModelSim. 2. Bolt on Serial port handler to allow communication between FPGA and controlling PC. Don't know if Xilinx provides anything as part of the Spartan-3 Starter Kit I purchased. There's always UART drivers on www.opencores.org anyway. PC --- Serial --- Simple_Controller --- Siever The Simple_Controller is a just a pass through, it calls the sieve_it() function directly and only when told to by the PC via the Serial connection. 3. Have multiple sievers being handled by controller on the FPGA. PC --- Serial --- Semi_Controller ---< Multiple Sievers The Semi_Controller calls multiple siever functions, all assignment handled by the PC via Serial. 4. Use the memory (1MB SRAM on Spartan-3 Dev Kit). To upload work units (p,c1..c9) to memory To download completed work units (p,result) from memory. PC --- Serial --- Controller --- Memory Sievers contact controller to request work unit from memory. Sievers return results to controller which stores in memory. Memory --- Controller --< Multiple Sievers Sievers now controlled by the on board controller. If there is work to do then controller assigns it itself. The PC -> Controller interaction is only to upload/download new/completed work units. 5. Optimise it, make it 64-bit, fit maximum number of sievers on FPGA. 6. Drive other outputs on Spartan-3 Dev kit board? 8 LEDs - Controller status, Serial activity (upload/download) 4-digit 7-segment display - # of work units processed per second? VGA output - crude graphical siever status, i.e. 5 scan lines per siever, LHS of screen is n=0, RHS is n=50M line scaled appropriately. This last stage would be pure overkill and I'd be happy to ignore them if they are to the detriment of the overall sieving performance of the FPGA (i.e. they take up space that could house another siever) but it would be nice to see the blinkenlights on the board put to some use. Ta, -Alex