Mark Gritter (markgritter) wrote,
Mark Gritter
markgritter

More on not trusting published algorithms

If you haven't read this paper on how not to implement the Sieve of Eratosthenes in a functional language, it's well worth your time: "The Genuine Sieve of Eratosthenes" by Melissa O'Neill of Harvey Mudd. It's about how a common simple functional programming example is completely wrong. (Her talk at Stanford on a new family of random number generators is worth the time too.. It's a cool hack to incorporate permutations into linear-congruential RNGs.)

I ran into a question on Quora about summing the first K prime numbers, and the person was asking whether there was a better way than trial division. When I pointed out that the Sieve was much better (even at relatively small scales!) his response was "then I need to fill all integers upto INTEGER.MAX_VALUE"--- i.e., how do we bridge the gap between "All the prime numbers up to N" and "The first K prime numbers".

There are three ways to tackle this and they're all interesting in their own way.

1) The Sieve is so much better than trial division that it almost doesn't matter. The prime number theorem tells us that there are about N/log(N) prime numbers less than N, and that the Kth prime is about K log K. So, trial division to find the Kth prime require work that's around O( (K log K)*K ) = O( K^2 log K ). We might be generous and note that most composite numbers don't require testing all previous primes in order to find a divisor, and bound below by Omega(K^2).

The sieve takes O(N log log N). If you're looking for, say, the 1000th prime, you can easily sieve up to 1000^2 = 1,000,000 and still come out ahead. (Particularly since, in practical terms, the Sieve doesn't require any division.)

2) The prime number theorem lets us come up with a pretty-good bound on how large an N we need to find the K'th prime. Run the sieve on, say, N = 2 K log K and that's a solid win.

3) Even if we could not predict where the Kth prime was likely to occur (or, say some subset of the primes of special interest), the Sieve can still be run to an indeterminate stopping point at only a constant-factor slowdown.

The paper linked above gives a sophisticated way of doing this, but brute force is fine too.

Say we sieve up to N=1000 and don't yet find the primes we're looking for. Then double N and start over from scratch. Do this until we find K primes; iteration J of the sieve has size N =(2^J)*1000. If we had an oracle to tell us J, we could have done one sieve pass. Instead we did sieves of size 1000, 2000, 4000, 8000, ... , 2^J*1000, but the sum of the sizes is just 2^(J+1)*1000 since it's a geometric series. So even if our answer was at 2^(J-1)*1000 + 1, the total amount of numbers sieved was at most 4x the minimum we needed to do.

And that's not the best we can do, if we re-use the first 1000 numbers rather than throwing them away and starting from scratch it gets somewhat better--- but that's just quibbling about constant factors. Even treating the Sieve as a black box gives us a better result than trial division.
Tags: algorithm, geek
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 0 comments