I happened across "A multimodular algorithm for computing Bernoulli numbers" which implements a cute hack. The Bernoulli number B(k) mod p can be efficiently calculated for prime p > 5. You can calculate a bunch of B(k) mod p_i in parallel, then use the Chinese Remainder Theorem to stitch those results back into the desired number. This algorithm's now part of the Sage toolkit.
I wonder what other functions could be decomposed this way? Once you start paying attention to bits (and not just words), operating "mod p" can save you a lot of time spent multiplying, both theoretically and practically.