Can this technique be efficiently parallelized, on a SIMD machine (such as GPU)?
A simple approach is to divide up the elements in one of the heaps. That is, instead of, say, [math]r(d) = d^4[/math], you could have a family of functions [math]r_i(d) = (192d+i)^4[/math], which partition the search space. Each thread gets a different function to work with. This reduces one of the heaps by a factor of 192, but the other is just as large. Where previously we did H^2 + H^2 heap operations, now we have to do H^2 + H(H/192) heap operations, and the operations on the smaller heap are more efficient in the worst-case as well. Also, every thread needs its own pair of *both* heaps which is somewhat memory-intensive, limiting either the size of the problem that can be tackled or the number of available threads.
We could also partition *all* of the inputs as long as every partition is tested against every other combination of partitions. (This is a natural optimization when some restriction on the equations can be found for some moduli, thus eliminating a broad range of possible solutions.) For example, if we divided all the inputs in half by parity, then we need to check 8 different equations. This again requires a pair of heaps per thread.
We could perhaps continue to use a shared heap for the left-hand side of the equation, at the cost of threads becoming idle when their minimum value grows too large. I don't know how to evaluate what this does to the speedup. It also introduces some serialization as the heap operation itself can't be parallelized.
A variation on the shared heap would "unroll" the left-hand heap and have a shared array of the next K values. I can't decide whether this is helpful or not.
On a side note, perhaps the availability of cheap large-capacity flash devices may make it feasible to apply techniques previously dismissed as infeasible due to space requirements.