### Parallelizing

In 1991 Daniel Bernstein published a short paper on a computational technique for solving diophantine equations of the form p(a) + q(b) = r(c) + s(d): https://cr.yp.to/papers/sortedsums.pdf He describes how a heap can be used to visit all sums p(a) + q(b), with a,b < H, in increasing order, using only near-linear space and near-linear time (in H). You can then set up two heaps and "increment" whichever one has the lowest minimum value, outputting any cases of equality.

Can this technique be efficiently parallelized, on a SIMD machine (such as GPU)?

A simple approach is to divide up the elements in one of the heaps. That is, instead of, say, [math]r(d) = d^4[/math], you could have a family of functions [math]r_i(d) = (192d+i)^4[/math], which partition the search space. Each thread gets a different function to work with. This reduces one of the heaps by a factor of 192, but the other is just as large. Where previously we did H^2 + H^2 heap operations, now we have to do H^2 + H(H/192) heap operations, and the operations on the smaller heap are more efficient in the worst-case as well. Also, every thread needs its own pair of *both* heaps which is somewhat memory-intensive, limiting either the size of the problem that can be tackled or the number of available threads.

We could also partition *all* of the inputs as long as every partition is tested against every other combination of partitions. (This is a natural optimization when some restriction on the equations can be found for some moduli, thus eliminating a broad range of possible solutions.) For example, if we divided all the inputs in half by parity, then we need to check 8 different equations. This again requires a pair of heaps per thread.

We could perhaps continue to use a shared heap for the left-hand side of the equation, at the cost of threads becoming idle when their minimum value grows too large. I don't know how to evaluate what this does to the speedup. It also introduces some serialization as the heap operation itself can't be parallelized.

A variation on the shared heap would "unroll" the left-hand heap and have a shared array of the next K values. I can't decide whether this is helpful or not.

On a side note, perhaps the availability of cheap large-capacity flash devices may make it feasible to apply techniques previously dismissed as infeasible due to space requirements.

Can this technique be efficiently parallelized, on a SIMD machine (such as GPU)?

A simple approach is to divide up the elements in one of the heaps. That is, instead of, say, [math]r(d) = d^4[/math], you could have a family of functions [math]r_i(d) = (192d+i)^4[/math], which partition the search space. Each thread gets a different function to work with. This reduces one of the heaps by a factor of 192, but the other is just as large. Where previously we did H^2 + H^2 heap operations, now we have to do H^2 + H(H/192) heap operations, and the operations on the smaller heap are more efficient in the worst-case as well. Also, every thread needs its own pair of *both* heaps which is somewhat memory-intensive, limiting either the size of the problem that can be tackled or the number of available threads.

We could also partition *all* of the inputs as long as every partition is tested against every other combination of partitions. (This is a natural optimization when some restriction on the equations can be found for some moduli, thus eliminating a broad range of possible solutions.) For example, if we divided all the inputs in half by parity, then we need to check 8 different equations. This again requires a pair of heaps per thread.

We could perhaps continue to use a shared heap for the left-hand side of the equation, at the cost of threads becoming idle when their minimum value grows too large. I don't know how to evaluate what this does to the speedup. It also introduces some serialization as the heap operation itself can't be parallelized.

A variation on the shared heap would "unroll" the left-hand heap and have a shared array of the next K values. I can't decide whether this is helpful or not.

On a side note, perhaps the availability of cheap large-capacity flash devices may make it feasible to apply techniques previously dismissed as infeasible due to space requirements.