In fact primitive Pythagorean triples have a quite lovely structure: they can be generated in a ternary tree using a set of three matrices. (In two different ways, no less! The second set of matrices was only discovered in 2008.) There are several implementations of this idea on Rosetta Code (often contrasted with the naive approach.) Primitive triples are those whose three elements have no common divisor.
We can use this structure to do better than O(N^2), under certain conditions.
Pick the largest element of N, call it L. Using depth-first search on the ternary tree, generate all primitive Pythagorean triples with hypotenuse length <= L. Note that this method of generation always produces larger hypotenuses in the children of each node, so we can effectively prune our search. There is a result from the turn of the century which says that the number of primitive Pythagorean triples <= L is, in the limit, a constant fraction of L (in fact 1/2*pi). So, the depth first search takes time O(L), because we generate a unique triple each step, and prune no more than 3x the number of valid triples.
How long does it take us to generate all triples from just the primitive ones? Well for (3,4,5) we create an additional L/5 triples. For (5,12,13) we create an additional L/13 triples. I claim that this is bounded by L*H(L), where H(k) = the k'th element in the harmonic series (i.e., 1 + 1/2 + 1/3 + 1/4 + ... + 1/k.) A little handwaving is required here because some hypotenuses appear in more than one Pythagorean triple. I was not able to find a reference online confirming my guess about this bound on the number of Pythagorean triples. (Mathworld cites only the limit result for the number of primitive triples.) However, using OEIS's list of the number of triangles I can verify that L*H(L) =~ L*log(L) is a good bound up to 10^12.
This process produces all the Pythagorean triples with no repeated elements. So, for our algorithm we put all elements on N into a hash table. (Note that randomized algorithms can produce a perfect hash in O(N) time, if we're worried about that aspect.) Then iterate through all Pythagorean triples with hypotenuse <= L, and check whether all three elements are in the hash table, which is a constant-time operation. Thus the total run time is O(L log L).
If N is O(L) then this algorithm is much better than O(N^2). That is, as long as the number of elements in the set is proportional to the largest number--- or even somewhat worse--- we win. The naive algorithm is better if there are only a few, large numbers. I coded up both in Python (they are just a few lines each) and ran a handful of experiments (times are in seconds:)
L N quadratic tree 1000000 1000 0.21 2.18 1000000 2000 0.83 2.22 1000000 4000 3.37 2.20 2000000 4000 3.39 4.82 2000000 6000 7.73 4.95
A possibly interesting follow-up question is which algorithm parallelizes better. Both algorithms have pretty trivial partitionings (tree sub-branches vs. subsets of the pairings.) With an unbounded number of processors the "quadratic" algorithm can be made constant-time, but the tree algorithm is not.
Code on pastebin (because I'm not one of the cool kids on github.)