Implement Charle's Coopers algorithm:
Steroid chksum algorithm by firstname.lastname@example.org
Basic strategy is to unroll the loop. normally adding in a loop
is slow because of the dependency, the cpu has to finish each loop
before it starts the next one and is unable to pipeline. one
way to get around this is to have multiple registers and have
multiple adds per loop, then consolidating at the end
as in unroll_checksum.
that is slow though because there is a lot of wasted space and
the cpu has to do a lot of adds.
key is that addition of a word is going to be faster
than addition of two chars, and also each word has 8 chars in it.
so we get 8 adds per loop. long1 + long2 is almost like
ooping over two arrays of 8 chars and adding them element wise.
except for the overflow. so we have to keep track of the overflow
and get rid of it at the end.