These following pictures show results obtained from running the
blocksearch.py script of my code generator.
The pictures shows different combination of unrolling of the innermost
loop and the blocking size used in atlas. There is no use of
prefetching. The same kind of register blockings is used for all
cases. This is usually ok, since the register blockings is mostly
independent of the loop unrollings.
The results have been obtained using atlas' kernel testers. This means
that the information displayed of the graphs does not necessarily
translate to a full kernel build. Often, a small block size will be
preferable, as it reduces the need for cleanup.
The first picture shows results for a Pentium 3. Notice the drop-off
in perfomance when the L1 cache is exceeded at at block size of 64.
This picture shows results for a Pentium 4 computing in double precision.
The drop-off in performance is more level here. Notice the big dip at
a blocksize of 128, this is probably due to some aliasing issues in
the cache. Another interesing feature (hard to see) is that a loop
unrolling of 8 is the fastest, and not the full loop unrolling. This
is probably the benefit of the trace-cache that makes this possible.
The same picture just for single precision.