TRTRI test resuts.

The first picture shows a comparison of the speedup it gives to change the minimum recursion level. In the most naive (and beautiful) version the algorithm recurs down to 1. However, this produces many function calls, so it is more efficient to write special code to handle all the small cases, and there by avoid the many function calls. As the problem gets large, the computational cost is completely dominated by the work done on the large blocks, so the speedup from writing the small cases explicitly is only visible for small problem sizes.

The following picture shows the recursive routine compared to the standard algorithm supplied by LAPACK, and the LAPACK algorithm using the ATLAS block size supplied by the routine ilaenv.f. The ATLAS BLAS is used in all cases.
The new recursive algorithm is about 10% faster than the LAPACK algorithm. What is a little disturbing however is the fact that the LAPACK algorithm actually runs slower when using the ATLAS block size.

For large problem sizes the situation looks like this. The recursive algorithm is still in the lead.

The next picture shows the large problem size test performed with LU on the same architecture using all the same libraries, so see if the LAPACK/ilaenv combination is still slower than LAPACK without knowledge of ATLAS block size.
In this case it does give better performance to supply LAPACK with ilaenv.