Summary: The emergence of the systolic paradigm presented by {\it H. T. Kung} and {\it C. E. Leiserson} [Sparse matrix computations, Proc. Symp., Knoxville 1978, 256-282 (1979; Zbl 0404.68037)] inspired the first 2D-array parallelization of the sequential matrix multiplication algorithm. Since then, and due to its attractive and appealing features, the systolic approach has been gaining a great momentum to the point where all 2D-array parallelization attempts were exclusively systolic. As a good result, latency has been successively reduced a number of times $(5N,3N,2N,3N/2)$, where $N$ is the matrix size. But as latency was getting lower, further irregularities were introduced into the array, making the implementation severely compromized either at VLSI level or at system level. The best illustrative case of such irregularities are the two designs proposed by {\it J. C. Tsay} and {\it P. Y. Chang} [Design of efficient regular arrays for matrix multiplication by two-step regularization, IEEE Tran. on Parallel and Distributed Systems, 6, No. 2 (1995) and Some new designs of 2D-array for matrix multiplication and transitive closure, ibid. 6, No. 4 (1995)] and considered as the fastest designs $(3N/2)$ that have been developed so far. The purpose of this paper is twofold: we first demonstrate that $N+\sqrt{N}+2$ is the minimal latency that can be achieved using the systolic approach. Afterwards, we introduce a full-parallel 2D-array algorithm with $N$ latency and $2N$ I/O-bandwidth. This novel algorithm is not only the fastest algorithm, but is also the most regular one, too. A 3D parallel version with $O(\log N)$ latency is also presented.