Derivation of the matrix multiply domain flow program

The concepts of partial and total orders are essential for finding optimal domain flow algorithms. Partial orders, or Poset, are the source of high-performance, low-power execution patterns.

The Linear Algebra universe is particularly rich in partial orders, something that has been exploited for centuries ¹. Matrix Computations² by Golub, and van Loan provide a comprehensive review. What follows may be a bit technical, but keep in mind the visualizations of the previous pages as you try to visualize what the math implies.

We want to evaluate the matrix-matrix multiplication: $ C = A \otimes B $, where $A$, $B$, and $C$ are matrices of size $N \times N$. We picked the square matrix version because it is cleaner to visualize the symmetry in the computational wavefront, but all that will follow will work just as well when the matrices are rectangular.

Linear algebra operators exhibits many independent operations. For example, there are four basic vector operators:

scale
add
multiply, and
dot product

The operator $z = alpha * x + y$ is frequently used, and although redundant, tends to be added as the fifth operator, and referred to as the saxpy operator for “Scalar Alpha X Plus Y”. The dot product is also referred to as the inner product. The inner product is an operator that brings two vectors together into a scalar representing a measure how much the vectors point in the same direction. The outer product is an operator that expands two vectors into a matrix: for vector $x$ and $y$, the outer product $\times$ is defined as: $xy^T$.

Matrix operations can be expressed in terms of these vector operators with many degrees of freedom. For example, ‘double loop’ matrix-vector multiplication can be arranged in $2! = 2$ different ways; we can evaluate the inner products, or we can evaluate the outer products.

‘Triple loop’ matrix-matrix multiplication can be arranged in $3! = 6$ ways. These arrangements have their own operators and their own modes of access, and thus the interplay with a spatial distribution of the rows and columns of the matrices is key to evaluate the efficiency of the computation. The orderings and properties of these different arrangements is shown in Table 1:

Loop order	Inner Loop	Middle Loop	Inner Loop Data Access
ijk	dot	vector $\times$ matrix	$A$ by row, $B$ by column
jik	dot	matrix $\times$ vector	$A$ by row, $B$ by column
ikj	saxpy	row saxpy	$B$ by row
jki	saxpy	column saxpy	$A$ by column
kij	saxpy	row outer product	$B$ by row
kji	saxpy	column outer product	$A$ by column

Table 1: Matrix Multiplication: Orderings and Properties (see ²)

The scale, add, and multiply operators are highly parallel in that each individual vector element operation is independent of each other. The dot product adds a consolidation, or contraction phase to yield a single, scalar valued result.

In a parallel context, all these vector operators have an information distribution phase that is non-trivial. First, vectors must be embedded in space, and secondly, the vectors need to be aligned so that these element operations can commence. Progressing to matrix operators, we have vectors of vectors that need to be aligned. For the domain flow algorithm we demonstrated, we started from the matrix-multiply expression as $N^2$ independent dot products. In mathematical terms: if we partition the $A$ matrix in rows, and the $B$ matrix in columns:

$$A = \begin{bmatrix} a_1^T \\\\ a_2^T \\\\ \vdots \\\\ a_n^T \end{bmatrix}$$

and

$$B = \begin{bmatrix} b_1 & b_2 & \cdots & b_n \end{bmatrix}$$

then matrix multiplication can be expressed as the collection of dot products:

$$\text{for i = 1:N} \\\\ \qquad \text{for j = 1:N} \\\\ \qquad\qquad c_{ij} = a_i^T b_j $$

Looking just at the individual dot product, a theoretical computer scientist would say: the fastest way to evaluate a dot product is through a binary tree of depth $log(N)$ yielding the result in $log(N)$ steps. A spec is written and handed off to a hardware engineer. When the hardware engineer looks at this problem, a very different view emerges. In the hardware world, an algebraic operator such as multiply or add evaluates, depending on the number system, in the order of 1 nsec. But sending the result across even a modestly sized chip, say 10x10 mm, can take 10 times as long. If the result needs to be communicated across the network, it can take a 1,000,000 times longer. With modern chip technology, it takes about the same time to compute a multiply or add as it does to communicate the result to a local neighborhood. From the perspective of electrons participating in the evaluation of an algebraic operator, computation and communication are roughly equal in terms of time and thus distance these electrons can ‘drift’.

What this means for evaluating the dot product is that the evaluation of $$c_{ij} = \sum\limits_{k=1}^N a_{ik} * b_{kj}$$ can be executed efficiently as a propagation through local neighborhoods of communicating functional units. In mathematical terms we can write this as a recurrence: $$c_{ijk} = c_{ijk-1} + a_{ik} * b_{kj}$$ You can start to recognize the domain flow algorithm as presented in our example. However, if we distribute the $c_{ij}$ propagation in that $k$-dimension, then accesses to $a_{ik}$ and $b_{kj}$ are not local at all. Is there a mechanism to get the correct $a$ and $b$ elements to the right location?

Let’s take a look at the dot products for $c_{1,1}$, $c_{1,2}$, and $c_{2,1}$. Visualize the propagation of the $c$ recurrence along the $k$-dimension above the point $i = 1, j = 1, k = 0$. Let’s position the row of $A$, $a_1^T$, alongside the $c_{1,1}$ propagation in the $j = 0$ plane. Thus, $a_{1,1}$ is presented at the lattice point $(1,0,1)$, and $a_{1,N}$ is positioned at the lattice point $(1,0,N)$. Similarly, let’s position the column of $B$, $b_1$, alongside the $c_{1,1}$ propagation, but position it in the $i = 0$ plane. That would imply that $b_{1,1}$ is presented at the lattice point $(0,1,1)$, and $b_{N,1}$ is positioned at the lattice point $(0,1,N)$. This would transform the recurrence equation for $c_{1,1}$ into:

$$c_{1,1,k} = c_{1,1,k-1} + a_{1,0,k} * b_{0,1,k}$$

This recurrence represents all local neighborhood operand communications.

If we now visualize the recurrence for $c_{1,2}$ to propagate in the $k$-column above the lattice point $(1,2,0)$ we recognize that for $c_{1,2}$ we need the same row vector of $A$ as the recurrence for $c_{1,1}$. We can thus propagate the elements of the row vector $a_1^T$ along the $j$-direction and build up one side of the dot products for the row $c_1^T$.

Generalized to the whole $A$ matrix, this is a set of propagation recurrences defined by: $$ a_{i,j,k} = a_{i,j-1,k}$$

Similarly, the column vector $b_1^T$ is shared between the recurrence of $c_{1,1}$ and $c_{2,1}$, and we can propagate the elements of the column vector $b_1^T$ along the $i$-direction and build up the other side of the dot products for the column $c_1$. Generalized to the whole $B$ matrix, this is a set of propagation recurrences defined by: $$ b_{i,j,k} = b_{i-1,j,k}$$

Once we have the $A$ and $B$ matrix elements distributed throughout the lattice, we can finally transform the $c$ recurrence into a local neighborhood operand communication as well:

$$c_{i,j,k} = c_{i,j,k-1} + a_{i,j-1,k} * b_{i-1,j,k}$$

This completes the transformation to all local neighborhood operand communications with the system of recurrences we have seen before expressed as a domain flow program:


compute ( (i,j,k) | 1 <= i,j,k <= N ) {
    a: a[i,j-1,k]
    b: b[i-1,j,k]
    c: c[i,j,k-1] + a[i,j-1,k] * b[i-1,j,k]
}

1: History of Matrices and Determinants

2: Matrix Computations, Gene Golub and Charles van Loan