BLAS Level 1

BLAS Level 1 are $\mathcal{O}(N)$ class operators. This makes these operators operand access limited and thus require careful distribution in a parallel environment.

There are four basic vector operations, and a fifth convenience operators. Let $ \alpha \in \Bbb{R}, x \in \Bbb{R^n}, y \in \Bbb{R^n}, z \in \Bbb{R^n}$$ then:

vector scale: scalar-vector multiplication: $z = \alpha x \implies (z_i = \alpha x_i)$
vector element addition: $z = x + y \implies (z_i = x_i + y_i)$
vector element multiply: $z = x * y \implies (z_i = x_i * y_i)$
vector dot product: $c = x^Ty \implies ( c = \sum_{i = 1}^n x_i y_i ) $, aka inner-product
saxpy, or scalar alpha x plus y, $z = \alpha x + y \implies z_i = \alpha x_i + y_i $

The fifth operator, while technically redundant, makes the expressions of linear algebra algorithms more concise.

One class of domain flow programs for these operators assumes a linear distribution of the vectors, and propagation recurrences for scalar entities. For the dot product, we also use a propagation recurrence to produce the scalar result.

scalar-vector multiplication

compute ( (i,j,k) | 1 <= i <= N, j = 1, k = 1 ) {
    alpha: alpha[i-1,j,k]
    z: alpha[i-1,j,k] * x[i,j-1,k]  
}

Here the vector x being scaled is presented in space at the location [i,0,1]. The scalar factor alpha is presented in space at [0,1,1] and is the propagated across the vector elements in the +i direction.

Assume for the moment that we only have nearest neighbor communications, then the broadcast of the scale factor alpha can be implemented in the form of a propagation. It is the schedule of the broadcast that will determine the available concurrency of this operator. In the description above, we have a completely sequential execution schedule due to the propagation broadcast of the scaling factor.

vector addition

compute ( (i,j,k) | 1 <= i <= N, j = 1, k = 1 ) {
    z: x[i,j-1,k] + y[i,j,k-1]    
}

In this articulation, the x vector is presented in space at location [i,0,1], and the y vector at location [i,1,0]. In this layout, there is no contention nor any dependency between input elements, and this operator would be fully concurrent, and would complete in a single step. The trick of course is that getting both vectors distributed to that location to enable that concurrency. This speaks to the problem of ‘chaining’ operators together in such a way that the output of a vector generating operator would deposit a vector spatially aligned for the next operator.

This is the first incling of the concept of domain flow. Vectors, matrices, and tensors are all aggregate data structures that typically move through the machine as a structured entity under the constraints of some domain specification: defined by the compute() clause in the examples shown so far.

vector multiplication

compute ( (i,j,k) | 1 <= i <= N, j = 1, k = 1 ) {
    z: x[i,j-1,k] * y[i,j,k-1]    
}

Element-wise vector multiplication follows the same domain structure as vector addition, only the arithmetic operator switches from addition to multiplication.

dot product

compute ( (i,j,k) | 1 <= i <= N, j = 1, k = 1 ) {
    x: x[i,j-1,k]
    y: y[i,j,k-1]
    z: x[i,j-1,k] * y[i,j,k-1] + z[i-1,j,k]    
}

Using the same domain and vector layout as the element-wise vector addition described previously, we can now construct a dot, or inner-product, operator. The x and y vector elements come in from the -j and -k direction, and the starting or previous partial result z0 comes in from [0,1,1].

Here we observe that even though the input vectors domain flow is entirely unconstrained, the evolution of the dot product execution is completely sequential, caused by the z partial sum propagating in the +i direction.

For the scale vector operator, we observe a similar serialization dynamic as the inner-product operator, but in that operator the serialization is caused by having to ‘broadcast’ the scalar value across the spatial domain of the x vector.

SAXPY

compute ( (i,j,k) | 1 <= i <= N, j = 1, k = 1 ) {
    alpha: alpha[i-1,j,k]
    z: alpha[i-1,j,k] * x[i,j-1,k] + y[i,j,k-1]    
}

The final scalar alpha x plus y, or saxpy operator is combining the vector scale and vector addition operators, and will show the same constraint as the vector scale operator due to the required propagation broadcast of the scaling factor.