2. Matrix Operations

February 10, 202

Basic Linear Algebra Subroutines (BLAS)

Basic linear algebra subroutines are a standard library interface for manipulating dense vectors and matrices. There are 3 levels of BLAS routines.

Vectors $O(n)$ flops. $O(n)$ data.
1. $x \cdot y = z$
Matrix-vector $O(n^2)$ flops. $O(n^2)$ data.
1. $Ax = B$
Matrix-matrix $O(n^3)$ flops. $O(n^2)$ data.
1. $AB = C$

Locality and memory

Matrices are usually stored as contiguous arrays, in column major order. The two costs of operations are

arithmetic operations (flops)
time spent fetching data (communication).

The cost of communication is much higher than the cost of a floating point operation.

💡 Communication takes about 100 cycles to fetch something.

Memories are optimized for

spatial locality
- It is faster to access data sequentially that is close to each other.
temporal locality
- This includes reusing data in cache.

Examples of matrix vector products

Here is a basic implementation of matrix-vector multiplication.

function matvec_row(A, x)
	m, n = size(A)
	y = zeros(eltype(x), m)
	for i = 1:m
		for j = 1:n
			y[i] += A[i,j] * x[j]
		end
	end
	return y
end

This has some spatial locality in $y$ because it is reusing cache.

It has some temporal locality in $x$ because it is sequentially calling the pointer.

However, this is slower than iterating over the $n$ then $m$! This is because the column-wise implementation has spatial locality for both $i$ and $j$.

Matrix multiplication

Notice that for BLAS3 operations, it takes $O(n^3)$ flops but uses $O(n^2)$ data. This creates a big opportunity for efficiency.

One way we can try to compute this is by the inner product. Suppose we want to compute $C = AB$. We can write $A$ as a $n\times 1$ matrix.

\[C= \begin{bmatrix} a_1^T\\\vdots\\a_m^T \end{bmatrix} \begin{bmatrix}b_1&\dots& b_n\end{bmatrix}\\ C_{ij} = a_i^T b_j\]

We can also try computing an outer product.

\[C= \begin{bmatrix}a_1&\dots& a_n\end{bmatrix}\begin{bmatrix} b_1^T\\\vdots\\b_m^T \end{bmatrix} \\ C_{ij} = a_i b_j^T\]

In practice, these can be made much faster by blocking.

Blocking

Blocking is a way to increase temporal locality by representing matrices as smaller matrices. It allocates memory that can be allocated fast.

Structured matrices

There are a couple important matrices that we should know.

Structured matrices
1. Diagonal
2. Tridiagonal
3. Permutation
  1. One nonzero per row/col