§ Proving block matmul using program analysis


It's a somewhat well-known fact that given matrix multiplication: O=ABO = AB where OR2n×2mO \in \mathbb R^{2n \times 2m} ( OO for output), AR2n×r,BRr×2mA \in \mathbb R^{2n \times r}, B \in \mathbb R^{r \times 2m} are matrices.
We can also write this as follows:
[o11o12o21o22]=[a11a12a21a22][b11b12b21b22]=[a11b11+a12b21a11b12+a12b22a21b11+a22b21a21b12+a22b22] \begin{bmatrix} o_{11} & o_{12} \\ o_{21} & o_{22} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11} b_{11} + a_{12} b_{21} & a_{11} b_{12} + a_{12} b_{22} \\ a_{21} b_{11}+ a_{22} b_{21} & a_{21} b_{12} + a_{22} b_{22} \end{bmatrix}

When written as code, the original matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] -> out:[2N][2M]
int matmul(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
  for(int i = 0; i < 2*N; ++i) {
    for(int j = 0; j < 2*M; ++j) {
      for(int k = 0; k < 2Z; ++k) out[i][j] += a[i][k] * b[k][j]
    }
  }
}

and the block-based matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] -> out:[2N][2M]
int matmulBlock(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
  for (int BI = 0; BI < 2; ++BI) {
    for (int BJ = 0; BJ < 2; ++BJ) {
      for(int i = BI*N; i < BI*N+N; ++i) {
        for(int j = BJ*M; j < BJ*M+M; ++j) {
          for(int k = 0; k < 2Z; ++k) { out[i][j] += a[i][k] * b[k][j] }
        }
      }
    }
  }
}

we wish to show that both of these programs have the same semantics . We will do this by appealing to ideas from program analysis.

§ The key idea


We will consider the statement:
out[i][j] += a[i][k] * b[k][j]

as occuring at an abstract "point in time" (i,j,k)(i, j, k) in the matmul function. I also occurs at an abstract "point in time" (BI,BJ,i,j,k)(BI, BJ, i', j', k') in the matmulBlock function.
We will then show that the loops for(i...) for(j...) for(k...) are fully parallel, and hence we can reorder the loops any way we want.
Then, we will show that the ordering imposed by (BI,BJ,i,j,k)(BI, BJ, i', j', k') is a reordering of the original (i,j,k)(i, j, k) ordering. We do this by showing that there is a bijection:
(i=i0,j=j0,k=k0)(BI=i0/N,BJ=j0/N,i=i0%N,j=j0%N,k=k0) (i=i_0, j=j_0, k=k_0) \rightarrow (BI=i_0/N, BJ=j_0/N, i=i_0\%N, j=j_0\%N, k=k_0)

Thus, this bijection executes all loops, and does so without affecting the program semantics.

§ Schedules


We'll zoom out a little, to consider some simple programs and understan how to represent parallelism.
void eg1(int N, int M, int out[N][M]) {
for(int i = 0; i < N; ++i) {
  for(int j = 1; j < M; ++j) {
    out[i][j] = out[i][j-1];
  }
}

Notice that this program is equivalent to the program with the ii loop reversed:
void eg1rev(int N, int M, int out[N][M]) {
for(int i = N-1; i >=0; --i) {
  for(int j = 1; j < M; ++j) {
    out[i][j] = out[i][j-1];
  }
}

What's actually stopping us from reversing the loop for(j...)? it's the fact that the value of, say, out[i=0][j=1] depends on out[i=0][j=0]. We can see that in general, out[i=i_0][j=j_0] depends on out[i=i_0][j=j_0-1]. We can represent this by considering a dependence set :
{write:(i0,j01)write:(i0,j0)} \{ \texttt{write}:(i_0, j_0-1) \rightarrow \texttt{write}:(i_0, j_0) \}

in general, we can reorder statements as long as we do not change the directions of the arrows in the dependence set.
We can imagine the scenario as follows:
^
| (1, 0)->(1, 1)->(1, 2)->(1, 3) ...
| (0, 0)->(0, 1)->(0, 2)->(0, 3) ....
(i, j) -->

§ Dependence structure of matmul.


§ Fully parallel, reordering


[TODO ]

§ References