Matrix Multiplication
=====================

Matrix Multiplication is one of the most widely operators in scientific
computing and deep learning, which is typically referred to as *GEMM*
(GEneral Matrix Multiply). Let's implement its computation in this
section.

Given :math:`A\in\mathbb R^{n\times l}`, and
:math:`B \in\mathbb R^{l\times m}`, if :math:`C=AB` then
:math:`C \in\mathbb R^{n\times m}` and

.. math:: C_{i,j} = \sum_{k=1}^l A_{i,k} B_{k,j}.

The elements accessed to compute :math:`C_{i,j}` are illustrated in
:numref:`fig_matmul_default`.

.. _fig_matmul_default:

.. figure:: ../img/matmul_default.svg

   Compute :math:`C_{x,y}` in matrix multiplication.


The following method returns the computing expression of matrix
multiplication.

.. code:: python

    import d2ltvm
    import numpy as np
    import tvm
    from tvm import te
    
    # Save to the d2ltvm package
    def matmul(n, m, l):
        """Return the computing expression of matrix multiplication
        A : n x l matrix
        B : l x m matrix
        C : n x m matrix with C = A B
        """
        k = te.reduce_axis((0, l), name='k')
        A = te.placeholder((n, l), name='A')
        B = te.placeholder((l, m), name='B')
        C = te.compute((n, m),
                        lambda x, y: te.sum(A[x, k] * B[k, y], axis=k),
                        name='C')
        return A, B, C

Let's compile a module for a square matrix multiplication.

.. code:: python

    n = 100
    A, B, C = matmul(n, n, n)
    s = te.create_schedule(C.op)
    print(tvm.lower(s, [A, B], simple_mode=True))
    mod = tvm.build(s, [A, B, C])


.. parsed-literal::
    :class: output

    // attr [C] storage_scope = "global"
    allocate C[float32 * 10000]
    produce C {
      for (x, 0, 100) {
        for (y, 0, 100) {
          C[((x*100) + y)] = 0f
          for (k, 0, 100) {
            C[((x*100) + y)] = (C[((x*100) + y)] + (A[((x*100) + k)]*B[((k*100) + y)]))
          }
        }
      }
    }
    

The pseudo code is simply a naive 3-level nested for loop to calculate
the matrix multiplication.

And then we verify the results. Note that NumPy may use multi-threading
to accelerate its computing, which may result in slightly different
results due to the numerical error. There we use ``assert_allclose``
with a relative large tolerant error to test the correctness.

.. code:: python

    a, b, c = d2ltvm.get_abc((100, 100), tvm.nd.array)
    mod(a, b, c)
    np.testing.assert_allclose(np.dot(a.asnumpy(), b.asnumpy()),
                               c.asnumpy(), atol=1e-5)

Summary
-------

-  We can express the computation of matrix multiplication in TVM in one
   line of code.
-  The naive matrix multiplication is a 3-level nested for loop.