.. _ch_conv_cpu:

Convolution
===========


In this section, we will optimize the convolution operator defined in
:numref:`ch_conv` on CPUs. Specifically, this is a 2-D convolution
operator.

Setup
-----

.. code:: python

    import d2ltvm
    import numpy as np
    import timeit
    import tvm
    from tvm import te
    
    target = 'llvm -mcpu=skylake-avx512'

Let's first define our performance baseline, which is the convolution
operator provided by MXNet, backing up by Intel MKL-DNN library. MKL-DNN
uses OpenMP compiled by Intel compiler to implement the thread-level
parallelization. In order to match the thread behavior of TVM described
in :numref:`ch_vector_add_cpu`, we specify the following Intel OpenMP
environment variables. Without setting them, sometimes the multi-thread
performance of MKL-DNN kernels may not be stable.

.. code:: python

    import os
    os.environ['KMP_AFFINITY']='granularity=fine,noduplicates,compact,1,0'

To simplify the measurement, we use the same number of input and output
channels, and the same size of height and width of input and kernel
tensors. The padding will be ``(kernel - 1) // 2`` and the stride will
be 1 so that the output will have the same width and height as the
input, i.e. ``SAME`` padding.

.. code:: python

    # Save to the d2ltvm package.
    def conv_gflop(oc, ic, n, k, p, s):
        """Compute the #floating point operations in a convolution.
    
        The arguments are output channels oc, input channels ic, input size n,
        kernel size k, padding p and stride s.
        """
        on = d2ltvm.conv_out_size(n, k, p, s)
        return 2 * oc * ic * on * on * k * k / 1e9
    
    # Save to the d2ltvm package.
    def conv_timer_mxnet(c, n, k, ctx):
        """Benchmark convolution in MXNet
    
        c : input, output channels
        n : input width and height
        k : kernel width and height
        """
        timer = timeit.Timer(
            setup='import d2ltvm\n'
            'import mxnet as mx\n'
            'c, n, k, p, s = %d, %d, %d, %d, 1\n'
            'data, weight, bias, out = d2ltvm.get_conv_data_mxnet(\n'
            '    c, c, n, k, p, s, "%s")'%(c, n, k, (k-1)//2, ctx),
            stmt='d2ltvm.conv_mxnet(data, weight, bias, out, k, p, s);'
            'out.wait_to_read()')
        return timer.timeit
    
    # Save to the d2ltvm package.
    def bench_conv_mxnet(sizes, ctx='cpu'):
        """Return the GFLOPS of MXNet convolution"""
        return [conv_gflop(c, c, n, k, (k-1)//2, 1) /
                d2ltvm.bench_workload(conv_timer_mxnet(c, n, k, ctx))
                for c, n, k in sizes]

Then benchmark its performance with various numbers of channels, when
the input and kernel width/height are fixed to be 64 and 3,
respectively.

.. code:: python

    channels = 2**np.arange(4, 9)
    # a list of (c, n, k)
    sizes = [(int(c), 64, 3) for c in channels]
    mxnet_gflops = bench_conv_mxnet(sizes)

Similar to the ``bench_matmul_tvm`` method implemented in
:numref:`ch_matmul_cpu`, we implement the reusable method to benchmark
the convolution operator. Note that the first argument ``func`` here is
a function that takes input arguments ``oc, ic, n, k, p, s`` and returns
the schedule as well as its corresponding tensor symbolic arguments.

.. code:: python

    # Save to the d2ltvm package.
    def bench_conv_tvm(func, sizes, target):
        def workload(nrepeats):
            timer = mod.time_evaluator(mod.entry_name, ctx=ctx, number=nrepeats)
            return timer(x, k, y).mean * nrepeats
        gflops, times = [], []
        for (c, n, k) in sizes:
            args = c, c, n, k, (k-1)//2, 1 # oc, ic, n, k, p, s
            s, (X, K, Y) = func(*args)
            mod = tvm.build(s, [X, K, Y], target)
            ctx = tvm.context(target, 0)
            x, k, y = d2ltvm.get_conv_data(
                *args, lambda x: tvm.nd.array(x, ctx=ctx))
            times.append(d2ltvm.bench_workload(workload))
            gflops.append(conv_gflop(*args))
        return np.array(gflops) / np.array(times)

Blocked Convolution
-------------------

Just as we tiled matrices into blocks to improve the cache efficiency
for matrix multiplication in :numref:`ch_block_matmul_cpu`, we can do
the same thing for convolution. Let's consider the case with a single
output channel for simplicity.

Consider a block of :math:`Y`, denoted as ``Y[i:i+tw, j:j+th]``, where
``th`` and ``tw`` are the tile sizes for height and width, respectively.
We know the it can be computed by applying the convolution on a block of
:math:`X` with kernel :math:`K`. In particular, assuming the width and
height of :math:`K` are ``kw`` and ``kh``, then

``Y[i:i+tw, j:j+th] = conv(X[i:i+tw+kw-1, j:j+th+kh-1], K)``

The elements that needed for a block are illustrated in
:numref:`fig_conv_block`. If we choose proper ``tw`` and ``th``, we
can fit data into the cache to improve its efficiency. This is easy to
be generated to multiple output channels as it simply adds the number of
kernels accordingly.

.. _fig_conv_block:

.. figure:: ../img/conv_block.svg

   Compute a block of :math:`Y` in convolution.


The implementation of this schedule is similar to the one for matrix
multiplication defined in :numref:`ch_block_matmul_cpu`. But also note
the following two differences:

1. We parallelize the outer data blocks across the output channels.
2. We move the inner height and width axes before the reduction axes,
   since they are more effectively to be optimized through vectorization
   and unrolling.

.. code:: python

    th, tw = 8, 8  # Tile sizes for height and weight
    
    def cached_block(oc, ic, n, k, p, s):
        X, K, Y, PaddedX = d2ltvm.conv(oc, ic, n, n, k, k, p, p, s, s)
        s = te.create_schedule(Y.op)
        CachedY = s.cache_write(Y, 'local')
        # Compute the output block for every output channel in parallel
        oc, h, w = Y.op.axis
        ho, wo, hi, wi = s[Y].tile(h, w, th, tw)
        ochw = s[Y].fuse(oc, ho, wo)
        s[Y].parallel(ochw)
        # Cache the output block, and move the inner height and width axes
        # to innermost, so we can vectorize and unroll them
        s[CachedY].compute_at(s[Y], ochw)
        _,  ch, cw = CachedY.op.axis
        ric, rkh, rkw = CachedY.op.reduce_axis
        s[CachedY].reorder(ric, rkh, rkw, ch, cw)
        s[CachedY].vectorize(cw)
        s[CachedY].unroll(ch)
        # Schedule the padding by adding thread-level parallelism
        if PaddedX != X:
            s[PaddedX].parallel(PaddedX.op.axis[0])
        return s, (X, K, Y)
    
    s, args = cached_block(32, 32, 64, 3, 1, 1)
    tvm.lower(s, args, simple_mode=True)


.. parsed-literal::
    :class: output

    // attr [PaddedX] storage_scope = "global"
    allocate PaddedX[float32 * 139392]
    produce PaddedX {
      parallel (i0, 0, 32) {
        for (i1, 0, 66) {
          for (i2, 0, 66) {
            PaddedX[(((i0*4356) + (i1*66)) + i2)] = tvm_if_then_else(((((i1 < 1) || (65 <= i1)) || (i2 < 1)) || (65 <= i2)), 0f, X[((((i0*4096) + (i1*64)) + i2) - 65)])
          }
        }
      }
    }
    produce Y {
      parallel (c.i.outer.fused.j.outer.fused, 0, 2048) {
        // attr [Y.local] storage_scope = "local"
        allocate Y.local[float32 * 64]
        produce Y.local {
          Y.local[ramp(0, 1, 8)] = x8(0f)
          Y.local[ramp(8, 1, 8)] = x8(0f)
          Y.local[ramp(16, 1, 8)] = x8(0f)
          Y.local[ramp(24, 1, 8)] = x8(0f)
          Y.local[ramp(32, 1, 8)] = x8(0f)
          Y.local[ramp(40, 1, 8)] = x8(0f)
          Y.local[ramp(48, 1, 8)] = x8(0f)
          Y.local[ramp(56, 1, 8)] = x8(0f)
          for (ric, 0, 32) {
            for (rkh, 0, 3) {
              for (rkw, 0, 3) {
                Y.local[ramp(0, 1, 8)] = (Y.local[ramp(0, 1, 8)] + (PaddedX[ramp((((((ric*4356) + (floordiv(floormod(c.i.outer.fused.j.outer.fused, 64), 8)*528)) + (rkh*66)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + rkw), 1, 8)]*x8(K[((((floordiv(c.i.outer.fused.j.outer.fused, 64)*288) + (ric*9)) + (rkh*3)) + rkw)])))
                Y.local[ramp(8, 1, 8)] = (Y.local[ramp(8, 1, 8)] + (PaddedX[ramp(((((((ric*4356) + (floordiv(floormod(c.i.outer.fused.j.outer.fused, 64), 8)*528)) + (rkh*66)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + rkw) + 66), 1, 8)]*x8(K[((((floordiv(c.i.outer.fused.j.outer.fused, 64)*288) + (ric*9)) + (rkh*3)) + rkw)])))
                Y.local[ramp(16, 1, 8)] = (Y.local[ramp(16, 1, 8)] + (PaddedX[ramp(((((((ric*4356) + (floordiv(floormod(c.i.outer.fused.j.outer.fused, 64), 8)*528)) + (rkh*66)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + rkw) + 132), 1, 8)]*x8(K[((((floordiv(c.i.outer.fused.j.outer.fused, 64)*288) + (ric*9)) + (rkh*3)) + rkw)])))
                Y.local[ramp(24, 1, 8)] = (Y.local[ramp(24, 1, 8)] + (PaddedX[ramp(((((((ric*4356) + (floordiv(floormod(c.i.outer.fused.j.outer.fused, 64), 8)*528)) + (rkh*66)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + rkw) + 198), 1, 8)]*x8(K[((((floordiv(c.i.outer.fused.j.outer.fused, 64)*288) + (ric*9)) + (rkh*3)) + rkw)])))
                Y.local[ramp(32, 1, 8)] = (Y.local[ramp(32, 1, 8)] + (PaddedX[ramp(((((((ric*4356) + (floordiv(floormod(c.i.outer.fused.j.outer.fused, 64), 8)*528)) + (rkh*66)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + rkw) + 264), 1, 8)]*x8(K[((((floordiv(c.i.outer.fused.j.outer.fused, 64)*288) + (ric*9)) + (rkh*3)) + rkw)])))
                Y.local[ramp(40, 1, 8)] = (Y.local[ramp(40, 1, 8)] + (PaddedX[ramp(((((((ric*4356) + (floordiv(floormod(c.i.outer.fused.j.outer.fused, 64), 8)*528)) + (rkh*66)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + rkw) + 330), 1, 8)]*x8(K[((((floordiv(c.i.outer.fused.j.outer.fused, 64)*288) + (ric*9)) + (rkh*3)) + rkw)])))
                Y.local[ramp(48, 1, 8)] = (Y.local[ramp(48, 1, 8)] + (PaddedX[ramp(((((((ric*4356) + (floordiv(floormod(c.i.outer.fused.j.outer.fused, 64), 8)*528)) + (rkh*66)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + rkw) + 396), 1, 8)]*x8(K[((((floordiv(c.i.outer.fused.j.outer.fused, 64)*288) + (ric*9)) + (rkh*3)) + rkw)])))
                Y.local[ramp(56, 1, 8)] = (Y.local[ramp(56, 1, 8)] + (PaddedX[ramp(((((((ric*4356) + (floordiv(floormod(c.i.outer.fused.j.outer.fused, 64), 8)*528)) + (rkh*66)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + rkw) + 462), 1, 8)]*x8(K[((((floordiv(c.i.outer.fused.j.outer.fused, 64)*288) + (ric*9)) + (rkh*3)) + rkw)])))
              }
            }
          }
        }
        for (i.inner, 0, 8) {
          for (j.inner, 0, 8) {
            Y[((((floordiv(c.i.outer.fused.j.outer.fused, 8)*512) + (i.inner*64)) + (floormod(c.i.outer.fused.j.outer.fused, 8)*8)) + j.inner)] = Y.local[((i.inner*8) + j.inner)]
          }
        }
      }
    }


The output pseudo code becomes a bit long, but as long as you understand
the scheduling primitives we used above, the code should make sense if
you take a close look.

Now let's benchmark its performance and compare with our baseline.

.. code:: python

    tvm_gflops = bench_conv_tvm(cached_block, sizes, target)
    d2ltvm.plot_gflops(channels, [mxnet_gflops, tvm_gflops], ['mxnet', 'tvm'])


.. figure:: output_conv_404e0c_13_0.svg


As we can seen that our schedule does well for small channel sizes,
while its performance decrease for large channel sizes. We will optimize
the later in :numref:`ch_packed_conv_cpu`.

Summary
-------

-  We tile the input and output data of convolution as we did for matrix
   multiplication for better cache efficiency.

Exercises
---------

-  Try different tiling sizes.