.. _ch_depthwise_conv_cpu:

Depthwise Convolution
=====================


In this section, we will follow the packing idea presented in
:numref:`ch_packed_conv_cpu` to re-defined the computation of
depthwise convolution and schedule it to run efficiently on CPUs.
Similar to the 2-D convolution, tiling the data along the channel
dimension and packing it into ``NCHW[x]c`` benefit the performance
significantly.

.. code:: python

    import d2ltvm
    import numpy as np
    import tvm
    from tvm import te
    import timeit
    import os
    os.environ['KMP_AFFINITY']='granularity=fine,noduplicates,compact,1,0'
    
    target = 'llvm -mcpu=skylake-avx512'

Packing Data and Weight
-----------------------

Recall the depthwise convolution described in
:numref:`ch_depthwise_conv`, it differs from the 2-D convolution by
having each channel of the input data convolved with a separated kernel.
Therefore, the packing mechanism of input data is exactly the same as we
did in :numref:`ch_packed_conv_cpu`. Kernel is a bit different, as the
size is in ``[oc, 1, kh, kw]``, which means that there is no need to
tile the input channel.

In other words, in the packing method below, we only pass one argument
``c`` to depict the channel, and another argument ``tc`` to depict the
tiling size of channels. Other than that, it works very similarly as the
``conv_pack`` method defined in :numref:`ch_packed_conv_cpu`.

.. code:: python

    def depthwise_conv_pack(c, nh, nw, kh, kw, ph, pw, tc):
        """Pack data and weight for depthwise convolution
           Note that the input channel of kernel is specified as 1,
           and the output channel of kernel equals the input channel of data
    
        c : input channel of data and output channel of kernel
        nh, nw : input width and height
        kh, kw : kernel width and height
        ph, pw : height and width padding
        tc : the tiling size of channels
        """
        X = te.placeholder((c, nh, nw), name='X')
        K = te.placeholder((c, 1, kh, kw), name='K')
        PaddedX = d2ltvm.padding(X, ph, pw) if ph * pw != 0 else X
        # make sure the channel tiling is valid
        if c < tc:
            tc = c
        assert c % tc == 0
        # pack X and K
        PackedX = te.compute(
            (c//tc, nh+ph*2, nw+pw*2, tc),
            lambda c_out, x, y, c_in: PaddedX[c_out*tc + c_in, x, y],
            name='PackedX')
        PackedK = te.compute(
            (c//tc, 1, kh, kw, 1, tc),
            lambda c_out, _, x, y, __, c_in: K[
                c_out*tc + c_in, 0, x, y],
            name='PackedK')
        return X, K, PaddedX, PackedX, PackedK

Computation
-----------

Like in :numref:`ch_packed_conv_cpu`, we also need to re-implement the
depthwise convolution computation accordingly.

.. code:: python

    def depthwise_conv(c, nh, nw, kh, kw, ph, pw, sh, sw, tc):
        """depthwise conv
    
        c : number of channels for both input and output.
        nh, nw : input width and height
        kh, kw : kernel width and height
        ph, pw : height and width padding
        sh, sw : height and width strides
        tc : the tiling sizes of channels
        """
        X, K, PaddedX, PackedX, PackedK = depthwise_conv_pack(
            c, nh, nw, kh, kw, ph, pw, tc)
        # reduction axes
        rkh = te.reduce_axis((0, kh), name='rkh')
        rkw = te.reduce_axis((0, kw), name='rkw')
        # output height and weights
        oh = d2ltvm.conv_out_size(nh, kh, ph, sh)
        ow = d2ltvm.conv_out_size(nw, kw, pw, sw)
        # compute Y in the packed layout
        PackedY = te.compute(
            (c//tc, oh, ow, tc),
            lambda c_out, x, y, c_in: te.sum(
                (PackedX[c_out, x*sh+rkh, y*sw+rkw, c_in] *
                 PackedK[c_out, 0, rkh, rkw, 0, c_in]),
                axis=[rkh, rkw]), name='PackedY')
        
        # Unpack the result
        Y = te.compute((c, oh, ow),
                        lambda c, x, y: PackedY[c//tc, x, y, c%tc],
                        name='Y')
        return X, K, Y, PaddedX, PackedX, PackedK, PackedY

Let's quickly compile it using the default scheduling to compute the
results.

.. code:: python

    c, n, k, p, s, tc = 32, 64, 3, 1, 1, 16
    
    X, K, Y, _, _, _, _ = depthwise_conv(c, n, n, k, k, p, p, s, s, tc)
    mod = tvm.build(te.create_schedule(Y.op), [X, K, Y])
    
    data, weight, out = d2ltvm.get_conv_data(c, c, n, k, p, s, tvm.nd.array, conv_type='depthwise')
    mod(data, weight, out)

And then verify the result.

.. code:: python

    data, weight, bias, out_mx = d2ltvm.get_conv_data_mxnet(c, c, n, k, p, s, conv_type='depthwise')
    d2ltvm.depthwise_conv_mxnet(data, weight, bias, out_mx, k, p, s)
    np.testing.assert_allclose(out_mx[0].asnumpy(), out.asnumpy(), atol=1e-5)

Schedule
--------

The optimization strategy here is almost identical to ``cache_block``
defined in :numref:`ch_packed_conv_cpu`. The main difference is in the
channels, i.e. we don't need to reduce along the input channel dimension
due to the compute nature of depthwise convolution.

The tiling sizes for channel and width are set to make sure that the
working set of the inner loop which calculates the cached output fits in
the cache.

.. code:: python

    # tiling sizes for channel and width
    tc, tw = 16, 4
    
    def depthwise_cached_block(c, n, k, p, s):
        X, K, Y, PaddedX, PackedX, PackedK, PackedY = depthwise_conv(
            c, n, n, k, k, p, p, s, s, tc)
        sch = te.create_schedule(Y.op)
    
        CachedY = sch.cache_write(PackedY, 'global')
    
        c_out, h, w, c_in = sch[PackedY].op.axis
        w_out, w_in = sch[PackedY].split(w, factor=tw)
        sch[PackedY].reorder(c_out, h, w_out, w_in, c_in)
        c_out_h = sch[PackedY].fuse(c_out, h)
        sch[PackedY].parallel(c_out_h)
        sch[CachedY].compute_at(sch[PackedY], w_out)
    
        cc_out, ch, cw, cc_in = sch[CachedY].op.axis
        kh, kw = sch[CachedY].op.reduce_axis
        sch[CachedY].reorder(cc_out, ch, kh, kw, cw, cc_in)
        sch[CachedY].vectorize(cc_in)
        sch[CachedY].unroll(cw)
        
        # Schedule the padding by adding thread-level parallelism
        if PaddedX != X:
            sch[PaddedX].parallel(PaddedX.op.axis[0])
        # Optimize the packing of X and K
        sch[PackedX].parallel(sch[PackedX].fuse(*PackedX.op.axis[0:2]))
        sch[PackedX].unroll(PackedX.op.axis[-1])
        sch[PackedK].parallel(sch[PackedK].fuse(*PackedK.op.axis[0:2]))
        sch[PackedK].unroll(PackedK.op.axis[-1])
        # Optimize the unpacking of Y
        sch[Y].parallel(sch[Y].fuse(*Y.op.axis[0:2]))
        sch[Y].unroll(Y.op.axis[-1])
        return sch, (X, K, Y)
    
    # c, n, k, p, s were defined in the previous code block
    sch, args = depthwise_cached_block(c, n, k, p, s)
    # Uncomment the following line to see the long
    # psuedo codes because of unrolling.
    # tvm.lower(sch, args, simple_mode=True)

As the scheduling is vastly changed, let's do another round of sanity
check.

.. code:: python

    mod = tvm.build(sch, args, target)
    ctx = tvm.context(target, 0)
    data, weight, out = d2ltvm.get_conv_data(
                c, c, n, k, p, s, lambda x: tvm.nd.array(x, ctx=ctx), conv_type='depthwise')
    mod(data, weight, out)
    
    data, weight, bias, out_mx = d2ltvm.get_conv_data_mxnet(c, c, n, k, p, s, conv_type='depthwise')
    d2ltvm.depthwise_conv_mxnet(data, weight, bias, out_mx, k, p, s)
    np.testing.assert_allclose(out_mx[0].asnumpy(), out.asnumpy(), atol=1e-5)

Benchmark
---------

Finally, let's benchmark the results against MXNet.

The following benchmarking method is very similar to ``bench_conv_tvm``
defined in :numref:`ch_conv_cpu`, with two differences:

1. The signature of the convolution functions (depthwise convolution
   only takes one channel input).
2. The way to compute the FLOPs of computation (the input channel
   dimension of depthwise convolution is 1).

We don't unify the benchmarking of depthwise convolution into the
``bench_conv_tvm`` method in order to reduce the possible confusion it
may cause.

.. code:: python

    # Save to the d2ltvm package.
    def bench_depthwise_conv_tvm(func, sizes, target):
        def workload(nrepeats):
            timer = mod.time_evaluator(mod.entry_name, ctx=ctx, number=nrepeats)
            return timer(x, k, y).mean * nrepeats
        gflops, times = [], []
        for (c, n, k) in sizes:
            args = c, n, k, (k-1)//2, 1 # c, n, k, p, s
            s, (X, K, Y) = func(*args)
            mod = tvm.build(s, [X, K, Y], target)
            ctx = tvm.context(target, 0)
            x, k, y = d2ltvm.get_conv_data(
                args[0], *args, lambda x: tvm.nd.array(x, ctx=ctx), conv_type='depthwise')
            times.append(d2ltvm.bench_workload(workload))
            gflops.append(d2ltvm.conv_gflop(1, *args))
        return np.array(gflops) / np.array(times)

Similarly, the timing methods for depthwise convolution in MXNet are
largely duplicated from the corresponding methods defined in
:numref:`ch_conv_cpu`.

.. code:: python

    # Save to the d2ltvm package.
    def depthwise_conv_timer_mxnet(c, n, k, ctx):
        """Benchmark convolution in MXNet
    
        c : input, output channels
        n : input width and height
        k : kernel width and height
        """
        timer = timeit.Timer(
            setup='import d2ltvm\n'
            'import mxnet as mx\n'
            'c, n, k, p, s = %d, %d, %d, %d, 1\n'
            'data, weight, bias, out = d2ltvm.get_conv_data_mxnet(\n'
            '    c, c, n, k, p, s, "%s", "%s")'%(c, n, k, (k-1)//2, ctx, 'depthwise'),
            stmt='d2ltvm.depthwise_conv_mxnet(data, weight, bias, out, k, p, s);'
            'out.wait_to_read()')
        return timer.timeit
    
    # Save to the d2ltvm package.
    def bench_depthwise_conv_mxnet(sizes, ctx='cpu'):
        """Return the GFLOPS of MXNet convolution"""
        return [d2ltvm.conv_gflop(1, c, n, k, (k-1)//2, 1) /
                d2ltvm.bench_workload(depthwise_conv_timer_mxnet(c, n, k, ctx))
                for c, n, k in sizes]

Now, let's benchmark against our MXNet baseline. We see that our
depthwise convolution performance consistently outperform MXNet. As
depthwise convolution is a memory-bound operator, we see that the
performance saturates after channel size gets to 128.

.. code:: python

    channels = 2**np.arange(4, 9)
    sizes = [(int(c), 64, 3) for c in channels] # a list of (c, n, k)
    tvm_gflops = bench_depthwise_conv_tvm(depthwise_cached_block, sizes, target)
    mxnet_gflops = bench_depthwise_conv_mxnet(sizes)
    d2ltvm.plot_gflops(channels, [mxnet_gflops, tvm_gflops], ['mxnet', 'tvm'])


.. figure:: output_depthwise_conv_7f7d09_19_0.svg


Summary
-------

-  We can get good performance out of depthwise convolution by following
   the same rules of optimizing 2-D convolution.

Exercises
---------

-  Try different tiling sizes.