.. _ch_packed_conv_cpu: Packed Convolution ================== We observed in :numref:`ch_conv_cpu` that the performance degrades when increasing the channel size for the convolution operator. In this section, we will consider tiling the channel axes and packing the data to improve the performance. .. code:: python import d2ltvm import numpy as np import tvm from tvm import te import os os.environ['KMP_AFFINITY']='granularity=fine,noduplicates,compact,1,0' target = 'llvm -mcpu=skylake-avx512' Packing Data and Weight ----------------------- In :numref:`ch_conv_cpu`, we first tiled the width and height axes and then moved the inner axes as the innermost dimensions during computing. It's the same idea to tile the channels. But we do one additional step to actually move the data in the inner channel loop to the last dimension, i.e. do data layout transformation. So we pay the data movement cost once, while improving the data accessing performance significantly. The following code block splits the channel dimension and move the data in NumPy. .. code:: python c, n, tc = 4, 2, 2 # channel, height/width, and tiling size x = np.arange(c*n*n).reshape((c, n, n)).astype('float32') print('input shape', x.shape, '\n', x) y = x.reshape(c//tc, n, n, tc).transpose(0, 2, 3, 1) print('packed shape', y.shape, '\n', y) .. parsed-literal:: :class: output input shape (4, 2, 2) [[[ 0. 1.] [ 2. 3.]] [[ 4. 5.] [ 6. 7.]] [[ 8. 9.] [10. 11.]] [[12. 13.] [14. 15.]]] packed shape (2, 2, 2, 2) [[[[ 0. 4.] [ 1. 5.]] [[ 2. 6.] [ 3. 7.]]] [[[ 8. 12.] [ 9. 13.]] [[10. 14.] [11. 15.]]]] Now let's implement the pack computation in TVM. .. code:: python def conv_pack(oc, ic, nh, nw, kh, kw, ph, pw, toc, tic): """Pack data and weight for convolution oc, ic : output and input channels nh, nw : input width and height kh, kw : kernel width and height ph, pw : height and width padding toc, tic : the tiling sizes of the output and input channels """ X = te.placeholder((ic, nh, nw), name='X') K = te.placeholder((oc, ic, kh, kw), name='K') PaddedX = d2ltvm.padding(X, ph, pw) if ph * pw != 0 else X # pack X and K assert ic % tic == 0 and oc % toc == 0 PackedX = te.compute( (ic//tic, nh+ph*2, nw+pw*2, tic), lambda ic_out, x, y, ic_in: PaddedX[ic_out*tic + ic_in, x, y], name='PackedX') PackedK = te.compute( (oc//toc, ic//tic, kh, kw, tic, toc), lambda oc_out, ic_out, x, y, ic_in, oc_in: K[ oc_out*toc + oc_in, ic_out*tic + ic_in, x, y], name='PackedK') return X, K, PaddedX, PackedX, PackedK Verify the results by re-implementing the previous example. .. code:: python X, _, _, PackedX, _ = conv_pack(c, c, n, n, 1, 1, 0, 0, tc, tc) mod = tvm.build(te.create_schedule(PackedX.op), [X, PackedX]) packed_x = tvm.nd.array(np.empty((c//tc, n, n, tc), dtype='float32')) mod(tvm.nd.array(x), packed_x) np.testing.assert_equal(packed_x.asnumpy(), y) Of note, in :numref:`ch_conv` we defined the layout of the input data to be ``NCHW``. For the packed data ``PackedX`` here, we follow the convention defined in :cite:`Liu.Wang.Yu.ea.2019` to define its layout as ``NCHW[x]c``, where ``x`` describes the size of ``c``, which is 2 (the value of ``tc`` above) in this case. Computation ----------- Since we changed the data layout, we need to re-implement the convolution computation accordingly. .. code:: python def conv(oc, ic, nh, nw, kh, kw, ph, pw, sh, sw, toc, tic): """2-D conv oc, ic : output and input channels. nh, nw : input width and height kh, kw : kernel width and height ph, pw : height and width padding sh, sw : height and width strides toc, tic : the tiling sizes of output channel and input channel """ X, K, PaddedX, PackedX, PackedK = conv_pack( oc, ic, nh, nw, kh, kw, ph, pw, toc, tic) # reduction axes ric_in = te.reduce_axis((0, tic), name='ric_in') ric_out = te.reduce_axis((0, ic//tic), name='ric_out') rkh = te.reduce_axis((0, kh), name='rkh') rkw = te.reduce_axis((0, kw), name='rkw') # output height and weights oh = d2ltvm.conv_out_size(nh, kh, ph, sh) ow = d2ltvm.conv_out_size(nw, kw, pw, sw) # Compuated Y in the packed layout PackedY = te.compute( (oc//toc, oh, ow, toc), lambda oc_out, x, y, oc_in: te.sum( PackedX[ric_out, x*sh+rkh, y*sw+rkw, ric_in] * PackedK[oc_out, ric_out, rkh, rkw, ric_in, oc_in], axis=[ric_out, rkh, rkw, ric_in]), name='Y') # Unpack the result Y = te.compute((oc, oh, ow), lambda oc, x, y: PackedY[oc//toc, x, y, oc%toc], name='Y') return X, K, Y, PaddedX, PackedX, PackedK, PackedY Let's compile it using the default scheduling and compute the results. .. code:: python oc, ic, n, k, p, s, toc, tic = 4, 6, 12, 3, 1, 1, 2, 3 X, K, Y, _, _, _, _ = conv(oc, ic, n, n, k, k, p, p, s, s, toc, tic) mod = tvm.build(te.create_schedule(Y.op), [X, K, Y]) data, weight, out = d2ltvm.get_conv_data(oc, ic, n, k, p, s, tvm.nd.array) mod(data, weight, out) And then verify the result. .. code:: python data, weight, bias, out_mx = d2ltvm.get_conv_data_mxnet(oc, ic, n, k, p, s) d2ltvm.conv_mxnet(data, weight, bias, out_mx, k, p, s) np.testing.assert_allclose(out_mx[0].asnumpy(), out.asnumpy(), atol=1e-5) Schedule -------- The optimization strategy here is similar to :numref:`ch_conv_cpu`. The major differences are 1. the innermost axis is the inner axis split from the output channel because the elements have already sit on the last dimension after packing. 2. We only split the width dimension instead of both width and height dimensions. 3. We need to schedule the packing and unpacking computations as well. .. code:: python # tiling sizes for output channel, input channel, and width toc, tic, tw = 16, 16, 4 def cached_block(oc, ic, n, k, p, s): X, K, Y, PaddedX, PackedX, PackedK, PackedY = conv( oc, ic, n, n, k, k, p, p, s, s, toc, tic) s = te.create_schedule(Y.op) CachedY = s.cache_write(PackedY, 'local') oc_out, h, w, oc_in = s[PackedY].op.axis oc_out_h = s[PackedY].fuse(oc_out, h) # Parallel on the first two dimensions oc_out and h s[PackedY].parallel(oc_out_h) # Optimize the computation of a cached output block w_out, w_in = s[PackedY].split(w, factor=tw) # Split the columns s[CachedY].compute_at(s[PackedY], w_out) _, _, cw, oc_in = CachedY.op.axis ric_out, rkh, rkw, ric_in = CachedY.op.reduce_axis s[CachedY].reorder(ric_out, rkh, rkw, ric_in, cw, oc_in) s[CachedY].unroll(ric_in) s[CachedY].unroll(cw) s[CachedY].vectorize(oc_in) # Schedule the padding by adding thread-level parallelism if PaddedX != X: s[PaddedX].parallel(PaddedX.op.axis[0]) # Optimize the packing of X and K s[PackedX].parallel(s[PackedX].fuse(*PackedX.op.axis[0:2])) s[PackedX].unroll(PackedX.op.axis[-1]) s[PackedK].parallel(s[PackedK].fuse(*PackedK.op.axis[0:2])) s[PackedK].unroll(PackedK.op.axis[-1]) # Optimize the unpacking of Y s[Y].parallel(s[Y].fuse(*Y.op.axis[0:2])) s[Y].unroll(Y.op.axis[-1]) return s, (X, K, Y) s, args = cached_block(32, 32, 64, 3, 1, 1) # Uncomment the following line to see the long # psuedo codes because of unrolling. # tvm.lower(s, args, simple_mode=True) Let's benchmark then same workloads as :numref:`ch_conv_cpu` and compare to our MXNet baseline. As you can see, the results are significantly improved. .. code:: python channels = 2**np.arange(4, 9) sizes = [(int(c), 64, 3) for c in channels] # a list of (c, n, k) tvm_gflops = d2ltvm.bench_conv_tvm(cached_block, sizes, target) mxnet_gflops = d2ltvm.bench_conv_mxnet(sizes) d2ltvm.plot_gflops(channels, [mxnet_gflops, tvm_gflops], ['mxnet', 'tvm']) .. figure:: output_packed_conv_946b6a_17_0.svg Summary ------- - We often tile input and output channels to for better cache efficiency and pack data accordingly. Exercises --------- - What if the number of channels cannot be divided by the tiling size? - Try different tiling sizes.