Operator Optimizations on GPUs
Quick search
code
Show Source
All Notebooks PDF Discuss GitHub
Dive into Deep Learning Compiler
Table Of Contents
  • 1. Getting Started
    • 1.1. Installation
    • 1.2. Vector Add
    • 1.3. Neural Network Inference
    • 1.4. Running on a Remote Machine
  • 2. Expressions for Operators
    • 2.1. Data Types
    • 2.2. Shapes
    • 2.3. Index and Shape Expressions
    • 2.4. Reduction Operations
    • 2.5. Conditional Expression: if-then-else
    • 2.6. Truth Value Testing: all and any
  • 3. Common Operators
    • 3.1. Broadcast Add
    • 3.2. Matrix Multiplication
    • 3.3. Convolution
    • 3.4. Depthwise Convolution
    • 3.5. Pooling
    • 3.6. Batch Normalization
  • Operator Optimizations on CPUs
    • 1. CPU Architecture
    • 2. Function Call Overhead
    • 3. Vector Add
    • 4. Broadcast Add
    • 5. Matrix Multiplication
    • 6. Improve Cache Efficiency by Blocking
    • 7. Convolution
    • 8. Packed Convolution
    • 9. Depthwise Convolution
    • 10. Pooling
    • 11. Batch Normalization
  • Operator Optimizations on GPUs
    • 1. GPU Architecture
    • 2. Vector Add
    • 3. Broadcast Add
    • 4. Matrix Multiplication
    • 5. Convolution
    • 6. Depthwise Convolution
    • 7. Pooling
    • 8. Batch Norm
  • 4. Neural Networks
  • 5. Deployment
  • References
Dive into Deep Learning Compiler
Table Of Contents
  • 1. Getting Started
    • 1.1. Installation
    • 1.2. Vector Add
    • 1.3. Neural Network Inference
    • 1.4. Running on a Remote Machine
  • 2. Expressions for Operators
    • 2.1. Data Types
    • 2.2. Shapes
    • 2.3. Index and Shape Expressions
    • 2.4. Reduction Operations
    • 2.5. Conditional Expression: if-then-else
    • 2.6. Truth Value Testing: all and any
  • 3. Common Operators
    • 3.1. Broadcast Add
    • 3.2. Matrix Multiplication
    • 3.3. Convolution
    • 3.4. Depthwise Convolution
    • 3.5. Pooling
    • 3.6. Batch Normalization
  • Operator Optimizations on CPUs
    • 1. CPU Architecture
    • 2. Function Call Overhead
    • 3. Vector Add
    • 4. Broadcast Add
    • 5. Matrix Multiplication
    • 6. Improve Cache Efficiency by Blocking
    • 7. Convolution
    • 8. Packed Convolution
    • 9. Depthwise Convolution
    • 10. Pooling
    • 11. Batch Normalization
  • Operator Optimizations on GPUs
    • 1. GPU Architecture
    • 2. Vector Add
    • 3. Broadcast Add
    • 4. Matrix Multiplication
    • 5. Convolution
    • 6. Depthwise Convolution
    • 7. Pooling
    • 8. Batch Norm
  • 4. Neural Networks
  • 5. Deployment
  • References

Operator Optimizations on GPUsΒΆ

The chapter talks about the operator optimization on Nvidia GPUs. Basically, we follow the some logic of last chapter, starting from introducing the architecture of GPUs, followed by the optimization of some typical operators.

  • 1. GPU Architecture
    • 1.1. Streaming Multiprocessor
    • 1.2. GPU Architecture
    • 1.3. Summary
  • 2. Vector Add
    • 2.1. CUDA Programming
    • 2.2. Summary
  • 3. Broadcast Add
    • 3.1. Setup
    • 3.2. Continuous scheduling
    • 3.3. Alternate scheduling
    • 3.4. Summary
    • 3.5. Exercise
  • 4. Matrix Multiplication
    • 4.1. Setup
    • 4.2. Blocked Matrix Multiplication on GPU
    • 4.3. Implementation
    • 4.4. Summary
  • 5. Convolution
    • 5.1. Setup
    • 5.2. Default schedule of CONV
    • 5.3. Tiling
    • 5.4. Optimizing the data access on GPUs
    • 5.5. Summary
    • 5.6. Exercise
  • 6. Depthwise Convolution
    • 6.1. Setup
    • 6.2. Default schedule of Depthwise Convolution
    • 6.3. Scheduling of Depthwise Convolution
    • 6.4. Summary
    • 6.5. Exercise
  • 7. Pooling
    • 7.1. Scheduling
    • 7.2. Benchmarking
    • 7.3. Summary
    • 7.4. Exercise
  • 8. Batch Norm
    • 8.1. Setup
    • 8.2. Schedule
    • 8.3. Benchmark
    • 8.4. Summary
    • 8.5. Exercise
Previous
11. Batch Normalization
Next
1. GPU Architecture