Course Overview
This course covers the basics of distributed AI in the cloud, focusing on deep learning training and parallelism. It explores the different types of distributed training models and topologies, including data parallelism and model parallelism. The course also discusses the challenges and communication overhead associated with deep learning, highlighting the importance of considering compute, memory, and communication. Additionally, it introduces Intel's Habana Gaudi platform, an ASIC-based platform designed for deep learning training with flexible topologies. The course also compares CPU, GPU, and XPU architectures for model parallelism, discussing their strengths and weaknesses. By the end of the course, learners will understand the fundamentals of deep learning and parallelism, as well as the importance of considering multiple factors when designing distributed AI systems.