Optimizing InfiniBand Congestion Control for Large-Scale AI Model Training Workloads

Authors

  • Rakesh Challa Principal Engineer, Dell Technologies, USA Author

DOI:

https://doi.org/10.15662/IJEETR.2022.0406015

Keywords:

InfiniBand Congestion Control, AI Model Training, High-Performance Computing, GPU Utilization, Scaling Efficiency

Abstract

This paper examines the issue of InfiniBand congestion in large-scale AI model training systems. The emphasis is made on the impact of congestions on training time, utilization of GPUs, and scaling efficiency in cases where thousands of GPUs are utilized in conjunction. Base system iteration time rose from 320 ms over 256 GPUs to 780 ms over 2048 GPUs and the utilization of GPUs decreased from 78% to 58%. The time of communication was also greater and it demonstrated that congestion of a network is a significant bottleneck. In order to address this issue, the paper uses network counters to identify congestion, and then involves congestion optimization techniques that include subnet manager tuning, virtual lane separation, and load balancing. Upon the implementation of these techniques, the time of iteration decreased to 590 ms at 2048 GPUs with an improvement of approximately 24%. The efficiency of scaling was increased from 58% to 78 %, and the network latency was decreased from 95 µs to 55 µs at peak levels. There was also better equilibrium of the use of links over the network. Based on the results of the conducted investigation, it is possible to increase the system performance by over 20% with the help of the proper congestion control. This assists in the quicker training of AI and optimal utilization of computing assets, which promote huge research and innovations.

References

[1] J. Escudero-Sahuquillo, P. J. Garcia, F. J. Quiles, G. Maglione-Mathey, and J. Duato, “Feasible enhancements to congestion control in InfiniBand-based networks,” Journal of Parallel and Distributed Computing, vol. 112, pp. 35–52, Oct. 2017, doi: 10.1016/j.jpdc.2017.09.008.

[2] F. Mizero, M. Veeraraghavan, Q. Liu, R. D. Russell, and J. M. Dennis, “A dynamic congestion management system for InfiniBand networks,” Supercomputing Frontiers and Innovations, vol. 3, no. 2, Sep. 2016, doi: 10.14529/jsfi160201.

[3] E. Zahavi, “Fat-tree routing and node ordering providing contention free traffic for MPI global collectives,” Journal of Parallel and Distributed Computing, vol. 72, no. 11, pp. 1423–1432, Feb. 2012, doi: 10.1016/j.jpdc.2012.01.018.

[4] T. Agarwal, A. Sharma, A. Laxmikant, and L. V. Kale, “Topology-aware task mapping for reducing communication contention on large parallel machines,” Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, p. 10 pp., Jan. 2006, doi: 10.1109/ipdps.2006.1639379.

[5] Y. Zhang, T. Groves, B. Cook, N. J. Wright, and A. K. Coskun, “Quantifying the impact of network congestion on application performance and network metrics,” 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 162–168, Sep. 2020, doi: 10.1109/cluster49012.2020.00026.

[6] J. Escudero-Sahuquillo et al., “A new proposal to deal with congestion in InfiniBand-based fat-trees,” Journal of Parallel and Distributed Computing, vol. 74, no. 1, pp. 1802–1819, Sep. 2013, doi: 10.1016/j.jpdc.2013.09.002.

[7] Y. Li et al., “HPCC,” SIGCOMM ’19: Proceedings of the ACM Special Interest Group on Data Communication, pp. 44–58, Aug. 2019, doi: 10.1145/3341302.3342085.

[8] R. Mittal et al., “TIMELY,” SIGCOMM ’15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pp. 537–550, Aug. 2015, doi: 10.1145/2785956.2787510.

[9] Y. Zhu, M. Ghobadi, V. Misra, and J. Padhye, “ECN or Delay,” CoNEXT ’16: Proceedings of the 12th International on Conference on Emerging Networking EXperiments and Technologies, pp. 313–327, Nov. 2016, doi: 10.1145/2999572.2999593.

[10] M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, and M. Yasuda, “Less is more: trading a little bandwidth for ultra-low latency in the data center,” Networked Systems Design and Implementation, vol. 6, no. 59, p. 19, Apr. 2012, [Online]. Available: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/lowlatency.pdf

[11] P.-J. Lu, M.-C. Lai, and J.-S. Chang, “A survey of High-Performance Interconnection Networks in High-Performance Computer Systems,” Electronics, vol. 11, no. 9, p. 1369, Apr. 2022, doi: 10.3390/electronics11091369.

[12] A. N. Daryin and A. A. Korzh, “Early evaluation of direct large-scale InfiniBand networks with adaptive routing,” Supercomputing Frontiers and Innovations, vol. 1, no. 3, Sep. 2014, doi: 10.14529/jsfi140303.

[13] G. Maglione-Mathey, J. Escudero-Sahuquillo, P. J. Garcia, F. J. Quiles, and E. Zahavi, “Leveraging InfiniBand controller to configure deadlock-free routing engines for Dragonflies,” Journal of Parallel and Distributed Computing, vol. 147, pp. 16–33, Aug. 2020, doi: 10.1016/j.jpdc.2020.07.010.

[14] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, “Hedera: dynamic flow scheduling for data center networks,” NSDI’10: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, p. 19, Apr. 2010, doi: 10.5555/1855711.1855730.

[15] M. Alizadeh et al., “CONGA,” SIGCOMM ’14: Proceedings of the 2014 ACM Conference on SIGCOMM, pp. 503–514, Aug. 2014, doi: 10.1145/2619239.2626316.

[16] J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal, “Fastpass,” ACM SIGCOMM Computer Communication Review, vol. 44, no. 4, pp. 307–318, Aug. 2014, doi: 10.1145/2740070.2626309.

[17] Y. Li, H. Qi, G. Lu, F. Jin, Y. Guo, and X. Lu, “Understanding hot interconnects with an extensive benchmark survey,” BenchCouncil Transactions on Benchmarks Standards and Evaluations, vol. 2, no. 3, p. 100074, Jul. 2022, doi: 10.1016/j.tbench.2022.100074.

[18] T. Groves, R. E. Grant, and D. Arnold, “NiMC: Characterizing and Eliminating Network-Induced Memory Contention,” 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 253–262, May 2016, doi: 10.1109/ipdps.2016.29.

[19] R. Underwood, J. Anderson, and A. Apon, “Measuring Network Latency Variation Impacts to High Performance Computing Application Performance,” ICPE ’18: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 68–79, Mar. 2018, doi: 10.1145/3184407.3184427.

[20] A. Bhatele et al., “Identifying the Culprits Behind Network Congestion,” 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 113–122, May 2015, doi: 10.1109/ipdps.2015.92.

[21] S. J. A. P. B. J. M. ; G. Iyer Ann C. ;. Mike Showerman; Eric Roman; Zbigniew T. Kalbarczyk; Bill Kramer; and Ravishankar K., “A study of network congestion in two supercomputing High-Speed interconnects.,” OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), Jul. 2020, [Online]. Available: https://www.osti.gov/biblio/1641243

Downloads

Published

2022-12-13

How to Cite

Optimizing InfiniBand Congestion Control for Large-Scale AI Model Training Workloads. (2022). International Journal of Engineering & Extended Technologies Research (IJEETR), 4(6), 5749-5757. https://doi.org/10.15662/IJEETR.2022.0406015