Network Working Group H. Wang Internet-Draft Huawei Intended status: Standards Track K. Yao Expires: 2 September 2024 China Mobile W. Pan H. Huang Huawei 1 March 2024 Application-aware Data Center Network (APDN) Use Cases and Requirements draft-wh-rtgwg-application-aware-dc-network-02 Abstract The deployment of large-scale AI services within data centers introduces significant challenges to established technologies, including load balancing and congestion control. Additionally, the adoption of cutting-edge network technologies, such as in-network computing, is on the rise within AI-centric data centers. These advanced network-assisted application acceleration technologies necessitate the flexible exchange of cross-layer interaction information between end-hosts and network nodes. The Application-aware Data Center Network (APDN) leverages the Application-aware Networking (APN) framework for application side to furnish the data center network with detailed application-aware information. This approach facilitates the rapid advancement of network-application co-design technologies. This document delves into the use cases of APDNs and outlines the associated requirements, setting the stage for enhanced performance and efficiency in data center operations tailored to the demands of AI services. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." Wang, et al. Expires 2 September 2024 [Page 1] Internet-Draft APDN March 2024 This Internet-Draft will expire on 2 September 2024. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 4 2. Use Case and Requirements for Application-aware Data Center Network . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Fine-grained packet scheduling for load balancing . . . . 4 2.2. Enhancing Distributed Machine Learning Training with In-Network Computing . . . . . . . . . . . . . . . . . . 6 2.3. Enhanced Congestion Control with Precise Feedback Mechanisms . . . . . . . . . . . . . . . . . . . . . . . 8 3. Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . 9 4. Security Considerations . . . . . . . . . . . . . . . . . . . 9 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.1. Normative References . . . . . . . . . . . . . . . . . . 9 6.2. Informative References . . . . . . . . . . . . . . . . . 10 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 11 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 1. Introduction The advent of large AI models like AlphaGo and ChatGPT4 has positioned distributed training for AI large models as a pivotal operation within large-scale data centers. To enhance the efficiency of training these substantial models, a significant number of computing units—such as thousands of GPUs operating in tandem—are deployed for parallel processing, aiming to minimize job completion time (JCT). This setup necessitates frequent and bandwidth-heavy communications among concurrent computing nodes, introducing a novel Wang, et al. Expires 2 September 2024 [Page 2] Internet-Draft APDN March 2024 multi-party communication mode that demands heightened throughput performance, load balancing proficiency, and congestion management capabilities from the data center network. Traditionally, data center technology primarily views the network as a mere conduit for data transmission for upper-layer applications, offering basic connectivity services. Yet, the scenario of large AI model training is increasingly incorporating network-assisted technologies, such as offloading parts of the computation to the network. This approach seeks to boost AI job efficiency through the joint optimization of network communication and computing applications. In many current instances of network assistance, operators tailor and implement proprietary protocols on a limited scale, leading to a lack of widespread interoperability. However, as AI data centers grow and diversify in offering cloud services for various AI tasks, emerging data center network technologies must account for serving different transports and applications. Building large-scale data centers now involves not just ensuring device interoperability but also facilitating interaction between network devices and end-host services. This document illustrates use cases that requires application-aware information between network nodes and applications. Current ways of conveying information are limited by the extensibility of packet headers, where only coarse-grained information can be transmitted between the network and the host through a limited space (for example, one-bit ECN [RFC3168] or DSCP in IP layer). The Application-aware Networking (APN) framework [I-D.li-apn-framework] delineates how application-aware information, including APN identification (ID) and/or parameters (e.g., network performance requirements), is encapsulated by network edge devices. This information is then carried in packets across an APN domain to support service provisioning, enable fine-grained traffic steering, and adjust network resources. An extension of the APN framework caters to the application side [I-D.li-rtgwg-apn-app-side-framework], allowing APN domain resources to be allocated to applications that encapsulate the APN attribute in packets. This document delves into the application side of the APN framework to foster enriched interaction between hosts and networks within the data center, outlining several use cases and the corresponding requirements for Application-aware Data center Network (APDN). Wang, et al. Expires 2 September 2024 [Page 3] Internet-Draft APDN March 2024 1.1. Terminology APDN: APplication-aware Data center Network SQN: SeQuence Number TOR: Top Of Rack switch PFC: Priority-based Flow Control NIC: Network Interface Card ECMP: Equal-Cost Multi-Path routing AI: Artificial Intelligence JCT: Job Completion Time PS: Parameter Server INC: In-Network Computing APN: APplication-aware Network 1.2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2. Use Case and Requirements for Application-aware Data Center Network 2.1. Fine-grained packet scheduling for load balancing Traditional data centers utilize the per-flow Equal-Cost Multi-Path (ECMP) method to distribute traffic evenly across several paths. These centers, primarily focused on cloud computing, handle a vast number of data flows. Despite the large quantity, these flows are predominantly small and short-lived, allowing the ECMP method to facilitate a nearly uniform traffic distribution across multiple pathways. Wang, et al. Expires 2 September 2024 [Page 4] Internet-Draft APDN March 2024 Contrastingly, the communication dynamics shift markedly during the training of large AI models. This process demands unprecedented bandwidth levels, where a singular data flow between multiple machines could potentially max out the upstream bandwidth of a server’s egress Network Interface Controller (NIC), with single data flow throughputs approaching or exceeding 100GB x X. Applying traditional per-flow ECMP strategies, such as hash-based or round-robin algorithms, often results in the concurrent allocation of large ("elephant") flows to a single pathway. This can lead to severe congestion, notably when two simultaneous 100Gb/s flows vie for the same 100Gb/s bandwidth, significantly impacting the completion time for AI jobs. To mitigate these issues, there's a pivotal shift towards implementing a fine-grained, per-packet ECMP strategy. This approach ensures the distribution of all packets from a single flow across multiple paths, enhancing balance and preventing congestion. However, due to the varying delays (propagation and switching) across these paths, such a strategy may result in significant packet disorder upon arrival at the destination, thereby degrading the performance of both transport and application layers. A viable solution is the resequencing of out-of-order packets at the egress Top-of-Rack (ToR) switch, employing per-packet ECMP. This assumes multipath transmission extends from ingress to egress ToRs, with the reordering principle ensuring that the packet departure sequence from the last ToR mirrors the arrival sequence at the first ToR. Achieving packet reordering at the egress ToR necessitates a clear indication of packet arrival sequences at the ingress ToR. Current protocols do not directly mark sequence numbers (SQNs) at the Ethernet and IP layers. * Presently, SQNs are encapsulated within transport layers (e.g., TCP, QUIC, RoCEv2) or application protocols. Relying on these SQNs for packet reordering requires network devices to interpret a vast array of transport/application layer information. * SQNs at the transport/application layer are allocated per flow, with each having distinct sequence number spaces and initial values. These cannot directly represent the packet arrival sequence at the initial ToR. Although assigning a specific reordering queue to each flow at the egress ToR and reordering based on upper-layer SQNs is conceivable, the associated hardware resource demands are significant. Wang, et al. Expires 2 September 2024 [Page 5] Internet-Draft APDN March 2024 * Direct modification of upper-layer SQNs by network devices to reflect ToR-ToR pairwise SQNs compromises end-to-end transmission reliability. Consequently, a mechanism to convey specific order information across the multipath forwarding domain, from the initial to the final device with reordering capabilities, is essential. The Application-aware Networking (APN) framework is proposed to transport critical ordering information. In this context, it records the sequence number of packets as they arrive at the ingress ToR (each ToR-ToR pair having a unique, incremental SQN), facilitating packet reordering by the egress ToR based on this data. Requirements: * [REQ1-1] The APN framework SHOULD tag each packet with an SQN alongside the APN ID to enable reordering. The ingress ToR SHOULD assign and log an SQN for each packet based on its arrival sequence, with SQN granularity adaptable to ToR-ToR, port-port, or queue-queue levels. * [REQ1-2] The APN-encapsulated SQN MUST remain unaltered within the multipathing domain and may be removed at the egress device. * [REQ1-3] The APN framework SHOULD convey necessary queue information (i.e., the sorting queue ID) to support fine-grained reordering. The queue ID SHOULD match the granularity of SQN assignments. Additionally, the APN framework COULD transport path details to expedite the differentiation between out-of-order packets and packet loss. 2.2. Enhancing Distributed Machine Learning Training with In-Network Computing Distributed machine learning training frequently employs the AllReduce communication mode[mpi-doc] for efficient cross-accelerator data transfer. This method is pivotal in scenarios involving data and model parallelism, where parallel execution across multiple processors necessitates the exchange of intermediate results, such as gradient data, as a core component of the communication process. The Parameter Server (PS) architecture[atp], which centralizes gradient data aggregation through a server from multiple clients and redistributes the aggregated results, often faces incast congestion challenges due to simultaneous large-volume data transmissions to the server. Wang, et al. Expires 2 September 2024 [Page 6] Internet-Draft APDN March 2024 In-network computing (INC) introduces a paradigm shift by delegating the server's processing tasks to network switches. Utilizing network devices equipped with high-capacity switching and computational abilities (for basic arithmetic operations) as surrogate parameter servers for gradient aggregation enables the consolidation of multiple data streams into a singular network stream. This approach not only alleviates server-side incast congestion but also leverages the superior speed of on-switch computing (e.g., ASICs) over traditional server-based processing (e.g., CPUs), offering a boon to distributed computing applications. As outlined in [I-D.draft-lou-rtgwg-sinc], the realization of INC requires network devices to comprehend the computing tasks dictated by applications, including the accurate parsing of relevant data units and the coordination of synchronization signals across diverse data sources. Present implementations like ATP[atp] and NetReduce[netreduce] necessitate that switches interpret upper-layer protocols and application-specific logic, which remains tailored to particular applications due to the absence of standardized transport or application protocols for INC. To accommodate a broad spectrum of INC applications, network devices must exhibit versatility across various protocol formats. Moreover, while end users may encrypt payloads for security, they might be inclined to expose certain non-sensitive data to benefit from accelerated INC operations. However, the current protocol landscape does not facilitate easy access to necessary INC data without decrypting the entire payload, posing interoperability challenges between applications and INC functionalities. The Application-aware Networking (APN) framework emerges as a solution, capable of conveying essential information for INC tasks and their associated data segments, thereby enabling the offloading of specific computational tasks to the network. _Requirements_: * [REQ2-1] The APN framework MUST include identifiers to differentiate among INC tasks. * [REQ2-2] The APN framework MUST accommodate the transport of application data in varied formats and lengths, such as gradient data for INC, along with the specified operations. Wang, et al. Expires 2 September 2024 [Page 7] Internet-Draft APDN March 2024 * [REQ2-3] To augment INC efficiency, the APN framework SHOULD transmit additional application-aware information to support computational processes without undermining end-to-end transport reliability. * [REQ2-4] The APN framework MUST have the capability to convey comprehensive INC outcomes and document the computational status within data packets. 2.3. Enhanced Congestion Control with Precise Feedback Mechanisms Data center environments encompass various congestion scenarios, notably: * The prevalent use of multi-accelerator collaborative AI model training, employing AllReduce and All2All communication patterns (Section 2.2), often leads to server-side incast congestion as multiple clients simultaneously transmit substantial volumes of gradient data. * Diverse load balancing methodologies across different flows can induce overload conditions on specific links. * The inherent randomness of service access within data centers frequently triggers traffic bursts, extending queue lengths and precipitating congestion. To mitigate these challenges, the industry has developed an array of congestion control algorithms tailored for data center networks. ECN-based congestion control mechanisms, such as DCTCP[RFC8257] and DCQCN[dcqcn], leverage ECN marks based on switch buffer occupancy levels to signal congestion. However, these approaches are constrained by the use of a singular 1-bit mark within packet headers to denote congestion, limiting the scope of conveyed congestion details due to header space restrictions. Alternative strategies, such as HPCC++ [I-D.draft-miao-ccwg-hpcc], adopt in-band telemetry to cumulatively append congestion data at each hop, increasing packet length and bandwidth consumption. Wang, et al. Expires 2 September 2024 [Page 8] Internet-Draft APDN March 2024 A compromise solution, AECN[I-D.draft-shi-ippm-advanced-ecn], endeavors to encapsulate critical congestion indicators along the path while minimizing data overhead through hop-by-hop aggregation, including queue delay and congested hop counts. This model allows end-hosts to specify the congestion metrics of interest, with network devices incrementally compiling this data en route. APN frameworks can facilitate this nuanced exchange, enabling tailored congestion data accumulation. _Requirements_: * [REQ3-1] The APN framework MUST empower data senders to specify the congestion metrics they wish to gather. * [REQ3-2] The APN framework MUST enable network nodes to log and update selected measurements accordingly. This may encompass metrics such as port queue lengths, link monitoring rates, PFC frame counts, probed RTTs, and variability, among others. Additionally, the APN MAY tag each measurement with its collector, assisting in the identification of potential congestion points. 3. Encapsulation The encapsulation of application-aware information proposed by use cases of APDN in the APN Header [I-D.draft-li-apn-header] will be defined in the future version of the draft. 4. Security Considerations TBD. 5. IANA Considerations This document has no IANA actions. 6. References 6.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . Wang, et al. Expires 2 September 2024 [Page 9] Internet-Draft APDN March 2024 6.2. Informative References [mpi-doc] "Message-Passing Interface Standard", August 2023, . [dcqcn] "Congestion Control for Large-Scale RDMA Deployments", n.d., . [netreduce] "NetReduce - RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration", n.d., . [atp] "ATP - In-network Aggregation for Multi-tenant Learning", n.d., . [I-D.li-apn-framework] Li, Z., Peng, S., Voyer, D., Li, C., Liu, P., Cao, C., and G. S. Mishra, "Application-aware Networking (APN) Framework", Work in Progress, Internet-Draft, draft-li- apn-framework-07, 3 April 2023, . [I-D.li-rtgwg-apn-app-side-framework] Li, Z. and S. Peng, "Extension of Application-aware Networking (APN) Framework for Application Side", Work in Progress, Internet-Draft, draft-li-rtgwg-apn-app-side- framework-00, 22 October 2023, . [I-D.draft-lou-rtgwg-sinc] Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao, "Signaling In-Network Computing operations (SINC)", Work in Progress, Internet-Draft, draft-lou-rtgwg-sinc-01, 15 September 2023, . [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., and G. Judd, "Data Center TCP (DCTCP): TCP Congestion Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, October 2017, . Wang, et al. Expires 2 September 2024 [Page 10] Internet-Draft APDN March 2024 [I-D.draft-miao-ccwg-hpcc] Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B., Tantsura, J., Alemania, A., and Y. Shpigelman, "HPCC++: Enhanced High Precision Congestion Control", Work in Progress, Internet-Draft, draft-miao-ccwg-hpcc-02, 29 February 2024, . [I-D.draft-shi-ippm-advanced-ecn] Shi, H., Zhou, T., and Z. Li, "Advanced Explicit Congestion Notification", Work in Progress, Internet- Draft, draft-shi-ippm-advanced-ecn-00, 11 December 2023, . [I-D.draft-li-apn-header] Li, Z., Peng, S., and S. Zhang, "Application-aware Networking (APN) Header", Work in Progress, Internet- Draft, draft-li-apn-header-04, 12 April 2023, . Acknowledgements Contributors Authors' Addresses Haibo Wang Huawei Email: rainsword.wang@huawei.com Kehan Yao China Mobile Email: yaokehan@chinamobile.com Wei Pan Huawei Email: tarzan.pan@huawei.com Hongyi Huang Huawei Email: hongyi.huang@huawei.com Wang, et al. Expires 2 September 2024 [Page 11]