Network Working Group                                            H. Wang
Internet-Draft                                                    Huawei
Intended status: Standards Track                                  K. Yao
Expires: 2 September 2024                                   China Mobile
                                                                  W. Pan
                                                                H. Huang
                                                                  Huawei
                                                            1 March 2024


Application-aware Data Center Network (APDN) Use Cases and Requirements
             draft-wh-rtgwg-application-aware-dc-network-02

Abstract

   The deployment of large-scale AI services within data centers
   introduces significant challenges to established technologies,
   including load balancing and congestion control.  Additionally, the
   adoption of cutting-edge network technologies, such as in-network
   computing, is on the rise within AI-centric data centers.  These
   advanced network-assisted application acceleration technologies
   necessitate the flexible exchange of cross-layer interaction
   information between end-hosts and network nodes.

   The Application-aware Data Center Network (APDN) leverages the
   Application-aware Networking (APN) framework for application side to
   furnish the data center network with detailed application-aware
   information.  This approach facilitates the rapid advancement of
   network-application co-design technologies.  This document delves
   into the use cases of APDNs and outlines the associated requirements,
   setting the stage for enhanced performance and efficiency in data
   center operations tailored to the demands of AI services.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."


Wang, et al.            Expires 2 September 2024                [Page 1]

Internet-Draft                    APDN                        March 2024


   This Internet-Draft will expire on 2 September 2024.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   4
     1.2.  Requirements Language . . . . . . . . . . . . . . . . . .   4
   2.  Use Case and Requirements for Application-aware Data Center
           Network . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.1.  Fine-grained packet scheduling for load balancing . . . .   4
     2.2.  Enhancing Distributed Machine Learning Training with
           In-Network Computing  . . . . . . . . . . . . . . . . . .   6
     2.3.  Enhanced Congestion Control with Precise Feedback
           Mechanisms  . . . . . . . . . . . . . . . . . . . . . . .   8
   3.  Encapsulation . . . . . . . . . . . . . . . . . . . . . . . .   9
   4.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
     6.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
     6.2.  Informative References  . . . . . . . . . . . . . . . . .  10
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  11
   Contributors  . . . . . . . . . . . . . . . . . . . . . . . . . .  11
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  11

1.  Introduction

   The advent of large AI models like AlphaGo and ChatGPT4 has
   positioned distributed training for AI large models as a pivotal
   operation within large-scale data centers.  To enhance the efficiency
   of training these substantial models, a significant number of
   computing units—such as thousands of GPUs operating in tandem—are
   deployed for parallel processing, aiming to minimize job completion
   time (JCT).  This setup necessitates frequent and bandwidth-heavy
   communications among concurrent computing nodes, introducing a novel


Wang, et al.            Expires 2 September 2024                [Page 2]

Internet-Draft                    APDN                        March 2024


   multi-party communication mode that demands heightened throughput
   performance, load balancing proficiency, and congestion management
   capabilities from the data center network.

   Traditionally, data center technology primarily views the network as
   a mere conduit for data transmission for upper-layer applications,
   offering basic connectivity services.  Yet, the scenario of large AI
   model training is increasingly incorporating network-assisted
   technologies, such as offloading parts of the computation to the
   network.  This approach seeks to boost AI job efficiency through the
   joint optimization of network communication and computing
   applications.  In many current instances of network assistance,
   operators tailor and implement proprietary protocols on a limited
   scale, leading to a lack of widespread interoperability.

   However, as AI data centers grow and diversify in offering cloud
   services for various AI tasks, emerging data center network
   technologies must account for serving different transports and
   applications.  Building large-scale data centers now involves not
   just ensuring device interoperability but also facilitating
   interaction between network devices and end-host services.

   This document illustrates use cases that requires application-aware
   information between network nodes and applications.  Current ways of
   conveying information are limited by the extensibility of packet
   headers, where only coarse-grained information can be transmitted
   between the network and the host through a limited space (for
   example, one-bit ECN [RFC3168] or DSCP in IP layer).

   The Application-aware Networking (APN) framework
   [I-D.li-apn-framework] delineates how application-aware information,
   including APN identification (ID) and/or parameters (e.g., network
   performance requirements), is encapsulated by network edge devices.
   This information is then carried in packets across an APN domain to
   support service provisioning, enable fine-grained traffic steering,
   and adjust network resources.  An extension of the APN framework
   caters to the application side [I-D.li-rtgwg-apn-app-side-framework],
   allowing APN domain resources to be allocated to applications that
   encapsulate the APN attribute in packets.

   This document delves into the application side of the APN framework
   to foster enriched interaction between hosts and networks within the
   data center, outlining several use cases and the corresponding
   requirements for Application-aware Data center Network (APDN).


Wang, et al.            Expires 2 September 2024                [Page 3]

Internet-Draft                    APDN                        March 2024


1.1.  Terminology

   APDN: APplication-aware Data center Network

   SQN: SeQuence Number

   TOR: Top Of Rack switch

   PFC: Priority-based Flow Control

   NIC: Network Interface Card

   ECMP: Equal-Cost Multi-Path routing

   AI: Artificial Intelligence

   JCT: Job Completion Time

   PS: Parameter Server

   INC: In-Network Computing

   APN: APplication-aware Network

1.2.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.  Use Case and Requirements for Application-aware Data Center Network

2.1.  Fine-grained packet scheduling for load balancing

   Traditional data centers utilize the per-flow Equal-Cost Multi-Path
   (ECMP) method to distribute traffic evenly across several paths.
   These centers, primarily focused on cloud computing, handle a vast
   number of data flows.  Despite the large quantity, these flows are
   predominantly small and short-lived, allowing the ECMP method to
   facilitate a nearly uniform traffic distribution across multiple
   pathways.


Wang, et al.            Expires 2 September 2024                [Page 4]

Internet-Draft                    APDN                        March 2024


   Contrastingly, the communication dynamics shift markedly during the
   training of large AI models.  This process demands unprecedented
   bandwidth levels, where a singular data flow between multiple
   machines could potentially max out the upstream bandwidth of a
   server’s egress Network Interface Controller (NIC), with single data
   flow throughputs approaching or exceeding 100GB x X.

   Applying traditional per-flow ECMP strategies, such as hash-based or
   round-robin algorithms, often results in the concurrent allocation of
   large ("elephant") flows to a single pathway.  This can lead to
   severe congestion, notably when two simultaneous 100Gb/s flows vie
   for the same 100Gb/s bandwidth, significantly impacting the
   completion time for AI jobs.

   To mitigate these issues, there's a pivotal shift towards
   implementing a fine-grained, per-packet ECMP strategy.  This approach
   ensures the distribution of all packets from a single flow across
   multiple paths, enhancing balance and preventing congestion.
   However, due to the varying delays (propagation and switching) across
   these paths, such a strategy may result in significant packet
   disorder upon arrival at the destination, thereby degrading the
   performance of both transport and application layers.

   A viable solution is the resequencing of out-of-order packets at the
   egress Top-of-Rack (ToR) switch, employing per-packet ECMP.  This
   assumes multipath transmission extends from ingress to egress ToRs,
   with the reordering principle ensuring that the packet departure
   sequence from the last ToR mirrors the arrival sequence at the first
   ToR.

   Achieving packet reordering at the egress ToR necessitates a clear
   indication of packet arrival sequences at the ingress ToR.  Current
   protocols do not directly mark sequence numbers (SQNs) at the
   Ethernet and IP layers.

   *  Presently, SQNs are encapsulated within transport layers (e.g.,
      TCP, QUIC, RoCEv2) or application protocols.  Relying on these
      SQNs for packet reordering requires network devices to interpret a
      vast array of transport/application layer information.

   *  SQNs at the transport/application layer are allocated per flow,
      with each having distinct sequence number spaces and initial
      values.  These cannot directly represent the packet arrival
      sequence at the initial ToR.  Although assigning a specific
      reordering queue to each flow at the egress ToR and reordering
      based on upper-layer SQNs is conceivable, the associated hardware
      resource demands are significant.


Wang, et al.            Expires 2 September 2024                [Page 5]

Internet-Draft                    APDN                        March 2024


   *  Direct modification of upper-layer SQNs by network devices to
      reflect ToR-ToR pairwise SQNs compromises end-to-end transmission
      reliability.

   Consequently, a mechanism to convey specific order information across
   the multipath forwarding domain, from the initial to the final device
   with reordering capabilities, is essential.

   The Application-aware Networking (APN) framework is proposed to
   transport critical ordering information.  In this context, it records
   the sequence number of packets as they arrive at the ingress ToR
   (each ToR-ToR pair having a unique, incremental SQN), facilitating
   packet reordering by the egress ToR based on this data.

   Requirements:

   *  [REQ1-1] The APN framework SHOULD tag each packet with an SQN
      alongside the APN ID to enable reordering.  The ingress ToR SHOULD
      assign and log an SQN for each packet based on its arrival
      sequence, with SQN granularity adaptable to ToR-ToR, port-port, or
      queue-queue levels.

   *  [REQ1-2] The APN-encapsulated SQN MUST remain unaltered within the
      multipathing domain and may be removed at the egress device.

   *  [REQ1-3] The APN framework SHOULD convey necessary queue
      information (i.e., the sorting queue ID) to support fine-grained
      reordering.  The queue ID SHOULD match the granularity of SQN
      assignments.  Additionally, the APN framework COULD transport path
      details to expedite the differentiation between out-of-order
      packets and packet loss.

2.2.  Enhancing Distributed Machine Learning Training with In-Network
      Computing

   Distributed machine learning training frequently employs the
   AllReduce communication mode[mpi-doc] for efficient cross-accelerator
   data transfer.  This method is pivotal in scenarios involving data
   and model parallelism, where parallel execution across multiple
   processors necessitates the exchange of intermediate results, such as
   gradient data, as a core component of the communication process.

   The Parameter Server (PS) architecture[atp], which centralizes
   gradient data aggregation through a server from multiple clients and
   redistributes the aggregated results, often faces incast congestion
   challenges due to simultaneous large-volume data transmissions to the
   server.


Wang, et al.            Expires 2 September 2024                [Page 6]

Internet-Draft                    APDN                        March 2024


   In-network computing (INC) introduces a paradigm shift by delegating
   the server's processing tasks to network switches.  Utilizing network
   devices equipped with high-capacity switching and computational
   abilities (for basic arithmetic operations) as surrogate parameter
   servers for gradient aggregation enables the consolidation of
   multiple data streams into a singular network stream.  This approach
   not only alleviates server-side incast congestion but also leverages
   the superior speed of on-switch computing (e.g., ASICs) over
   traditional server-based processing (e.g., CPUs), offering a boon to
   distributed computing applications.

   As outlined in [I-D.draft-lou-rtgwg-sinc], the realization of INC
   requires network devices to comprehend the computing tasks dictated
   by applications, including the accurate parsing of relevant data
   units and the coordination of synchronization signals across diverse
   data sources.

   Present implementations like ATP[atp] and NetReduce[netreduce]
   necessitate that switches interpret upper-layer protocols and
   application-specific logic, which remains tailored to particular
   applications due to the absence of standardized transport or
   application protocols for INC.  To accommodate a broad spectrum of
   INC applications, network devices must exhibit versatility across
   various protocol formats.

   Moreover, while end users may encrypt payloads for security, they
   might be inclined to expose certain non-sensitive data to benefit
   from accelerated INC operations.  However, the current protocol
   landscape does not facilitate easy access to necessary INC data
   without decrypting the entire payload, posing interoperability
   challenges between applications and INC functionalities.

   The Application-aware Networking (APN) framework emerges as a
   solution, capable of conveying essential information for INC tasks
   and their associated data segments, thereby enabling the offloading
   of specific computational tasks to the network.

   _Requirements_:

   *  [REQ2-1] The APN framework MUST include identifiers to
      differentiate among INC tasks.

   *  [REQ2-2] The APN framework MUST accommodate the transport of
      application data in varied formats and lengths, such as gradient
      data for INC, along with the specified operations.


Wang, et al.            Expires 2 September 2024                [Page 7]

Internet-Draft                    APDN                        March 2024


   *  [REQ2-3] To augment INC efficiency, the APN framework SHOULD
      transmit additional application-aware information to support
      computational processes without undermining end-to-end transport
      reliability.

   *  [REQ2-4] The APN framework MUST have the capability to convey
      comprehensive INC outcomes and document the computational status
      within data packets.

2.3.  Enhanced Congestion Control with Precise Feedback Mechanisms

   Data center environments encompass various congestion scenarios,
   notably:

   *  The prevalent use of multi-accelerator collaborative AI model
      training, employing AllReduce and All2All communication patterns
      (Section 2.2), often leads to server-side incast congestion as
      multiple clients simultaneously transmit substantial volumes of
      gradient data.

   *  Diverse load balancing methodologies across different flows can
      induce overload conditions on specific links.

   *  The inherent randomness of service access within data centers
      frequently triggers traffic bursts, extending queue lengths and
      precipitating congestion.

   To mitigate these challenges, the industry has developed an array of
   congestion control algorithms tailored for data center networks.
   ECN-based congestion control mechanisms, such as DCTCP[RFC8257] and
   DCQCN[dcqcn], leverage ECN marks based on switch buffer occupancy
   levels to signal congestion.

   However, these approaches are constrained by the use of a singular
   1-bit mark within packet headers to denote congestion, limiting the
   scope of conveyed congestion details due to header space
   restrictions.  Alternative strategies, such as HPCC++
   [I-D.draft-miao-ccwg-hpcc], adopt in-band telemetry to cumulatively
   append congestion data at each hop, increasing packet length and
   bandwidth consumption.


Wang, et al.            Expires 2 September 2024                [Page 8]

Internet-Draft                    APDN                        March 2024


   A compromise solution, AECN[I-D.draft-shi-ippm-advanced-ecn],
   endeavors to encapsulate critical congestion indicators along the
   path while minimizing data overhead through hop-by-hop aggregation,
   including queue delay and congested hop counts.  This model allows
   end-hosts to specify the congestion metrics of interest, with network
   devices incrementally compiling this data en route.  APN frameworks
   can facilitate this nuanced exchange, enabling tailored congestion
   data accumulation.

   _Requirements_:

   *  [REQ3-1] The APN framework MUST empower data senders to specify
      the congestion metrics they wish to gather.

   *  [REQ3-2] The APN framework MUST enable network nodes to log and
      update selected measurements accordingly.  This may encompass
      metrics such as port queue lengths, link monitoring rates, PFC
      frame counts, probed RTTs, and variability, among others.
      Additionally, the APN MAY tag each measurement with its collector,
      assisting in the identification of potential congestion points.

3.  Encapsulation

   The encapsulation of application-aware information proposed by use
   cases of APDN in the APN Header [I-D.draft-li-apn-header] will be
   defined in the future version of the draft.

4.  Security Considerations

   TBD.

5.  IANA Considerations

   This document has no IANA actions.

6.  References

6.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.


Wang, et al.            Expires 2 September 2024                [Page 9]

Internet-Draft                    APDN                        March 2024


6.2.  Informative References

   [mpi-doc]  "Message-Passing Interface Standard", August 2023,
              <https://www.mpi-forum.org/docs/mpi-4.1>.

   [dcqcn]    "Congestion Control for Large-Scale RDMA Deployments",
              n.d.,
              <https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
              p523.pdf>.

   [netreduce]
              "NetReduce - RDMA-Compatible In-Network Reduction for
              Distributed DNN Training Acceleration", n.d.,
              <https://arxiv.org/abs/2009.09736>.

   [atp]      "ATP - In-network Aggregation for Multi-tenant Learning",
              n.d.,
              <https://www.usenix.org/conference/nsdi21/presentation/
              lao>.

   [I-D.li-apn-framework]
              Li, Z., Peng, S., Voyer, D., Li, C., Liu, P., Cao, C., and
              G. S. Mishra, "Application-aware Networking (APN)
              Framework", Work in Progress, Internet-Draft, draft-li-
              apn-framework-07, 3 April 2023,
              <https://datatracker.ietf.org/doc/html/draft-li-apn-
              framework-07>.

   [I-D.li-rtgwg-apn-app-side-framework]
              Li, Z. and S. Peng, "Extension of Application-aware
              Networking (APN) Framework for Application Side", Work in
              Progress, Internet-Draft, draft-li-rtgwg-apn-app-side-
              framework-00, 22 October 2023,
              <https://datatracker.ietf.org/doc/html/draft-li-rtgwg-apn-
              app-side-framework-00>.

   [I-D.draft-lou-rtgwg-sinc]
              Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao,
              "Signaling In-Network Computing operations (SINC)", Work
              in Progress, Internet-Draft, draft-lou-rtgwg-sinc-01, 15
              September 2023, <https://datatracker.ietf.org/doc/html/
              draft-lou-rtgwg-sinc-01>.

   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
              October 2017, <https://www.rfc-editor.org/rfc/rfc8257>.


Wang, et al.            Expires 2 September 2024               [Page 10]

Internet-Draft                    APDN                        March 2024


   [I-D.draft-miao-ccwg-hpcc]
              Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B.,
              Tantsura, J., Alemania, A., and Y. Shpigelman, "HPCC++:
              Enhanced High Precision Congestion Control", Work in
              Progress, Internet-Draft, draft-miao-ccwg-hpcc-02, 29
              February 2024, <https://datatracker.ietf.org/doc/html/
              draft-miao-ccwg-hpcc-02>.

   [I-D.draft-shi-ippm-advanced-ecn]
              Shi, H., Zhou, T., and Z. Li, "Advanced Explicit
              Congestion Notification", Work in Progress, Internet-
              Draft, draft-shi-ippm-advanced-ecn-00, 11 December 2023,
              <https://datatracker.ietf.org/doc/html/draft-shi-ippm-
              advanced-ecn-00>.

   [I-D.draft-li-apn-header]
              Li, Z., Peng, S., and S. Zhang, "Application-aware
              Networking (APN) Header", Work in Progress, Internet-
              Draft, draft-li-apn-header-04, 12 April 2023,
              <https://datatracker.ietf.org/doc/html/draft-li-apn-
              header-04>.

Acknowledgements

Contributors

Authors' Addresses

   Haibo Wang
   Huawei
   Email: rainsword.wang@huawei.com


   Kehan Yao
   China Mobile
   Email: yaokehan@chinamobile.com


   Wei Pan
   Huawei
   Email: tarzan.pan@huawei.com


   Hongyi  Huang
   Huawei
   Email: hongyi.huang@huawei.com


Wang, et al.            Expires 2 September 2024               [Page 11]