From owner-tcp-impl@lerc.nasa.gov  Tue Feb  1 20:44:16 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA12012
	for <tcpimpl-archive@odin.ietf.org>; Tue, 1 Feb 2000 20:44:12 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id SAA18070
	for tcp-impl-outgoing; Tue, 1 Feb 2000 18:01:58 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id SAA18058
	for <tcp-impl@grc.nasa.gov>; Tue, 1 Feb 2000 18:01:57 -0500 (EST)
From: pfoy@hns.com
Received: by seraph3.lerc.nasa.gov; id SAA06590; Tue, 1 Feb 2000 18:01:55 -0500 (EST)
Received: from hns3.hns.com(208.236.67.3) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma006561; Tue, 1 Feb 00 18:01:53 -0500
Received: from hnssysb.md.hns.com (hnssysb.hns.com [139.85.52.101])
	by hns3.hns.com (Pro-8.9.3/Pro-8.9.3) with ESMTP id RAA21722
	for <tcp-impl@grc.nasa.gov>; Tue, 1 Feb 2000 17:59:50 -0500 (EST)
Received: from ngw2.hns.com (ngw2.hns.com [139.85.177.38])
	by hnssysb.md.hns.com (8.9.0/8.8.7) with SMTP id SAA20316
	for <tcp-impl@grc.nasa.gov>; Tue, 1 Feb 2000 18:01:51 -0500 (EST)
Received: by ngw2.hns.com(Lotus SMTP MTA v4.6.5  (863.2 5-20-1999))  id 85256878.007E8223 ; Tue, 1 Feb 2000 18:01:48 -0500
X-Lotus-FromDomain: HNS
To: tcp-impl@grc.nasa.gov
Message-ID: <85256878.007E8188.00@ngw2.hns.com>
Date: Tue, 1 Feb 2000 18:01:36 -0500
Subject: TCP Options Header Fields
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk



All size references in many books and websites specify that the size of the
Options field in the TCP header is 'variable' .  However, in the Selective ACK
RFC (2018), it specifies that the Option field is a maximum of 40 bytes.

What is the maximum size of the Option field in the TCP header?

Thanks.

Patrick Foy




From owner-tcp-impl@lerc.nasa.gov  Tue Feb  1 23:26:19 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA15683
	for <tcpimpl-archive@odin.ietf.org>; Tue, 1 Feb 2000 23:26:18 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA28705
	for tcp-impl-outgoing; Tue, 1 Feb 2000 20:44:02 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA28693
	for <tcp-impl@grc.nasa.gov>; Tue, 1 Feb 2000 20:44:00 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA22928; Tue, 1 Feb 2000 20:43:59 -0500 (EST)
Received: from drawbridge.ascend.com(198.4.92.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma022914; Tue, 1 Feb 00 20:43:36 -0500
Received: from fw-ext.ascend.com (fw-ext [198.4.92.5])
	by drawbridge.ascend.com (8.9.1a/8.9.1) with SMTP id RAA24012;
	Tue, 1 Feb 2000 17:37:24 -0800 (PST)
Received: from russet.ascend.com by fw-ext.ascend.com
          via smtpd (for drawbridge.ascend.com [198.4.92.1]) with SMTP; 2 Feb 2000 01:43:34 UT
Received: from wopr.eng.ascend.com (wopr.eng.ascend.com [206.65.212.178])
	by russet.ascend.com (8.9.1a/8.9.1) with ESMTP id RAA11854;
	Tue, 1 Feb 2000 17:43:33 -0800 (PST)
Received: from wli-sun.eng.ascend.com (wli-sun.eng.ascend.com [10.40.40.132])
	by wopr.eng.ascend.com (8.9.1/8.9.1) with ESMTP id RAA02295;
	Tue, 1 Feb 2000 17:47:04 -0800 (PST)
Received: from ascend.com (localhost [127.0.0.1])
	by wli-sun.eng.ascend.com (8.8.8+Sun/8.8.8) with ESMTP id RAA08225;
	Tue, 1 Feb 2000 17:43:34 -0800 (PST)
Message-ID: <38978BC5.340901D7@ascend.com>
Date: Tue, 01 Feb 2000 17:43:33 -0800
From: william Li <liw@ascend.com>
X-Mailer: Mozilla 4.6 [en] (X11; I; SunOS 5.6 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To: pfoy@hns.com
CC: tcp-impl@grc.nasa.gov
Subject: Re: TCP Options Header Fields
References: <85256878.007E8188.00@ngw2.hns.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

pfoy@hns.com wrote:
> 
> All size references in many books and websites specify that the size of the
> Options field in the TCP header is 'variable' .  However, in the Selective ACK
> RFC (2018), it specifies that the Option field is a maximum of 40 bytes.
> 
> What is the maximum size of the Option field in the TCP header?

The tcp hlen is 4 bits, that would limit the tcp header length
to 2^4 * 5 = 64 bytes. The standard tcp header takes 20 bytes
plus 1 required tcp option (I believer it is MSS) 4 bytes. So the
maximum
option length left for SACK is 40 bytes.

 
> 
> Thanks.
> 
> Patrick Foy

-- 
Cheers

William Li
InterNetworking Systems, Lucent Technologies


From owner-tcp-impl@lerc.nasa.gov  Wed Feb  2 00:26:42 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id AAA16630
	for <tcpimpl-archive@odin.ietf.org>; Wed, 2 Feb 2000 00:26:40 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id WAA03524
	for tcp-impl-outgoing; Tue, 1 Feb 2000 22:08:01 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id WAA03511
	for <tcp-impl@grc.nasa.gov>; Tue, 1 Feb 2000 22:08:00 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id WAA01014; Tue, 1 Feb 2000 22:08:00 -0500 (EST)
Received: from sgi.sgi.com(192.48.153.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma001004; Tue, 1 Feb 00 22:07:39 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) 
	by sgi.com (980327.SGI.8.8.8-aspam/980304.SGI-aspam:
       SGI does not authorize the use of its proprietary
       systems or networks for unsolicited or bulk email
       from the Internet.) 
	via ESMTP id TAA06390; Tue, 1 Feb 2000 19:07:25 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (clock.engr.sgi.com [150.166.75.10])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via SMTP id TAA94824;
	Tue, 1 Feb 2000 19:07:23 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (localhost [127.0.0.1]) by clock.engr.sgi.com (950413.SGI.8.6.12/960327.SGI.AUTOCF) via ESMTP id TAA02189; Tue, 1 Feb 2000 19:08:57 -0800
Message-Id: <200002020308.TAA02189@clock.engr.sgi.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: william Li <liw@ascend.com>
Cc: tcp-impl@grc.nasa.gov
Subject: Re: TCP Options Header Fields 
From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
In-Reply-To: Your message of "Tue, 01 Feb 2000 17:43:33 PST."
             <38978BC5.340901D7@ascend.com> 
Date: Tue, 01 Feb 2000 19:08:57 -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> pfoy@hns.com wrote:
> > 
> > All size references in many books and websites specify that the size of the
> > Options field in the TCP header is 'variable' .  However, in the Selective ACK
> > RFC (2018), it specifies that the Option field is a maximum of 40 bytes.
> > 
> > What is the maximum size of the Option field in the TCP header?
> 
> The tcp hlen is 4 bits, that would limit the tcp header length
> to 2^4 * 5 = 64 bytes. The standard tcp header takes 20 bytes
> plus 1 required tcp option (I believer it is MSS) 4 bytes. So the
> maximum
> option length left for SACK is 40 bytes.

With hlen as 4 bits, the maximum tcp header length from a zero offset is (2^4 
- 1) << 2 = 60, giving 40 bytes maximum for tcp options.

-- 
Zachary Amsden  zamsden@engr.sgi.com  (650) 933-6919  09U-510  Core Protocols




From owner-tcp-impl@lerc.nasa.gov  Thu Feb  3 23:28:26 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA06957
	for <tcpimpl-archive@odin.ietf.org>; Thu, 3 Feb 2000 23:28:26 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA04399
	for tcp-impl-outgoing; Thu, 3 Feb 2000 20:51:01 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA04382
	for <tcp-impl@grc.nasa.gov>; Thu, 3 Feb 2000 20:50:59 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA22305; Thu, 3 Feb 2000 20:50:59 -0500 (EST)
Received: from web1606.mail.yahoo.com(128.11.23.206) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma022292; Thu, 3 Feb 00 20:50:31 -0500
Received: (qmail 13200 invoked by uid 60001); 4 Feb 2000 01:42:30 -0000
Message-ID: <20000204014230.13199.qmail@web1606.mail.yahoo.com>
Received: from [137.65.49.122] by web1606.mail.yahoo.com; Thu, 03 Feb 2000 17:42:30 PST
Date: Thu, 3 Feb 2000 17:42:30 -0800 (PST)
From: alex r <rnalex@yahoo.com>
Subject: Re: TCP Options Header Fields 
To: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>, william Li <liw@ascend.com>
Cc: tcp-impl@grc.nasa.gov
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

From RFC 793
 Data Offset:  4 bits

    The number of 32 bit words in the TCP Header. 
This indicates where
    the data begins.  The TCP header (even one
including options) is an
    integral number of 32 bits long.


This says that the values possible are 0 to 15 (2^4
values). The counting starts from 1 even though 0 can
be represented and effectively we have 15 values
(1-15) which contribute to the length and this is a 4
byte quantity. So we have 15*4=60 bytes max as the
header length.

--- Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
wrote:
> > pfoy@hns.com wrote:
> > > 
> > > All size references in many books and websites
> specify that the size of the
> > > Options field in the TCP header is 'variable' . 
> However, in the Selective ACK
> > > RFC (2018), it specifies that the Option field
> is a maximum of 40 bytes.
> > > 
> > > What is the maximum size of the Option field in
> the TCP header?
> > 
> > The tcp hlen is 4 bits, that would limit the tcp
> header length
> > to 2^4 * 5 = 64 bytes. The standard tcp header
> takes 20 bytes
> > plus 1 required tcp option (I believer it is MSS)
> 4 bytes. So the
> > maximum
> > option length left for SACK is 40 bytes.
> 
> With hlen as 4 bits, the maximum tcp header length
> from a zero offset is (2^4 
> - 1) << 2 = 60, giving 40 bytes maximum for tcp
> options.
> 
> -- 
> Zachary Amsden  zamsden@engr.sgi.com  (650) 933-6919
>  09U-510  Core Protocols
> 
> 
> 
__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb  3 23:55:05 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA06956
	for <tcpimpl-archive@odin.ietf.org>; Thu, 3 Feb 2000 23:28:26 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA04038
	for tcp-impl-outgoing; Thu, 3 Feb 2000 20:45:01 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA04025
	for <tcp-impl@grc.nasa.gov>; Thu, 3 Feb 2000 20:45:00 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA21808; Thu, 3 Feb 2000 20:44:58 -0500 (EST)
Received: from web1604.mail.yahoo.com(128.11.23.204) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma021711; Thu, 3 Feb 00 20:44:26 -0500
Received: (qmail 18093 invoked by uid 60001); 4 Feb 2000 01:44:17 -0000
Message-ID: <20000204014417.18092.qmail@web1604.mail.yahoo.com>
Received: from [137.65.49.122] by web1604.mail.yahoo.com; Thu, 03 Feb 2000 17:44:17 PST
Date: Thu, 3 Feb 2000 17:44:17 -0800 (PST)
From: alex r <rnalex@yahoo.com>
Subject: Re: TCP Options Header Fields 
To: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>, william Li <liw@ascend.com>
Cc: tcp-impl@grc.nasa.gov
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

From RFC 793
 Data Offset:  4 bits

    The number of 32 bit words in the TCP Header. 
This indicates where
    the data begins.  The TCP header (even one
including options) is an
    integral number of 32 bits long.


This says that the values possible are 0 to 15 (2^4
values). The counting starts from 1 even though 0 can
be represented and effectively we have 15 values
(1-15) which contribute to the length and this is a 4
byte quantity. So we have 15*4=60 bytes max as the
header length.

--- Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
wrote:
> > pfoy@hns.com wrote:
> > > 
> > > All size references in many books and websites
> specify that the size of the
> > > Options field in the TCP header is 'variable' . 
> However, in the Selective ACK
> > > RFC (2018), it specifies that the Option field
> is a maximum of 40 bytes.
> > > 
> > > What is the maximum size of the Option field in
> the TCP header?
> > 
> > The tcp hlen is 4 bits, that would limit the tcp
> header length
> > to 2^4 * 5 = 64 bytes. The standard tcp header
> takes 20 bytes
> > plus 1 required tcp option (I believer it is MSS)
> 4 bytes. So the
> > maximum
> > option length left for SACK is 40 bytes.
> 
> With hlen as 4 bits, the maximum tcp header length
> from a zero offset is (2^4 
> - 1) << 2 = 60, giving 40 bytes maximum for tcp
> options.
> 
> -- 
> Zachary Amsden  zamsden@engr.sgi.com  (650) 933-6919
>  09U-510  Core Protocols
> 
> 
> 
__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  4 08:10:25 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id IAA25116
	for <tcpimpl-archive@odin.ietf.org>; Fri, 4 Feb 2000 08:10:24 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id FAA04352
	for tcp-impl-outgoing; Fri, 4 Feb 2000 05:38:40 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id FAA04348;
	Fri, 4 Feb 2000 05:38:30 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id FAA05807; Fri, 4 Feb 2000 05:38:30 -0500 (EST)
Received: from ada.cs.ucy.ac.cy(194.42.10.200) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma005775; Fri, 4 Feb 00 05:37:57 -0500
Received: from cs099 (cs099.cs.ucy.ac.cy [194.42.10.211])
	by ada.cs.ucy.ac.cy (8.8.8/8.8.8) with SMTP id MAA33184;
	Fri, 4 Feb 2000 12:40:13 +0200
Message-ID: <016801bf6efc$637a8110$d30a2ac2@cs.ucy.ac.cy>
From: "Andreas Pitsillides" <andreas.pitsillides@ucy.ac.cy>
To: <Undisclosed-Recipient:@cs.ucy.ac.cy;>
Subject: IEEE INFOCOM 2001 Call for participation
Date: Fri, 4 Feb 2000 12:40:26 +0200
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-7"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.00.2919.6600
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6600
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

---------------------------------<>-----------------------------------
The 20th Annual Conference of IEEE Communications and Computer Societies

                     C A L L   F O R    P A P E R S

                  I E E E   I N F O C O M    2 0 0 1


              The Conference on Computer Communications
               "20 Years into the Communications Odyssey"

                    http://www.ieee-infocom.org/2001

                 April 22-26, 2001 - Anchorage, Alaska
       Sponsored by the IEEE Communications and Computer Societies

CALL FOR PAPERS
================

The major conference on computer communications and networking is
celebrating its 20th anniversary in the splendid setting of Anchorage
(Alaska) during the week of April 22-26. The conference will bring
together researchers and practitioners of every aspect of digital
communications and networks, presenting the most up-to-date results
and achievements in these fields. The IEEE INFOCOM 2001 program committee
is soliciting original papers describing state-of-the-art research and
development in all areas of computer networking and data
communications. Topics of interest include, but are not limited to,
the following:


BISDN and ATM                       Network management and control
Billing and pricing                 Network measurements and testbeds
Communication protocols             Protocol design and analysis
Congestion and admission control    Quality of service
Flow control                        Queueing theory
Cryptography, information hiding    Scheduling
Internet and web applications       Security and privacy
Optical networks                    Storage area networks
Mobile networks                     Switching and switch architectures
Multicast                           Traffic management and control
Multimedia                          Routing
Multiple access                     Web performance and caching
Network architectures               Wireless networks


PAPER SUBMISSION
================

Papers must be submitted electronically according to the instructions
described in <http://www.ieee-infocom.org/2001> and summarized
below. Proposals for panels, half- or full-day tutorials should be
submitted to the respective chairs. Please refer to the conference web
site for further details.

Papers must be formatted according to the IEEE standard format except
for the font size, which MUST be 11pt.  To make it easy to adhere to
the formatting standard, we offer templates and samples for LaTex,
MSWord, and FrameMaker (please refer to the pertinent web pages at
<http://www.ieee-infocom.org/2001>).
-------------------------------------------------------------------
PAPERS THAT DO NOT COMPLY TO THE ABOVE FORMAT CANNOT BE REVIEWED
-------------------------------------------------------------------

Submissions must be in PDF or Postscript.  Postscript papers must use
only standard PostScript fonts: Times Roman, Courier, Symbol, and
Helvetica.  (Please note that Postscript output from MSWord typically
does not work on non-Microsoft platforms.  The use of the Apple
LaserWriter II printer driver is strongly recommended).  The above
formatted papers can be submitted in a compressed form (gzip, zip,
WinZip, compress).

Because of the size limitation on the final manuscript, and to ensure
that the reviewed paper and the final version have a similar size,
-----------------------------------------------------
PAPERS WITH MORE THAN 11 PAGES CANNOT BE REVIEWED
-----------------------------------------------------
(this is roughly equivalent to 20 double-spaced pages).

Papers must be submitted electronically using the Web site at
<http://www.ieee-infocom.org/2001>.  This web page contains exact and
detailed instructions about the submission process. Author's contact
information must be provided during submission. To save space, authors
may omit this information from the paper itself.  Authors will receive
an immediate notification of the successful receipt of the file
containing their paper.  Subsequently, a formal notification will be
sent after verifying that the paper can be printed successfully.

-------------------------------------------------------------------------
| SUBMISSIONS WILL ONLY BE ACCEPTED BETWEEN MAY 1ST AND JULY 5TH, 2000. |
-------------------------------------------------------------------------

SUBMISSION DEADLINES ARE STRICT!  PAPERS THAT HAVE BEEN IMPROPERLY
SUBMITTED OR IMPROPERLY FORMATTED BY THE SUBMISSION DEADLINE WILL NOT BE
CONSIDERED.  TO AVOID LAST MINUTE PROBLEMS, AUTHORS ARE ENCOURAGED TO
SUBMIT THEIR PAPERS WELL IN ADVANCE OF THE DEADLINE.


THE REVIEW PROCESS
==================

Each paper will typically be reviewed by three independent reviewers,
whose reviews will be relayed to the corresponding author.  Following
last year successful experiment, authors will have a chance to provide
a limited rebuttal on the reviews before the program committee makes
its final decision.


TRAVEL GRANTS
=============

Limited travel assistance to students, post-docs and junior faculty
presenting a paper in the conference will be available. Please refer
to the conference web site for further details.


IMPORTANT DATES
===============

   Complete paper due             July 5, 2000
   Notification of acceptance     October 31, 2000
   Final version due              December 31, 2000


PROGRAM COMMITTEE CO-CHAIRS [infocom@watson.ibm.com]
===========================

   Rene L. Cruz, UCSD
   Giovanni Pacifici, IBM Research




From owner-tcp-impl@lerc.nasa.gov  Fri Feb  4 15:51:24 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA09966
	for <tcpimpl-archive@odin.ietf.org>; Fri, 4 Feb 2000 15:51:23 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id NAA28407
	for tcp-impl-outgoing; Fri, 4 Feb 2000 13:03:42 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id NAA28355
	for <tcp-impl@grc.nasa.gov>; Fri, 4 Feb 2000 13:03:39 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id NAA27185; Fri, 4 Feb 2000 13:03:36 -0500 (EST)
Received: from fwns1d.raleigh.ibm.com(204.146.167.235) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma027109; Fri, 4 Feb 00 13:03:03 -0500
Received: from rtpmail02.raleigh.ibm.com (rtpmail02.raleigh.ibm.com [9.37.172.48])
	by fwns1.raleigh.ibm.com (8.9.0/8.9.0/RTP-FW-1.2) with ESMTP id NAA23920;
	Fri, 4 Feb 2000 13:03:00 -0500
Received: from rotala.raleigh.ibm.com (rotala.raleigh.ibm.com [9.37.82.31])
	by rtpmail02.raleigh.ibm.com (8.8.5/8.8.5/RTP-ral-1.1) with ESMTP id NAA32520;
	Fri, 4 Feb 2000 13:02:59 -0500
Received: from rotala.raleigh.ibm.com (localhost [127.0.0.1]) by rotala.raleigh.ibm.com (8.9.3/8.7/RTP-ral-1.0) with ESMTP id NAA17761; Fri, 4 Feb 2000 13:02:00 -0500
Message-Id: <200002041802.NAA17761@rotala.raleigh.ibm.com>
To: Bob Braden <braden@ISI.EDU>
cc: tcp-impl@grc.nasa.gov, brittone@us.ibm.com
Subject: Re: TCP MSS option value 
In-Reply-To: Message from Bob Braden <braden@ISI.EDU> 
   of "Sat, 22 Jan 2000 00:11:15 GMT." <200001220011.AAA07222@gra.isi.edu> 
Date: Fri, 04 Feb 2000 13:02:00 -0500
From: Thomas Narten <narten@raleigh.ibm.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Bob,

> Note that the value you should send in the MSS option has no relation
> to the appearance of TCP options.  That is the problem of the sender
> only.

This makes complete sense (and the same rule applies to IP
options). I.e, what the MSS option should be communicating is precise
information that allows the sender to send a maximal sized packet that
will be properly reassembled by the receiver.  If the sender includes
options (whether TCP or IP) it needs to take that into account when
sending the TCP segment.

Likewise, a receiver has no way of knowing what options its peer will
be sending (i.e., they could vary on a segment-by-segment basis), so
it should not consider options when calculating the MSS value it
advertises.

Bottom line, the sender and receiver need to agree on what the MSS
value really means. Looking way back to RFC 879, it very clearly
states:

>    The MSS counts only data octets in the segment, it does not count the
>    TCP header or the IP header.

and later:

>    The relationship between the value of the maximum IP datagram size
>    and the maximum TCP segment size is obscure.  The problem is that
>    both the IP header and the TCP header may vary in length.  The TCP
>    Maximum Segment Size option (MSS) is defined to specify the maximum
>    number of data octets in a TCP segment exclusive of TCP (or IP)
>    header.

Later RFCs seem consistent with this, but sometimes are a bit
imprecise in the exact language used regarding options.  

RFC 2581 says:

>    RECEIVER MAXIMUM SEGMENT SIZE (RMSS):  The RMSS is the size of the
>       largest segment the receiver is willing to accept.  This is the
>       value specified in the MSS option sent by the receiver during
>       connection startup.  Or, if the MSS option is not used, 536 bytes
>       [Bra89].  The size does not include the TCP/IP headers and
>       options.

There is a bit of an ambiguity here with this definition. First, the
MSS doesn't really mean "largest segment the receiver is willing to
accept", it means largest data part that can be accepted in a received
segment, assuming the segment has minimal IP and TCP headers (e.g.,
N-20-20)".

Mentioning "options" in the same sentence as "does not include the
TCP/IP headers and options" is confusing, as it could be read to imply
that the receiver needs to understand about what options will be
received, and adjust the MSS advertisement as appropriate. But the
receiver has no way of knowing what options will be sent (and when),
so the mentioning of options in the context of the MSS advertisement
is inappropriate. All talk about options should be dealt with in the
context of computing the Send MSS.

brittone@us.ibm.com writes:

> If I understand correctly that RFC2851 says TCP should advertise MSS opt
> value of 1448 when MTU=1500 and TCPoptionLength=12, then I have some
> further questions about how to calculate Eff.snd.MSS, the number of data
> bytes to send in a segment along with the headers and options:

My take is that RFC 2851 should not be interpreted to say that the
advertised MSS should be 1448 above.  It should be 1460, and the Send
MSS calculation of RFC 1122 still applies.

> I hope I have misinterpreted RFC2581's intentions regarding MSS option
> value.

I believe Ed is misintepreting RFC 2581's intent.

Thomas


From owner-tcp-impl@lerc.nasa.gov  Fri Feb  4 17:05:05 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA11580
	for <tcpimpl-archive@odin.ietf.org>; Fri, 4 Feb 2000 17:05:04 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA12598
	for tcp-impl-outgoing; Fri, 4 Feb 2000 14:36:03 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA12561
	for <tcp-impl@grc.nasa.gov>; Fri, 4 Feb 2000 14:36:01 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id OAA10343; Fri, 4 Feb 2000 14:35:58 -0500 (EST)
Received: from frantic.weston.bsdi.com(209.173.194.254) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma010296; Fri, 4 Feb 00 14:35:14 -0500
Received: (from dab@localhost)
	by frantic.bsdi.com (8.9.3/8.9.0) id NAA12219;
	Fri, 4 Feb 2000 13:33:24 -0600 (CST)
Date: Fri, 4 Feb 2000 13:33:24 -0600 (CST)
From: David Borman <dab@bsdi.com>
Message-Id: <200002041933.NAA12219@frantic.bsdi.com>
To: braden@ISI.EDU, narten@raleigh.ibm.com
Subject: Re: TCP MSS option value
Cc: brittone@us.ibm.com, tcp-impl@grc.nasa.gov
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Sigh, I *really* need to plow through and get the revision
to RFC 1323 done.  It has a section in Appendix A addressing
MSS, which I've attached.

			-David Borman, dab@bsdi.com

   TCP Options and MSS
              
        There has been some confusion as to what value should be filled 
        in the TCP MSS option when using TCP options.  RFC-879
        [Postel83] stated:

             The MSS counts only data octets in the segment, it does not
             count the TCP header or the IP header.

        which is unclear about what to do about TCP options.  RFC-1122
        [Braden89] attempted to clarify this in section 4.2.2.6, but 
        there still seems to be confusion.
              
        So, the MSS value to be sent in an MSS option should be equal to
        the effective MTU minus the fixed IP and TCP headers.  Since
        both IP and TCP options are ignored when calculating the value
        for the MSS option, if there are any IP or TCP options to be
        sent in a packet, then the sender must decrease the size of the
        TCP data accordingly.  The reason for this can be seen in the
        following table:

                         +--------------------+--------------------+
                         | MSS is adjusted    | MSS isn't adjusted |
                         | to include options | to include options |
        +----------------+--------------------+--------------------+
        | Sender adjusts | Packets are too    | Packets are the    |
        | length for     | short              | correct length     |
        | options        |                    |                    |
        +----------------+--------------------+--------------------+
        | Sender doesn't | Packets are the    | Packets are too    |
        | adjust length  | correct length     | long.              |
        | for options    |                    |                    |
        +----------------+--------------------+--------------------+

        Since the goal is to not send IP datagrams that have to be
        fragmented, and packets sent with the constraints in the lower
        right of this grid will cause IP fragmentation, the only way to
        guarantee that this doesn't happen is for the data sender to
        decrease the TCP data length by the size of the IP and TCP
        options.  And since the sender will be adjusting the TCP data
        length when sending IP and TCP options, there is no need to
        include the IP and TCP option lengths in the MSS value.


From owner-tcp-impl@lerc.nasa.gov  Mon Feb  7 15:40:33 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA21198
	for <tcpimpl-archive@odin.ietf.org>; Mon, 7 Feb 2000 15:40:31 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id MAA13645
	for tcp-impl-outgoing; Mon, 7 Feb 2000 12:55:48 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id MAA13634
	for <tcp-impl@lerc.nasa.gov>; Mon, 7 Feb 2000 12:55:46 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id MAA16569; Mon, 7 Feb 2000 12:55:45 -0500 (EST)
Received: from mailhost.iprg.nokia.com(205.226.5.12) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma016546; Mon, 7 Feb 00 12:55:37 -0500
Received: from aspen.iprg.nokia.com ([205.226.14.73]) by mailhost.iprg.nokia.com (8.8.8/8.6.10) with ESMTP id JAA29468 for <tcp-impl@lerc.nasa.gov>; Mon, 7 Feb 2000 09:55:35 -0800 (PST)
From: Fred Bauer <fred@iprg.nokia.com>
Received: (fred@localhost) by aspen.iprg.nokia.com (8.8.8/8.6.12) id JAA00417 for tcp-impl@lerc.nasa.gov; Mon, 7 Feb 2000 09:55:33 -0800 (PST)
Date: Mon, 7 Feb 2000 09:55:33 -0800 (PST)
Message-Id: <200002071755.JAA00417@aspen.iprg.nokia.com>
To: tcp-impl@lerc.nasa.gov
Subject: INFOCOM 2000 Call For Participation
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

                  CALL  FOR  PARTICIPATION
		  ------------------------

		      IEEE Infocom 2000
    (Israel)  http://www.comnet.technion.ac.il/infocom2000
      (U.S.A.)  http://www.cse.ucsc.edu/~rom/infocom2000
       (Japan) http://halo.kuamp.kyoto-u.ac.jp/~infocom

               Dan Panorama Hotel, Tel Aviv, Israel

		      March 26-30, 2000

     Sponsored by the IEEE Communications and Computer Societies

IMPORTANT 
=========

Early registration cut-off date: February 28, 2000 (only 3 more weeks)

Registration fees for an IEEE member prior to February 28, 2000 will be $500
and it will include all technical sessions, open receptions, proceedings
(CD) and three lunches. For other fees consult the web pages.
On-line registration: https://secure.computer.org/conf/infocom/register.htm

VENUE
=====

For the last 18 years, Infocom has been the major conference on computer
communications and networking, bringing together researchers and
implementors of every aspect of data communications and networks
presenting the most up-to-date results and achievements in the field.

The 19th annual conference on Computer Communications, Infocom 2000,
will be held at the Dan Panorama Hotel in Tel-Aviv, Israel, during the
week of March 26-30, 2000.  Overlooking the Mediterranean, the Dan
Panorama Tel Aviv is a city hotel in a resort setting.  Just a few steps
away are fine shops, theaters, restaurants and the corporate world of
Tel Aviv, contrasted by the ancient port city of Jaffa with its
picturesque corners and flea markets for bargain hunters.  The hotel
features a large swimming pool, beach access and a fully equipped
health & fitness center. 

SCOPE
=====

Original papers and panel discussions describing state-of-the-art
research and development in all areas of computer networking and data
communications will be presented. Browse the excellent technical
program and see the papers at
http://www.comnet.technion.ac.il/infocom2000/program.html


KEYNOTE SPEAKER
===============

Prof. Leonard Kleinrock, Chairman, Nomadix, Inc.
Keynote title: Nomadic Computing and Smart Spaces
http://www.comnet.technion.ac.il/infocom2000/key.html

TUTORIALS
=========

Full Day
--------
- Wavelength-routing optical networks (Kumar Sivarajan, Indian Institute 
							     of Science)
- The evolution of QoS in the Internet standards community (Jon Crowcroft,
                                                   University College London)
- Overview of network security (Radia Perlman, Sun Microsystems)
- Teletraffic Models and Tools: From Basics to Advanced(Khosrow Sohraby, 
                                     University of Missouri, Kansas City)
- IP Multicast: past, present and future (Radia Perlman, Sun
                                    Microsystems & Christophe Diot, Sprint)

Half Day
--------
- MPLS (Loa Andersson, Nortel Networks)
- New technologies for LAN systems (Dono Van-Mierop, IBM Israel)
- Satellite IP networking (Catherine Rosenberg, Purdue University)
- Mobile IP: adding mobility to the Internet (Charles Perkins, Nokia Research)

http://www.comnet.technion.ac.il/infocom2000/tutorial.html

QUESTIONS?
===========================

Write to 
infocom@comnet.technion.ac.il]


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 10 16:24:06 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA10898
	for <tcpimpl-archive@odin.ietf.org>; Thu, 10 Feb 2000 16:24:05 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id NAA02461
	for tcp-impl-outgoing; Thu, 10 Feb 2000 13:29:14 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id NAA02425
	for <tcp-impl@grc.nasa.gov>; Thu, 10 Feb 2000 13:29:11 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id NAA13677; Thu, 10 Feb 2000 13:29:09 -0500 (EST)
Received: from info.iet.unipi.it(131.114.9.184) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma013598; Thu, 10 Feb 00 13:28:50 -0500
Received: (from luigi@localhost)
	by info.iet.unipi.it (8.9.3/8.9.3) id TAA42924;
	Thu, 10 Feb 2000 19:28:58 +0100 (CET)
	(envelope-from luigi)
From: Luigi Rizzo <luigi@info.iet.unipi.it>
Message-Id: <200002101828.TAA42924@info.iet.unipi.it>
Subject: new dummynet page
To: tcp-impl@grc.nasa.gov
Date: Thu, 10 Feb 2000 19:28:58 +0100 (CET)
X-Mailer: ELM [version 2.4ME+ PL61 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

Hi,

over the last weeks I have done a lot of improvement to my
"dummynet" traffic shaper/ link emulator, and i took this
chance to clean up the web page

	http://www.iet.unipi.it/~luigi/ip_dummynet/

and also build and put there a bootable floppy image which allows
people to test (and use!) it without having to install a full
FreeBSD release on their systems.

Hope you will find this useful.

	Cheers
	luigi
-----------------------------------+-------------------------------------
  Luigi RIZZO, luigi@iet.unipi.it  . Dip. di Ing. dell'Informazione
  http://www.iet.unipi.it/~luigi/  . Universita` di Pisa
  TEL/FAX: +39-050-568.533/522     . via Diotisalvi 2, 56126 PISA (Italy)
  Mobile   +39-347-0373137
-----------------------------------+-------------------------------------



From owner-tcp-impl@lerc.nasa.gov  Thu Feb 10 20:25:45 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA13856
	for <tcpimpl-archive@odin.ietf.org>; Thu, 10 Feb 2000 20:25:45 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id RAA25429
	for tcp-impl-outgoing; Thu, 10 Feb 2000 17:58:15 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id RAA25391
	for <tcp-impl@grc.nasa.gov>; Thu, 10 Feb 2000 17:58:13 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id RAA06896; Thu, 10 Feb 2000 17:58:10 -0500 (EST)
Received: from stephens.ittc.ukans.edu(129.237.125.220) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma006797; Thu, 10 Feb 00 17:57:37 -0500
Received: from mill.ittc.ukans.edu (mill.ittc.ukans.edu [129.237.125.192])
	by stephens.ittc.ukans.edu (8.9.3/8.9.3/ITTC-NOSPAM-1.0) with ESMTP id QAA17161
	for <tcp-impl@grc.nasa.gov>; Thu, 10 Feb 2000 16:57:36 -0600 (CST)
Received: from localhost by mill.ittc.ukans.edu (8.8.5/KU-4.0-client)
	id QAA03632; Thu, 10 Feb 2000 16:57:36 -0600 (CST)
Date: Thu, 10 Feb 2000 16:57:36 -0600 (CST)
From: Anupama Sundaresan <anu@ittc.ukans.edu>
To: tcp-impl@grc.nasa.gov
Subject: Anomalous TCP behaviour? 
In-Reply-To: <200001261903.TAA08209@orchard.arlington.ma.us>
Message-ID: <Pine.SO4.4.02.10002101651560.3194-100000@mill.ittc.ukans.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Hello,

	A few days back I had voiced a few doubts about whether TCP was
transmitting packets outoforder. Thanks to all the replies I carried out
some tests which proved that it was indeed tcpdump which was missing out
packets presumably because of limited buffering and because it is swamped
by a huge volume of packets.


The starting sequence number is 1577980427. The following is a printk output
at the transmitter.

Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578416928
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578416928    ---> 1st Txion
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578418376
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578418376
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578419824
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578419824
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578421272
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578422720
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578424168
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578425616
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578421272
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578422720
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578424168
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578427064
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578428512
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578429960
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578425616
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578427064
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578428512
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578431408
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578432856
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578434304
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578429960
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578431408
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578432856
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578435752
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578437200
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578438648
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578434304
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578435752
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578437200
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578438648
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578439428
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578439428
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578440876
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578440876
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578442324
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578442324
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578443772
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578445220
Feb  9 21:13:24 testbed5 kernel:  Seq # in send: 1578446668
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578443772
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578445220
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578446668
Feb  9 21:13:24 testbed5 kernel:  Seq # in Eth 1578416928   ---> ReTxion
Feb  9 21:13:24 testbed5 kernel:  tp->total_rexmits 1 sk->daddr 561900929
sk->prot->retransmits 1

The seq# 1578416928 is retransmitted. But tcpdump output at the transmitter
shows that this packet was not transmitted for the first time. The very same
packet is denoted by the seq# 436501 
(1578416928 - 1577980427(starting seq#) = 436501)

testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 422689:424137(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 424137:425585(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 425585:427033(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 427033:428481(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 428481:429929(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 429929:431377(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 424137
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 427033
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 429929
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 432825
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 435721
## The first transmission is not recorded by tcpdump. Infact the transmission
of 18156 bytes (between 431377 and 449553) is not recorded by tcpdump
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 449533:450981(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 450981:452429(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 452429:453877(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 453877:455325(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 455325:456773(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 456773:458221(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 459001:460449(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 460449:461897(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 461897:463345(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 463345:464793(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 466241:467689(1448)
## the retransmission occurs here which looks like an outoforder packet since
the first transmission was not recorded by tcpdump
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 436501:437949(1448)


Whereas the dump output at the receiver shows that the transmitter has indeed
transmitted all the packets (that too in order and NOT out of order) but it is
tcpdump at the Txr which is missing out some packets

Dump o/p at Rxr:

We can see that the packets between 431377 and 449553 are received except for a
loss starting from 436501

testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 429929:431377(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 431377:432825(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 432825
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 432825:434273(1448)
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 434273:435721(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 435721
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 435721:436501(780)
#loss of a segment between 436501 and 437949
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 437949:439397(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 439397:440845(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 440845:442293(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 442293:443741(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 443741:445189(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 445189:446637(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 446637:448085(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 448085:449533(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 449533:450981(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 450981:452429(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501
testbed5.ittc.ukans.edu.1028 testbed4.ittc.ukans.edu.5001: 452429:453877(1448)
testbed4.ittc.ukans.edu.5001 testbed5.ittc.ukans.edu.1028: ack 436501

Thanks,
Anu.




From owner-tcp-impl@lerc.nasa.gov  Wed Feb 16 23:17:22 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA23063
	for <tcpimpl-archive@odin.ietf.org>; Wed, 16 Feb 2000 23:17:21 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA02024
	for tcp-impl-outgoing; Wed, 16 Feb 2000 20:42:37 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA01997
	for <tcp-impl@grc.nasa.gov>; Wed, 16 Feb 2000 20:42:35 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA10673; Wed, 16 Feb 2000 20:42:33 -0500 (EST)
Received: from nassau.proxinet.com(166.90.59.24) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma010645; Wed, 16 Feb 00 20:42:17 -0500
Received: from proxinet.com (IDENT:wuchang@localhost [127.0.0.1])
	by nassau.proxinet.com (8.9.3/8.9.3) with ESMTP id RAA05825
	for <tcp-impl@grc.nasa.gov>; Wed, 16 Feb 2000 17:43:49 -0800
Message-ID: <38AB5255.2D3B2281@proxinet.com>
Date: Wed, 16 Feb 2000 17:43:49 -0800
From: Wu-chang Feng <wuchang@proxinet.com>
Reply-To: wuchang@proxinet.com
X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.5-15 i686)
X-Accept-Language: en
MIME-Version: 1.0
To: TCP Implementors <tcp-impl@grc.nasa.gov>
Subject: Palm TCP stack
Content-Type: multipart/mixed;
 boundary="------------B6714448F38FC9E55EA42E9F"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

This is a multi-part message in MIME format.
--------------B6714448F38FC9E55EA42E9F
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I'm running across some weird TCP behavior using a Palm V with a
Minstrel V modem.  In the process of downloading a file to the Palm,
if the connection is terminated you get the  behavior observed in the
attached file.  In short, the Palm continually sends FIN segments and
the server continually sends ACKs back.  (Note that the connection is
terminated by calling close() and not shutdown()).  It would seem that
the palm should be sending a reset back to the server and not the
FINs.  Does anyone know if there's a page which tracks bugs in Palm's
TCP stack?  Any help would be greatly appreciated........

Wu
--------------B6714448F38FC9E55EA42E9F
Content-Type: text/plain; charset=us-ascii;
 name="to_tcpimpl"
Content-Disposition: inline;
 filename="to_tcpimpl"
Content-Transfer-Encoding: 7bit

tail end of trace, close() interrupts download after byte 3753
--------------------------------------------------------------
17:05:53.378203 mynetra.8005 > 166.137.17.102.9489: . 3217:3753(536) ack 0 win 9112 (DF)
17:05:54.962306 166.137.17.102.9489 > mynetra.8005: . ack 3753 win 1608
17:05:54.962335 mynetra.8005 > 166.137.17.102.9489: . 3753:4289(536) ack 0 win 9112 (DF)
17:05:54.962352 mynetra.8005 > 166.137.17.102.9489: . 4289:4825(536) ack 0 win 9112 (DF)
17:05:55.550862 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 3753 win 1608
17:05:55.550884 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:56.127270 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:56.127291 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:56.610361 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:56.610381 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:56.624960 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:56.624977 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:56.712507 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:56.712526 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:57.190468 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:57.190497 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:57.205040 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:57.205056 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:57.216006 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:57.216025 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:57.675817 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:57.675837 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:57.770665 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:57.770684 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:57.781629 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:57.781647 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:58.157429 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:58.157449 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:58.351000 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:58.351022 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:58.361962 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:58.361979 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:58.639158 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:58.639178 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:58.843573 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:58.843591 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:58.850905 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:58.850922 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:59.124419 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:59.124437 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:59.317810 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:59.317829 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)
17:05:59.332428 166.137.17.102.9489 > mynetra.8005: F 0:0(0) ack 4289 win 0
17:05:59.332450 mynetra.8005 > 166.137.17.102.9489: . ack 1 win 9112 (DF)

--------------B6714448F38FC9E55EA42E9F--



From owner-tcp-impl@lerc.nasa.gov  Wed Feb 16 23:27:43 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA23128
	for <tcpimpl-archive@odin.ietf.org>; Wed, 16 Feb 2000 23:27:42 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id VAA03934
	for tcp-impl-outgoing; Wed, 16 Feb 2000 21:20:51 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id VAA03920
	for <tcp-impl@grc.nasa.gov>; Wed, 16 Feb 2000 21:20:49 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id VAA13882; Wed, 16 Feb 2000 21:20:48 -0500 (EST)
Received: from nassau.proxinet.com(166.90.59.24) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma013866; Wed, 16 Feb 00 21:20:06 -0500
Received: from proxinet.com (IDENT:wuchang@localhost [127.0.0.1])
	by nassau.proxinet.com (8.9.3/8.9.3) with ESMTP id SAA06379
	for <tcp-impl@grc.nasa.gov>; Wed, 16 Feb 2000 18:21:44 -0800
Message-ID: <38AB5B38.101706E@proxinet.com>
Date: Wed, 16 Feb 2000 18:21:44 -0800
From: Wu-chang Feng <wuchang@proxinet.com>
Reply-To: wuchang@proxinet.com
X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.5-15 i686)
X-Accept-Language: en
MIME-Version: 1.0
To: TCP Implementors <tcp-impl@grc.nasa.gov>
Subject: addendum to last message
Content-Type: multipart/mixed;
 boundary="------------D9030BD7F3D34152E5289C92"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

This is a multi-part message in MIME format.
--------------D9030BD7F3D34152E5289C92
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

The situation gets worse when the connection is closed with
shutdown(fd,2).  Data segments continue to be sent to the Palm at the
exponential backoff interval.  (Note that shutdown maps to
NetLibSocketShutdown).  Am I looking at a NetLib bug?

Thanks in advance,
Wu
--------------D9030BD7F3D34152E5289C92
Content-Type: text/plain; charset=us-ascii;
 name="to_tcpimpl2"
Content-Disposition: inline;
 filename="to_tcpimpl2"
Content-Transfer-Encoding: 7bit

18:03:00.056635 166.137.17.102.9492 > mynetra.8005: . ack 1609 win 1876
18:03:00.056660 mynetra.8005 > 166.137.17.102.9492: . 2681:3217(536) ack 57 win 9112 (DF)
18:03:00.348836 166.137.17.102.9492 > mynetra.8005: . ack 2145 win 1876
18:03:00.348856 mynetra.8005 > 166.137.17.102.9492: . 3217:3753(536) ack 57 win 9112 (DF)
18:03:00.732199 166.137.17.102.9492 > mynetra.8005: . ack 2681 win 1876
18:03:00.732218 mynetra.8005 > 166.137.17.102.9492: . 3753:4289(536) ack 57 win 9112 (DF)
18:03:01.119171 166.137.17.102.9492 > mynetra.8005: . ack 3217 win 1876
18:03:01.119195 mynetra.8005 > 166.137.17.102.9492: . 4289:4825(536) ack 57 win 9112 (DF)
18:03:01.608295 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 3753 win 1340
18:03:01.608314 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:01.900139 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:01.900160 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:02.381615 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:02.381651 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:02.396155 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:02.396171 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:02.480053 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:02.480070 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:02.961498 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:02.961519 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:02.976078 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:02.976096 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:03.085505 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:03.085524 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:03.446604 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:03.446624 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:03.457563 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:03.457578 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:03.639940 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:03.639958 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:03.833246 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:03.833264 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:03.946807 mynetra.8005 > 166.137.17.102.9492: . 4289:4825(536) ack 58 win 9112 (DF)
18:03:04.026944 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:04.026961 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:04.220244 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:04.220263 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:04.231195 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:04.231215 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:04.439152 166.137.17.102.9492 > mynetra.8005: F 57:57(0) ack 4289 win 0
18:03:04.439174 mynetra.8005 > 166.137.17.102.9492: . ack 58 win 9112 (DF)
18:03:05.088229 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:03:05.095516 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:03:05.183051 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:03:05.380049 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:03:05.390995 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:03:09.607223 mynetra.8005 > 166.137.17.102.9492: . 4289:4825(536) ack 58 win 9112 (DF)
18:03:10.885047 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:03:20.938099 mynetra.8005 > 166.137.17.102.9492: . 4289:4825(536) ack 58 win 9112 (DF)
18:03:27.245048 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:03:43.609735 mynetra.8005 > 166.137.17.102.9492: . 4289:4825(536) ack 58 win 9112 (DF)
18:03:44.723327 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:04:28.952961 mynetra.8005 > 166.137.17.102.9492: . 4289:4825(536) ack 58 win 9112 (DF)
18:04:30.047417 166.137.17.102.9492 > mynetra.8005: . ack 4289 win 0
18:05:28.957296 mynetra.8005 > 166.137.17.102.9492: . 4289:4825(536) ack 58 win 9112 (DF)

--------------D9030BD7F3D34152E5289C92--



From owner-tcp-impl@lerc.nasa.gov  Wed Feb 23 15:44:02 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA02507
	for <tcpimpl-archive@odin.ietf.org>; Wed, 23 Feb 2000 15:44:01 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id MAA17396
	for tcp-impl-outgoing; Wed, 23 Feb 2000 12:43:28 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id MAA17368
	for <tcp-impl@lerc.nasa.gov>; Wed, 23 Feb 2000 12:43:26 -0500 (EST)
From: fred@iprg.nokia.com
Received: by seraph3.lerc.nasa.gov; id MAA18728; Wed, 23 Feb 2000 12:43:25 -0500 (EST)
Received: from mailhost.iprg.nokia.com(205.226.5.12) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma018679; Wed, 23 Feb 00 12:42:51 -0500
Received: from darkstar.iprg.nokia.com (darkstar.iprg.nokia.com [205.226.5.69])
	by mailhost.iprg.nokia.com (8.9.3/8.9.3-GLGS) with ESMTP id JAA16028
	for <tcp-impl@lerc.nasa.gov>; Wed, 23 Feb 2000 09:42:42 -0800 (PST)
Received: (from root@localhost)
	by darkstar.iprg.nokia.com (8.9.3/8.9.3-VIRSCAN) id JAA17520
	for <tcp-impl@lerc.nasa.gov>; Wed, 23 Feb 2000 09:42:41 -0800
X-Virus-Scanned:  Wed, 23 Feb 2000 09:42:41 -0800 Nokia Silicon Valley Email Exploit Scanner
Received: from <fred@iprg.nokia.com> (vienna.iprg.nokia.com [205.226.11.35]) by darkstar.iprg.nokia.com  SMTP/WTS (12.69)
 xma017251; Wed, 23 Feb 00 09:42:35 -0800
Received: (fred@localhost) by vienna.iprg.nokia.com (8.8.8/8.6.12) id JAA00515 for tcp-impl@lerc.nasa.gov; Wed, 23 Feb 2000 09:42:35 -0800 (PST)
Date: Wed, 23 Feb 2000 09:42:35 -0800 (PST)
Message-Id: <200002231742.JAA00515@vienna.iprg.nokia.com>
To: tcp-impl@lerc.nasa.gov
Subject: INFOCOM 2000: Last week for early registration
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>>>>> INFOCOM 2000 - LAST WEEK FOR EARLY REGISTRATION  <<<<<

                  CALL  FOR  PARTICIPATION
		  ------------------------

		      IEEE Infocom 2000
    (Israel)  http://www.comnet.technion.ac.il/infocom2000
      (U.S.A.)  http://www.cse.ucsc.edu/~rom/infocom2000
       (Japan) http://halo.kuamp.kyoto-u.ac.jp/~infocom

               Dan Panorama Hotel, Tel Aviv, Israel

		      March 26-30, 2000

     Sponsored by the IEEE Communications and Computer Societies

IMPORTANT 
=========

Early registration cut-off date: February 28, 2000 (last week)

Registration fees for an IEEE member prior to February 28, 2000 will be $500
and it will include all technical sessions, open receptions, proceedings
(CD) and three lunches. For other fees consult the web pages.
On-line registration: https://secure.computer.org/conf/infocom/register.htm

VENUE
=====

For the last 18 years, Infocom has been the major conference on computer
communications and networking, bringing together researchers and
implementors of every aspect of data communications and networks
presenting the most up-to-date results and achievements in the field.

The 19th annual conference on Computer Communications, Infocom 2000,
will be held at the Dan Panorama Hotel in Tel-Aviv, Israel, during the
week of March 26-30, 2000.  Overlooking the Mediterranean, the Dan
Panorama Tel Aviv is a city hotel in a resort setting.  Just a few steps
away are fine shops, theaters, restaurants and the corporate world of
Tel Aviv, contrasted by the ancient port city of Jaffa with its
picturesque corners and flea markets for bargain hunters.  The hotel
features a large swimming pool, beach access and a fully equipped
health & fitness center. 

SCOPE
=====

Original papers and panel discussions describing state-of-the-art
research and development in all areas of computer networking and data
communications will be presented. Browse the excellent technical
program and see the papers at
http://www.comnet.technion.ac.il/infocom2000/program.html


KEYNOTE SPEAKER
===============

Prof. Leonard Kleinrock, Chairman, Nomadix, Inc.
Keynote title: Nomadic Computing and Smart Spaces
http://www.comnet.technion.ac.il/infocom2000/key.html

TUTORIALS
=========

Full Day
--------
- Wavelength-routing optical networks (Kumar Sivarajan, Indian Institute 
							     of Science)
- The evolution of QoS in the Internet standards community (Jon Crowcroft,
                                                   University College London)
- Overview of network security (Radia Perlman, Sun Microsystems)
- Teletraffic Models and Tools: From Basics to Advanced(Khosrow Sohraby, 
                                     University of Missouri, Kansas City)
- IP Multicast: past, present and future (Radia Perlman, Sun
                                    Microsystems & Christophe Diot, Sprint)

Half Day
--------
- MPLS (Loa Andersson, Nortel Networks)
- New technologies for LAN systems (Dono Van-Mierop, IBM Israel)
- Satellite IP networking (Catherine Rosenberg, Purdue University)
- Mobile IP: adding mobility to the Internet (Charles Perkins, Nokia Research)

http://www.comnet.technion.ac.il/infocom2000/tutorial.html

QUESTIONS?
===========================

Write to 
infocom@comnet.technion.ac.il]


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 19:45:56 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA13226
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 19:45:55 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id QAA06384
	for tcp-impl-outgoing; Thu, 24 Feb 2000 16:58:20 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id QAA06342
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 16:58:16 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id QAA22378; Thu, 24 Feb 2000 16:58:16 -0500 (EST)
Received: from sj-mailhub-3.cisco.com(171.68.224.215) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma022363; Thu, 24 Feb 00 16:58:15 -0500
Received: from csapuntz-u1.cisco.com (csapuntz-u1.cisco.com [171.69.199.29])
	by sj-mailhub-3.cisco.com (8.9.1a/8.9.1) with ESMTP id OAA01655;
	Thu, 24 Feb 2000 14:19:35 -0800 (PST)
Received: (csapuntz@localhost) by csapuntz-u1.cisco.com (8.8.8-Cisco List Logging/CISCO.WS.1.2) id NAA18955; Thu, 24 Feb 2000 13:56:46 -0800 (PST)
Date: Thu, 24 Feb 2000 13:56:46 -0800 (PST)
Message-Id: <200002242156.NAA18955@csapuntz-u1.cisco.com>
From: Costa Sapuntzakis <csapuntz@cisco.com>
To: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
cc: gibbs@freebsd.org, zaitcev@metabyte.com, drich@fjst.com
Subject: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


The TCP RDMA option reduces the overhead of receiving data over
TCP-based protocols such as NFS and HTTP.  

It enables the construction of a simple hardware accelerator that
copies data directly from the incoming packet into application
buffers, avoiding expensive copies in the protocol stack.  Even
without hardware acceleration, the option enables the protocol stack
to decrease the number of copies it must do.

The TCP RDMA option is an annotation and requires no modifications to
higher layer protocols. It can be used with popular protocols such as 
HTTP, NFS, and CIFS, along with new protocols.

The TCP option also provides a bit to indicate application-level
message boundaries. The bit enables out-of-order processing of the TCP
receive queue, potentially decreasing service times in the presence of
packet drops and improving performance on parallel systems.

A draft describing the TCP RDMA option can be found at:
ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt

-Costa



From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 20:36:36 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA14014
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 20:36:36 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id SAA12423
	for tcp-impl-outgoing; Thu, 24 Feb 2000 18:00:36 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id SAA12393
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 18:00:34 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id SAA29690; Thu, 24 Feb 2000 18:00:33 -0500 (EST)
Received: from mercury.sun.com(192.9.25.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma029587; Thu, 24 Feb 00 18:00:19 -0500
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with ESMTP id PAA07669
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 15:00:16 -0800 (PST)
Received: from jurassic.eng.sun.com (jurassic.Eng.Sun.COM [129.146.84.31])
	by sunmail1.Sun.COM (8.9.1b+Sun/8.9.1/ENSMAIL,v1.6.1-sunmail1) with ESMTP id PAA15164;
	Thu, 24 Feb 2000 15:00:14 -0800 (PST)
Received: from awe174-18 (awe174-18.AWE.Sun.COM [192.29.174.18])
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) with SMTP id PAA07352;
	Thu, 24 Feb 2000 15:00:12 -0800 (PST)
Date: Thu, 24 Feb 2000 14:59:38 -0800 (PST)
From: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>
Reply-To: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: Costa Sapuntzakis <csapuntz@cisco.com>
Cc: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov, gibbs@freebsd.org,
        zaitcev@metabyte.com, drich@fjst.com
In-Reply-To: "Your message with ID" <200002242156.NAA18955@csapuntz-u1.cisco.com>
Message-ID: <Roam.SIMC.2.0.6.951433178.19794.nordmark@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> The TCP RDMA option reduces the overhead of receiving data over
> TCP-based protocols such as NFS and HTTP.  

Do you have any data (simulation, implementation) to back up this claim?
Or did you mean to say "provides a capbility which an implementation can
use to try to reduce the overhead"?
 
> It enables the construction of a simple hardware accelerator that
> copies data directly from the incoming packet into application
> buffers, avoiding expensive copies in the protocol stack.  Even
> without hardware acceleration, the option enables the protocol stack
> to decrease the number of copies it must do.

This seems to be an overstatement as well. Are you saying that an
implementation that currently has a single copy in its receive path
(from kernel to user space) can "reduce" the number of copies without
any hardware acceleration? That would imply that the number of
copies could be reduces to zero which I have a hard time understanding
(unless you add hardware acceleration).

> The TCP RDMA option is an annotation and requires no modifications to
> higher layer protocols. It can be used with popular protocols such as 
> HTTP, NFS, and CIFS, along with new protocols.

How will the higher layer protocol peer know what RDMA offsets to use?
For this to provide benefits to e.g. NFS it seems like you'd want the
NFS peer to have offsets that would depend on which file is being written.
Otherwise the only benefit would be for the NIC to be able to separate all
the protocol headers from the data so that the data can be placed on 
e.g. contigiuos pages in memory.

> The TCP option also provides a bit to indicate application-level
> message boundaries. The bit enables out-of-order processing of the TCP
> receive queue, potentially decreasing service times in the presence of
> packet drops and improving performance on parallel systems.
> 
> A draft describing the TCP RDMA option can be found at:
> ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt

There is no DNS entry for ftpeng.cisco.com so I can't access the document.

Thanks,
   Erik




From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 21:44:27 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA15266
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 21:44:26 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA16815
	for tcp-impl-outgoing; Thu, 24 Feb 2000 19:02:51 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA16792
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 19:02:49 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA06637; Thu, 24 Feb 2000 19:02:49 -0500 (EST)
Received: from pizda.ninka.net(216.101.162.242) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma006616; Thu, 24 Feb 00 19:02:13 -0500
Received: (from davem@localhost)
	by pizda.ninka.net (8.9.3/8.9.3) id PAA16973;
	Thu, 24 Feb 2000 15:57:10 -0800
Date: Thu, 24 Feb 2000 15:57:10 -0800
Message-Id: <200002242357.PAA16973@pizda.ninka.net>
X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f
From: "David S. Miller" <davem@redhat.com>
To: Erik.Nordmark@Eng.Sun.COM
CC: csapuntz@cisco.com, ips@ece.cmu.edu, tcp-impl@grc.nasa.gov,
        gibbs@freebsd.org, zaitcev@metabyte.com, drich@fjst.com
In-reply-to: <Roam.SIMC.2.0.6.951433178.19794.nordmark@jurassic> (message from
	Erik Nordmark on Thu, 24 Feb 2000 14:59:38 -0800 (PST))
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References:  <Roam.SIMC.2.0.6.951433178.19794.nordmark@jurassic>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

   Date: Thu, 24 Feb 2000 14:59:38 -0800 (PST)
   From: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>

As an aside I think the RDMA proposal has a lot of holes too.  For
example, there are in-kernel HTTP accelerators that do the complete
client header parse and initial packet response in the hw interrupt
handler.  There are no user buffers involved, and static response
data is DMA'd directly from the filesystem page cache.

   > A draft describing the TCP RDMA option can be found at:
   > ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt

   There is no DNS entry for ftpeng.cisco.com so I can't access the
   document.

Here is what I get:

? host -a ftpeng.cisco.com
Trying null domain
rcode = 0 (Success), ancount=1
The following answer is not authoritative:
The following answer is not verified as authentic by the server:
ftpeng.cisco.com        84574 IN        CNAME   ftp-eng.cisco.com
For authoritative answers, see:
cisco.com       38435 IN        NS      NS1.cisco.com
cisco.com       38435 IN        NS      NS2.cisco.com
Additional information:
NS1.cisco.com   78995 IN        A       192.31.7.92
NS2.cisco.com   67536 IN        A       192.135.250.69
rcode = 0 (Success), ancount=1
The following answer is not authoritative:
The following answer is not verified as authentic by the server:
ftp-eng.cisco.com       84574 IN        A       198.92.30.33
For authoritative answers, see:
CISCO.com       38435 IN        NS      NS1.CISCO.com
CISCO.com       38435 IN        NS      NS2.CISCO.com
Additional information:
NS1.CISCO.com   78995 IN        A       192.31.7.92
NS2.CISCO.com   67536 IN        A       192.135.250.69


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 21:50:27 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA15859
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 21:50:26 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA17700
	for tcp-impl-outgoing; Thu, 24 Feb 2000 19:14:50 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA17692
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 19:14:48 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA08029; Thu, 24 Feb 2000 19:14:48 -0500 (EST)
Received: from pizda.ninka.net(216.101.162.242) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma008019; Thu, 24 Feb 00 19:14:32 -0500
Received: (from davem@localhost)
	by pizda.ninka.net (8.9.3/8.9.3) id QAA16997;
	Thu, 24 Feb 2000 16:10:26 -0800
Date: Thu, 24 Feb 2000 16:10:26 -0800
Message-Id: <200002250010.QAA16997@pizda.ninka.net>
X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f
From: "David S. Miller" <davem@redhat.com>
To: gibbs@freebsd.org
CC: Erik.Nordmark@Eng.Sun.COM, csapuntz@cisco.com, ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov, gibbs@freebsd.org, zaitcev@metabyte.com,
        drich@fjst.com
In-reply-to: <200002250009.RAA01066@caspian.plutotech.com> (gibbs@freebsd.org)
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References:  <200002250009.RAA01066@caspian.plutotech.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

   Date: Thu, 24 Feb 2000 17:09:51 -0700
   From: "Justin T. Gibbs" <gibbs@freebsd.org>

   In the case of a server response, RDMA benefits the client, not the
   server, so I fail to see why your example is problematic.  Zero
   copy send is not what this standard addresses.

With client memory bus bandwidth in the multi-gigabyte per second
range, who needs to avoid the single copy?  How much NFS and web
surfing does one need to do before this is would really come into
play?

And the bus speeds will just be faster by the time something like
this could be deployed widely.

For example, look at SACK, only within the past year are there a
decent number of systems out there implementing it.  Now how many
years ago did it enter RFC state?  And there are still stacks out
there even in their current development sources not implementing any
form of it.

Later,
David S. Miller
davem@redhat.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 21:52:33 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA15884
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 21:52:32 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA17343
	for tcp-impl-outgoing; Thu, 24 Feb 2000 19:09:36 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA17306
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 19:09:33 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA07402; Thu, 24 Feb 2000 19:09:33 -0500 (EST)
Received: from caspian.plutotech.com(206.168.67.80) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma007378; Thu, 24 Feb 00 19:09:18 -0500
Received: from caspian.plutotech.com (localhost [127.0.0.1])
	by caspian.plutotech.com (8.9.3/8.9.1) with ESMTP id RAA01066;
	Thu, 24 Feb 2000 17:09:51 -0700 (MST)
	(envelope-from gibbs@caspian.plutotech.com)
Message-Id: <200002250009.RAA01066@caspian.plutotech.com>
X-Mailer: exmh version 2.1.0 09/18/1999
To: "David S. Miller" <davem@redhat.com>
cc: Erik.Nordmark@Eng.Sun.COM, csapuntz@cisco.com, ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov, gibbs@freebsd.org, zaitcev@metabyte.com,
        drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
In-reply-to: Your message of "Thu, 24 Feb 2000 15:57:10 PST."
             <200002242357.PAA16973@pizda.ninka.net> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 24 Feb 2000 17:09:51 -0700
From: "Justin T. Gibbs" <gibbs@freebsd.org>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>   Date: Thu, 24 Feb 2000 14:59:38 -0800 (PST)
>   From: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>
>
>As an aside I think the RDMA proposal has a lot of holes too.  For
>example, there are in-kernel HTTP accelerators that do the complete
>client header parse and initial packet response in the hw interrupt
>handler.  There are no user buffers involved, and static response
>data is DMA'd directly from the filesystem page cache.

In the case of a server response, RDMA benefits the client, not the
server, so I fail to see why your example is problematic.  Zero copy
send is not what this standard addresses.

--
Justin




From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 22:24:47 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA16315
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 22:24:46 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA19946
	for tcp-impl-outgoing; Thu, 24 Feb 2000 19:50:51 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA19927
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 19:50:49 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA11757; Thu, 24 Feb 2000 19:50:49 -0500 (EST)
Received: from caspian.plutotech.com(206.168.67.80) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma011654; Thu, 24 Feb 00 19:50:34 -0500
Received: from caspian.plutotech.com (localhost [127.0.0.1])
	by caspian.plutotech.com (8.9.3/8.9.1) with ESMTP id RAA01134;
	Thu, 24 Feb 2000 17:50:52 -0700 (MST)
	(envelope-from gibbs@caspian.plutotech.com)
Message-Id: <200002250050.RAA01134@caspian.plutotech.com>
X-Mailer: exmh version 2.1.0 09/18/1999
To: "David S. Miller" <davem@redhat.com>
cc: gibbs@freebsd.org, Erik.Nordmark@Eng.Sun.COM, csapuntz@cisco.com,
        ips@ece.cmu.edu, tcp-impl@grc.nasa.gov, zaitcev@metabyte.com,
        drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
In-reply-to: Your message of "Thu, 24 Feb 2000 16:10:26 PST."
             <200002250010.QAA16997@pizda.ninka.net> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 24 Feb 2000 17:50:52 -0700
From: "Justin T. Gibbs" <gibbs@freebsd.org>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>   In the case of a server response, RDMA benefits the client, not the
>   server, so I fail to see why your example is problematic.  Zero
>   copy send is not what this standard addresses.
>
>With client memory bus bandwidth in the multi-gigabyte per second
>range, who needs to avoid the single copy?  How much NFS and web
>surfing does one need to do before this is would really come into
>play?

This is a very different argument than the one you "implied" before.
I can certainly say that I'd rather make use of my memory bandwidth,
regardless of how much I happen to have, in a more constructive manner
than copying data for no good reason.

RDMA is a general purpose feature.  Don't shoe horn it into just an
option to accelerate a few protocols.

>And the bus speeds will just be faster by the time something like
>this could be deployed widely.

And the networks will be faster by then too and the demand for
pulling more rich content will increase, etc.

>For example, look at SACK, only within the past year are there a
>decent number of systems out there implementing it.  Now how many
>years ago did it enter RFC state?  And there are still stacks out
>there even in their current development sources not implementing any
>form of it.

The performance impact of RDMA is quite a bit larger than SACK, so
I don't know that your example is relevant.  All the big vendors
implement zero-copy in some shape or form and since RDMA is a scheme
to make zero-copy work in more cases, I'm sure it will be picked up
if the proposal is deemed sane.

>Later,
>David S. Miller
>davem@redhat.com

--
Justin




From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 22:30:11 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA16625
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 22:30:11 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA20247
	for tcp-impl-outgoing; Thu, 24 Feb 2000 19:54:39 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA20210
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 19:54:35 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA12095; Thu, 24 Feb 2000 19:54:34 -0500 (EST)
Received: from mercury.sun.com(192.9.25.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma012039; Thu, 24 Feb 00 19:53:57 -0500
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with ESMTP id QAA21425
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 16:53:56 -0800 (PST)
Received: from jurassic.eng.sun.com (jurassic.Eng.Sun.COM [129.146.88.31])
	by sunmail1.Sun.COM (8.9.1b+Sun/8.9.1/ENSMAIL,v1.6.1-sunmail1) with ESMTP id QAA07619;
	Thu, 24 Feb 2000 16:53:55 -0800 (PST)
Received: from shield (shield.Eng.Sun.COM [129.146.85.114])
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) with SMTP id QAA29178;
	Thu, 24 Feb 2000 16:53:54 -0800 (PST)
Date: Thu, 24 Feb 2000 16:53:53 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: tcp-impl@grc.nasa.gov
Cc: ips@ece.cmu.edu, gibbs@freebsd.org, zaitcev@metabyte.com, drich@fjst.com
In-Reply-To: "Your message with ID" <200002242357.PAA16973@pizda.ninka.net>
Message-ID: <Roam.SIMC.2.0.6.951440033.1084.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> As an aside I think the RDMA proposal has a lot of holes too.  For
> example, there are in-kernel HTTP accelerators that do the complete
> client header parse and initial packet response in the hw interrupt
> handler.  There are no user buffers involved, and static response
> data is DMA'd directly from the filesystem page cache.

Even without the above special acceleration, section 2 in the draft
overestimates the number of copies involved in most OSes nowadays.

> In the case of a server response, RDMA benefits the client, not the
> server, so I fail to see why your example is problematic.  Zero copy
> send is not what this standard addresses.

Even without RDMA, people have done zero copy receive with TCP/IP before. 
What I see in RDMA is a way to provide message boundary in TCP so that apps
do not need to spend time on that (I call them lazy apps). I don't see right
away the other claims in the draft.  It seems to me that the authors should
re-evaluate their claims and provide support to them.  I don't see any
arguments in the draft.

If message boundary is the only benefit, I think the SCTP proposal is more
interesting.  Check draft-ietf-sigtran-sctp-06.txt.

							K. Poon.
							kcpoon@eng.sun.com




From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 22:52:03 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA17886
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 22:52:03 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA21094
	for tcp-impl-outgoing; Thu, 24 Feb 2000 20:06:35 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA21080
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 20:06:34 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA13414; Thu, 24 Feb 2000 20:06:34 -0500 (EST)
Received: from pizda.ninka.net(216.101.162.242) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma013380; Thu, 24 Feb 00 20:05:49 -0500
Received: (from davem@localhost)
	by pizda.ninka.net (8.9.3/8.9.3) id RAA17054;
	Thu, 24 Feb 2000 17:01:34 -0800
Date: Thu, 24 Feb 2000 17:01:34 -0800
Message-Id: <200002250101.RAA17054@pizda.ninka.net>
X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f
From: "David S. Miller" <davem@redhat.com>
To: gibbs@freebsd.org
CC: gibbs@freebsd.org, Erik.Nordmark@Eng.Sun.COM, csapuntz@cisco.com,
        ips@ece.cmu.edu, tcp-impl@grc.nasa.gov, zaitcev@metabyte.com,
        drich@fjst.com
In-reply-to: <200002250050.RAA01134@caspian.plutotech.com> (gibbs@freebsd.org)
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References:  <200002250050.RAA01134@caspian.plutotech.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

   Date: Thu, 24 Feb 2000 17:50:52 -0700
   From: "Justin T. Gibbs" <gibbs@freebsd.org>

   The performance impact of RDMA is quite a bit larger than SACK,

It depends who you are.

For someone over a satellite link, I think SACK benefits them
much more than RDMA.

Later,
David S. Miller
davem@redhat.com


From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 23:25:31 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA18250
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 23:25:30 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA22975
	for tcp-impl-outgoing; Thu, 24 Feb 2000 20:35:07 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA22961
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 20:35:06 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA16477; Thu, 24 Feb 2000 20:35:04 -0500 (EST)
Received: from mercury.sun.com(192.9.25.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma016374; Thu, 24 Feb 00 20:34:18 -0500
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with ESMTP id RAA04582
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 17:34:17 -0800 (PST)
Received: from jurassic.eng.sun.com (jurassic.Eng.Sun.COM [129.146.86.31])
	by sunmail1.Sun.COM (8.9.1b+Sun/8.9.1/ENSMAIL,v1.6.1-sunmail1) with ESMTP id RAA15294;
	Thu, 24 Feb 2000 17:34:17 -0800 (PST)
Received: from shield (shield.Eng.Sun.COM [129.146.85.114])
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) with SMTP id RAA05249;
	Thu, 24 Feb 2000 17:34:15 -0800 (PST)
Date: Thu, 24 Feb 2000 17:34:14 -0800 (PST)
From: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Reply-To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
To: tcp-impl@grc.nasa.gov, "Justin T. Gibbs" <gibbs@freebsd.org>
Cc: csapuntz@cisco.com, ips@ece.cmu.edu, zaitcev@metabyte.com, drich@fjst.com
In-Reply-To: "Your message with ID" <200002250050.RAA01134@caspian.plutotech.com>
Message-ID: <Roam.SIMC.2.0.6.951442454.3237.kcpoon@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> The performance impact of RDMA is quite a bit larger than SACK, so
> I don't know that your example is relevant.  All the big vendors
> implement zero-copy in some shape or form and since RDMA is a scheme
> to make zero-copy work in more cases, I'm sure it will be picked up
> if the proposal is deemed sane.

Can you elaborate on this?  Suppose TCP "blindly" does zero copy everything to
an app's buffer (for example, to a web browser's receive buffer) without
RDMA.  Then the browser app looks at the data and displays it.  What is the
difference RDMA makes in this case?  Yes, RDMA can separate different messages
in the buffer.  But this can also be done by the browser app, not by TCP.

I guess this is what I suggest the authors to add to the draft.  It is not
clear to me how RDMA can make a difference, especially in those cases the
authors claim a big performance difference.

							K. Poon.
							kcpoon@eng.sun.com




From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 23:47:58 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA18860
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 23:47:57 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA23614
	for tcp-impl-outgoing; Thu, 24 Feb 2000 20:42:38 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA23581
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 20:42:35 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA17437; Thu, 24 Feb 2000 20:42:33 -0500 (EST)
Received: from mercury.sun.com(192.9.25.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma017406; Thu, 24 Feb 00 20:42:09 -0500
Received: from ebaymail2.EBay.Sun.COM ([129.150.111.20])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with ESMTP id RAA07029;
	Thu, 24 Feb 2000 17:42:07 -0800 (PST)
Received: from ha10nwk.EBay.Sun.COM (phys-ha10nwkb.EBay.Sun.COM [129.150.144.211])
	by ebaymail2.EBay.Sun.COM (8.9.1b+Sun/8.9.1/ENSMAIL,v1.6) with ESMTP id RAA18385;
	Thu, 24 Feb 2000 17:42:05 -0800 (PST)
Received: from jetsun by ha10nwk.EBay.Sun.COM (8.8.8+Sun/SMI-SVR4)
	id RAA17745; Thu, 24 Feb 2000 17:42:05 -0800 (PST)
Message-Id: <200002250142.RAA17745@ha10nwk.EBay.Sun.COM>
Date: Thu, 24 Feb 2000 17:42:05 -0800 (PST)
From: David Robinson <David.Robinson@EBay.Sun.COM>
Reply-To: David Robinson <David.Robinson@EBay.Sun.COM>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
To: ips@ece.cmu.edu
Cc: tcp-impl@grc.nasa.gov
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Content-MD5: dRBoRwrHEPdRhrLF5rMnhA==
X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.4 SunOS 5.8 sun4u sparc 
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

To efficently determine boundaries within a packet stream work
must be done somewhere.  In the RDMA proposal it is up to the
clients to do the work to make the server's job easier.  In traditional
intelligent NIC cards the server does the work by parsing the headers.

It seems that the design of RDMA is backwards as it relies on changes
to the many clients to enable efficiency on the server. A traditional
intelligent NIC card with a modest amount of hardware/firmware
can handle 99+% of requests from unmodified clients.  The existence
proof is checksumming NICs and NFS accelerator boards.

For an efficient IP storage device it will have to deal with legacy IP
client stacks (no RDMA) and a competitive IP storage vendor will
implement the smart NIC described above. Why is RDMA more compelling?

	-David
	



From owner-tcp-impl@lerc.nasa.gov  Thu Feb 24 23:53:40 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA18898
	for <tcpimpl-archive@odin.ietf.org>; Thu, 24 Feb 2000 23:53:39 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA24231
	for tcp-impl-outgoing; Thu, 24 Feb 2000 20:53:51 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA24227
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 20:53:49 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA18480; Thu, 24 Feb 2000 20:53:48 -0500 (EST)
Received: from sgi.sgi.com(192.48.153.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma018418; Thu, 24 Feb 00 20:53:03 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) 
	by sgi.com (980327.SGI.8.8.8-aspam/980304.SGI-aspam:
       SGI does not authorize the use of its proprietary
       systems or networks for unsolicited or bulk email
       from the Internet.) 
	via ESMTP id RAA07438; Thu, 24 Feb 2000 17:52:43 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (clock.engr.sgi.com [163.154.34.45])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id RAA93030;
	Thu, 24 Feb 2000 17:52:30 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (localhost [127.0.0.1]) by clock.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id RAA71116; Thu, 24 Feb 2000 17:56:26 -0800 (PST)
Message-Id: <200002250156.RAA71116@clock.engr.sgi.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: "David S. Miller" <davem@redhat.com>
Cc: Erik.Nordmark@Eng.Sun.COM, csapuntz@cisco.com, ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov, gibbs@freebsd.org, zaitcev@metabyte.com,
        drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
In-Reply-To: Your message of "Thu, 24 Feb 2000 16:10:26 PST."
             <200002250010.QAA16997@pizda.ninka.net> 
Date: Thu, 24 Feb 2000 17:56:26 -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>    Date: Thu, 24 Feb 2000 17:09:51 -0700
>    From: "Justin T. Gibbs" <gibbs@freebsd.org>
> 
>    In the case of a server response, RDMA benefits the client, not the
>    server, so I fail to see why your example is problematic.  Zero
>    copy send is not what this standard addresses.
> 
> With client memory bus bandwidth in the multi-gigabyte per second
> range, who needs to avoid the single copy?  How much NFS and web
> surfing does one need to do before this is would really come into
> play?

With network bandwidth approaching memory bus bandwidth, it becomes an issue.  
A single copy receive uses 3x the bus bandwidth of a zero copy implementation 
(or worse if not aligned properly), and also poisons many more cachelines.

-- 
Zachary Amsden  zamsden@engr.sgi.com  (650) 933-6919  09U-510  Core Protocols




From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 00:01:46 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id AAA19076
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 00:01:45 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id VAA25088
	for tcp-impl-outgoing; Thu, 24 Feb 2000 21:08:06 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id VAA25076
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 21:08:04 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id VAA19873; Thu, 24 Feb 2000 21:08:03 -0500 (EST)
Received: from pizda.ninka.net(216.101.162.242) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma019841; Thu, 24 Feb 00 21:07:57 -0500
Received: (from davem@localhost)
	by pizda.ninka.net (8.9.3/8.9.3) id SAA17132;
	Thu, 24 Feb 2000 18:03:31 -0800
Date: Thu, 24 Feb 2000 18:03:31 -0800
Message-Id: <200002250203.SAA17132@pizda.ninka.net>
X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f
From: "David S. Miller" <davem@redhat.com>
To: zamsden@cthulhu.engr.sgi.com
CC: Erik.Nordmark@Eng.Sun.COM, csapuntz@cisco.com, ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov, gibbs@freebsd.org, zaitcev@metabyte.com,
        drich@fjst.com
In-reply-to: <200002250156.RAA71116@clock.engr.sgi.com> (message from Zachary
	Amsden on Thu, 24 Feb 2000 17:56:26 -0800)
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References:  <200002250156.RAA71116@clock.engr.sgi.com>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

   From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
   Date: Thu, 24 Feb 2000 17:56:26 -0800

   With network bandwidth approaching memory bus bandwidth, it becomes
   an issue.  A single copy receive uses 3x the bus bandwidth of a
   zero copy implementation (or worse if not aligned properly), and
   also poisons many more cachelines.

Before we discuss this point further, does anyone have real hard
evidence that cpu cycles on the client side are the issue for
the vast majority of systems out there?

In my experience cpu cycles are abundant on client machines.  Server
side is where getting precious cpu cycles back seems more important.
Client side cpu cycles are typically being expent on tasks such as
IDCT transforms to decode audio/visual streams, executing Java code,
but not copying the data from the network.

Next, have any other implementors investigated the cpu usage gains
obtainable from deferring TCP receive packet processing to user
context?  I have seen it make significant improvements, and whats more
this helps out all systems, this means old applications and
unsophisticated hardware.  (I am referring to Jakobson's idea of
nearly 10 years ago, and it appears once again he was right.)

Finally, what does RDMA do in the presence of SACK options?  Which
set of options gets kicked out?  Should the RDMA options go in
and we just drop the SACK blocks?  Or the other way around?

Later,
David S. Miller
davem@redhat.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 00:02:20 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id AAA19101
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 00:02:19 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id VAA25970
	for tcp-impl-outgoing; Thu, 24 Feb 2000 21:24:35 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id VAA25956
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 21:24:33 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id VAA21388; Thu, 24 Feb 2000 21:24:33 -0500 (EST)
Received: from calcite.rhyolite.com(38.159.140.3) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma021352; Thu, 24 Feb 00 21:23:57 -0500
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.3/calcite) id TAA25413
	env-from <vjs>;
	Thu, 24 Feb 2000 19:23:47 -0700 (MST)
Date: Thu, 24 Feb 2000 19:23:47 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <200002250223.TAA25413@calcite.rhyolite.com>
To: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>

> > A draft describing the TCP RDMA option can be found at:
> > ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt
>
> There is no DNS entry for ftpeng.cisco.com so I can't access the document.

ftpeng.cisco.com resolves for me to 198.92.30.33, and the URL works
ftpeng.cisco.com does not answer ICMP Echo-Requests.  It also seems that
Cisco is filtering ICMP TTL Exceeded.

Oh, well.  I predict that soon traceroute and ping will be as
effective as if the Internet were run by the old line telco managers
who went great lengths to keep their technical problems quite.
The recent security hassles will be a handy (and quite silly)
excuse.  (Yes, of course, Cisco has every right to filter however
they want.  I'm talking about technical sense, not rights.)


I'm even less impressed about the proposal than Erik Nordmark,
perhaps because more than 10 years ago I saw systems shipped by
more than one competitor of Sun Microsystems that paged flipped
NFS/UDP and user TCP data.  (well, one of the other vendors might
have been a little more recent 10 years.)

The motive for the proposal seems to be that while only a very few
CPU instructions are needed to page flip, the functions of those CPU
instructions are very hard in hardware.  I don't agree.  In today's
world of ASIC's, silicon to figure out where to drop incoming TCP
segments or NFS/UDP/IP fragments based only on old fashioned TCP and
RPC/XRD/UDP headers is nothing to write home about.  It wasn't even
all that big a deal more than 10 years ago, as everyone involved with
or who watched Protocol Engines Inc. remembers.

Hashing is almost as cool (and easy) in hardware as in software.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 01:00:42 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id BAA20196
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 01:00:41 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id WAA00299
	for tcp-impl-outgoing; Thu, 24 Feb 2000 22:38:07 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id WAA00283
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 22:38:05 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id WAA28865; Thu, 24 Feb 2000 22:38:04 -0500 (EST)
Received: from ren.netconnect.com.au(203.7.198.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma028828; Thu, 24 Feb 00 22:37:41 -0500
Received: (qmail 17574 invoked from network); 25 Feb 2000 03:37:36 -0000
Received: from unknown (HELO cvs.com.au) (203.87.14.203)
  by mail.netconnect.com.au with SMTP; 25 Feb 2000 03:37:36 -0000
Message-ID: <38B5BA9B.1D7C5957@cvs.com.au>
Date: Fri, 25 Feb 2000 10:11:23 +1100
From: Charles Esson <charlese@cvs.com.au>
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Costa Sapuntzakis <csapuntz@cisco.com>
CC: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov, gibbs@freebsd.org,
        zaitcev@metabyte.com, drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References: <200002242156.NAA18955@csapuntz-u1.cisco.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

1) Given that the first paragraph says section 10 of rfc2026 is
irrelevant,
under what conditions is the document published?

Costa Sapuntzakis wrote:

> The TCP RDMA option reduces the overhead of receiving data over
> TCP-based protocols such as NFS and HTTP.
>
> It enables the construction of a simple hardware accelerator that
> copies data directly from the incoming packet into application
> buffers, avoiding expensive copies in the protocol stack.  Even
> without hardware acceleration, the option enables the protocol stack
> to decrease the number of copies it must do.
>
> The TCP RDMA option is an annotation and requires no modifications to
> higher layer protocols. It can be used with popular protocols such as
> HTTP, NFS, and CIFS, along with new protocols.
>
> The TCP option also provides a bit to indicate application-level
> message boundaries. The bit enables out-of-order processing of the TCP
> receive queue, potentially decreasing service times in the presence of
> packet drops and improving performance on parallel systems.
>
> A draft describing the TCP RDMA option can be found at:
> ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt
>
> -Costa



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 01:02:26 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id BAA20282
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 01:02:26 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id WAA29789
	for tcp-impl-outgoing; Thu, 24 Feb 2000 22:29:52 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id WAA29770
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 22:29:49 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id WAA27990; Thu, 24 Feb 2000 22:29:49 -0500 (EST)
Received: from pneumatic-tube.sgi.com(204.94.214.22) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma027957; Thu, 24 Feb 00 22:29:05 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via ESMTP id TAA00629; Thu, 24 Feb 2000 19:31:49 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (clock.engr.sgi.com [163.154.34.45])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id TAA68813;
	Thu, 24 Feb 2000 19:28:41 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (localhost [127.0.0.1]) by clock.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id TAA71655; Thu, 24 Feb 2000 19:32:41 -0800 (PST)
Message-Id: <200002250332.TAA71655@clock.engr.sgi.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: "David S. Miller" <davem@redhat.com>
Cc: Erik.Nordmark@Eng.Sun.COM, csapuntz@cisco.com, ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov, gibbs@freebsd.org, zaitcev@metabyte.com,
        drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
In-Reply-To: Your message of "Thu, 24 Feb 2000 18:03:31 PST."
             <200002250203.SAA17132@pizda.ninka.net> 
Date: Thu, 24 Feb 2000 19:32:40 -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>    From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
>    Date: Thu, 24 Feb 2000 17:56:26 -0800
> 
>    With network bandwidth approaching memory bus bandwidth, it becomes
>    an issue.  A single copy receive uses 3x the bus bandwidth of a
>    zero copy implementation (or worse if not aligned properly), and
>    also poisons many more cachelines.

> In my experience cpu cycles are abundant on client machines.  Server
> side is where getting precious cpu cycles back seems more important.

Agreed for the most part.  But reducing latency and cache misses are just as 
important on servers, and it would be interesting to see whether RDMA would 
make a difference there.  Of course for file/web serving, the server receive 
fast path can all be done in-kernel anyways, and output volume dwarfs input 
volume.

I can come up with a rigged scenario where RDMA should make a difference:

Group server or database which needs large amounts of external data over the 
network via NFS/your protocol of choice.  Internet search engines would 
possibly fall into this category.  In this case, input/output ratio is 
drastically different than a typical server, and memory bandwidth used by 
copying could become an issue.

However, this is a pretty limited use, and would only need to be deployed on a 
few machines.

I don't see RDMA being beneficial for typical web/file servers, or clients, 
other than getting a couple % extra performance.

-- 
Zachary Amsden  zamsden@engr.sgi.com  (650) 933-6919  09U-510  Core Protocols




From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 01:09:41 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id BAA20810
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 01:09:41 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id WAA00743
	for tcp-impl-outgoing; Thu, 24 Feb 2000 22:45:37 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id WAA00715
	for <tcp-impl@grc.nasa.gov>; Thu, 24 Feb 2000 22:45:35 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id WAA29580; Thu, 24 Feb 2000 22:45:35 -0500 (EST)
Received: from newdev.eecs.harvard.edu(140.247.60.212) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma029559; Thu, 24 Feb 00 22:45:32 -0500
Received: (from sob@localhost)
	by newdev.harvard.edu (8.9.3/8.9.3) id WAA01141;
	Thu, 24 Feb 2000 22:44:55 -0500 (EST)
Date: Thu, 24 Feb 2000 22:44:55 -0500 (EST)
From: Scott Bradner <sob@harvard.edu>
Message-Id: <200002250344.WAA01141@newdev.harvard.edu>
To: charlese@cvs.com.au, csapuntz@cisco.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Cc: drich@fjst.com, gibbs@freebsd.org, ips@ece.cmu.edu, tcp-impl@grc.nasa.gov,
        zaitcev@metabyte.com
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Charles asks:
> 1) Given that the first paragraph says section 10 of rfc2026 is
> irrelevant,
> under what conditions is the document published?

actually, since it has not shown up in the Internet Drafts directory it 
has not been published in the sense that it could be the topic
of a work item in an IETF working group

so, while the topic may be interesting I'd suggest that it be ignored
in the contect of an IPS BOF (and note that when published it will not
say that 2026 is "irrelevant" cuz if it does it will not get published)

Scott


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 02:19:47 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA02299
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 02:19:46 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id AAA04750
	for tcp-impl-outgoing; Fri, 25 Feb 2000 00:02:55 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id AAA04721
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 00:02:52 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id AAA06801; Fri, 25 Feb 2000 00:02:51 -0500 (EST)
Received: from ns1.metabyte.com(216.218.208.34) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma006781; Fri, 25 Feb 00 00:02:22 -0500
Received: (from zaitcev@localhost)
	by ns1.metabyte.com (8.9.1a/8.9.1/akolb/110398) id VAA28688;
	Thu, 24 Feb 2000 21:01:22 -0800 (PST)
From: Pete Zaitcev <zaitcev@metabyte.com>
Message-Id: <200002250501.VAA28688@ns1.metabyte.com>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: sob@harvard.edu (Scott Bradner)
Date: Thu, 24 Feb 2000 21:01:22 -0800 (PST)
Cc: charlese@cvs.com.au, csapuntz@cisco.com, drich@fjst.com, gibbs@freebsd.org,
        ips@ece.cmu.edu, tcp-impl@grc.nasa.gov, zaitcev@metabyte.com
In-Reply-To: <200002250344.WAA01141@newdev.harvard.edu> from "Scott Bradner" at Feb 24, 2000 10:44:55 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> > 1) Given that the first paragraph says section 10 of rfc2026 is irrelevant,
> > under what conditions is the document published?
> 
> actually, since it has not shown up in the Internet Drafts directory it 
> has not been published in the sense that it could be the topic
> of a work item in an IETF working group
> 
> so, while the topic may be interesting I'd suggest that it be ignored
> in the contect of an IPS BOF (and note that when published it will not
> say that 2026 is "irrelevant" cuz if it does it will not get published)
> 
> Scott

Very well, but what about its companion document (SCOT)?
 http://search.ietf.org/internet-drafts/draft-satran-scot-00.txt
It is published, isn't it? It was somewhat disturbing to see the
notice, but on the other hand it was honest. IBM could just as
easily come up silently with a silly software patent for RDMA option
or for SCSI over TCP idea as such.

--Pete


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 02:39:05 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA03192
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 02:39:05 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id AAA05254
	for tcp-impl-outgoing; Fri, 25 Feb 2000 00:12:39 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id AAA05228
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 00:12:36 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id AAA07639; Fri, 25 Feb 2000 00:12:36 -0500 (EST)
Received: from caspian.plutotech.com(206.168.67.80) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma007625; Fri, 25 Feb 00 00:12:27 -0500
Received: from caspian.plutotech.com (localhost [127.0.0.1])
	by caspian.plutotech.com (8.9.3/8.9.1) with ESMTP id WAA01485;
	Thu, 24 Feb 2000 22:13:07 -0700 (MST)
	(envelope-from gibbs@caspian.plutotech.com)
Message-Id: <200002250513.WAA01485@caspian.plutotech.com>
X-Mailer: exmh version 2.1.0 09/18/1999
To: Kacheong Poon <Kacheong.Poon@Eng.Sun.COM>
cc: tcp-impl@grc.nasa.gov, csapuntz@cisco.com, ips@ece.cmu.edu,
        zaitcev@metabyte.com, drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
In-reply-to: Your message of "Thu, 24 Feb 2000 17:34:14 PST."
             <Roam.SIMC.2.0.6.951442454.3237.kcpoon@jurassic> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 24 Feb 2000 22:13:07 -0700
From: "Justin T. Gibbs" <gibbs@FreeBSD.org>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

>> The performance impact of RDMA is quite a bit larger than SACK, so
>> I don't know that your example is relevant.  All the big vendors
>> implement zero-copy in some shape or form and since RDMA is a scheme
>> to make zero-copy work in more cases, I'm sure it will be picked up
>> if the proposal is deemed sane.
>
>Can you elaborate on this?  Suppose TCP "blindly" does zero copy everything to
>an app's buffer (for example, to a web browser's receive buffer) without
>RDMA.  Then the browser app looks at the data and displays it.  What is the
>difference RDMA makes in this case?  Yes, RDMA can separate different messages
>in the buffer.  But this can also be done by the browser app, not by TCP.

You seem to be saying that in the common case zero copy is achievable.
Most implementations I've seen require the network driver to make
a guess about where the payload will be in an incoming packet so the header
can be stripped off and the payload dmaed to an aligned area.   A page
flip is then performed to get the data where the user wants it,
imposing the restriction that your  payload be page sized so you don't
leave gaps in the user's destination buffer.  Certainly, with a more
intelligent network adapter that knows every protocol you can determine
exactly where the data is in each packet.  If you add connection tracking
and sequence number sniffing to the nic with a mechanism to register user
buffers to connections, you can get zero copy every time*.  Unfortunately
this is not very general purpose solution.  The point of RDMA seems to be
to allow nic manufacturers to add support for a single tcp option that, at
the very least, allows the nic to align the payload for you.  Add RID
registration with the nic and you get the payload exactly where you want it
too.  All without too much state information kept by the nic.

* This technique has been implemented with custom firmware on  Alteon
  Gig-E cards for a product I work on.
  single protocol
--
Justin




From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 05:48:55 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id FAA04640
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 05:48:54 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id DAA13292
	for tcp-impl-outgoing; Fri, 25 Feb 2000 03:05:56 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id DAA13277
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 03:05:54 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id DAA21496; Fri, 25 Feb 2000 03:05:53 -0500 (EST)
Received: from prue.eim.surrey.ac.uk(131.227.76.5) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma021484; Fri, 25 Feb 00 03:05:48 -0500
Received: from petra.ee.surrey.ac.uk ([131.227.88.13] ident=eep1lw)
	by prue.eim.surrey.ac.uk with esmtp (Exim 3.03 #1)
	id 12OFko-0004nz-00; Fri, 25 Feb 2000 08:05:38 +0000
Date: Fri, 25 Feb 2000 08:05:34 +0000 (GMT)
From: Lloyd Wood <l.wood@eim.surrey.ac.uk>
X-Sender: eep1lw@petra.ee.surrey.ac.uk
Reply-To: L.Wood@eim.surrey.ac.uk
To: Vernon Schryver <vjs@calcite.rhyolite.com>
cc: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov, ietf@ietf.org
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
In-Reply-To: <200002250223.TAA25413@calcite.rhyolite.com>
Message-ID: <Pine.GSO.4.21.0002250727470.7870-100000@petra.ee.surrey.ac.uk>
Organization: speaking for none
X-url: http://www.ee.surrey.ac.uk/Personal/L.Wood/
X-no-archive: yes
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

On Thu, 24 Feb 2000, Vernon Schryver wrote:

> > From: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>
> 
> > > A draft describing the TCP RDMA option can be found at:
> > > ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt

If this _was_ a draft, it would be available from
http://www.ietf.org/internet-drafts/

(and I imagine that it wouldn't go against RFC2026, or be copyright
Cisco. I see that e.g.
ftp://ftp.ietf.org/internet-drafts/draft-satran-scot-00.txt
has similar wording. How widespread is this practice?)


> > There is no DNS entry for ftpeng.cisco.com so I can't access the document.
> 
> ftpeng.cisco.com resolves for me to 198.92.30.33, and the URL works

It's an alias for ftp-eng.
there doesn't appear to be a reverse DNS entry, though.


> ftpeng.cisco.com does not answer ICMP Echo-Requests.  It also seems that
> Cisco is filtering ICMP TTL Exceeded.
> 
> Oh, well.  I predict that soon traceroute and ping will be as
> effective as if the Internet were run by the old line telco managers
> who went great lengths to keep their technical problems quite.

Bear in mind that both traceroute and ping were effective one-person
opportunistic hacks based on using an existing infrastructure in
unexpected ways.

If they'd first been tediously designed by a committee, standardised
and mandated, things might be different. they probably wouldn't work
as well, but they'd be on buzzword-compliant feature lists.


> I'm even less impressed about the proposal than Erik Nordmark,

Note the mentions of SCSI and SCSI/TCP and the tie-in with the
proposed IP Storage efforts (recent ietf general list discussion).

I'd still like to know _why_.

L.

SCSI DMA over TCP? What _is_ all this aiming for - trying to build
distributed RAID arrays with really poor performance that are subject
to WAN outages and DoS attacks?

<L.Wood@surrey.ac.uk>PGP<http://www.ee.surrey.ac.uk/Personal/L.Wood/>







From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 07:34:54 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id HAA06201
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 07:34:54 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id EAA19682
	for tcp-impl-outgoing; Fri, 25 Feb 2000 04:55:25 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id EAA19675
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 04:55:24 -0500 (EST)
From: julian_satran@il.ibm.com
Received: by seraph3.lerc.nasa.gov; id EAA29280; Fri, 25 Feb 2000 04:55:23 -0500 (EST)
Received: from d12lmsgate-2.de.ibm.com(195.212.91.200) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma029275; Fri, 25 Feb 00 04:55:16 -0500
Received: from d12relay01.de.ibm.com (d12relay01.de.ibm.com [9.165.215.22])
	by d12lmsgate-2.de.ibm.com (1.0.0) with ESMTP id KAA53310;
	Fri, 25 Feb 2000 10:55:12 +0100
Received: from d12mta05.de.ibm.com (d12mta05_cs0 [9.165.222.239])
	by d12relay01.de.ibm.com (8.8.8m2/NCO v2.06) with SMTP id KAA44992;
	Fri, 25 Feb 2000 10:55:10 +0100
Received: by d12mta05.de.ibm.com(Lotus SMTP MTA v4.6.5  (863.2 5-20-1999))  id C1256890.00367577 ; Fri, 25 Feb 2000 10:54:50 +0100
X-Lotus-FromDomain: IBMIL@IBMDE
To: David Robinson <David.Robinson@EBay.Sun.COM>
cc: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Message-ID: <C1256890.003670A7.00@d12mta05.de.ibm.com>
Date: Fri, 25 Feb 2000 11:54:25 +0200
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk



Message boundaries are only part of the proposal - and they don't imply
additional work at the client. Doing zero copy based only on the
information in current headers is certainly possible at low speed. Over 1
Gb/s it requires some innovation and lots of silicon. The RDMA option makes
it possible at a far lower price. And the zero copy it enables might go
deep into the application space as it is only an annotation on packets.
It certainly makes sense on all the new applications (NFS4, SCSI, etc) and
the retrofit into existing ones is not that difficult either.
And placing it over TCP puts it on a safer ground that having to use at
higher speed completely new and unproven protocols (VIA, NGIO etc.).

Julo
Julian Satran - IBM Research

David Robinson <David.Robinson@EBay.Sun.COM> on 25/02/2000 03:42:05

Please respond to David Robinson <David.Robinson@EBay.Sun.COM>

To:   ips@ece.cmu.edu
cc:   tcp-impl@grc.nasa.gov (bcc: Julian Satran/Haifa/IBM)
Subject:  Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.




To efficently determine boundaries within a packet stream work
must be done somewhere.  In the RDMA proposal it is up to the
clients to do the work to make the server's job easier.  In traditional
intelligent NIC cards the server does the work by parsing the headers.

It seems that the design of RDMA is backwards as it relies on changes
to the many clients to enable efficiency on the server. A traditional
intelligent NIC card with a modest amount of hardware/firmware
can handle 99+% of requests from unmodified clients.  The existence
proof is checksumming NICs and NFS accelerator boards.

For an efficient IP storage device it will have to deal with legacy IP
client stacks (no RDMA) and a competitive IP storage vendor will
implement the smart NIC described above. Why is RDMA more compelling?

     -David







From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 07:58:21 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id HAA06434
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 07:58:19 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id FAA20362
	for tcp-impl-outgoing; Fri, 25 Feb 2000 05:13:24 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id FAA20356
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 05:13:23 -0500 (EST)
From: julian_satran@il.ibm.com
Received: by seraph3.lerc.nasa.gov; id FAA00649; Fri, 25 Feb 2000 05:13:23 -0500 (EST)
Received: from d12lmsgate-3.de.ibm.com(195.212.91.201) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma000618; Fri, 25 Feb 00 05:12:52 -0500
Received: from d12relay01.de.ibm.com (d12relay01.de.ibm.com [9.165.215.22])
	by d12lmsgate-3.de.ibm.com (1.0.0) with ESMTP id LAA113932;
	Fri, 25 Feb 2000 11:12:48 +0100
Received: from d12mta05.de.ibm.com (d12mta05_cs0 [9.165.222.239])
	by d12relay01.de.ibm.com (8.8.8m2/NCO v2.06) with SMTP id LAA69446;
	Fri, 25 Feb 2000 11:12:44 +0100
Received: by d12mta05.de.ibm.com(Lotus SMTP MTA v4.6.5  (863.2 5-20-1999))  id C1256890.0038164D ; Fri, 25 Feb 2000 11:12:37 +0100
X-Lotus-FromDomain: IBMIL@IBMDE
To: Vernon Schryver <vjs@calcite.rhyolite.com>
cc: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Message-ID: <C1256890.0038155B.00@d12mta05.de.ibm.com>
Date: Fri, 25 Feb 2000 12:12:25 +0200
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk



That is not completely accurate. You will need appreciably more silicon to
do what you suggest.   And you can do it only with information that "passes
through the protocol" .
The good thing about the  proposal is that it can TAG whatever the
application wants (and that can be several layers away from the protocol).
You can't "page-flip" to buffers that you are not aware of. And page
flipping wherever is applicable assumes  also page boundaries for buffers.

Julo

Julian Satran - IBM Research



Vernon Schryver <vjs@calcite.rhyolite.com> on 25/02/2000 04:23:47

Please respond to Vernon Schryver <vjs@calcite.rhyolite.com>

To:   ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
cc:    (bcc: Julian Satran/Haifa/IBM)
Subject:  Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.




> From: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>

> > A draft describing the TCP RDMA option can be found at:
> > ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt
>
> There is no DNS entry for ftpeng.cisco.com so I can't access the
document.

ftpeng.cisco.com resolves for me to 198.92.30.33, and the URL works
ftpeng.cisco.com does not answer ICMP Echo-Requests.  It also seems that
Cisco is filtering ICMP TTL Exceeded.

Oh, well.  I predict that soon traceroute and ping will be as
effective as if the Internet were run by the old line telco managers
who went great lengths to keep their technical problems quite.
The recent security hassles will be a handy (and quite silly)
excuse.  (Yes, of course, Cisco has every right to filter however
they want.  I'm talking about technical sense, not rights.)


I'm even less impressed about the proposal than Erik Nordmark,
perhaps because more than 10 years ago I saw systems shipped by
more than one competitor of Sun Microsystems that paged flipped
NFS/UDP and user TCP data.  (well, one of the other vendors might
have been a little more recent 10 years.)

The motive for the proposal seems to be that while only a very few
CPU instructions are needed to page flip, the functions of those CPU
instructions are very hard in hardware.  I don't agree.  In today's
world of ASIC's, silicon to figure out where to drop incoming TCP
segments or NFS/UDP/IP fragments based only on old fashioned TCP and
RPC/XRD/UDP headers is nothing to write home about.  It wasn't even
all that big a deal more than 10 years ago, as everyone involved with
or who watched Protocol Engines Inc. remembers.

Hashing is almost as cool (and easy) in hardware as in software.


Vernon Schryver    vjs@rhyolite.com





From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 10:46:29 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA12251
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 10:46:15 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id HAA28231
	for tcp-impl-outgoing; Fri, 25 Feb 2000 07:41:12 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id HAA28215
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 07:41:10 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id HAA12108; Fri, 25 Feb 2000 07:41:08 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma012071; Fri, 25 Feb 00 07:40:52 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12OK1V-0007YW-00; Fri, 25 Feb 2000 12:39:09 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: gibbs@FreeBSD.org (Justin T. Gibbs)
Date: Fri, 25 Feb 2000 12:39:05 +0000 (GMT)
Cc: Kacheong.Poon@Eng.Sun.COM (Kacheong Poon), tcp-impl@grc.nasa.gov,
        csapuntz@cisco.com, ips@ece.cmu.edu, zaitcev@metabyte.com,
        drich@fjst.com
In-Reply-To: <200002250513.WAA01485@caspian.plutotech.com> from "Justin T. Gibbs" at Feb 24, 2000 10:13:07 PM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12OK1V-0007YW-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> flip is then performed to get the data where the user wants it,
> imposing the restriction that your  payload be page sized so you don't
> leave gaps in the user's destination buffer.  Certainly, with a more

Perhaps its about time the world put together an official, sane, ring buffer
style mmap socket api. A lot of the requirement to align data is coming
from the existing socket API. 

> * This technique has been implemented with custom firmware on  Alteon
>   Gig-E cards for a product I work on.

Nifty.



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 11:22:09 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id LAA13539
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 11:22:08 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id IAA02476
	for tcp-impl-outgoing; Fri, 25 Feb 2000 08:21:41 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id IAA02462
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 08:21:39 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id IAA15972; Fri, 25 Feb 2000 08:21:38 -0500 (EST)
Received: from ren.netconnect.com.au(203.7.198.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma015951; Fri, 25 Feb 00 08:21:27 -0500
Received: (qmail 1074 invoked from network); 25 Feb 2000 13:21:59 -0000
Received: from unknown (HELO cvs.com.au) (203.87.14.203)
  by mail.netconnect.com.au with SMTP; 25 Feb 2000 13:21:59 -0000
Message-ID: <38B64383.7F0A213@cvs.com.au>
Date: Fri, 25 Feb 2000 19:55:31 +1100
From: Charles Esson <charlese@cvs.com.au>
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Costa Sapuntzakis <csapuntz@cisco.com>
CC: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov, gibbs@freebsd.org,
        zaitcev@metabyte.com, drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References: <200002242156.NAA18955@csapuntz-u1.cisco.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

I must have missed something.

If we don't have this, you can take the destination port, convert to a
table address, use the sequence number,
do some calculations and come up with a buffer address and an offset. If
you want to mess up the layering
of your stack, they are all things you can do now.

or using RDMA

You tell the would the contents of your table, have the server play with
the data so detailed,
the server sends back info in this option.

--->An attacker plays with the data so returned.

The client instead of doing the port to table translation to determine
where to send the data
from the packet , does a port to table translation to find the table to
detect malicious corruption's
of the option.

How can you assume anything in the option is nothing but rubbish until
such
a translation and check is done?


1) Haven't you only changed the reason to do the port to table
translation.
2) Have completely messed up the layering of your stack. Ok people do it,
but as an option?
3) Provided a whole new farm yard of bugs and potential attacks?

As I read more and more RFC's one question I am constantly asking, why are
people making this terrific
protocol more and more complicated.

Isn't it difficult enough to implement already ( I think it is anyway).



Costa Sapuntzakis wrote:

> The TCP RDMA option reduces the overhead of receiving data over
> TCP-based protocols such as NFS and HTTP.
>
> It enables the construction of a simple hardware accelerator that
> copies data directly from the incoming packet into application
> buffers, avoiding expensive copies in the protocol stack.  Even
> without hardware acceleration, the option enables the protocol stack
> to decrease the number of copies it must do.

1) I think your a bit game if you assume the data in the option is
anything other than a story that aims to mislead.
2) If you want to play this game why can't you use the port, sequence
number and a table? All things available
to you without advertising the state of your system.

>
>
> The TCP RDMA option is an annotation and requires no modifications to
> higher layer protocols. It can be used with popular protocols such as
> HTTP, NFS, and CIFS, along with new protocols.

>
>
> The TCP option also provides a bit to indicate application-level
> message boundaries. The bit enables out-of-order processing of the TCP
> receive queue, potentially decreasing service times in the presence of
> packet drops and improving performance on parallel systems.

Out of order processing assumes the data that didn't arrive one day will.
Is that a valid
assumption on the internet? What happens with packets that are received
more than once,
do you DMA over the previous data thus corrupting the changes that have
been made
by the application? Do you make the rule the application may not alter the
DMA buffers?
Or do you check for this condition? Is this check in your acceleration
hardware?

I assume you believe you still only move the sequence number forward if
all data has
been received to that sequence number. Is this done in the acceleration
hardware?

By the time you have checked that the option is valid, updated your
received
sequence number, and made sure you are not DMAing  over data the
application already
has, what have you gained that could not be had without this option and a
restructure of a stack.

P.S. Maximum of two copies in the stack I wrote, one from card to
ip_buffer, and an
optional copy from ip_buffer to application_buffer. I don't believe I am
partially smart.
I deceive the introduction is misleading.


>
>
> A draft describing the TCP RDMA option can be found at:
> ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt
>
> -Costa



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 11:23:39 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id LAA13591
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 11:23:37 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id IAA02320
	for tcp-impl-outgoing; Fri, 25 Feb 2000 08:20:12 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id IAA02282
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 08:20:09 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id IAA15822; Fri, 25 Feb 2000 08:20:08 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma015727; Fri, 25 Feb 00 08:19:32 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12OKcF-0007bx-00; Fri, 25 Feb 2000 13:17:07 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: julian_satran@il.ibm.com
Date: Fri, 25 Feb 2000 13:17:05 +0000 (GMT)
Cc: David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
In-Reply-To: <C1256890.003670A7.00@d12mta05.de.ibm.com> from "julian_satran@il.ibm.com" at Feb 25, 2000 11:54:25 AM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12OKcF-0007bx-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> Gb/s it requires some innovation and lots of silicon. The RDMA option makes
> it possible at a far lower price. And the zero copy it enables might go
> deep into the application space as it is only an annotation on packets.

I am not convinced the amount of silicon changes between the two. The
RDMA id make be faked by an attacker so must still be verified. 

Va Jacobson proposed and to an extent implemented a system where the user
context does all the TCP work. In that sort of situation and with a more
sensible API than the BSD socket one you dont appear to need a lot of silicon,
in fact the worst case is the wildcard.

Alan



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 12:10:36 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id MAA14841
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 12:10:36 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id JAA11239
	for tcp-impl-outgoing; Fri, 25 Feb 2000 09:22:26 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id JAA11202
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 09:22:25 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id JAA23761; Fri, 25 Feb 2000 09:22:24 -0500 (EST)
Received: from mail-gw.hursley.ibm.com(194.196.110.15) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma023530; Fri, 25 Feb 00 09:21:34 -0500
Received: from sp3at21.hursley.ibm.com (sp3at21.hursley.ibm.com [9.20.45.21]) by mail-gw.hursley.ibm.com (AIX4.3/UCB 8.8.8/8.8.8) with ESMTP id OAA277522; Fri, 25 Feb 2000 14:20:25 GMT
Received: from hursley.ibm.com ([9.14.4.75]) by sp3at21.hursley.ibm.com (AIX4.2/UCB 8.7/8.7.3) with ESMTP id OAA12410; Fri, 25 Feb 2000 14:20:23 GMT
Message-ID: <38B68F5E.CCCB9A5C@hursley.ibm.com>
Date: Fri, 25 Feb 2000 08:19:10 -0600
From: Brian E Carpenter <brian@hursley.ibm.com>
Organization: IBM
X-Mailer: Mozilla 4.61 [en] (Win98; I)
X-Accept-Language: en,fr
MIME-Version: 1.0
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
CC: julian_satran@il.ibm.com, David Robinson <David.Robinson@EBay.Sun.COM>,
        ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References: <E12OKcF-0007bx-00@the-village.bc.nu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

As I recall, Van's solution required an unconventional I/O system where DMA went
straight into user memory right off the LAN chip. Not too many people make
computers like that.

  Brian

Alan Cox wrote:
> 
> > Gb/s it requires some innovation and lots of silicon. The RDMA option makes
> > it possible at a far lower price. And the zero copy it enables might go
> > deep into the application space as it is only an annotation on packets.
> 
> I am not convinced the amount of silicon changes between the two. The
> RDMA id make be faked by an attacker so must still be verified.
> 
> Va Jacobson proposed and to an extent implemented a system where the user
> context does all the TCP work. In that sort of situation and with a more
> sensible API than the BSD socket one you dont appear to need a lot of silicon,
> in fact the worst case is the wildcard.
> 
> Alan


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 12:30:14 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id MAA15293
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 12:30:14 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id JAA13997
	for tcp-impl-outgoing; Fri, 25 Feb 2000 09:39:52 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id JAA13953
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 09:39:42 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id JAA26210; Fri, 25 Feb 2000 09:39:40 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma025374; Fri, 25 Feb 00 09:36:00 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12OLlj-0007hq-00; Fri, 25 Feb 2000 14:30:59 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: brian@hursley.ibm.com (Brian E Carpenter)
Date: Fri, 25 Feb 2000 14:30:58 +0000 (GMT)
Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), julian_satran@il.ibm.com,
        David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
In-Reply-To: <38B68F5E.CCCB9A5C@hursley.ibm.com> from "Brian E Carpenter" at Feb 25, 2000 08:19:10 AM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12OLlj-0007hq-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> As I recall, Van's solution required an unconventional I/O system where DMA went
> straight into user memory right off the LAN chip. Not too many people make
> computers like that.

Actually it doesn't, although every PC today is your 'unconventional' system.
Many 10Mbit and almost all the 100Mbit cards do scatter gather DMA.  To do
zero copy means landing the buffer into memory that can become accessible to
the user, the rest of the benefits for single copy/checksum/user come with
DMA landing in purely kernel controlled space




From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 14:28:38 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA18644
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 14:28:37 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id LAA01839
	for tcp-impl-outgoing; Fri, 25 Feb 2000 11:36:09 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id LAA01819
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 11:36:06 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id LAA12866; Fri, 25 Feb 2000 11:36:05 -0500 (EST)
Received: from calcite.rhyolite.com(38.159.140.3) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma012800; Fri, 25 Feb 00 11:35:34 -0500
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.3/calcite) id JAA08403
	env-from <vjs>;
	Fri, 25 Feb 2000 09:35:30 -0700 (MST)
Date: Fri, 25 Feb 2000 09:35:30 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <200002251635.JAA08403@calcite.rhyolite.com>
To: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

] From: Charles Esson <charlese@cvs.com.au>

] I must have missed something.
]
] If we don't have this, you can take the destination port, convert to a
] table address, use the sequence number,
] do some calculations and come up with a buffer address and an offset. If
] you want to mess up the layering
] of your stack, they are all things you can do now.

Standards committees don't like hashing.  It looks complicated and
insufficiently deterministic on an overhead projector.

] or using RDMA
] ...

] --->An attacker plays with the data so returned.
] ...

That's a very good point.  The tagging in RDMA cannot be used until
after it has been validated by the receiver.  The validating consists
of looking at sequence numbers, RPC/XDR headers, etc. to figure
out where the data can and should go, and then checking that the
sender guessed right.  Why not skip the last part and ignore the
RDMA tag?  Then why send the RDMA tag?

 .......


> From: Pete Zaitcev <zaitcev@metabyte.com>

> Very well, but what about its companion document (SCOT)?
>  http://search.ietf.org/internet-drafts/draft-satran-scot-00.txt
> It is published, isn't it? It was somewhat disturbing to see the
> notice, but on the other hand it was honest. IBM could just as
> easily come up silently with a silly software patent for RDMA option
> or for SCSI over TCP idea as such.

The IETF's protections against patent games are well intended, but nothing
to worry about if you want to play them and nothing to rely upon if you
don't.  The history of IETF patent games demonstrates that the IETF is
powerless to limit them (or worse), and that they're harder to play than
the players hope.  (E.g. PPP CCP and PPP 48-bit FCS, respectively)

  ......

} From: "Justin T. Gibbs" <gibbs@FreeBSD.org>

} ...
} >Can you elaborate on this?  Suppose TCP "blindly" does zero copy everything to
} >an app's buffer (for example, to a web browser's receive buffer) without
} >RDMA.  Then the browser app looks at the data and displays it.  What is the
} >difference RDMA makes in this case?  Yes, RDMA can separate different messages
} >in the buffer.  But this can also be done by the browser app, not by TCP.
}
} You seem to be saying that in the common case zero copy is achievable.
} Most implementations I've seen require the network driver to make
} a guess about where the payload will be in an incoming packet so the header
} can be stripped off and the payload dmaed to an aligned area.   A page
} flip is then performed to get the data where the user wants it,
} imposing the restriction that your  payload be page sized so you don't
} leave gaps in the user's destination buffer.

That is required only if you stick to the current API.  Obvious, 
minor changes in the direction of some operating systems that existed
before UNIX are sufficient to relax the page boundary requirement.
To use RDMA, you have to change the API.

}                                               Certainly, with a more
} intelligent network adapter that knows every protocol you can determine
} exactly where the data is in each packet.  If you add connection tracking
} and sequence number sniffing to the nic with a mechanism to register user
} buffers to connections, you can get zero copy every time*.  Unfortunately
} this is not very general purpose solution.

Only standards committees and some academics care about "every protocol"
or optimizing absolutely every application.  The rest of us (including
academics) only care about optimizing the important stuff.

Also as you say, looking at sequence numbers in the interface and relaxing
the sockets API rules about not touching any bytes in the buffer except
those that are actually received lets you avoid copies all of the time.
I don't see why that is not a general purpose solution, if you want one.

}                                             The point of RDMA seems to be
} to allow nic manufacturers to add support for a single tcp option that, at
} the very least, allows the nic to align the payload for you.  Add RID
} registration with the nic and you get the payload exactly where you want it
} too.  All without too much state information kept by the nic.

I've been hearing since the mid-1980's proposals to do TSP lookups in the
network interface instead of software because it is so incredibly difficult
to find the right TSP quickly in software.  I think those ideas are similar
to the RDMA idea.  They assume facts not in evidence, that there is a
problem that needs to be solved, and that the solution is not worse than
the nominal problem.  There are reasons why such proposals appear in
standards committees before implementations.

 ......

] From: Lloyd Wood <l.wood@eim.surrey.ac.uk>

] Note the mentions of SCSI and SCSI/TCP and the tie-in with the
] proposed IP Storage efforts (recent ietf general list discussion).
]
] I'd still like to know _why_.

] ...
] SCSI DMA over TCP? What _is_ all this aiming for - trying to build
] distributed RAID arrays with really poor performance that are subject
] to WAN outages and DoS attacks?

Why put SCSI over an protocol that measures RTT's, worries about
congestion in routers, and that expects the error rates that come
with 5000 miles of wire and 20 routers in the path?  Does anyone
really think that TCP/IP or even IP with it's 64K bit packet limit
are remotely close to the right protocol, particularly given the
existing and commercially available alternatives?

A standards committee is the venue of first and last resort for
such ideas, especially a committee that is related to currently
trendy things like the SuperInfoHypeWay.

 ....

) From: julian_satran@il.ibm.com

) That is not completely accurate. You will need appreciably more silicon to
) do what you suggest.   And you can do it only with information that "passes
) through the protocol" .

Significantly silicon more than what to do what?  Since the comment
was addressed to me, I'll assume one 'what' was looking at sequence
numbers, port numbers, and so forth to page flip.  Clearly it takes more
silicon to support page flipping in hardware than to not support page
flipping in hardware.  I will not agree that the required silicon is a
big deal, not because I have a clue about floor plans and so forth (I
don't), but because at a previous employeer I fought to keep the hardware
guys from throwing in gates to do it.  They had the silicon to spare and
had heard so much about the wonderfulness of page flipping that they wanted
to get in on the fun.

Doing things in hardware is ok only if you absolutely must.  Software is
always better when it is good enough, because it is soft.

) The good thing about the  proposal is that it can TAG whatever the
) application wants (and that can be several layers away from the protocol).
) You can't "page-flip" to buffers that you are not aware of. And page
) flipping wherever is applicable assumes  also page boundaries for buffers.

That's important only if you stick close to the sockets or UNIX read()
API.  If you are not ultra-conservative, and if you know a little of the
history of file and device I/O API's, or of you think about such things
for 10 seconds, then RDMA tagging becomes less interesting.
To use RDMA tagging, you must abandone the UNIX read() API.  If you change
the API, then you may as well think about the whole problem instead of
only a corner.  If you let the operating system tell the application where
the incoming data arrived, then you don't need elaborate hints from the
sender to the receivers hardware to say where the receiving software will
want the data.


) Vernon Schryver <vjs@calcite.rhyolite.com> on 25/02/2000 04:23:47
)
) Please respond to Vernon Schryver <vjs@calcite.rhyolite.com>

I did not write that!

 .....

) From: Alan Cox <alan@lxorguk.ukuu.org.uk>
)
) > flip is then performed to get the data where the user wants it,
) > imposing the restriction that your  payload be page sized so you don't
) > leave gaps in the user's destination buffer.  Certainly, with a more
)
) Perhaps its about time the world put together an official, sane, ring buffer
) style mmap socket api. A lot of the requirement to align data is coming
) from the existing socket API. 

The IETF should not get involved in API's.  There are plenty of other
standards committees in that arena, as well as big commercial outfits
including one in the U.S. Pacific Northwest.  In other words, do you think
the IETF would be more successful arguing with Microsoft about winsock
than the IETF has been in dealing with Microsoft's obviously completely
stupid and wrong PPP ideas?

If you do get involved in standardizing such things, then *PLEASE* don't
limit yourself to #$%$#@! ring buffers!  The ancient Execelan and preceding
(I've a mental block against the name starting with 'I') ring buffer notion
was ok as an initial hack, but WRONG for something to go fast.  To start,
you don't need pointers or indeces that must be written by both the
interface and the host.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 15:07:21 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA19683
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 15:07:19 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id MAA05853
	for tcp-impl-outgoing; Fri, 25 Feb 2000 12:03:55 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id MAA05800
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 12:03:52 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id MAA16767; Fri, 25 Feb 2000 12:03:52 -0500 (EST)
Received: from pneumatic-tube.sgi.com(204.94.214.22) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma016727; Fri, 25 Feb 00 12:03:13 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via ESMTP id JAA04975; Fri, 25 Feb 2000 09:05:59 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (clock.engr.sgi.com [163.154.34.45])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id JAA00610;
	Fri, 25 Feb 2000 09:02:52 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (localhost [127.0.0.1]) by clock.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id JAA73948; Fri, 25 Feb 2000 09:06:56 -0800 (PST)
Message-Id: <200002251706.JAA73948@clock.engr.sgi.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Charles Esson <charlese@cvs.com.au>
Cc: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov, gibbs@freebsd.org,
        zaitcev@metabyte.com, drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
In-Reply-To: Your message of "Fri, 25 Feb 2000 19:55:31 +1100."
             <38B64383.7F0A213@cvs.com.au> 
Date: Fri, 25 Feb 2000 09:06:55 -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> I must have missed something.
> 
> If we don't have this, you can take the destination port, convert to a
> table address, use the sequence number,
> do some calculations and come up with a buffer address and an offset. If
> you want to mess up the layering
> of your stack, they are all things you can do now.

Yes, you can do that, and it isn't that tricky for simple protocols.  For any 
bulk data protocol higher than TCP with fixed headers, determining the payload 
offset is pretty straigtforward for a single transfer per connection.

Once you start with multiple transfers per connection, variable headers, or 
many different protocols, it becomes harder to do all this work in silicon.  
The advantage I see in RDMA is giving a generic payload pointer for the NIC to 
separate protocol data and payload.

I wonder whether it would be better to do this at the IP layer to enable it to 
be used for UDP and other protocols as well.

Of course RDMA is only going to help in cases where you have receive bandwidth 
issues, and such a scenario isn't likely to be the case for web/file servers 
or desktop clients.

The scenario that it gives the most benefit for is a "middleman" server that 
needs to do lots of I/O to a network storage device while servicing requests.  
In this case, any network I/O running locally may well be over a different 
protocol like UDP or ST.  (ST however, has no need for RDMA acceleration, as 
it already has a buffer transfer design).

-- 
Zachary Amsden  zamsden@engr.sgi.com  (650) 933-6919  09U-510  Core Protocols




From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 16:04:01 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA20639
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 16:04:01 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id NAA18477
	for tcp-impl-outgoing; Fri, 25 Feb 2000 13:21:15 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id NAA18434
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 13:21:12 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id NAA29738; Fri, 25 Feb 2000 13:21:12 -0500 (EST)
Received: from sj-msg-core-2.cisco.com(171.69.43.88) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma029681; Fri, 25 Feb 00 13:20:44 -0500
Received: from csapuntz-u1.cisco.com (csapuntz-u1.cisco.com [171.69.199.29])
	by sj-msg-core-2.cisco.com (8.9.3/8.9.1) with ESMTP id KAA07324;
	Fri, 25 Feb 2000 10:21:10 -0800 (PST)
Received: from localhost (csapuntz@localhost) by csapuntz-u1.cisco.com (8.8.8-Cisco List Logging/CISCO.WS.1.2) with ESMTP id KAA19389; Fri, 25 Feb 2000 10:20:43 -0800 (PST)
X-Authentication-Warning: csapuntz-u1.cisco.com: csapuntz owned process doing -bs
Date: Fri, 25 Feb 2000 10:20:42 -0800 (PST)
From: Costa Sapuntzakis <csapuntz@cisco.com>
To: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>
cc: tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
In-Reply-To: <Roam.SIMC.2.0.6.951433178.19794.nordmark@jurassic>
Message-ID: <Pine.GSO.4.10.10002251010510.19376-100000@csapuntz-u1.cisco.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Erik,

> > The TCP RDMA option reduces the overhead of receiving data over
> > TCP-based protocols such as NFS and HTTP.  
> 
> Do you have any data (simulation, implementation) to back up this claim?
> Or did you mean to say "provides a capbility which an implementation can
> use to try to reduce the overhead"?

Thanks for keeping me honest here. I have yet to code up an
implementation and generate numbers. The claims are, thus, perhaps
overstated.

> This seems to be an overstatement as well. Are you saying that an
> implementation that currently has a single copy in its receive path
> (from kernel to user space) can "reduce" the number of copies without
> any hardware acceleration? That would imply that the number of
> copies could be reduces to zero which I have a hard time understanding
> (unless you add hardware acceleration).

The copy that is being eliminated is from the TCP receive buffer to the
buffer cache.

As for the DNS entry problem, I'm a bit baffled since it works
from two external non-Cisco machines for me. I'll be forwarding this to
our internal tech guys. The domain name ftp-eng.cisco.com might
work better.

-Costa




From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 16:20:20 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA20973
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 16:20:20 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id NAA22637
	for tcp-impl-outgoing; Fri, 25 Feb 2000 13:46:06 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id NAA22581
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 13:46:04 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id NAA03920; Fri, 25 Feb 2000 13:45:59 -0500 (EST)
Received: from kickme.cisco.com(198.92.30.42) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma003872; Fri, 25 Feb 00 13:45:42 -0500
Received: from csapuntz-u1.cisco.com (csapuntz-u1.cisco.com [171.69.199.29])
	by kickme.cisco.com (8.9.1a/8.9.1) with ESMTP id KAA16466;
	Fri, 25 Feb 2000 10:35:05 -0800 (PST)
Received: from localhost (csapuntz@localhost) by csapuntz-u1.cisco.com (8.8.8-Cisco List Logging/CISCO.WS.1.2) with ESMTP id KAA19437; Fri, 25 Feb 2000 10:45:34 -0800 (PST)
X-Authentication-Warning: csapuntz-u1.cisco.com: csapuntz owned process doing -bs
Date: Fri, 25 Feb 2000 10:45:34 -0800 (PST)
From: Costa Sapuntzakis <csapuntz@cisco.com>
To: "David S. Miller" <davem@redhat.com>
cc: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
In-Reply-To: <200002242357.PAA16973@pizda.ninka.net>
Message-ID: <Pine.GSO.4.10.10002251025290.19392-100000@csapuntz-u1.cisco.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Hi,

Your input on this is very much appreciated. I'd like to clarify a couple
things, though. Some of these things may not need clarification. :)

The TCP RDMA option is not about accelerating sending of data. It's
about speeding up the receiving of bulk data.

The TCP RDMA option isn't proposing to only about accelerating servers.
It's about accelerating the receiver of data. In the case of NFS READ
RPCs, that's the NFS client.

I agree that for specific problem domains, SCSI, NFS, HTTP, people may
want to build specialized server hardware.

Perhaps explaining why I decided to explore this space will help.
Today, you have specialized silicon that for simple bus protocols
(SCSI parallel interface and ATA) will directly take transfer blocks
between the device and the buffer cache. This is not currently done
with TCP, to the best of my knowledge. The best TCP implementations
do zero copy to a TCP receive buffer.

However, in the case of most storage protocols, you don't want
the data in the receive buffer. You want it in the buffer cache, so
there is a copy to the buffer cache.

So, NFS has a  CPU overhead hit as compared to optimized storage host bus
adapters. The goal was to eliminate part of this hit, by getting rid of an
extra copy.

Now, this proposal doesn't fix the interrupt overhead problem. 
Optimized FC/SCSI NICs have one interrupt/transfer or less.

-Costa

On Thu, 24 Feb 2000, David S. Miller wrote:

>    Date: Thu, 24 Feb 2000 14:59:38 -0800 (PST)
>    From: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>
> 
> As an aside I think the RDMA proposal has a lot of holes too.  For
> example, there are in-kernel HTTP accelerators that do the complete
> client header parse and initial packet response in the hw interrupt
> handler.  There are no user buffers involved, and static response
> data is DMA'd directly from the filesystem page cache.
> 
>    > A draft describing the TCP RDMA option can be found at:
>    > ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt
> 
>    There is no DNS entry for ftpeng.cisco.com so I can't access the
>    document.
> 
> Here is what I get:
> 
> ? host -a ftpeng.cisco.com
> Trying null domain
> rcode = 0 (Success), ancount=1
> The following answer is not authoritative:
> The following answer is not verified as authentic by the server:
> ftpeng.cisco.com        84574 IN        CNAME   ftp-eng.cisco.com
> For authoritative answers, see:
> cisco.com       38435 IN        NS      NS1.cisco.com
> cisco.com       38435 IN        NS      NS2.cisco.com
> Additional information:
> NS1.cisco.com   78995 IN        A       192.31.7.92
> NS2.cisco.com   67536 IN        A       192.135.250.69
> rcode = 0 (Success), ancount=1
> The following answer is not authoritative:
> The following answer is not verified as authentic by the server:
> ftp-eng.cisco.com       84574 IN        A       198.92.30.33
> For authoritative answers, see:
> CISCO.com       38435 IN        NS      NS1.CISCO.com
> CISCO.com       38435 IN        NS      NS2.CISCO.com
> Additional information:
> NS1.CISCO.com   78995 IN        A       192.31.7.92
> NS2.CISCO.com   67536 IN        A       192.135.250.69
> 
> 
> 



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 16:22:10 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA21042
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 16:22:10 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id NAA23599
	for tcp-impl-outgoing; Fri, 25 Feb 2000 13:51:18 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id NAA23584
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 13:51:15 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id NAA04845; Fri, 25 Feb 2000 13:51:15 -0500 (EST)
Received: from sj-msg-core-2.cisco.com(171.69.43.88) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma004813; Fri, 25 Feb 00 13:50:33 -0500
Received: from csapuntz-u1.cisco.com (csapuntz-u1.cisco.com [171.69.199.29])
	by sj-msg-core-2.cisco.com (8.9.3/8.9.1) with ESMTP id KAA08982;
	Fri, 25 Feb 2000 10:51:00 -0800 (PST)
Received: from localhost (csapuntz@localhost) by csapuntz-u1.cisco.com (8.8.8-Cisco List Logging/CISCO.WS.1.2) with ESMTP id KAA19441; Fri, 25 Feb 2000 10:50:32 -0800 (PST)
X-Authentication-Warning: csapuntz-u1.cisco.com: csapuntz owned process doing -bs
Date: Fri, 25 Feb 2000 10:50:32 -0800 (PST)
From: Costa Sapuntzakis <csapuntz@cisco.com>
To: "David S. Miller" <davem@redhat.com>
cc: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
In-Reply-To: <200002250101.RAA17054@pizda.ninka.net>
Message-ID: <Pine.GSO.4.10.10002251047590.19392-100000@csapuntz-u1.cisco.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Hi, 

I don't think it's really a matter of SACK vs. RDMA. I think they are both
useful and somewhat orthogonal.

There is a concern that both RDMA and SACK are large options and there
is only 40 bytes of TCP options space.

Perhaps there is a way of expanding the TCP options space (yet another
TCP option!) that will enable both to co-exist peacefully.

-Costa

On Thu, 24 Feb 2000, David S. Miller wrote:

>    Date: Thu, 24 Feb 2000 17:50:52 -0700
>    From: "Justin T. Gibbs" <gibbs@freebsd.org>
> 
>    The performance impact of RDMA is quite a bit larger than SACK,
> 
> It depends who you are.
> 
> For someone over a satellite link, I think SACK benefits them
> much more than RDMA.
> 
> Later,
> David S. Miller
> davem@redhat.com
> 
> 
> 



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 16:38:27 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA21490
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 16:38:27 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA25973
	for tcp-impl-outgoing; Fri, 25 Feb 2000 14:07:08 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA25930
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 14:07:05 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id OAA06806; Fri, 25 Feb 2000 14:07:00 -0500 (EST)
Message-Id: <200002251907.OAA06806@seraph3.lerc.nasa.gov>
Received: from be.be.com(208.243.144.2) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma006771; Fri, 25 Feb 00 14:06:31 -0500
Received: (qmail 10251 invoked from network); 25 Feb 2000 19:11:51 -0000
Received: from gpz.be.com (10.113.216.32)
  by mail.be.com with SMTP; 25 Feb 2000 19:11:51 -0000
To: tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Date: Fri, 25 Feb 2000 11:13:08 PST
From: "Howard Berkey" <howard@be.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Reply-To: howard@be.com
X-Mailer: BeOS Mail
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


It is unclear to me what real-world benefit this REALLY offers over 
existing methods.  If your stack is already jumping through the hoops 
necessary to (a) support zero-copy on recieve and (b) do it in a 
robust, secure manner, then RDMA seems to me to be of dubious value and 
an invitation to attack.

I must be missing something.

Howard


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 16:39:07 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA21503
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 16:39:05 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA26139
	for tcp-impl-outgoing; Fri, 25 Feb 2000 14:08:33 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA26104
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 14:08:31 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id OAA07081; Fri, 25 Feb 2000 14:08:30 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma006954; Fri, 25 Feb 00 14:08:05 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12OPyF-00085a-00; Fri, 25 Feb 2000 19:00:11 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: zamsden@cthulhu.engr.sgi.com (Zachary Amsden)
Date: Fri, 25 Feb 2000 19:00:09 +0000 (GMT)
Cc: charlese@cvs.com.au (Charles Esson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov, gibbs@freebsd.org, zaitcev@metabyte.com,
        drich@fjst.com
In-Reply-To: <200002251706.JAA73948@clock.engr.sgi.com> from "Zachary Amsden" at Feb 25, 2000 09:06:55 AM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12OPyF-00085a-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> I wonder whether it would be better to do this at the IP layer to enable it to 
> be used for UDP and other protocols as well.

Congratulations. You are probably the first person in the world to find a sane
use for the IPv6 flow id 8)

Alan



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 17:25:44 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA22370
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 17:25:44 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA03808
	for tcp-impl-outgoing; Fri, 25 Feb 2000 14:55:14 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA03782
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 14:55:11 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id OAA14282; Fri, 25 Feb 2000 14:55:11 -0500 (EST)
Received: from deliverator.sgi.com(204.94.214.10) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma014236; Fri, 25 Feb 00 14:54:56 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id LAA07647; Fri, 25 Feb 2000 11:50:12 -0800 (PST)
	mail_from (aman@cthulhu.engr.sgi.com)
Received: from lhotse.engr.sgi.com (lhotse.engr.sgi.com [163.154.35.41])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id LAA90761;
	Fri, 25 Feb 2000 11:54:39 -0800 (PST)
	mail_from (aman@cthulhu.engr.sgi.com)
Received: from engr.sgi.com (localhost [127.0.0.1]) by lhotse.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id LAA29821; Fri, 25 Feb 2000 11:52:53 -0800 (PST)
Message-ID: <38B6DD95.19C9BCAD@engr.sgi.com>
Date: Fri, 25 Feb 2000 11:52:53 -0800
From: Aman Singla <aman@cthulhu.engr.sgi.com>
Organization: SGI
X-Mailer: Mozilla 4.7C-SGI [en] (X11; I; IRIX 6.5 IP32)
X-Accept-Language: en
MIME-Version: 1.0
To: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
CC: Charles Esson <charlese@cvs.com.au>, ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov, gibbs@freebsd.org, zaitcev@metabyte.com,
        drich@fjst.com
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References: <200002251706.JAA73948@clock.engr.sgi.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


> In this case, any network I/O running locally may well be over a different 
> protocol like UDP or ST.  (ST however, has no need for RDMA acceleration, as 
> it already has a buffer transfer design).

Thats IMO an excellent point. If you're going to have to change your
client as well as the server, and add an additional formatted header
(through the TCP RDMA option), why do you want to constraint yourself
to doing it in the context of TCP only!
Why not do it using STP? a protocol that has been designed to do
RDMAs - addressing various issues which come up when designing RDMA
s/w, firmware, h/w..
The BDS fileserver using STP does something very similar to what you're
proposing to do with NFS. There's even a prototype doing SCSI using
RDMAs of the very same kind using STP.

thanks,

:a


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 17:30:53 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA22434
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 17:30:53 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA02699
	for tcp-impl-outgoing; Fri, 25 Feb 2000 14:49:14 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA02676
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 14:49:12 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id OAA13353; Fri, 25 Feb 2000 14:49:10 -0500 (EST)
Received: from sgi.sgi.com(192.48.153.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma013309; Fri, 25 Feb 00 14:48:43 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) 
	by sgi.com (980327.SGI.8.8.8-aspam/980304.SGI-aspam:
       SGI does not authorize the use of its proprietary
       systems or networks for unsolicited or bulk email
       from the Internet.) 
	via ESMTP id LAA03198; Fri, 25 Feb 2000 11:48:36 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (clock.engr.sgi.com [163.154.34.45])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id LAA51275;
	Fri, 25 Feb 2000 11:48:27 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (localhost [127.0.0.1]) by clock.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id LAA74680; Fri, 25 Feb 2000 11:52:34 -0800 (PST)
Message-Id: <200002251952.LAA74680@clock.engr.sgi.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Costa Sapuntzakis <csapuntz@cisco.com>
Cc: tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
In-Reply-To: Your message of "Fri, 25 Feb 2000 11:19:50 PST."
             <Pine.GSO.4.10.10002251118280.19392-100000@csapuntz-u1.cisco.com> 
Date: Fri, 25 Feb 2000 11:52:34 -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> 
> IP options do not interact well with today's routers. They take
> the packet off of the fast path and dump it into software processing mode.
> Otherwise, the IP options would be right place to do this.
> 
> -Costa

Yes, but are the types of uses we are talking about realistically going to be 
run over routed networks?  It's basically a mechanism for accelerating 
high-bandwidth storage access, which is likely to be a switched network.

Of course, IP options don't play well with the fast path in the TCP/IP stack 
either, but that can be changed much more easily than the router fast path (I 
assume it is in a hardware CAM or something similar).

If the goal is just high-bandwidth storage I/O, there are much lighter ways of 
doing this than using TCP.

-- 
Zachary Amsden  zamsden@engr.sgi.com  3-6919  31-2-510  Core Protocols




From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 17:52:17 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA22801
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 17:52:17 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id PAA05473
	for tcp-impl-outgoing; Fri, 25 Feb 2000 15:06:32 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id PAA05423
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 15:06:30 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id PAA15668; Fri, 25 Feb 2000 15:06:27 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma015634; Fri, 25 Feb 00 15:06:24 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12OQyR-0008Cq-00; Fri, 25 Feb 2000 20:04:27 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: csapuntz@cisco.com (Costa Sapuntzakis)
Date: Fri, 25 Feb 2000 20:04:25 +0000 (GMT)
Cc: davem@redhat.com (David S. Miller), ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
In-Reply-To: <Pine.GSO.4.10.10002251025290.19392-100000@csapuntz-u1.cisco.com> from "Costa Sapuntzakis" at Feb 25, 2000 10:45:34 AM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12OQyR-0008Cq-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> However, in the case of most storage protocols, you don't want
> the data in the receive buffer. You want it in the buffer cache, so
> there is a copy to the buffer cache.

That assumes you require a copy to move from the network layer to the
buffer/page cache. Is that a warranted assumption. I think not.

> Now, this proposal doesn't fix the interrupt overhead problem. 
> Optimized FC/SCSI NICs have one interrupt/transfer or less.

Interrupt overhead is old hat. Mitigation schemes are well understood

Alan



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 18:06:16 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA22955
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 18:06:16 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id PAA06954
	for tcp-impl-outgoing; Fri, 25 Feb 2000 15:17:01 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id PAA06924
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 15:16:59 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id PAA17011; Fri, 25 Feb 2000 15:16:58 -0500 (EST)
Received: from atlrel2.hp.com(156.153.255.202) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma016994; Fri, 25 Feb 00 15:16:46 -0500
Received: from hpindlm.cup.hp.com (hpindlm.cup.hp.com [15.13.95.89])
	by atlrel2.hp.com (Postfix) with ESMTP
	id B2788618; Fri, 25 Feb 2000 15:17:02 -0500 (EST)
Received: from mk731912 (h0100cmk.atl.hp.com [15.113.172.114]) by hpindlm.cup.hp.com with ESMTP (8.7.6/8.7.3 TIS 5.0.1) id MAA24980; Fri, 25 Feb 2000 12:23:28 -0800 (PST)
Message-Id: <4.2.2.20000225115214.00b73280@hpindlm.cup.hp.com>
X-Sender: krause@hpindlm.cup.hp.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.2.2 
Date: Fri, 25 Feb 2000 12:03:38 -0800
To: "David S. Miller" <davem@redhat.com>, gibbs@freebsd.org
From: Michael Krause <krause@cup.hp.com>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Cc: Erik.Nordmark@Eng.Sun.COM, csapuntz@cisco.com, ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov, gibbs@freebsd.org, zaitcev@metabyte.com,
        drich@fjst.com
In-Reply-To: <200002250010.QAA16997@pizda.ninka.net>
References: <200002250009.RAA01066@caspian.plutotech.com>
 <200002250009.RAA01066@caspian.plutotech.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 04:10 PM 2/24/00 -0800, David S. Miller wrote:
>    Date: Thu, 24 Feb 2000 17:09:51 -0700
>    From: "Justin T. Gibbs" <gibbs@freebsd.org>
>
>    In the case of a server response, RDMA benefits the client, not the
>    server, so I fail to see why your example is problematic.  Zero
>    copy send is not what this standard addresses.
>
>With client memory bus bandwidth in the multi-gigabyte per second
>range, who needs to avoid the single copy?  How much NFS and web
>surfing does one need to do before this is would really come into
>play?
>
>And the bus speeds will just be faster by the time something like
>this could be deployed widely.

It ain't free and there are plenty of reasons to avoid copying data since 
many applications do not always touch the data being moved.  Also, http and 
web traffic are probably not the best examples to illustrate the value of 
RDMA Read and Write operations.  Storage which was the context that started 
this effort certainly benefits from a combination of send for 
control  messages and RDMA for data movement - doesn't matter whether this 
in user-space / kernel-space, client or server.  Those buffers may be only 
used by the server to reflect a data set to a set of users without ever 
touching the buffers themselves.  Also, one could use this technology with 
storage devices to bypass the server and send data to one or more NICs for 
remote access - RDMA is still quite good for this type of operation and 
does not involve touching the data.

Mike



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 21:09:19 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA24230
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 21:09:18 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id RAA26493
	for tcp-impl-outgoing; Fri, 25 Feb 2000 17:51:04 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id RAA26463
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 17:51:01 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id RAA07886; Fri, 25 Feb 2000 17:51:01 -0500 (EST)
Received: from calcite.rhyolite.com(38.159.140.3) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma007852; Fri, 25 Feb 00 17:50:56 -0500
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.3/calcite) id PAA15085
	env-from <vjs>;
	Fri, 25 Feb 2000 15:50:49 -0700 (MST)
Date: Fri, 25 Feb 2000 15:50:49 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <200002252250.PAA15085@calcite.rhyolite.com>
To: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Costa Sapuntzakis <csapuntz@cisco.com>

> ...
> Today, you have specialized silicon that for simple bus protocols
> (SCSI parallel interface and ATA) will directly take transfer blocks
> between the device and the buffer cache. This is not currently done
> with TCP, to the best of my knowledge. ...

It might be good to investigate the history of Protocol Engines Inc.,
including its goals, the reasons for its failure as a business, and what
it achieved technically.  A skewed history might be:
 1. founded to make silicon for XTP, a nominally faster protocol than TCP.
 2. when XTP protocol and the XTP chips got bogged down, shifted to making
    chips to help TCP go wire speed over FDDI.
 3. other people made TCP go wire speed over FDDI without any special
    silicon or new to protocols.  That took some wind out of XTP's sails,
    and tore the sails driving PEI's TCP acclerator chips.
  4. standard standards committee problems with XTP didn't help PEI's other
    sails.

If you ask me, SCSI/IP and RDMA have striking parallels to #1 and #2. 
I bet you'll meet parallels to #3 before any real deployment.  You've
started to see #4 in some of the suggested improvements to RDMA today.
It's not that the suggestions are not good ideas.  That problem is that
committees cannot say no to good ideas, while the one thing that matters
above all in any design task is saying no to almost everything.

Protocol Engines and XTP were based on the unexamined assumption that TCP
is very difficult to implement and an unavoidably slow protocol.  Most
people just knew those "facts" 15 years ago.  I think RDMA suffers a
similar problem.  Instead of starting by assuming that a new protocol is
needed for a new goal, if you actually look within the existing boundaries,
you'll often find a solution.  Often the inside solution is better than
any possible extension of the protocol.  Protocol extensions require more
bandwidth and more processing on both sender and receiver.  They also have
problems gaining enough marketshare to survive.

Please don't misunderstand me.  Greg didn't include my name among
the authors on one of the XTP specs because I said XTP was a stupid
idea.  I still like lots of XTP.  I also think that many of the
XTP ideas can be *and have been* applied to TCP implementations.


> However, in the case of most storage protocols, you don't want
> the data in the receive buffer. You want it in the buffer cache, so
> there is a copy to the buffer cache.

Which NFS implementation written in the last 10 or at least 5 years and
intended to be fast doesn't move data between the buffer cache near the
disk and the buffer cache near the application with zero (0) copies?
Page flipping to and from buffer caches is especially easy, because
buffer caches tend to be page aligned, and file systems like to move
data in page-sized or larger chunks.


> So, NFS has a  CPU overhead hit as compared to optimized storage host bus
> adapters. The goal was to eliminate part of this hit, by getting rid of an
> extra copy.

How can you have fewer than zero copies?

> Now, this proposal doesn't fix the interrupt overhead problem. 
> Optimized FC/SCSI NICs have one interrupt/transfer or less.

Interrupts are killers, and so for that last 5 or 10 years, a competetive
NFS system has had about 0.1 interrupts per packet.  The trick is not
reducing the ratio of interrupts/packet, but reducing it only so far that
things don't slow down, and increasing the ratio when the total system
(client & server) moves into a regime that requires more interrupts.


] From: Michael Krause <krause@cup.hp.com>

] It ain't free and there are plenty of reasons to avoid copying data since 
] ...
] touching the buffers themselves.  Also, one could use this technology with 
] storage devices to bypass the server and send data to one or more NICs for 
] remote access - RDMA is still quite good for this type of operation and 
] does not involve touching the data.

There are other, much easier ways to separate data and control
information in the receiver than being forced to parse optional
new bits in TCP or IP headers.

For 10 years, network interfaces in commercial UNIX systems have been
putting the headers (including RPC/XDR) of incoming NFS traffic in one
place (a "small mbuf") and the data in another place (the buffer cache)
without extra copies, and without parsing any headers, not to mention
new header bits with the nasty problems of TCP or IP options.
And this despite the fact that the RPC/XDR stuff is between variable length
(recall the NFS group list) and a hard to predict length.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 21:23:35 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA24239
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 21:09:22 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id RAA24014
	for tcp-impl-outgoing; Fri, 25 Feb 2000 17:20:19 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id RAA23989
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 17:20:16 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id RAA04537; Fri, 25 Feb 2000 17:20:15 -0500 (EST)
Received: from sabre.sjf.novell.com(130.57.86.42) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma004460; Fri, 25 Feb 00 17:19:36 -0500
Received: (from mahdavi@localhost)
	by sabre.sjf.novell.com (8.9.3/8.9.3) id OAA09485;
	Fri, 25 Feb 2000 14:19:11 -0800
Reply-To: mahdavi@novell.com
To: Costa Sapuntzakis <csapuntz@cisco.com>
Cc: tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References: <Pine.GSO.4.10.10002251047590.19392-100000@csapuntz-u1.cisco.com>
From: Jamshid Mahdavi <mahdavi@novell.com>
Date: 25 Feb 2000 14:19:11 -0800
In-Reply-To: Costa Sapuntzakis's message of "Fri, 25 Feb 2000 10:50:32 -0800 (PST)"
Message-ID: <yu8xln49ylgg.fsf@sabre.sjf.novell.com>
Lines: 18
X-Mailer: Gnus v5.7/Emacs 20.4
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Regarding RDMA and SACK, I don't know if there is exactly competition
for options space, since SACK primarily appears on acks, and my quick
reading of RDMA only appears on data.

(Of course, you'd still have to worry about bidirectional simultaneous
bulk data transfers, but I don't think anyone really does this.  Most
bidirectional apps are request/response, which would have little clash
between SACK and RDMA).

That being said, I'm inclined to agree with most of the other people
who question the necessity of RDMA, for most of the other reasons
which have already been stated.

--J





From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 21:34:20 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA24469
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 21:34:20 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id SAA00522
	for tcp-impl-outgoing; Fri, 25 Feb 2000 18:57:47 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id SAA00517
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 18:57:46 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id SAA14026; Fri, 25 Feb 2000 18:57:46 -0500 (EST)
Received: from deliverator.sgi.com(204.94.214.10) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma014021; Fri, 25 Feb 00 18:57:15 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id PAA12513; Fri, 25 Feb 2000 15:52:41 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (clock.engr.sgi.com [163.154.34.45])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id PAA87314;
	Fri, 25 Feb 2000 15:57:12 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (localhost [127.0.0.1]) by clock.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id QAA75358; Fri, 25 Feb 2000 16:01:20 -0800 (PST)
Message-Id: <200002260001.QAA75358@clock.engr.sgi.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: mahdavi@novell.com
Cc: tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
In-Reply-To: Your message of "25 Feb 2000 14:19:11 PST."
             <yu8xln49ylgg.fsf@sabre.sjf.novell.com> 
Date: Fri, 25 Feb 2000 16:01:20 -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> 
> Regarding RDMA and SACK, I don't know if there is exactly competition
> for options space, since SACK primarily appears on acks, and my quick
> reading of RDMA only appears on data.

Not only that, the presence of multiple SACK options means you have a serious 
traffic problem, and in that case I can't see RDMA helping your performance 
tremendously in that case.

-- 
Zachary Amsden  zamsden@engr.sgi.com  3-6919  31-2-510  Core Protocols




From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 22:46:57 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA26850
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 22:46:57 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA03394
	for tcp-impl-outgoing; Fri, 25 Feb 2000 19:49:34 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA03378
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 19:49:32 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA18801; Fri, 25 Feb 2000 19:49:32 -0500 (EST)
Received: from mercury.sun.com(192.9.25.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma018706; Fri, 25 Feb 00 19:49:18 -0500
Received: from sunmail1.Sun.COM ([129.145.1.2])
	by mercury.Sun.COM (8.9.3+Sun/8.9.3) with ESMTP id QAA10050
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 16:49:17 -0800 (PST)
Received: from jurassic.eng.sun.com (jurassic.Eng.Sun.COM [129.146.88.31])
	by sunmail1.Sun.COM (8.9.1b+Sun/8.9.1/ENSMAIL,v1.6.1-sunmail1) with ESMTP id QAA16942;
	Fri, 25 Feb 2000 16:49:17 -0800 (PST)
Received: from bobo (bobo.Eng.Sun.COM [129.146.86.130])
	by jurassic.eng.sun.com (8.9.3+Sun/8.9.3) with SMTP id QAA16151;
	Fri, 25 Feb 2000 16:49:16 -0800 (PST)
Date: Fri, 25 Feb 2000 16:49:16 -0800 (PST)
From: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>
Reply-To: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: Costa Sapuntzakis <csapuntz@cisco.com>
Cc: Erik Nordmark <Erik.Nordmark@Eng.Sun.COM>, tcp-impl@grc.nasa.gov
In-Reply-To: "Your message with ID" <Pine.GSO.4.10.10002251010510.19376-100000@csapuntz-u1.cisco.com>
Message-ID: <Roam.SIMC.2.0.6.951526156.18298.nordmark@jurassic>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> The copy that is being eliminated is from the TCP receive buffer to the
> buffer cache.

Most performing implementation (include the BSD 4.2 networking code from 
1980-something) just do a single copy from the kernel buffer to the
applications buffer in user space (assuming the NIC does DMA i.e. the device
driver doesn't copy from a buffer on the NIC to memory).
I that case you are not saving any copies. Thus the only potential performance
benefit is to more easily place the DMA'ed data in a "good" memory location.
But a NIC that parses protocol headers could accomplish the same thing without
requiring the sender to include any new TCP options.

> As for the DNS entry problem, I'm a bit baffled since it works
> from two external non-Cisco machines for me. I'll be forwarding this to
> our internal tech guys. The domain name ftp-eng.cisco.com might
> work better.

It must have been some transient with the http proxies around here.
Sorry for the false alarm.

  Erik



From owner-tcp-impl@lerc.nasa.gov  Fri Feb 25 23:17:55 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA27124
	for <tcpimpl-archive@odin.ietf.org>; Fri, 25 Feb 2000 23:17:55 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA06964
	for tcp-impl-outgoing; Fri, 25 Feb 2000 20:49:35 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA06954
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 20:49:33 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA24555; Fri, 25 Feb 2000 20:49:32 -0500 (EST)
Received: from ibn-host12.ironbridgenetworks.com(146.115.140.12) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma024468; Fri, 25 Feb 00 20:49:03 -0500
Received: (from news@localhost)
	by ironbridgenetworks.com (8.9.3/8.9.3) id UAA10929
	for tcp-impl@grc.nasa.gov; Fri, 25 Feb 2000 20:49:03 -0500 (EST)
To: tcp-impl@grc.nasa.gov
From: James Carlson <carlson@ironbridgenetworks.com>
Newsgroups: lists.ietf.tcp-impl
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Date: 25 Feb 2000 20:49:02 -0500
Organization: IronBridge Networks
Lines: 51
Message-ID: <86d7pkka29.fsf@ironbridgenetworks.com>
References: <Pine.GSO.4.10.10002251025290.19392-100000@csapuntz-u1.cisco.com>
NNTP-Posting-Host: helios.ibnets.com
X-Newsreader: Gnus v5.5/Emacs 20.3
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

csapuntz@cisco.com (Costa Sapuntzakis) writes:
> The TCP RDMA option is not about accelerating sending of data. It's
> about speeding up the receiving of bulk data.
> 
> The TCP RDMA option isn't proposing to only about accelerating servers.
> It's about accelerating the receiver of data. In the case of NFS READ
> RPCs, that's the NFS client.

If you want to go fast on NFS client (and server!) operations, why not
use the default UDP mode of operation?  You just set your receive
offset to PAGE_SIZE-sizeof(READ3resok)-28-sizeof(layer_two), and
you're all set to page flip.  If you want to get fancy, you build this
offsetting magic into the device DMA.

As for TCP with arbitrary (unspecified) applications, it would make
more sense to create a new "map in TCP segment" system call than to
use RDMA.  Have this new call map in (read-only) raw system buffers to
the user's address space containing the received in-order data (with
packet headers and other nonsense), and copy out along a struct uio
array pointing to the actual data the application should be examining.
That gets you a zero data copy interface from driver to application
without having to modify any peers.

(I hacked a copy of 4.4BSD to do zero data copy on transmit as well
for an embedded system we're using here.  It was rather easy to do --
I just replaced m_copydata with a new "m_refcopy" function that
created read-only reference copies of the buffers for retransmit
purposes.  TCP ack handling and m_free/m_prepend needed minor tweaks
to get the reference counter logic right.)

> So, NFS has a  CPU overhead hit as compared to optimized storage host bus
> adapters. The goal was to eliminate part of this hit, by getting rid of an
> extra copy.

That of course depends strongly on system architecture.  In any event,
RDMA doesn't appear to help.  You could just as easily create a nifty
Ethernet driver that copies only (say) the first 54 bytes of memory
into main memory and leaves the rest in the queue until it determines
the exact destination for the payload (or hits an exception case that
requires harder work).  This would eliminate the extra memory copy and
result in perfectly aligned receive data.  Such "fast path" code is
quite common in well-designed systems.

I hope describing the 54-bytes-first algorithm here stops anyone from
patenting the idea.  ;-}

-- 
James Carlson, System Architect                     <carlson@ibnets.com>
IronBridge Networks / 55 Hayden Avenue   71.246W   Vox:  +1 781 372 8132
Lexington MA  02421-7996 / USA           42.423N   Fax:  +1 781 372 8090
"PPP Design and Debugging" --- http://people.ne.mediaone.net/carlson/ppp


From owner-tcp-impl@lerc.nasa.gov  Sat Feb 26 00:15:28 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id AAA28035
	for <tcpimpl-archive@odin.ietf.org>; Sat, 26 Feb 2000 00:15:28 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id VAA10056
	for tcp-impl-outgoing; Fri, 25 Feb 2000 21:51:48 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id VAA10044
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 21:51:47 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id VAA29848; Fri, 25 Feb 2000 21:51:47 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma029838; Fri, 25 Feb 00 21:51:34 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12OXJB-0000Jp-00; Sat, 26 Feb 2000 02:50:17 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: carlson@ironbridgenetworks.com (James Carlson)
Date: Sat, 26 Feb 2000 02:50:15 +0000 (GMT)
Cc: tcp-impl@grc.nasa.gov
In-Reply-To: <86d7pkka29.fsf@ironbridgenetworks.com> from "James Carlson" at Feb 25, 2000 08:49:02 PM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12OXJB-0000Jp-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> use RDMA.  Have this new call map in (read-only) raw system buffers to
> the user's address space containing the received in-order data (with
> packet headers and other nonsense), and copy out along a struct uio
> array pointing to the actual data the application should be examining.
> That gets you a zero data copy interface from driver to application
> without having to modify any peers.

I was thinking something along the lines of

	recvmsgiov(fd, iovec, etc...)

the difference being the kernel gets to fill in the iovec..



From owner-tcp-impl@lerc.nasa.gov  Sat Feb 26 00:55:04 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id AAA28422
	for <tcpimpl-archive@odin.ietf.org>; Sat, 26 Feb 2000 00:55:04 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id WAA11391
	for tcp-impl-outgoing; Fri, 25 Feb 2000 22:21:49 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id WAA11382
	for <tcp-impl@grc.nasa.gov>; Fri, 25 Feb 2000 22:21:48 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id WAA02405; Fri, 25 Feb 2000 22:21:47 -0500 (EST)
Received: from ibn-host12.ironbridgenetworks.com(146.115.140.12) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma002391; Fri, 25 Feb 00 22:21:07 -0500
Received: (from carlson@localhost)
	by ironbridgenetworks.com (8.9.3/8.9.3) id WAA02821;
	Fri, 25 Feb 2000 22:21:01 -0500 (EST)
Date: Fri, 25 Feb 2000 22:21:01 -0500 (EST)
Message-Id: <200002260321.WAA02821@ironbridgenetworks.com>
X-Authentication-Warning: helios.helios: carlson set sender to carlson@ironbridgenetworks.com using -f
From: James Carlson <carlson@ironbridgenetworks.com>
To: alan@lxorguk.ukuu.org.uk
CC: tcp-impl@grc.nasa.gov
In-reply-to: <E12OXJB-0000Jp-00@the-village.bc.nu> (message from Alan Cox on
	Sat, 26 Feb 2000 02:50:15 +0000 (GMT))
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References:  <E12OXJB-0000Jp-00@the-village.bc.nu>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> I was thinking something along the lines of
> 
> 	recvmsgiov(fd, iovec, etc...)
> 
> the difference being the kernel gets to fill in the iovec..

Yep.  That's what I was thinking.  You just said it better.  ;-}

-- 
James Carlson, System Architect                     <carlson@ibnets.com>
IronBridge Networks / 55 Hayden Avenue   71.246W   Vox:  +1 781 372 8132
Lexington MA  02421-7996 / USA           42.423N   Fax:  +1 781 372 8090
"PPP Design and Debugging" --- http://people.ne.mediaone.net/carlson/ppp


From owner-tcp-impl@lerc.nasa.gov  Sat Feb 26 15:46:31 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA18506
	for <tcpimpl-archive@odin.ietf.org>; Sat, 26 Feb 2000 15:46:31 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id NAA14078
	for tcp-impl-outgoing; Sat, 26 Feb 2000 13:01:35 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id NAA14074
	for <tcp-impl@grc.nasa.gov>; Sat, 26 Feb 2000 13:01:34 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id NAA29884; Sat, 26 Feb 2000 13:01:34 -0500 (EST)
Received: from prue.eim.surrey.ac.uk(131.227.76.5) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma029878; Sat, 26 Feb 00 13:01:17 -0500
Received: from petra.ee.surrey.ac.uk ([131.227.88.13] ident=eep1lw)
	by prue.eim.surrey.ac.uk with esmtp (Exim 3.03 #1)
	id 12OlWi-00057z-00; Sat, 26 Feb 2000 18:01:12 +0000
Date: Sat, 26 Feb 2000 18:01:09 +0000 (GMT)
From: Lloyd Wood <l.wood@eim.surrey.ac.uk>
X-Sender: eep1lw@petra.ee.surrey.ac.uk
Reply-To: L.Wood@eim.surrey.ac.uk
To: Jamshid Mahdavi <mahdavi@novell.com>
cc: Costa Sapuntzakis <csapuntz@cisco.com>, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
In-Reply-To: <yu8xln49ylgg.fsf@sabre.sjf.novell.com>
Message-ID: <Pine.GSO.4.21.0002261800160.12668-100000@petra.ee.surrey.ac.uk>
Organization: speaking for none
X-url: http://www.ee.surrey.ac.uk/Personal/L.Wood/
X-no-archive: yes
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

On 25 Feb 2000, Jamshid Mahdavi wrote:

> Regarding RDMA and SACK, I don't know if there is exactly competition
> for options space, since SACK primarily appears on acks, and my quick
> reading of RDMA only appears on data.
> 
> (Of course, you'd still have to worry about bidirectional simultaneous
> bulk data transfers, but I don't think anyone really does this.

Two words: delayed acks.

L.

<L.Wood@surrey.ac.uk>PGP<http://www.ee.surrey.ac.uk/Personal/L.Wood/>



From owner-tcp-impl@lerc.nasa.gov  Sat Feb 26 15:46:33 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA18517
	for <tcpimpl-archive@odin.ietf.org>; Sat, 26 Feb 2000 15:46:32 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id NAA14064
	for tcp-impl-outgoing; Sat, 26 Feb 2000 13:00:50 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id NAA14057
	for <tcp-impl@grc.nasa.gov>; Sat, 26 Feb 2000 13:00:49 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id NAA29849; Sat, 26 Feb 2000 13:00:49 -0500 (EST)
Received: from prue.eim.surrey.ac.uk(131.227.76.5) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma029845; Sat, 26 Feb 00 13:00:07 -0500
Received: from petra.ee.surrey.ac.uk ([131.227.88.13] ident=eep1lw)
	by prue.eim.surrey.ac.uk with esmtp (Exim 3.03 #1)
	id 12OlVT-00056K-00; Sat, 26 Feb 2000 17:59:55 +0000
Date: Sat, 26 Feb 2000 17:59:51 +0000 (GMT)
From: Lloyd Wood <l.wood@eim.surrey.ac.uk>
X-Sender: eep1lw@petra.ee.surrey.ac.uk
Reply-To: L.Wood@eim.surrey.ac.uk
To: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
cc: mahdavi@novell.com, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
In-Reply-To: <200002260001.QAA75358@clock.engr.sgi.com>
Message-ID: <Pine.GSO.4.21.0002261755380.12668-100000@petra.ee.surrey.ac.uk>
Organization: speaking for none
X-url: http://www.ee.surrey.ac.uk/Personal/L.Wood/
X-no-archive: yes
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

On Fri, 25 Feb 2000, Zachary Amsden wrote:

> > Regarding RDMA and SACK, I don't know if there is exactly competition
> > for options space, since SACK primarily appears on acks, and my quick
> > reading of RDMA only appears on data.
> 
> Not only that, the presence of multiple SACK options means you have a serious 
> traffic problem,

Not necessarily.

'Packet Reordering is not Pathological Network Behavior'
 Bennett, Partridge, Sheetman, IEEE/ACM Transactions on Networking,
 December 1999, (7):6, pp. 789-798

See also:
http://www.ir.bbn.com/%7Eshectman/metrics.html

multiple sack options do not necessarily equate to multiple losses.

L.

> and in that case I can't see RDMA helping your performance 
> tremendously in that case.
> 
> -- 
> Zachary Amsden  zamsden@engr.sgi.com  3-6919  31-2-510  Core Protocols

<L.Wood@surrey.ac.uk>PGP<http://www.ee.surrey.ac.uk/Personal/L.Wood/>



From owner-tcp-impl@lerc.nasa.gov  Sat Feb 26 17:08:29 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA19109
	for <tcpimpl-archive@odin.ietf.org>; Sat, 26 Feb 2000 17:08:28 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA16388
	for tcp-impl-outgoing; Sat, 26 Feb 2000 14:03:06 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA16321
	for <tcp-impl@grc.nasa.gov>; Sat, 26 Feb 2000 14:03:04 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id OAA04547; Sat, 26 Feb 2000 14:03:04 -0500 (EST)
Received: from calcite.rhyolite.com(38.159.140.3) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma004520; Sat, 26 Feb 00 14:02:36 -0500
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.3/calcite) id MAA03209
	for tcp-impl@grc.nasa.gov  env-from <vjs>;
	Sat, 26 Feb 2000 12:02:31 -0700 (MST)
Date: Sat, 26 Feb 2000 12:02:31 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <200002261902.MAA03209@calcite.rhyolite.com>
To: tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: James Carlson <carlson@ironbridgenetworks.com>

> ...
> > The TCP RDMA option isn't proposing to only about accelerating servers.
> > It's about accelerating the receiver of data. In the case of NFS READ
> > RPCs, that's the NFS client.
>
> If you want to go fast on NFS client (and server!) operations, why not
> use the default UDP mode of operation?  You just set your receive
> offset to PAGE_SIZE-sizeof(READ3resok)-28-sizeof(layer_two), and
> you're all set to page flip.  If you want to get fancy, you build this
> offsetting magic into the device DMA.

There's a simpler and much better way, motivated by the observation that
trailers are rare in IETF application and transport protocols.  Put the
first packetsize(mod pagesize) bytes in one queue or buffer and the
remaining multiple of pagesize bytes in another queue or buffer.  Notice
that this trick automatically covers variable sized TCP/IP headers due to
options.  Because the typical NFS/UDP/IP header contains 4(mod 8) bytes
and IP fragments must be 0(mod 8), another more sublte kludge of a trick
is needed if your MTU is less than 8K+about 200 and your NFS readsize is
a multiple of page sizes larger than 4K.


> ...
> That of course depends strongly on system architecture.  In any event,
> RDMA doesn't appear to help.  You could just as easily create a nifty
> Ethernet driver that copies only (say) the first 54 bytes of memory
> into main memory and leaves the rest in the queue until it determines
> the exact destination for the payload (or hits an exception case that
> requires harder work). ...

> I hope describing the 54-bytes-first algorithm here stops anyone from
> patenting the idea.  ;-}

That's been shipped commercially for many years, as has the modulo pagesize
trick.  Given the number of customers that bought proprietary UNIX source,
there should be no problem proving prior art.  Commercial FDDI chipsets
such as Motorola's CAMEL have supported the 54-btes-first idea for more
than 5 years.  The CAMEL can be told to set asside the the first 64 bytes.
That covers the usual 3-bytes of padding that does not appear on the wire,
the 13 byte MAC header, the 8-byte LLC header, and 40 bytes of TCP/IP
headers without any options.  The CAMEL can even put the pair of streams
of headers and data into different ring buffers.  Unfortunately, we all
know that the TCP/IP headers no longer have a fixed length of 40.
That limits the utility of the idea.

The problem with page flipping for the last 6 or 7 years has not been
teaching your DMA hardware to do it with the information already in the
bit streams, but with increasing page sizes, and then only with input.
16 KByte pages became too small for fast computers years ago, but there
not many media that have 64 KByte frames, and then there's the IPv4 packet
size problem and the TCP/IPv4 maximum segment size problem.
That's why the letting the operating system pick the user-space address
of input buffers such as was done in systems 30 years ago or using your
IOVEC idea is necessary.

It's always been trivial (modulo paging system overhead, especially
multi-processor locking and page table entry caching) to page flip on
output.  The hassles have always been on input.  And yes, those paging
system overheads can amount to many microseconds of latency and even
computation, and that amounts to a big deal if you care about speed.


I think all of that is obvious to anyone who has ever looked seriously
about making NFS or TCP go fast.  10 years ago it and related ideas were
considered big deal trade secrets, but that was 10 years ago.  It is also
obviously more effective, cheaper, easier, compatible, and interoperable
than RDMA.  It is why I think the RDMA proposal is like most such that
I've seen over the years, including the Motorola CAMEL header separating,
based on having a cool idea for a solution instead actually looking at
the problems.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Sat Feb 26 17:30:11 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA19210
	for <tcpimpl-archive@odin.ietf.org>; Sat, 26 Feb 2000 17:30:10 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id OAA18264
	for tcp-impl-outgoing; Sat, 26 Feb 2000 14:54:52 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id OAA18256
	for <tcp-impl@grc.nasa.gov>; Sat, 26 Feb 2000 14:54:51 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id OAA08075; Sat, 26 Feb 2000 14:54:49 -0500 (EST)
Received: from palrel1.hp.com(156.153.255.242) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma008065; Sat, 26 Feb 00 14:54:31 -0500
Received: from tardy.cup.hp.com (tardy.cup.hp.com [15.8.80.176])
	by palrel1.hp.com (Postfix) with ESMTP
	id 7F999425; Sat, 26 Feb 2000 11:54:19 -0800 (PST)
Received: (from raj@localhost) by tardy.cup.hp.com (8.8.6 (PHNE_17190)/8.7.3 TIS 5.0.1) id LAA12506; Sat, 26 Feb 2000 11:54:15 -0800 (PST)
From: Rick Jones <raj@cup.hp.com>
Message-Id: <200002261954.LAA12506@tardy.cup.hp.com>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: carlson@ironbridgenetworks.com
Date: Sat, 26 Feb 2000 11:54:15 -0800 (PST)
Cc: tcp-impl@grc.nasa.gov
In-Reply-To: <86d7pkka29.fsf@ironbridgenetworks.com> from James Carlson at Feb "25," 2000 "08:49:02" pm
X-Mailer: ELM [$Revision: 1.17.214.2 $]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit


> That of course depends strongly on system architecture.  In any
> event, RDMA doesn't appear to help.  You could just as easily create
> a nifty Ethernet driver that copies only (say) the first 54 bytes of
> memory into main memory and leaves the rest in the queue until it
> determines the exact destination for the payload (or hits an
> exception case that requires harder work).  This would eliminate the
> extra memory copy and result in perfectly aligned receive data.
> Such "fast path" code is quite common in well-designed systems.
> 
> I hope describing the 54-bytes-first algorithm here stops anyone
> from patenting the idea.  ;-}

I cannot recall the specifics in terms of how many bytes it would
bring-in, but the "Slider" FDDI card for the HP 9000 735 and 755
workstations did just that - DMA-in some number of bytes and notify
the driver - enough to figure-out what was going-on, and then the
driver would program the remaining DMA. I think that the 735 and its
Slider FDDI card shipped about 1993ish.

rick jones
it's all just a question of datastructure management...


From owner-tcp-impl@lerc.nasa.gov  Sat Feb 26 22:38:19 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA22982
	for <tcpimpl-archive@odin.ietf.org>; Sat, 26 Feb 2000 22:38:19 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA28474
	for tcp-impl-outgoing; Sat, 26 Feb 2000 19:21:52 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA28463
	for <tcp-impl@grc.nasa.gov>; Sat, 26 Feb 2000 19:21:50 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA26593; Sat, 26 Feb 2000 19:21:50 -0500 (EST)
Received: from deliverator.sgi.com(204.94.214.10) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma026586; Sat, 26 Feb 00 19:21:23 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id QAA28181; Sat, 26 Feb 2000 16:16:39 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (clock.engr.sgi.com [163.154.34.45])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id QAA18127;
	Sat, 26 Feb 2000 16:21:10 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (localhost [127.0.0.1]) by clock.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id QAA78952; Sat, 26 Feb 2000 16:25:18 -0800 (PST)
Message-Id: <200002270025.QAA78952@clock.engr.sgi.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
In-Reply-To: Your message of "Sat, 26 Feb 2000 02:50:15 GMT."
             <E12OXJB-0000Jp-00@the-village.bc.nu> 
Date: Sat, 26 Feb 2000 16:25:18 -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> > use RDMA.  Have this new call map in (read-only) raw system buffers to
> > the user's address space containing the received in-order data (with
> > packet headers and other nonsense), and copy out along a struct uio
> > array pointing to the actual data the application should be examining.
> > That gets you a zero data copy interface from driver to application
> > without having to modify any peers.
> 
> I was thinking something along the lines of
> 
> 	recvmsgiov(fd, iovec, etc...)
> 
> the difference being the kernel gets to fill in the iovec..

And enabled by a socket option to indicate the app is willing to accept the 
kernel placement of headers and data.  Still needs hardware support to 
separate header and payload, though.

-- 
Zachary Amsden  zamsden@engr.sgi.com  3-6919  31-2-510  Core Protocols




From owner-tcp-impl@lerc.nasa.gov  Sat Feb 26 22:41:12 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA23038
	for <tcpimpl-archive@odin.ietf.org>; Sat, 26 Feb 2000 22:41:11 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA28940
	for tcp-impl-outgoing; Sat, 26 Feb 2000 19:33:07 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA28924
	for <tcp-impl@grc.nasa.gov>; Sat, 26 Feb 2000 19:33:05 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA27511; Sat, 26 Feb 2000 19:33:05 -0500 (EST)
Received: from orchard.hamachi.org(4.255.0.98) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma027396; Sat, 26 Feb 00 19:32:23 -0500
Received: from orchard.arlington.ma.us (localhost [[UNIX: localhost]])
	by orchard.arlington.ma.us (8.8.8/1.34) with ESMTP id AAA27346;
	Sun, 27 Feb 2000 00:28:04 GMT
Message-Id: <200002270028.AAA27346@orchard.arlington.ma.us>
To: Rick Jones <raj@cup.hp.com>
cc: carlson@ironbridgenetworks.com, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
In-Reply-To: Message from Rick Jones <raj@cup.hp.com> 
   of "Sat, 26 Feb 2000 11:54:15 PST." <200002261954.LAA12506@tardy.cup.hp.com> 
Date: Sat, 26 Feb 2000 19:28:03 -0500
From: Bill Sommerfeld <sommerfeld@orchard.arlington.ma.us>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> I cannot recall the specifics in terms of how many bytes it would
> bring-in, but the "Slider" FDDI card for the HP 9000 735 and 755
> workstations did just that - DMA-in some number of bytes and notify
> the driver - enough to figure-out what was going-on, and then the
> driver would program the remaining DMA. I think that the 735 and its
> Slider FDDI card shipped about 1993ish.

While we're engaged in a game of "who did it first..", I was under the
impression that apollo's token ring did the same sort of thing, about
a decade before that, albeit with an RDMA-like hack in the link layer.
The original apollo m68k pagesize was 1k..

As I understand it, packet headers were PIO'ed into one pool, and then
the packet data was DMA'ed into another pool of page-aligned,
page-flippable buffers.  One of the apollo old-timers like Paul Leach
would know for sure..

					- Bill


From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 00:21:50 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id AAA23978
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 00:21:50 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id VAA03256
	for tcp-impl-outgoing; Sat, 26 Feb 2000 21:35:23 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id VAA03250
	for <tcp-impl@grc.nasa.gov>; Sat, 26 Feb 2000 21:35:22 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id VAA05261; Sat, 26 Feb 2000 21:35:21 -0500 (EST)
Received: from calcite.rhyolite.com(38.159.140.3) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma005217; Sat, 26 Feb 00 21:34:36 -0500
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.3/calcite) id TAA12952
	for tcp-impl@grc.nasa.gov  env-from <vjs>;
	Sat, 26 Feb 2000 19:34:35 -0700 (MST)
Date: Sat, 26 Feb 2000 19:34:35 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <200002270234.TAA12952@calcite.rhyolite.com>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Cc: tcp-impl@grc.nasa.gov
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Bill Sommerfeld <sommerfeld@orchard.arlington.ma.us>

> ...
> While we're engaged in a game of "who did it first..", I was under the
> impression that apollo's token ring did the same sort of thing, about
> a decade before that, albeit with an RDMA-like hack in the link layer.
> The original apollo m68k pagesize was 1k..
>
> As I understand it, packet headers were PIO'ed into one pool, and then
> the packet data was DMA'ed into another pool of page-aligned,
> page-flippable buffers.  One of the apollo old-timers like Paul Leach
> would know for sure..

If you allow changes in the on-the-wire bits, then don't forget the
4.2 BSD (or earlier) "trailers", which put the headers in a little
mbuf, and the data into a string of enormous, 512 byte page mbuf.


I don't think the point has been "who did it first," but "let's not
start improving it before considering the state of the art."


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 07:52:03 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id HAA08076
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 07:52:03 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id FAA20873
	for tcp-impl-outgoing; Sun, 27 Feb 2000 05:11:25 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id FAA20869
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 05:11:24 -0500 (EST)
From: julian_satran@il.ibm.com
Received: by seraph3.lerc.nasa.gov; id FAA03089; Sun, 27 Feb 2000 05:11:24 -0500 (EST)
Received: from d12lmsgate-3.de.ibm.com(195.212.91.201) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma003087; Sun, 27 Feb 00 05:11:04 -0500
Received: from d12relay01.de.ibm.com (d12relay01.de.ibm.com [9.165.215.22])
	by d12lmsgate-3.de.ibm.com (1.0.0) with ESMTP id LAA31802;
	Sun, 27 Feb 2000 11:11:01 +0100
Received: from d12mta05.de.ibm.com (d12mta05_cs0 [9.165.222.239])
	by d12relay01.de.ibm.com (8.8.8m2/NCO v2.06) with SMTP id LAA62354;
	Sun, 27 Feb 2000 11:11:01 +0100
Received: by d12mta05.de.ibm.com(Lotus SMTP MTA v4.6.5  (863.2 5-20-1999))  id C1256892.0037EE2F ; Sun, 27 Feb 2000 11:10:54 +0100
X-Lotus-FromDomain: IBMIL@IBMDE
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
cc: David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
Message-ID: <C1256892.0037ED28.00@d12mta05.de.ibm.com>
Date: Sun, 27 Feb 2000 11:57:13 +0200
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Mime-Version: 1.0
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk



True. You can do everything by processing headers... but the you need to
understand all the protocols that are amased over TCP. To do it in silicon
it will probably make sense for a subset.
RDMA is a general purpose solution and the user decides what to do with it.
You can look at it as a simple way to enable the protocl stack and the
application to completely separate the protocol state machine (defined by
headers and/or trailers) from the payload.

As for the vulnerability to attacks with a good size RDMAID and some
imagination you can get the same level of protection as with the TCP
sequence number (even a bit better because sequence numbers can be guessed
from context).

Julo

Alan Cox <alan@lxorguk.ukuu.org.uk> on 25/02/2000 15:17:05

Please respond to Alan Cox <alan@lxorguk.ukuu.org.uk>

To:   julian_satran%ibmil.RSCS@STUTVM1.DE.IBM.COM
cc:   David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
      tcp-impl@grc.nasa.gov (bcc: Julian Satran/Haifa/IBM)
Subject:  Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.




> Gb/s it requires some innovation and lots of silicon. The RDMA option
makes
> it possible at a far lower price. And the zero copy it enables might go
> deep into the application space as it is only an annotation on packets.

I am not convinced the amount of silicon changes between the two. The
RDMA id make be faked by an attacker so must still be verified.

Va Jacobson proposed and to an extent implemented a system where the user
context does all the TCP work. In that sort of situation and with a more
sensible API than the BSD socket one you dont appear to need a lot of
silicon,
in fact the worst case is the wildcard.

Alan






From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 11:05:11 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id LAA09218
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 11:05:10 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id IAA25578
	for tcp-impl-outgoing; Sun, 27 Feb 2000 08:17:26 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id IAA25574
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 08:17:25 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id IAA13074; Sun, 27 Feb 2000 08:17:24 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma013066; Sun, 27 Feb 00 08:17:13 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12P3aU-0000Ce-00; Sun, 27 Feb 2000 13:18:18 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: zamsden@cthulhu.engr.sgi.com (Zachary Amsden)
Date: Sun, 27 Feb 2000 13:18:17 +0000 (GMT)
Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), tcp-impl@grc.nasa.gov
In-Reply-To: <200002270025.QAA78952@clock.engr.sgi.com> from "Zachary Amsden" at Feb 26, 2000 04:25:18 PM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12P3aU-0000Ce-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> > I was thinking something along the lines of
> > 
> > 	recvmsgiov(fd, iovec, etc...)
> > 
> > the difference being the kernel gets to fill in the iovec..
> 
> And enabled by a socket option to indicate the app is willing to accept the 
> kernel placement of headers and data.  Still needs hardware support to 
> separate header and payload, though.

It certainly makes it easier. Your average $30 tulip card can take a good
guess it this especially for predictible protocols like NFS over UDP.

However guessing wrongly is not fatal since for the case where you DMA then
map into the application you can map in some of the header bits without any
real security impact and point the iov further down the packet



From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 14:15:25 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA10470
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 14:15:24 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id LAA00827
	for tcp-impl-outgoing; Sun, 27 Feb 2000 11:09:55 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id LAA00821
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 11:09:54 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id LAA23679; Sun, 27 Feb 2000 11:09:54 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma023440; Sun, 27 Feb 00 11:09:09 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12P6GM-0000Qe-00; Sun, 27 Feb 2000 16:09:42 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: julian_satran@il.ibm.com
Date: Sun, 27 Feb 2000 16:09:39 +0000 (GMT)
Cc: alan@lxorguk.ukuu.org.uk (Alan Cox),
        David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
In-Reply-To: <C1256892.00380C2C.00@d12mta02.de.ibm.com> from "julian_satran@il.ibm.com" at Feb 27, 2000 11:57:13 AM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12P6GM-0000Qe-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> understand all the protocols that are amased over TCP. To do it in silicon
> it will probably make sense for a subset.

You dont want to do it in silicon. Forget doing all this in silicon. We have
this funky stuff called software. The silicon needs no RDMA support to do
sensible work in the API and the underlying OS are sensibly designed.

> You can look at it as a simple way to enable the protocl stack and the
> application to completely separate the protocol state machine (defined by
> headers and/or trailers) from the payload.

The two are tied together. You have to parse the TCP option stream to get
the ident in the first place. You can't act on the RDMAID until you
have checked the packet is syntactically valid and you've processed
the options including handling the SACK data mixed in with it.

It might also be fragmented of course.


> As for the vulnerability to attacks with a good size RDMAID and some
> imagination you can get the same level of protection as with the TCP
> sequence number (even a bit better because sequence numbers can be guessed
> from context).

The tcp sequence number protects against ordering errors not against DMAing
crap into the wrong buffer.

Different game, different cost if you lose.

Alan



From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 14:33:24 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA10610
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 14:33:24 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id LAA02012
	for tcp-impl-outgoing; Sun, 27 Feb 2000 11:42:12 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id LAA01997
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 11:42:10 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id LAA25771; Sun, 27 Feb 2000 11:42:09 -0500 (EST)
Received: from ibn-host12.ironbridgenetworks.com(146.115.140.12) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma025760; Sun, 27 Feb 00 11:42:05 -0500
Received: (from news@localhost)
	by ironbridgenetworks.com (8.9.3/8.9.3) id LAA23377
	for tcp-impl@grc.nasa.gov; Sun, 27 Feb 2000 11:42:04 -0500 (EST)
To: tcp-impl@grc.nasa.gov
From: James Carlson <carlson@ironbridgenetworks.com>
Newsgroups: lists.ietf.tcp-impl
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Date: 27 Feb 2000 11:42:04 -0500
Organization: IronBridge Networks
Lines: 21
Message-ID: <863dqer40z.fsf@ironbridgenetworks.com>
References: <200002270025.QAA78952@clock.engr.sgi.com>
NNTP-Posting-Host: helios.ibnets.com
X-Newsreader: Gnus v5.5/Emacs 20.3
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

zamsden@cthulhu.engr.sgi.com (Zachary Amsden) writes:
> And enabled by a socket option to indicate the app is willing to accept the 
> kernel placement of headers and data.  Still needs hardware support to 
> separate header and payload, though.

Actually, no, it doesn't require any special hardware support.  Who
cares to separate the header and payload at all?  If it's not
specifically NFS that we're talking about (which has its own problems
due to demand-paging), then it's sufficient to map the entire received
packet into user space -- headers and all.  All that's really needed
at the application level is a list of data-start addresses and lengths
(as would be present in a regular uio vector).  The standard recvmsg()
call already provides an interface like this.  The extra bit of magic
is that for zero-copy the user doesn't get to put his own buffers into
msghdr.msg_iov; the kernel picks.  How about SO_RCVZCOPY to enable ...?

-- 
James Carlson, System Architect                     <carlson@ibnets.com>
IronBridge Networks / 55 Hayden Avenue   71.246W   Vox:  +1 781 372 8132
Lexington MA  02421-7996 / USA           42.423N   Fax:  +1 781 372 8090
"PPP Design and Debugging" --- http://people.ne.mediaone.net/carlson/ppp


From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 19:42:36 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA13334
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 19:42:36 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id QAA12052
	for tcp-impl-outgoing; Sun, 27 Feb 2000 16:34:42 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id QAA12041
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 16:34:40 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id QAA14960; Sun, 27 Feb 2000 16:34:40 -0500 (EST)
Received: from ren.netconnect.com.au(203.7.198.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma014950; Sun, 27 Feb 00 16:34:28 -0500
Received: (qmail 27305 invoked from network); 27 Feb 2000 21:34:31 -0000
Received: from unknown (HELO cvs.com.au) (203.87.14.203)
  by mail.netconnect.com.au with SMTP; 27 Feb 2000 21:34:31 -0000
Message-ID: <38B95A08.EF0D9124@cvs.com.au>
Date: Mon, 28 Feb 2000 04:08:24 +1100
From: Charles Esson <charlese@cvs.com.au>
X-Mailer: Mozilla 4.5 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: julian_satran@il.ibm.com
CC: Alan Cox <alan@lxorguk.ukuu.org.uk>,
        David Robinson <David.Robinson@EBay.Sun.COM>, ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
References: <C1256892.0037ED28.00@d12mta05.de.ibm.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit



julian_satran@il.ibm.com wrote:

> True. You can do everything by processing headers... but the you need to
> understand all the protocols that are amased over TCP. To do it in silicon
> it will probably make sense for a subset.
> RDMA is a general purpose solution and the user decides what to do with it.
> You can look at it as a simple way to enable the protocl stack and the
> application to completely separate the protocol state machine (defined by
> headers and/or trailers) from the payload.
>
> As for the vulnerability to attacks with a good size RDMAID and some
> imagination

Why not apply the imagination to the writing of a zero copy TCP stack based
on the current standards? It would be a lot more general and a lot more useful
as
it would work against unmodified servers.

> you can get the same level of protection as with the TCP
> sequence number (even a bit better because sequence numbers can be guessed
> from context).

Are you suggesting that the sequence number should no longer be used? Are you
suggesting that the RDMAID and sequence number should be unrelated?

If the sequence number is still used then the RDMAIN is simple additional data
that
can be used by the attacker. Just another farmyard of potential bugs to be
investigated
and exploited?

If you are not going to use the sequence number, how does a system that doesn't

support this option receive the data? If your aim is  to design a system that
isn't
backward compatible, why would anyone in there right mind support it?

Aiming to keep the intellectual property rights, not backward compatible,
dubious technical merit. ummm?

Regards.

>
>
> Julo
>
> Alan Cox <alan@lxorguk.ukuu.org.uk> on 25/02/2000 15:17:05
>
> Please respond to Alan Cox <alan@lxorguk.ukuu.org.uk>
>
> To:   julian_satran%ibmil.RSCS@STUTVM1.DE.IBM.COM
> cc:   David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
>       tcp-impl@grc.nasa.gov (bcc: Julian Satran/Haifa/IBM)
> Subject:  Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
>
> > Gb/s it requires some innovation and lots of silicon. The RDMA option
> makes
> > it possible at a far lower price. And the zero copy it enables might go
> > deep into the application space as it is only an annotation on packets.
>
> I am not convinced the amount of silicon changes between the two. The
> RDMA id make be faked by an attacker so must still be verified.
>
> Va Jacobson proposed and to an extent implemented a system where the user
> context does all the TCP work. In that sort of situation and with a more
> sensible API than the BSD socket one you dont appear to need a lot of
> silicon,
> in fact the worst case is the wildcard.
>
> Alan



From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 20:08:17 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA13883
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 20:08:16 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id RAA16251
	for tcp-impl-outgoing; Sun, 27 Feb 2000 17:24:57 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id RAA16244
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 17:24:56 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id RAA18140; Sun, 27 Feb 2000 17:24:55 -0500 (EST)
Received: from ruby.cisco.com(171.69.198.43) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma018135; Sun, 27 Feb 00 17:24:34 -0500
Received: from alfonso. ([10.19.129.228])
	by ruby.cisco.com (8.8.8-Cisco List Logging/8.8.8) with SMTP id OAA20322;
	Sun, 27 Feb 2000 14:23:18 -0800 (PST)
Message-Id: <3.0.3.32.20000227141724.006bd2d4@ruby.cisco.com>
X-Sender: cheriton@ruby.cisco.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.3 (32)
Date: Sun, 27 Feb 2000 14:17:24 -0800
To: Alan Cox <alan@lxorguk.ukuu.org.uk>, julian_satran@il.ibm.com
From: "David R. Cheriton" <cheriton@cisco.com>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Cc: alan@lxorguk.ukuu.org.uk (Alan Cox),
        David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
In-Reply-To: <E12P6GM-0000Qe-00@the-village.bc.nu>
References: <C1256892.00380C2C.00@d12mta02.de.ibm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

At 04:09 PM 2/27/00 +0000, Alan Cox wrote:
>> understand all the protocols that are amased over TCP. To do it in silicon
>> it will probably make sense for a subset.
>
>You dont want to do it in silicon. Forget doing all this in silicon. We have
>this funky stuff called software. The silicon needs no RDMA support to do
>sensible work in the API and the underlying OS are sensibly designed.
>

I think it would be useful to get more quantitative.  I like to
think all of us in this discussion are well aware that some level
of performance can be achieved in software.  I'm not seeing where
"sensibly designed" leads us WRT to OSs, because, if they arent now
by whatever definition you are using, it seems pretty academic, as
they say.
  Clearly, data is being received from hardware and software does not
get to touch it until it has been stored to some memory.  My 
assumption is that the storage system memory is arranged in fixed
size pages of disk/file pages.  Without hardware RDMA to the storage
level, I believe one requires an extra copy, from whatever the
hardware delivers to what the storage system expects.  Either
you use twice the bandwidth in the storage system memory system or
or else you have a separate memory system for the network, and
have software/processor power adequate to copy between at wire
speed (with all the associated support facilities for this processor.)
 Unless there is something wrong with this reasoning,
it seems like a cost issue of providing the above hardware resources
vs. providing a NIC chip that can RDMA.  

My guessitimate is that the software-only approach would be easily
10 times more expensive here at the higher speed rates, of 10 Gbps.
If there is serious doubt about the merits of real hardware support,
we should try to quantify costs further at these speed ranges, IMHO.

>> You can look at it as a simple way to enable the protocl stack and the
>> application to completely separate the protocol state machine (defined by
>> headers and/or trailers) from the payload.
>
>The two are tied together. You have to parse the TCP option stream to get
>the ident in the first place. You can't act on the RDMAID until you
>have checked the packet is syntactically valid and you've processed
>the options including handling the SACK data mixed in with it.
>
>It might also be fragmented of course.
What you need to do >>in the common case<< before processing the RDMA
option is relatively simple and a very small portion of the
overall protocol state machines, so I presume your comment is
asking for more careful wording? Or do you really disagree with the
fundamental point?

>
>
>> As for the vulnerability to attacks with a good size RDMAID and some
>> imagination you can get the same level of protection as with the TCP
>> sequence number (even a bit better because sequence numbers can be guessed
>> from context).
>
>The tcp sequence number protects against ordering errors not against DMAing
>crap into the wrong buffer.
>
>Different game, different cost if you lose.
>
It would help me to have a more careful definition of the types of
attacks you have in mind.  In an unsecure network with intruders,
presumably I can end up with bad data in the right buffer
or right data in the wrong buffer without using RDMA.
Do you view we have made things worse, and if so, how?
or are you objecting to us not making things better?


>Alan
>
David Cheriton




From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 20:23:51 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA14375
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 20:23:50 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id RAA16735
	for tcp-impl-outgoing; Sun, 27 Feb 2000 17:40:41 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id RAA16731
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 17:40:40 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id RAA19221; Sun, 27 Feb 2000 17:40:40 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma019216; Sun, 27 Feb 00 17:40:13 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12PCN6-0000vD-00; Sun, 27 Feb 2000 22:41:04 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: cheriton@cisco.com (David R. Cheriton)
Date: Sun, 27 Feb 2000 22:41:01 +0000 (GMT)
Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), julian_satran@il.ibm.com,
        David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
In-Reply-To: <3.0.3.32.20000227141724.006bd2d4@ruby.cisco.com> from "David R. Cheriton" at Feb 27, 2000 02:17:24 PM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12PCN6-0000vD-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> it seems like a cost issue of providing the above hardware resources
> vs. providing a NIC chip that can RDMA.  

or a NIC chip that has more memory on the chip and sends you the header then
you asynchronously give a DMA location for the rest of the buffer.

You've now got rid of RDMA, instead your little bit of extra silicon is 
generic, and will work with stuff like appletalk even. What does the
extra RAM and PCI glue cost you - probably not a lot.

That is why I prefer software solutions

> >It might also be fragmented of course.
> What you need to do >>in the common case<< before processing the RDMA
> option is relatively simple and a very small portion of the
> overall protocol state machines, so I presume your comment is
> asking for more careful wording? Or do you really disagree with the
> fundamental point?

Yes

The sequence of operations you must execute is complex. 

You have to

Check the packet is long enough
Check the header is IPv4
Check the IPV4 IHL is valid
Check the protocol field
Check the source/dest addresses are legal
Work out if dest is for us or another node
Perform IP fragmentation (checks at least cannot be deferred defrag can)
Walk the IP options (I guess you could defer this)
Get the tcp header
Check the tcp header lengths are legal and fit the packet length
Check the tcp ports to figure out the socket
Parse the tcp options 
Check the RDMA identifier, arbitarily aligned of course
Check the RDMA identifier, port and addresses match
Perform sequence space checks

Then you can deliver the packet.

Alan



From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 22:50:58 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA17157
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 22:50:58 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id TAA21383
	for tcp-impl-outgoing; Sun, 27 Feb 2000 19:44:28 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id TAA21373
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 19:44:26 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id TAA27846; Sun, 27 Feb 2000 19:44:26 -0500 (EST)
Received: from calcite.rhyolite.com(38.159.140.3) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma027841; Sun, 27 Feb 00 19:44:19 -0500
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.3/calcite) id RAA02693
	env-from <vjs>;
	Sun, 27 Feb 2000 17:44:12 -0700 (MST)
Date: Sun, 27 Feb 2000 17:44:12 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <200002280044.RAA02693@calcite.rhyolite.com>
To: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: "David R. Cheriton" <cheriton@cisco.com>

> ...
>   Clearly, data is being received from hardware and software does not
> get to touch it until it has been stored to some memory.  My 
> assumption is that the storage system memory is arranged in fixed
> size pages of disk/file pages.  Without hardware RDMA to the storage
> level, I believe one requires an extra copy, from whatever the
> hardware delivers to what the storage system expects.  Either
> you use twice the bandwidth in the storage system memory system or
> or else you have a separate memory system for the network, and
> have software/processor power adequate to copy between at wire
> speed (with all the associated support facilities for this processor.)
>  Unless there is something wrong with this reasoning,
> it seems like a cost issue of providing the above hardware resources
> vs. providing a NIC chip that can RDMA.  

Depending on how you are counting copies, that reasoning has been wrong
in commercial UNIX systems for more than 10 years.
Do you use the RDMA bits before IP checksum, the TCP checksum, and the
medium FCS or checksum have been checked?  If not, if you receive the
entire link layer frame into some kind of temporary buffer or FIFO,
probably in the "network interface card/controller," to check the trailing
FCS and before using the RDMA bits, then commercial UNIX systems have been
doing as you say to save copies since the late 1980's.  As I said before,
such systems were a part of what killed Protocol Engines Inc.

If you do use the RDMA bits in the TCP header after 50-60 bytes of
the frame have arrived, but before the frame FCS, aren't you worried
about bit rot in the RDMA?

> My guessitimate is that the software-only approach would be easily
> 10 times more expensive here at the higher speed rates, of 10 Gbps.
> If there is serious doubt about the merits of real hardware support,
> we should try to quantify costs further at these speed ranges, IMHO.

By "expensive," are you talking about dollars or bits/second?

Regardless, if you look at the number of CPU cycles or gates in custom
silicon required to support incoming page flipping in old, existing
implementations, I bet you'll find that they are less "expensive" than
any likely RDMA implementation.  Power of 2 modular arithmetic is awfully
cheap compared to parsing and validating TCP options.


> ...
> It would help me to have a more careful definition of the types of
> attacks you have in mind.  In an unsecure network with intruders,
> presumably I can end up with bad data in the right buffer
> or right data in the wrong buffer without using RDMA.
> Do you view we have made things worse, and if so, how?
> or are you objecting to us not making things better?

Is it possible for a bad guy to use RDMA to put bad data into memory
that is not a buffer?

If the RID does no more than choose from a safe list of buffers, then how
does RDMA usefully differ from the old FDDI, ATM, and HIPPI implementations
that put incoming page-flippable data in buffers that get into user space
with the data having been seen on the system bus the absolute minimum
number of times for any scheme, including RDMA, once?
Systems I've worked on have done mbuf allocation in the network interface
hardware, including putting page-flippable payloads into page-mbufs that
can eventually be flipped into user space.  And of course, take care of
the TCP or UDP checksum.

Given the recently described extensions to readv(), absolutely
all data received by a system like that would be page-flippable,
and without needing the silicon or CPU cycles to parse RDMA options
or requiring the sender to send RDMA options or even know that the
receiver is being fast.


Vernon Schryver    vjs@rhyolite.com


From owner-tcp-impl@lerc.nasa.gov  Sun Feb 27 23:46:18 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA17949
	for <tcpimpl-archive@odin.ietf.org>; Sun, 27 Feb 2000 23:46:18 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id UAA24517
	for tcp-impl-outgoing; Sun, 27 Feb 2000 20:57:59 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id UAA24512
	for <tcp-impl@grc.nasa.gov>; Sun, 27 Feb 2000 20:57:58 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id UAA03423; Sun, 27 Feb 2000 20:57:58 -0500 (EST)
Received: from ruby.cisco.com(171.69.198.43) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma003416; Sun, 27 Feb 00 20:57:24 -0500
Received: from alfonso. ([10.19.129.228])
	by ruby.cisco.com (8.8.8-Cisco List Logging/8.8.8) with SMTP id RAA23480;
	Sun, 27 Feb 2000 17:56:10 -0800 (PST)
Message-Id: <3.0.3.32.20000227175017.006baa14@ruby.cisco.com>
X-Sender: cheriton@ruby.cisco.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.3 (32)
Date: Sun, 27 Feb 2000 17:50:17 -0800
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
From: "David R. Cheriton" <cheriton@cisco.com>
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), julian_satran@il.ibm.com,
        David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
In-Reply-To: <E12PCN6-0000vD-00@the-village.bc.nu>
References: <3.0.3.32.20000227141724.006bd2d4@ruby.cisco.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

Alan,

Given you are concerned about Appletalk and PCI, perhaps we just
need to agree you are addressing a different performance level of
storage system.  I see no reason for hardware support at Appletalk
and PCI speeds either.  The RDMA option, even implemented in
a hardware NIC, would not preclude software processing of Appletalk
or anything else.

  I think it would be only fair for this group to expect you to
provide a more fully worked out design and evaluation of your
proposed NIC if this is really to be taken as a serious alternative
to an RDMA option.  I for one dont find it a competitive design
(it is also one that was previously considered.)

Regarding the complexity of TCP processing in hardware, people are 
doing this, so I regard this as a done deal.  (I'd hate
to be a high-speed NIC vendor that cant do this.)
The only issue is how to handle the next level of 
protocols that have high performance requirements, i.e. storage.
Also, NICs with hardware enhancements for protocol support,
such as the Intel Gigabit Ethernet NIC have been very successful
both in performance and market.  So, expect more there.

Finally, the RDMA option is targeted to allow use of standard
 Internet protocols for SANs in place of  specialized protocols
and networks such as  FibreChannel, especially for non-local 
access.
I.e. an extra option rather than an extra protocol stack and network.
I hope the discussion can recognize that broader issue in considering
alternatives and the general need for the RDMA option.
I.e. can SCSI over TCP really compete with FC without the RDMA option?
(In this sense, the subject line should have put SCSI first.)

DRC

At 10:41 PM 2/27/00 +0000, Alan Cox wrote:
>> it seems like a cost issue of providing the above hardware resources
>> vs. providing a NIC chip that can RDMA.  
>
>or a NIC chip that has more memory on the chip and sends you the header then
>you asynchronously give a DMA location for the rest of the buffer.
>
>You've now got rid of RDMA, instead your little bit of extra silicon is 
>generic, and will work with stuff like appletalk even. What does the
>extra RAM and PCI glue cost you - probably not a lot.
>
>That is why I prefer software solutions
>
>> >It might also be fragmented of course.
>> What you need to do >>in the common case<< before processing the RDMA
>> option is relatively simple and a very small portion of the
>> overall protocol state machines, so I presume your comment is
>> asking for more careful wording? Or do you really disagree with the
>> fundamental point?
>
>Yes
>
>The sequence of operations you must execute is complex. 
>
>You have to
>
>Check the packet is long enough
>Check the header is IPv4
>Check the IPV4 IHL is valid
>Check the protocol field
>Check the source/dest addresses are legal
>Work out if dest is for us or another node
>Perform IP fragmentation (checks at least cannot be deferred defrag can)
>Walk the IP options (I guess you could defer this)
>Get the tcp header
>Check the tcp header lengths are legal and fit the packet length
>Check the tcp ports to figure out the socket
>Parse the tcp options 
>Check the RDMA identifier, arbitarily aligned of course
>Check the RDMA identifier, port and addresses match
>Perform sequence space checks
>
>Then you can deliver the packet.
>
>Alan
>
>
>
>



From owner-tcp-impl@lerc.nasa.gov  Mon Feb 28 05:51:40 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id FAA02530
	for <tcpimpl-archive@odin.ietf.org>; Mon, 28 Feb 2000 05:51:39 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id CAA09047
	for tcp-impl-outgoing; Mon, 28 Feb 2000 02:49:47 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id CAA09038
	for <tcp-impl@grc.nasa.gov>; Mon, 28 Feb 2000 02:49:46 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id CAA28256; Mon, 28 Feb 2000 02:49:46 -0500 (EST)
Received: from pneumatic-tube.sgi.com(204.94.214.22) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma028248; Mon, 28 Feb 00 02:49:15 -0500
Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via ESMTP id XAA00643; Sun, 27 Feb 2000 23:52:00 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (clock.engr.sgi.com [163.154.34.45])
	by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF)
	via ESMTP id XAA58345;
	Sun, 27 Feb 2000 23:48:52 -0800 (PST)
	mail_from (zamsden@clock.engr.sgi.com)
Received: from clock.engr.sgi.com (localhost [127.0.0.1]) by clock.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id XAA83438; Sun, 27 Feb 2000 23:52:53 -0800 (PST)
Message-Id: <200002280752.XAA83438@clock.engr.sgi.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: James Carlson <carlson@ironbridgenetworks.com>
Cc: alan@lxorguk.ukuu.org.uk, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc. 
From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
In-Reply-To: Your message of "27 Feb 2000 11:42:04 EST."
             <863dqer40z.fsf@ironbridgenetworks.com> 
Date: Sun, 27 Feb 2000 23:52:53 -0800
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> zamsden@cthulhu.engr.sgi.com (Zachary Amsden) writes:
> > And enabled by a socket option to indicate the app is willing to accept the 
> > kernel placement of headers and data.  Still needs hardware support to 
> > separate header and payload, though.
> 
> Actually, no, it doesn't require any special hardware support.  Who
> cares to separate the header and payload at all?  If it's not
> specifically NFS that we're talking about (which has its own problems
> due to demand-paging), then it's sufficient to map the entire received
> packet into user space -- headers and all.  All that's really needed
> at the application level is a list of data-start addresses and lengths
> (as would be present in a regular uio vector).  The standard recvmsg()
> call already provides an interface like this.  The extra bit of magic
> is that for zero-copy the user doesn't get to put his own buffers into
> msghdr.msg_iov; the kernel picks.  How about SO_RCVZCOPY to enable ...?

No, that situation doesn't require any hardware support.  However, a zero-copy 
receive path is not the only element of RDMA - RDMA was designed (I suppose 
from the discussion here) specifically to address header/payload issues for 
storage protocols.  Clearly one can do zero-copy receive with changes to the 
API and no hardware/firmware modifications.  But with no special hardware 
support, flipping the payload into some page with alignment constraints will 
require another copy.

There is one exception to my last statement that I know of:  If you pre-adjust 
the hardware receive buffers to make the payload align on a page boundary, you 
can flip the page into the buffer cache for (hopefully) the common case.  
However, this requires the ability to tune these header offsets and will only 
work for one protocol at a time (mostly).

Realistically, who is going to be running a storage system that requires so 
much bandwidth that avoiding receive copies is necessary, and runs on generic 
NICs with no firmware/ASIC modifications possible?  So I think using modified 
hardware is completely reasonable in those circumstances.

In any case, zero-copy receive with no payload separation, as you propose, 
would work fine with the API changes previously suggested, but that is a 
separate discussion :)

-- 
Zachary Amsden  zamsden@engr.sgi.com  3-6919  31-2-510  Core Protocols




From owner-tcp-impl@lerc.nasa.gov  Mon Feb 28 06:28:21 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id GAA03026
	for <tcpimpl-archive@odin.ietf.org>; Mon, 28 Feb 2000 06:28:21 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id DAA12935
	for tcp-impl-outgoing; Mon, 28 Feb 2000 03:43:02 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id DAA12922
	for <tcp-impl@grc.nasa.gov>; Mon, 28 Feb 2000 03:43:00 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id DAA05864; Mon, 28 Feb 2000 03:43:00 -0500 (EST)
Received: from kickme.cisco.com(198.92.30.42) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma005856; Mon, 28 Feb 00 03:42:29 -0500
Received: from csapuntz-u1.cisco.com (csapuntz-u1.cisco.com [171.69.199.29])
	by kickme.cisco.com (8.9.1a/8.9.1) with ESMTP id AAA29259;
	Mon, 28 Feb 2000 00:31:42 -0800 (PST)
Received: from localhost (csapuntz@localhost) by csapuntz-u1.cisco.com (8.8.8-Cisco List Logging/CISCO.WS.1.2) with ESMTP id AAA19859; Mon, 28 Feb 2000 00:42:27 -0800 (PST)
X-Authentication-Warning: csapuntz-u1.cisco.com: csapuntz owned process doing -bs
Date: Mon, 28 Feb 2000 00:42:27 -0800 (PST)
From: Costa Sapuntzakis <csapuntz@cisco.com>
To: Vernon Schryver <vjs@calcite.rhyolite.com>
cc: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: NFS Header/data parsing and RDMA
In-Reply-To: <200002280044.RAA02693@calcite.rhyolite.com>
Message-ID: <Pine.GSO.4.10.10002272341340.19855-100000@csapuntz-u1.cisco.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk


Ok, so doing NFSv2/v3 header/data splitting is easy on an in-order
TCP stream because NFS has fixed-length trailers. Here's a little
technique:

1) Assume that all READ/WRITE transfers are powers of 2
2) Assume all RPCs larger than 4k are WRITE RPCs and all responses larger
than 4k are READ responses
3) Take the message size from the first 4 bytes of the RPC/TCP
encapsulation
4) Round the message size down to the nearest power of 2 (call this
quantity data_size)
5) The data is the last data_size bytes of the message. Put
the last data_size bytes in a separate aligned buffer.

Note, to do this with NFS/TCP, your NIC has to do some primitive
level of TCP processing (at least keep track of flows). It also
needs to understand RPC/TCP message boundaries.

Are there significantly simpler approaches than this? 

NFSv4 doesn't seem to have fixed length trailers and neither
does CIFS in all cases. And it looks like it will be costly to parse 
NFSv4 headers. 

RDMA still has the following features:

- Per-packet (Works with arbitrary out-of-order reception of TCP
segments)
- Fixed header that's generic across all protocols (NFSv4, v5, AFS,
DFS, CIFS, etc..) 
- No page flipping necessary on solicited transfers
- Message boundary bit (which is admittedly orthogonal to RDMA) allows
out-of-order processing on TCP receive buffer. Decreases parsing latency,
esp. in the face of packet drops.

The following measures should improve security/safety:

- NIC should ascertain that TCP segment is in receive window

- NIC needs to check that the RID is valid for a given TCP conn
  for safety/security reasons
  
  If the NIC does header/data splitting, it needs to keep track of
  per-flow information because most file block transfesr will span
  multiple TCP segments. So the NIC will probably have a notion of
  a TCP flow #.

  The brute force approach way to check RID validity is to use a CAM to
  map from (RID, TCP flow #) -> buffer address.

  If that's too expensive, then RIDs can be hashed and the flow # and
  buffer address stored in the bucket. The flow # is verified before
  using the buffer address.

-Costa




From owner-tcp-impl@lerc.nasa.gov  Mon Feb 28 10:48:48 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA12209
	for <tcpimpl-archive@odin.ietf.org>; Mon, 28 Feb 2000 10:48:47 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id HAA26435
	for tcp-impl-outgoing; Mon, 28 Feb 2000 07:52:03 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id HAA26413
	for <tcp-impl@grc.nasa.gov>; Mon, 28 Feb 2000 07:52:01 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id HAA24184; Mon, 28 Feb 2000 07:52:00 -0500 (EST)
Received: from lightning.swansea.uk.linux.org(194.168.151.1) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma024150; Mon, 28 Feb 00 07:51:39 -0500
Received: from alan by the-village.bc.nu with local (Exim 2.12 #1)
	id 12PPeo-0001qn-00; Mon, 28 Feb 2000 12:52:14 +0000
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
To: cheriton@cisco.com (David R. Cheriton)
Date: Mon, 28 Feb 2000 12:52:11 +0000 (GMT)
Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), julian_satran@il.ibm.com,
        David.Robinson@EBay.Sun.COM (David Robinson), ips@ece.cmu.edu,
        tcp-impl@grc.nasa.gov
In-Reply-To: <3.0.3.32.20000227175017.006baa14@ruby.cisco.com> from "David R. Cheriton" at Feb 27, 2000 05:50:17 PM
X-Mailer: ELM [version 2.5 PL1]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E12PPeo-0001qn-00@the-village.bc.nu>
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Content-Transfer-Encoding: 7bit

> Given you are concerned about Appletalk and PCI, perhaps we just
> need to agree you are addressing a different performance level of
> storage system.  I see no reason for hardware support at Appletalk

No we are addressing the same level of performance in a general purpose OS
the difference is my case doesn't need weird protocol hacks. When working
with a dedicated storage system RDMA becomes even less interesting because
you code the stack to the needs of the filer. You don't even need an MMU
on such kit

> proposed NIC if this is really to be taken as a serious alternative
> to an RDMA option.  I for one dont find it a competitive design
> (it is also one that was previously considered.)

I don't find RDMA a credible useful solution at the protocol level. It
doesn't offer any visible advantage, it complicates the stack futher thus
punishing the majority in the interest of the few.

I'm interested in why you think such a NIC wouldnt work, Providing you have
interrupt mitigation I see no reason for it to fail. I'd be interested
to know what the flaws in that technique were in your eyes.

> doing this, so I regard this as a done deal.  (I'd hate
> to be a high-speed NIC vendor that cant do this.)

But can you do it at $6 a part in volume ? Thats what the other 99.9% of the
people care about.

> The only issue is how to handle the next level of 
> protocols that have high performance requirements, i.e. storage.

ST already has this sort of stuff figured out.

> Also, NICs with hardware enhancements for protocol support,
> such as the Intel Gigabit Ethernet NIC have been very successful
> both in performance and market.  So, expect more there.

Intel provide no useful documentation so that is hard to evaluate.

> Finally, the RDMA option is targeted to allow use of standard
>  Internet protocols for SANs in place of  specialized protocols
> and networks such as  FibreChannel, especially for non-local 
> access.

So is ST and ST is proven technology. If you are going to break the protocol
to add hacks to it you might as well design the protocol properly based on the
past twenty years of learning where TCP is hard to get right. You need IP to
be compatible you don't need TCP.

If you want to solve the generic problem then you don't need RDMA anyway

Alan



From owner-tcp-impl@lerc.nasa.gov  Mon Feb 28 15:35:47 2000
Received: from lombok-fi.lerc.nasa.gov (lombok-fi.lerc.nasa.gov [139.88.112.33])
	by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA19552
	for <tcpimpl-archive@odin.ietf.org>; Mon, 28 Feb 2000 15:35:44 -0500 (EST)
Received: (from listserv@localhost)
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) id MAA12791
	for tcp-impl-outgoing; Mon, 28 Feb 2000 12:44:02 -0500 (EST)
Received: from seraph3.lerc.nasa.gov (firewall-user@guardian03.lerc.nasa.gov [139.88.146.12])
	by lombok-fi.lerc.nasa.gov (NASA LeRC 8.9.1.1/8.9.1) with ESMTP id MAA12745
	for <tcp-impl@grc.nasa.gov>; Mon, 28 Feb 2000 12:44:00 -0500 (EST)
Received: by seraph3.lerc.nasa.gov; id MAA03254; Mon, 28 Feb 2000 12:43:57 -0500 (EST)
Received: from calcite.rhyolite.com(38.159.140.3) by seraph3.lerc.nasa.gov via smap (V5.0)
	id xma003122; Mon, 28 Feb 00 12:43:11 -0500
Received: (from vjs@localhost)
	by calcite.rhyolite.com (8.9.3/calcite) id KAA14348
	env-from <vjs>;
	Mon, 28 Feb 2000 10:43:01 -0700 (MST)
Date: Mon, 28 Feb 2000 10:43:01 -0700 (MST)
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Message-Id: <200002281743.KAA14348@calcite.rhyolite.com>
To: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: NFS Header/data parsing and RDMA
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk

> From: Costa Sapuntzakis <csapuntz@cisco.com>

> Ok, so doing NFSv2/v3 header/data splitting is easy on an in-order
> TCP stream because NFS has fixed-length trailers. Here's a little
> technique:
> ...

> Note, to do this with NFS/TCP, your NIC has to do some primitive
> level of TCP processing (at least keep track of flows). It also
> needs to understand RPC/TCP message boundaries.

Do I understand correctly that you're applying the familiar
NFS/UDP page flipping tactic to NFS/TCP?

> Are there significantly simpler approaches than this? 

1. How about using NFS/UDP instead of NFS/TCP?
  It's well known in the NFS community that NFSv2-3/TCP is no faster or
  otherwise better than NFSv2-3/UDP except over very narrow or at least
  rather long pipes.  (Recall also the congestion control and avoidance
  mechanisms in some NFSv2-3/UDP implementations.)

2. Use NFS/TCP, but send every RPC/XDR transaction in a single TCP segment,
  and use IP fragmentation to fit the MTU.  This tactic was used for 10+
  years ago in the FDDI adapters of some super computers.  It does have
  the problems of IP fragmentation, but those problems are rarely
  encountered where NFS is used.

> NFSv4 doesn't seem to have fixed length trailers and neither
> does CIFS in all cases. And it looks like it will be costly to parse 
> NFSv4 headers. 

I've not been paying attention to NFSv4.  A quick skim of the draft
suggests that it will not displace NFSv2/3 in the environments where NFS
is currently popular.  NFSv4 certainly has nothing to do with anything
like SCSI over IP.  I'm also far from convinced that NFSv4 has got some
of the extensions close enough to the underlying real filesystems to be
popular.  Even if I'm wrong, it will be years before NFSv4 is widely used
While I think there are ways to page flip NFSv4 without special hardware,
I don't think they are worth talking about yet.  Even if I'm also wrong
about that, it is years early to be modifying TCP/IP to support NFSv4.
No one can see what NFSv4 will be like when it is popular enough to justify
modifying TCP today, if NFSv4 ever is popular.


> RDMA still has the following features:
>
> - Per-packet (Works with arbitrary out-of-order reception of TCP
> segments)
> - Fixed header that's generic across all protocols (NFSv4, v5, AFS,
> DFS, CIFS, etc..) 
> - No page flipping necessary on solicited transfers
> - Message boundary bit (which is admittedly orthogonal to RDMA) allows
> out-of-order processing on TCP receive buffer. Decreases parsing latency,
> esp. in the face of packet drops.
> ...

Knowing to which buffer an out-of-order TCP segment belongs is something
that I don't see how to do without something like RDMA.  However,
out-of-order TCP segments are both very rare and very bad for TCP
performance, regardless of whether RDMA is present.  Out of order
TCP segments must be even more rare in storage networks.

Talk about NFSv5 or even AFS/DFS does the opposite of make me think there
might be something good in RDMA.  And as I've said, it's years too early
to justifiy RDMA with NFSv4.

With existing techniques, if you don't want to page flip, you don't need
to.  If you are able to provide enough distinct application buffer streams
to the NIC for RDMA, then you could do the same for other techniques.

What's that about "parsing latency" and what does it have to do with 
lost segments?  Are you proposing to deliver TCP data to applications
out of order?  I trust not!

   ....

] From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>

] ...
]No, that situation doesn't require any hardware support.  However, a zero-copy 
] receive path is not the only element of RDMA - RDMA was designed (I suppose 
] from the discussion here) specifically to address header/payload issues for 
] storage protocols.  Clearly one can do zero-copy receive with changes to the 
] API and no hardware/firmware modifications.  But with no special hardware 
] support, flipping the payload into some page with alignment constraints will 
] require another copy.

What about the many systems that have been page flipping NFS in and out
of buffer caches for more than 10 years, with no changes to APIs or special
silicon?

]There is one exception to my last statement that I know of:  If you pre-adjust 
]the hardware receive buffers to make the payload align on a page boundary, you 
] can flip the page into the buffer cache for (hopefully) the common case.  
] However, this requires the ability to tune these header offsets and will only 
] work for one protocol at a time (mostly).

The page flipping systems I've worked on did not tune header offsets and
worked on more than one protocol.  (Given your email address, it might be
interesting to check the old IRIX source trees.  Besides the NFS kernel
code and the HIPPI, ATM, and FDDI drivers and firmware, check cmd/rcp and
cmd/rsh.)  UDP page flipping is trivial on protocols that have no trailers.
It requires trivial smarts in the NIC and much simpler buffer allocation
by the NIC than RDMA requires.  (I suspect RDMA needs pools of buffers
for every stream, while the classic tactic needs only two pools, "little"
and "pages"....well, for tiny improvements I've also done it with "little",
"medium" and "pages".)

] Realistically, who is going to be running a storage system that requires so 
] much bandwidth that avoiding receive copies is necessary, and runs on generic 
] NICs with no firmware/ASIC modifications possible?  So I think using modified 
] hardware is completely reasonable in those circumstances.
] ...

Even more reasonable than special hardware are modified API's and protocols
and other steps, including ensuring that out-of-order packets are very
rare, and with header offsets are few, fixed, known, and friendly.

How would you have out-of-order arrival on a storage network, other than
due to bit rot in the wires, and what storage network is going to have
significant bit rot?


Vernon Schryver    vjs@rhyolite.com


