This is a rough draft - Megan 04/20/92

	       Audio/Video Transport WG Meeting Report
			 17-Mar-92, San Diego

1. Introduction:  Goals, Scope of this working group

The AVT WG met for three sessions on Tuesday in San Diego.  Audio from
the presentations and discussions at these sessions was "audiocast"
via UDP and IP multicast to participants at a number of locations
ranging from Australia to the UK, and the remote participants were
able to ask questions over the return path.

The purpose of this working group is to specify one or more
experimental protocols to foster interoperation among multiple packet
audio/video implementations in experiments such as this audiocast.
The focus of the WG is short-term (see the charter).  Our first goal
is to have the protocols defined and experimental implementations
running in time for use in a second audiocast at the July, 1992 IETF
meeting.  Therefore, in this meeting we dove right in to a discussion
of what the protocol should look like.


2. Data packet header formats for real-time audio and video

We need a "transport" protocol for real-time, continuous media.  That
means we don't want the retransmission and flow control of TCP, but we
do want sequencing and checksumming.  We could define a new protocol
to fit directly over IP, but in keeping with the short-term scope of
this working group, we choose to fit a new protocol over IP+UDP so it
can be deployed quickly.  Alternatively, another protocol that
provides the necessary functions, such as ST-II, can be used.  Those
functions are port addressing, length, and (optional) checksumming.

The missing function is sequencing.  Steve Casner described the data
packet format of the Network Voice Protocol (NVP-II) which was serving
this function for the audiocast of this meeting.  The header is
efficient (only 4 octets), but that makes some of the fields too small
to support current requirements.  To begin discussion of a
replacement, the following strawman protocol with only two fields was
proposed:

	o  32-bit Timestamp (16 bits of seconds + 16-bit fraction)
	o  Sequence Number (could be less than 32 bits)

There was substantial discussion of the nature of the timestamp.  It
must have sufficient range to cover any network delay (segment
lifetime) that might be expected, and it must have sufficient
resolution to allow the desired degree of superposition and
coordination among media streams.  The bit allocation shown has a
range of 18 hours and a resolution of 16 microseconds.  The timestamp
could be synchronous with the media sampling clock, in which case it
would tick at the nominal sampling rate and drift with respect to real
time, or it could be synchronous with real time.  In the latter case,
the timestamp could represent absolute real time if it were defined to
be the middle 32 bits of a Network Time Protocol (NTP) timestamp, or
it could be merely relative to real time.

For purposes of synchronization among multiple media sources,
real-time timestamps should be used, though they need not be absolute.
Julio Escobar from BBN gave a presentation on the Synchronization
Protocol.  It is based on globally synchronized clocks (e.g., using
NTP) and defines a set of control protocol exchanges to establish an
equalization delay for synchronized playback.  It's only requirement
on the data packet format is that a real-time-synchronous, relative
timestamp be carried.

Although the timestamp field can be used to sequence the packets, it
cannot be used to detect lost packets for media such as voice that
suppress transmission when there is no activity.  The sequence number
serves that function.  It could be smaller than 32 bits because the
timestamp disambiguates wrap-around within the maximum segment
lifetime.  The number of bits should be large enough that the loss of
exactly one sequence space of packets is a rare-enough event that
failure to detect it is acceptable.

Steve Deering proposed some additional fields/functions to be included
in the data packet header:

	o  Checksum (to validate decryption)
	o  Version Number
	o  Encoding Type

The UDP checksum cannot be used to validate decryption because it must
be applied after encryption, so a separate checksum would be required.
An alternative that would not require an additional field but would
require more complex processing is to use the successful decryption of
several properly sequenced packets as the validation of the key.  On
the other hand, including a checksum at this level, covering either
just the header or header plus data, would also be useful with the
ST-II protocol that does not checksum higher-layer protocols.

A version number would allow implementations to distinguish among
multiple versions of the protocol.

The encoding type field might be used for several purposes.  It could
identify the particular compression algorithm used so that the
receiver could select the correct decompression.  However, if that
selection would be constant over the life of the session, it could be
communicated in an out-of-band control protocol.

If multiple media are sent are sent on one port number, then an
additional level of demultiplexing would be needed and the encoding
field could serve that purpose.  For layered (embedded) coding
schemes, a field is needed to identify the separate layers, but this
field might be here or might be consigned to the application-layer
protocol.  For the network to process the separate layers at different
priorities, it is expected that some priority field would be needed in
the network layer.

Finally, two fields from other packet audio protocols were considered:

        o  Energy Level (from Xerox PARC Phoenixphone)
        o  Cumulative Delay (from CCITT G.764)

For audio packets, the energy level is an indication of the sound
volume in the packet.  This may be useful to the receiver when mixing
audio streams, for example.  It could be recalculated by the receiver
rather than being carried in the packet.

The CCITT recommendation G.764 Packetized Voice Protocol includes a
field that records the cumulative variable queueing delays experienced
by a packet in traversing the network.  This may be useful for
deadline-scheduling of packet forwarding, but it was decided that
those experimenting with such algorithms would need to add the field
in some lower layer.


3. Field inclusion criteria

We did not attempt to decide "in real time" what fields/functions
should be included or excluded.  Further discussion is expected via
email.  Instead, we established some criteria for inclusion of these
and other fields in a real-time transport protocol:

    - What percentage of applications would require the field?  If
      only a small percentage, the field should be left to the
      application layer.

    - What application functions we are trying to support with these
      fields?  We may be able to combine functions by choosing the
      fields right.

    - How should we tradeoff network bandwidth vs. processing and
      complexity of control algorithms?  (The discussions of the
      checksum and energy fields are examples.)

    - Would the field be constant in all packets at a given
      demultiplexing level?  If so, that information could be implicit
      and carried in an out-of-band control protocol.  Or is there a
      need for the data to be self-describing?

    - Does the field/function "belong" at this level?  Considerations
      include overlap with other layers, aesthetics, common practice
      and understanding.


4. Addressing

In the third session we discussed how addressing (multiplexing) should
be divided among the layers.  Steve Deering explained:

    - The IP multicast address should identify a particular session or
      set of recipients.  Two different sets of recipients should have
      two different addresses.

    - The destination port address must be the same for all recipients
      if the packets are to be multicast, so the destination port must
      be administratively, not dynamically, assigned.  Since the space
      space of well-known port numbers is small, we can't assign
      separate ports for each kind of data in a multimedia session.
      It may be appropriate to have a control port and a data port, or
      perhaps to distinguish major data types, such as audio and
      video.  Source port numbers are dynamically assigned and can
      distinguish multiple participants at one IP address.

    - If there are multiple flows (e.g., audio and video) to one
      multicast address, it may be necessary to include another level
      of demultiplexing in the audio/video transport layer.  This
      relates to the "encoding" field mentioned earlier.

Further discussion is needed to decide how much multiplexing should
occur at each layer.  There are considerations both of address space
and of implementation (whether it is better to read multiple media on
one socket or separate sockets, for example).


5. Linkages between data and control

Flexible management of multimedia connections or sessions is the
subject of current research and beyond the short-term scope of this
working group.  For simple application modes, such as an audiocast on
an advertized "channel" (e.g., IP multicast address), operation is
possible with no control protocol at all.

For spontaneous communication, there is pool of 2^16 IP multicast
addresses from which an address may be chosen, but then that address
must be communicated to the participants.  This group may define a
simple interim protocol for this purpose as a second step (after the
transport protocol).  Some inputs to this process would be the
"session protocol" used by the vat program, the Connection Control
Protocol from ISI, and the DVC control protocol (see next section).


6. Software Encoding

Listed as a bonus topic on the agenda was a discussion of algorithms
and protocols for software encoding of real-time media.  This is not a
main topic because such protocols should be at a layer above the
transport.  However, in keeping with the working group goal to foster
interoperation and experimentation with packet audio and video, it may
be valuable to agree on some (perhaps low performance) software
compression techniques for use until hardware is generally available.

For this purpose, Paul Milazzo from BBN gave an update on the protocol
used in the Desktop Video Conference program.  DVC uses the low-cost
VideoPix frame-grabber card for SPARCstations plus software
compression to generate video at about 5 frames per second.  The DVC
protocol communicates sequences of video subimage blocks over UDP and
uses TCP for the control connection.  A recent enhancement is the
ability to decode multiple streams (up to 6 so far).


7. Further discussion

Thanks to Karen Sollins and Eve Schooler for taking the notes from
which these minutes were prepared.  A longer report of the meeting
with more detail will be posted to the mailing list rem-conf@es.net to
stimulate discussion of the issues raised above.  It is proposed that
we also hold some packet audio teleconference meetings as needed to
augment the e-mail discussion.