This is a rough draft - Megan 04/20/92 Audio/Video Transport WG Meeting Report 17-Mar-92, San Diego 1. Introduction: Goals, Scope of this working group The AVT WG met for three sessions on Tuesday in San Diego. Audio from the presentations and discussions at these sessions was "audiocast" via UDP and IP multicast to participants at a number of locations ranging from Australia to the UK, and the remote participants were able to ask questions over the return path. The purpose of this working group is to specify one or more experimental protocols to foster interoperation among multiple packet audio/video implementations in experiments such as this audiocast. The focus of the WG is short-term (see the charter). Our first goal is to have the protocols defined and experimental implementations running in time for use in a second audiocast at the July, 1992 IETF meeting. Therefore, in this meeting we dove right in to a discussion of what the protocol should look like. 2. Data packet header formats for real-time audio and video We need a "transport" protocol for real-time, continuous media. That means we don't want the retransmission and flow control of TCP, but we do want sequencing and checksumming. We could define a new protocol to fit directly over IP, but in keeping with the short-term scope of this working group, we choose to fit a new protocol over IP+UDP so it can be deployed quickly. Alternatively, another protocol that provides the necessary functions, such as ST-II, can be used. Those functions are port addressing, length, and (optional) checksumming. The missing function is sequencing. Steve Casner described the data packet format of the Network Voice Protocol (NVP-II) which was serving this function for the audiocast of this meeting. The header is efficient (only 4 octets), but that makes some of the fields too small to support current requirements. To begin discussion of a replacement, the following strawman protocol with only two fields was proposed: o 32-bit Timestamp (16 bits of seconds + 16-bit fraction) o Sequence Number (could be less than 32 bits) There was substantial discussion of the nature of the timestamp. It must have sufficient range to cover any network delay (segment lifetime) that might be expected, and it must have sufficient resolution to allow the desired degree of superposition and coordination among media streams. The bit allocation shown has a range of 18 hours and a resolution of 16 microseconds. The timestamp could be synchronous with the media sampling clock, in which case it would tick at the nominal sampling rate and drift with respect to real time, or it could be synchronous with real time. In the latter case, the timestamp could represent absolute real time if it were defined to be the middle 32 bits of a Network Time Protocol (NTP) timestamp, or it could be merely relative to real time. For purposes of synchronization among multiple media sources, real-time timestamps should be used, though they need not be absolute. Julio Escobar from BBN gave a presentation on the Synchronization Protocol. It is based on globally synchronized clocks (e.g., using NTP) and defines a set of control protocol exchanges to establish an equalization delay for synchronized playback. It's only requirement on the data packet format is that a real-time-synchronous, relative timestamp be carried. Although the timestamp field can be used to sequence the packets, it cannot be used to detect lost packets for media such as voice that suppress transmission when there is no activity. The sequence number serves that function. It could be smaller than 32 bits because the timestamp disambiguates wrap-around within the maximum segment lifetime. The number of bits should be large enough that the loss of exactly one sequence space of packets is a rare-enough event that failure to detect it is acceptable. Steve Deering proposed some additional fields/functions to be included in the data packet header: o Checksum (to validate decryption) o Version Number o Encoding Type The UDP checksum cannot be used to validate decryption because it must be applied after encryption, so a separate checksum would be required. An alternative that would not require an additional field but would require more complex processing is to use the successful decryption of several properly sequenced packets as the validation of the key. On the other hand, including a checksum at this level, covering either just the header or header plus data, would also be useful with the ST-II protocol that does not checksum higher-layer protocols. A version number would allow implementations to distinguish among multiple versions of the protocol. The encoding type field might be used for several purposes. It could identify the particular compression algorithm used so that the receiver could select the correct decompression. However, if that selection would be constant over the life of the session, it could be communicated in an out-of-band control protocol. If multiple media are sent are sent on one port number, then an additional level of demultiplexing would be needed and the encoding field could serve that purpose. For layered (embedded) coding schemes, a field is needed to identify the separate layers, but this field might be here or might be consigned to the application-layer protocol. For the network to process the separate layers at different priorities, it is expected that some priority field would be needed in the network layer. Finally, two fields from other packet audio protocols were considered: o Energy Level (from Xerox PARC Phoenixphone) o Cumulative Delay (from CCITT G.764) For audio packets, the energy level is an indication of the sound volume in the packet. This may be useful to the receiver when mixing audio streams, for example. It could be recalculated by the receiver rather than being carried in the packet. The CCITT recommendation G.764 Packetized Voice Protocol includes a field that records the cumulative variable queueing delays experienced by a packet in traversing the network. This may be useful for deadline-scheduling of packet forwarding, but it was decided that those experimenting with such algorithms would need to add the field in some lower layer. 3. Field inclusion criteria We did not attempt to decide "in real time" what fields/functions should be included or excluded. Further discussion is expected via email. Instead, we established some criteria for inclusion of these and other fields in a real-time transport protocol: - What percentage of applications would require the field? If only a small percentage, the field should be left to the application layer. - What application functions we are trying to support with these fields? We may be able to combine functions by choosing the fields right. - How should we tradeoff network bandwidth vs. processing and complexity of control algorithms? (The discussions of the checksum and energy fields are examples.) - Would the field be constant in all packets at a given demultiplexing level? If so, that information could be implicit and carried in an out-of-band control protocol. Or is there a need for the data to be self-describing? - Does the field/function "belong" at this level? Considerations include overlap with other layers, aesthetics, common practice and understanding. 4. Addressing In the third session we discussed how addressing (multiplexing) should be divided among the layers. Steve Deering explained: - The IP multicast address should identify a particular session or set of recipients. Two different sets of recipients should have two different addresses. - The destination port address must be the same for all recipients if the packets are to be multicast, so the destination port must be administratively, not dynamically, assigned. Since the space space of well-known port numbers is small, we can't assign separate ports for each kind of data in a multimedia session. It may be appropriate to have a control port and a data port, or perhaps to distinguish major data types, such as audio and video. Source port numbers are dynamically assigned and can distinguish multiple participants at one IP address. - If there are multiple flows (e.g., audio and video) to one multicast address, it may be necessary to include another level of demultiplexing in the audio/video transport layer. This relates to the "encoding" field mentioned earlier. Further discussion is needed to decide how much multiplexing should occur at each layer. There are considerations both of address space and of implementation (whether it is better to read multiple media on one socket or separate sockets, for example). 5. Linkages between data and control Flexible management of multimedia connections or sessions is the subject of current research and beyond the short-term scope of this working group. For simple application modes, such as an audiocast on an advertized "channel" (e.g., IP multicast address), operation is possible with no control protocol at all. For spontaneous communication, there is pool of 2^16 IP multicast addresses from which an address may be chosen, but then that address must be communicated to the participants. This group may define a simple interim protocol for this purpose as a second step (after the transport protocol). Some inputs to this process would be the "session protocol" used by the vat program, the Connection Control Protocol from ISI, and the DVC control protocol (see next section). 6. Software Encoding Listed as a bonus topic on the agenda was a discussion of algorithms and protocols for software encoding of real-time media. This is not a main topic because such protocols should be at a layer above the transport. However, in keeping with the working group goal to foster interoperation and experimentation with packet audio and video, it may be valuable to agree on some (perhaps low performance) software compression techniques for use until hardware is generally available. For this purpose, Paul Milazzo from BBN gave an update on the protocol used in the Desktop Video Conference program. DVC uses the low-cost VideoPix frame-grabber card for SPARCstations plus software compression to generate video at about 5 frames per second. The DVC protocol communicates sequences of video subimage blocks over UDP and uses TCP for the control connection. A recent enhancement is the ability to decode multiple streams (up to 6 so far). 7. Further discussion Thanks to Karen Sollins and Eve Schooler for taking the notes from which these minutes were prepared. A longer report of the meeting with more detail will be posted to the mailing list rem-conf@es.net to stimulate discussion of the issues raised above. It is proposed that we also hold some packet audio teleconference meetings as needed to augment the e-mail discussion.