Internet Engineering Task Force (IETF)                 M. Duckworth, Ed.
Internet Draft                                                  Polycom
Intended status:
Request for Comments: 8845
Category: Standards Track                                   A. Pepperell
Expires: July 8, 2016
ISSN: 2070-1721                                                    Acano
                                                               S. Wenger
                                                        January 8, 2016
                                                               June 2020

                Framework for Telepresence Multi-Streams


   This document defines a framework for a protocol to enable devices in
   a telepresence conference to interoperate.  The protocol enables
   communication of information about multiple media streams so a
   sending system and receiving system can make reasonable decisions
   about transmitting, selecting selecting, and rendering the media streams.  This
   protocol is used in addition to SIP signaling and SDP Session Description
   Protocol (SDP) negotiation for setting up a telepresence session.

Status of this This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents an Internet Standards Track document.

   This document is a product of the Internet Engineering Task Force
   (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list  It represents the consensus of current
   Internet-Drafts is at

   Internet-Drafts are draft documents valid the IETF community.  It has
   received public review and has been approved for a maximum publication by the
   Internet Engineering Steering Group (IESG).  Further information on
   Internet Standards is available in Section 2 of six
   months RFC 7841.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be updated, replaced, or obsoleted by other
   documents obtained at any time.  It is inappropriate to use Internet-Drafts
   as reference material or to cite them other than as "work in

   This Internet-Draft will expire on July 8, 2016.

Copyright Notice

   Copyright (c) 2016 2020 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   ( in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1. Introduction...................................................3  Introduction
   2. Terminology....................................................4  Requirements Language
   3. Definitions....................................................4  Definitions
   4.  Overview and Motivation........................................7 Motivation
   5.  Description of the Framework/Model............................10 Framework/Model
   6.  Spatial Relationships.........................................15 Relationships
   7.  Media Captures and Capture Scenes.............................17 Scenes
     7.1.  Media Captures...........................................17 Captures
       7.1.1.  Media Capture Attributes............................18 Attributes
     7.2.  Multiple Content Capture.................................24 Capture
       7.2.1.  MCC Attributes......................................25 Attributes
     7.3.  Capture Scene............................................30 Scene
       7.3.1.  Capture Scene attributes............................33 Attributes
       7.3.2.  Capture Scene View attributes.......................33 Attributes
     7.4.  Global View List.........................................34 List
   8.  Simultaneous Transmission Set Constraints.....................35 Constraints
   9. Encodings.....................................................37  Encodings
     9.1.  Individual Encodings.....................................37 Encodings
     9.2.  Encoding Group...........................................38 Group
     9.3.  Associating Captures with Encoding Groups................39 Groups
   10. Consumer's Choice of Streams to Receive from the Provider....40 Provider
     10.1.  Local preference........................................43 Preference
     10.2.  Physical simultaneity restrictions......................43 Simultaneity Restrictions
     10.3.  Encoding and encoding group limits......................43 Encoding Group Limits
   11. Extensibility................................................44 Extensibility
   12. Examples - Using the Framework (Informative).................44 (Informative)
     12.1.  Provider Behavior.......................................44 Behavior
       12.1.1. Three screen  Three-Screen Endpoint Provider.....................44 Provider
       12.1.2.  Encoding Group Example.............................51 Example
       12.1.3.  The MCU Case.......................................52 Case
     12.2.  Media Consumer Behavior.................................53 Behavior
       12.2.1. One screen  One-Screen Media Consumer..........................53 Consumer
       12.2.2. Two screen  Two-Screen Media Consumer configuring Configuring the example..54 Example
       12.2.3. Three screen  Three-Screen Media Consumer configuring Configuring the example55 Example
     12.3.  Multipoint Conference utilizing Utilizing Multiple Content Captures55 Captures
       12.3.1.  Single Media Captures and MCC in the same
         Advertisement..............................................55 Same
       12.3.2.  Several MCCs in the same Advertisement.............59 Same Advertisement
       12.3.3.  Heterogeneous conference Conference with switching Switching and
       12.3.4.  Heterogeneous conference Conference with voice activated
         switching..................................................67 Voice-Activated
   13. Acknowledgements.............................................70
   14. IANA Considerations..........................................70
   15. Considerations
   14. Security Considerations......................................70
   16. Changes Since Last Version...................................73
   17. Considerations
   15. References
     15.1.  Normative References.........................................81
   18. References
     15.2.  Informative References.......................................82
   19. References
   Authors' Addresses...........................................83 Addresses

1.  Introduction

   Current telepresence systems, though based on open standards such as
   RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
   other.  A major factor limiting the interoperability of telepresence
   systems is the lack of a standardized way to describe and negotiate
   the use of multiple audio and video streams comprising the media
   flows.  This document provides a framework for protocols to enable
   interoperability by handling multiple streams in a standardized way.
   The framework is intended to support the use cases described in Use "Use
   Cases for Telepresence Multistreams Multistreams" [RFC7205] and to meet the
   requirements in Requirements "Requirements for Telepresence Multistreams Multistreams"
   [RFC7262].  This includes cases using multiple media streams that are
   not necessarily telepresence.

   This document occasionally refers to the term "CLUE", in capital
   letters.  CLUE is an acronym for "ControLling mUltiple streams for
   tElepresence", which is the name of the IETF working group in which
   this document and certain companion documents have been developed.
   Often, CLUE-something refers to something that has been designed by
   the CLUE working group; for example, this document may be called
   the CLUE-framework.

   The basic session setup for the use cases is based on SIP [RFC3261]
   and SDP offer/answer [RFC3264].  In addition to basic SIP & SDP
   offer/answer, CLUE specific signaling that is ControLling mUltiple streams for
   tElepresence (CLUE) specific is required to exchange the information
   describing the multiple media streams.  The motivation for this
   framework, an overview of the signaling, and the information required
   to be exchanged is are described in subsequent sections of this
   document.  Companion documents describe the signaling details
   [I-D.ietf-clue-signaling] and
   [RFC8848], the data model [I-D.ietf-clue-data-
   model-schema] [RFC8846], and the protocol [I-D.ietf-clue-protocol]. [RFC8847].

2. Terminology  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "OPTIONAL" in this document are to be interpreted as described in RFC 2119
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

3.  Definitions

   This document occasionally refers to the term "CLUE".  CLUE is an
   acronym for "ControLling mUltiple streams for tElepresence", which is
   the name of the IETF working group in which this document and certain
   companion documents have been developed.  Often, CLUE-* refers to
   something that has been designed by the CLUE working group; for
   example, this document may be called the CLUE-framework.

   The terms defined below are used throughout this document and in
   companion documents.  In  Capitalization is used in order to easily
   identify the use of a defined term, those terms are capitalized. term.

   Advertisement: a  A CLUE message a Media Provider sends to a Media
      Consumer describing specific aspects of the content of the media, media
      and any restrictions it has in terms of being able to provide
      certain Streams simultaneously.

   Audio Capture: Capture (AC):  Media Capture for audio.  Denoted as ACn "ACn" in
      the examples in this document.

   Capture:  Same as Media Capture.

   Capture Device:  A device that converts physical input, such as
      audio, video or text, into an electrical signal, in most cases to
      be fed into a media encoder.

   Capture Encoding:  A specific encoding of a Media Capture, to be sent
      by a Media Provider to a Media Consumer via RTP.

   Capture Scene: a  A structure representing a spatial region captured by
      one or more Capture Devices, each capturing media representing a
      portion of the region.  The spatial region represented by a
      Capture Scene may correspond to a real region in physical space,
      such as a room.  A Capture Scene includes attributes and one or
      more Capture Scene Views, with each view including one or more
      Media Captures.

   Capture Scene View (CSV): a  A list of Media Captures of the same media
      type that together form one way to represent the entire Capture

   CLUE-capable device:  A device that supports the CLUE data channel
      [RFC8850], the CLUE protocol [I-D.ietf-clue-
   protocol] [RFC8847] and the principles of CLUE negotiation, and
      negotiation; it also seeks CLUE-
   enabled CLUE-enabled calls.

   CLUE-enabled call:  A call in which two CLUE-capable devices have
      successfully negotiated support for a CLUE data channel in SDP
      [RFC4566].  A CLUE-enabled call is not necessarily immediately
      able to send CLUE-controlled media; negotiation of the data
      channel and of the CLUE protocol must complete first.  Calls
      between two CLUE-
   capable CLUE-capable devices which that have not yet successfully
      completed negotiation of support for the CLUE data channel in SDP
      are not considered CLUE- enabled. CLUE-enabled.

   Conference: used  Used as defined in [RFC4353], A "A Framework for Conferencing within
      the Session Initiation Protocol (SIP). (SIP)" [RFC4353].

   Configure Message:  A CLUE message a Media Consumer sends to a Media
      Provider specifying which content and Media Streams it wants to
      receive, based on the information in a corresponding Advertisement

   Consumer:  short for Media Consumer.

   Encoding:  short for Individual Encoding.

   Encoding Group:  A set of encoding parameters representing a total
      media encoding capability to be sub-divided subdivided across potentially
      multiple Individual Encodings.

   Endpoint:  A CLUE-capable device which that is the logical point of final
      termination through receiving, decoding and rendering, and/or
      initiation through capturing, encoding, and sending of media
      streams.  An endpoint consists of one or more physical devices
      that source and sink media streams, and exactly one [RFC4353]
      Participant (which, in turn, includes exactly one SIP User Agent).
      Endpoints can be anything from multiscreen/multicamera rooms to
      handheld devices.

   Global View:  A set of references to one or more Capture Scene Views
      of the same media type that are defined within Scenes of the same
      advertisement.  A Global View is a suggestion from the Provider to
      the Consumer for one set of CSVs that provide a useful
      representation of all the scenes in the advertisement.

   Global View List:  A list of Global Views included in an
      Advertisement.  A Global View List may include Global Views of
      different media types.

   Individual Encoding:  a set of parameters representing a way to
      encode a Media Capture to become a Capture Encoding.

   Multipoint Control Unit (MCU):  a CLUE-capable device that connects
      two or more endpoints together into one single multimedia conference [RFC5117].
      [RFC7667].  An MCU includes an [RFC4353]-like Mixer, a Mixer like that described in
      [RFC4353], without the [RFC4353] requirement of [RFC4353] to send media to
      each participant.

   Media:  Any data that, after suitable encoding, can be conveyed over
      RTP, including audio, video video, or timed text.

   Media Capture: a Capture (MC):  A source of Media, such as from one or more
      Capture Devices or constructed from other Media streams.

   Media Consumer: a  A CLUE-capable device that intends to receive
      Capture Encodings.

   Media Provider: a  A CLUE-capable device that intends to send Capture

   Multiple Content Capture (MCC):  A Capture that mixes and/or switches
      other Captures of a single type. (E.g. type (for example, all audio or all
      video).  Particular Media Captures may or may not be present in
      the resultant Capture Encoding Encoding, depending on time or space.
      Denoted as
   MCCn "MCCn" in the example cases in this document.

   Plane of Interest:  The spatial plane within a scene containing the
   most relevant
      most-relevant subject matter.

   Provider:  Same as a Media Provider.

   Render: the  The process of generating a representation from media, such
      as displayed motion video or sound emitted from loudspeakers.

   Scene:  Same as a Capture Scene

   Simultaneous Transmission Set: a  A set of Media Captures that can be
      transmitted simultaneously from a Media Provider.

   Single Media Capture:  A capture which that contains media from a single
      source capture device, e.g. e.g., an audio capture from a single
      microphone or a video capture from a single camera.

   Spatial Relation:  The arrangement in space of two objects, objects in space, in
      contrast to relation in time or other relationships.

   Stream: a  A Capture Encoding sent from a Media Provider to a Media
      Consumer via RTP [RFC3550].

   Stream Characteristics: the  The media stream attributes commonly used in
      non-CLUE SIP/SDP environments (such as: as media codec, bit rate, bitrate,
      resolution, profile/level profile/level, etc.) as well as CLUE specific CLUE-specific
      attributes, such as the Capture ID or a spatial location.

   Video Capture: Capture (VC):  Media Capture for video.  Denoted as VCn in the
      example cases in this document.

   Video Composite:  A single image that is formed, normally by an RTP
      mixer inside an MCU, by combining visual elements from separate

4.  Overview and Motivation

   This section provides an overview of the functional elements defined
   in this document to represent a telepresence or multistream system.
   The motivations for the framework described in this document are also

   Two key concepts introduced in this document are the terms "Media
   Provider" and "Media Consumer".  A Media Provider represents the
   entity that sends the media and a Media Consumer represents the
   entity that receives the media.  A Media Provider provides Media in
   the form of RTP packets, packets; a Media Consumer consumes those RTP packets.
   Media Providers and Media Consumers can reside in Endpoints or in
   Multipoint Control Units (MCUs).  A Media Provider in an Endpoint is
   usually associated with the generation of media for Media Captures;
   these Media Captures are typically sourced from cameras, microphones,
   and the like.  Similarly, the Media Consumer in an Endpoint is
   usually associated with renderers, such as screens and loudspeakers.
   In MCUs, Media Providers and Consumers can have the form of outputs
   and inputs, respectively, of RTP mixers, RTP translators, and similar
   devices.  Typically, telepresence devices devices, such as Endpoints and MCUs
   MCUs, would perform as both Media Providers and Media Consumers, the
   former being concerned with those devices' transmitted media and the
   latter with those devices' received media.  In a few circumstances, a
   CLUE-capable device includes only Consumer or Provider functionality,
   such as recorder-type Consumers or webcam-type Providers.

   The motivations for the framework outlined in this document include
   the following:

   (1)  Endpoints in telepresence systems typically have multiple Media
        Capture and Media Render devices, e.g., multiple cameras and
        screens.  While previous system designs were able to set up
        calls that would capture media using all cameras and display
        media on all screens, for example, there was no mechanism that
        could associate these Media Captures with each other in space
        and time, in a cross-
   vendor cross-vendor interoperable way.

   (2)  The mere fact that there are multiple capturing and rendering
        devices, each of which may be configurable in aspects such as
        zoom, leads to the difficulty that a variable number of such
        devices can be used to capture different aspects of a region.
        The Capture Scene concept allows for the description of multiple
        setups for those multiple capture devices that could represent
        sensible operation points of the physical capture devices in a
        room, chosen by the operator.  A Consumer can pick and choose
        from those configurations based on its rendering abilities and
        then inform the Provider about its choices.  Details are
        provided in section Section 7.

   (3)  In some cases, physical limitations or other reasons disallow
        the concurrent use of a device in more than one setup.  For
        example, the center camera in a typical three-camera conference
        room can set its zoom objective either to capture only either the middle few seats,
        seats only or all seats of a room, but not both concurrently.
        The Simultaneous Transmission Set concept allows a Provider to
        signal such limitations.  Simultaneous Transmission Sets are
        part of the Capture Scene description, description and are discussed in section
        Section 8.

   (4)  Often, the devices in a room do not have the computational
        complexity or connectivity to deal with multiple encoding
        options simultaneously, even if each of these options is
        sensible in certain scenarios, and even if the simultaneous
        transmission is also sensible (i.e. (i.e., in case of multicast media
        distribution to multiple endpoints).  Such constraints can be
        expressed by the Provider using the Encoding Group concept,
        which is described in section Section 9.

   (5)  Due to the potentially large number of RTP streams required for
        a Multimedia Conference involving potentially many Endpoints,
        each of which can have many Media Captures and media renderers,
        it has become common to multiplex multiple RTP streams onto the
        same transport address, so as to avoid using the port number as
        a multiplexing point and the associated shortcomings such as
        NAT/firewall traversal.  The large number of possible
        permutations of sensible options a Media Provider can make
        available to a Media Consumer makes a mechanism desirable that
        allows it to narrow down the number of possible options that a
        SIP offer/answer exchange has to consider.  Such information is
        made available using protocol mechanisms specified in this
        document and companion documents.  The Media Provider and Media
        Consumer may use information in CLUE messages to reduce the
        complexity of SIP offer/answer messages.  Also, there are
        aspects of the control of both Endpoints and MCUs that
        dynamically change during the progress of a call, such as
   audio-level based audio-
        level-based screen switching, layout changes, and so on, which
        need to be conveyed.  Note that these control aspects are
        complementary to those specified in traditional SIP based SIP-based
        conference management management, such as BFCP. Binary Floor Control Protocol
        (BFCP).  An exemplary call flow can be found in section Section 5.

   Finally, all this information needs to be conveyed, and the notion of
   support for it needs to be established.  This is done by the
   negotiation of a "CLUE channel", a data channel negotiated early
   during the initiation of a call.  An Endpoint or MCU that rejects the
   establishment of this data channel, by definition, does not support CLUE based
   CLUE-based mechanisms, whereas an Endpoint or MCU that accepts it is
   indicating support for CLUE as specified in this document and its
   companion documents.

5.  Description of the Framework/Model

   The CLUE framework specifies how multiple media streams are to be
   handled in a telepresence conference.

   A Media Provider (transmitting Endpoint or MCU) describes specific
   aspects of the content of the media and the media stream encodings it
   can send in an Advertisement; and the Media Consumer responds to the
   Media Provider by specifying which content and media streams it wants
   to receive in a Configure message.  The Provider then transmits the
   asked-for content in the specified streams.

   This Advertisement and Configure typically occur during call
   initiation, after CLUE has been enabled in a call, but they MAY also
   happen at any time throughout the call, whenever there is a change in
   what the Consumer wants to receive or (perhaps less common) what the
   Provider can send.

   An Endpoint or MCU typically act acts as both Provider and Consumer at
   the same time, sending Advertisements and sending Configurations in
   response to receiving Advertisements.  (It is possible to be just one
   or the other.)

   The data model [I-D.ietf-clue-data-model-schema]is [RFC8846] is based around two main concepts: a Capture
   and an Encoding.  A Media Capture (MC), Capture, such as of type audio or video,
   has attributes to describe the content a Provider can send.  Media
   Captures are described in terms of CLUE-defined attributes, such as
   spatial relationships and purpose of the capture.  Providers tell
   Consumers which Media Captures they can provide, described in terms
   of the Media Capture attributes.

   A Provider organizes its Media Captures into one or more Capture
   Scenes, each representing a spatial region, such as a room.  A
   Consumer chooses which Media Captures it wants to receive from the
   Capture Scenes.

   In addition, the Provider can send the Consumer a description of the
   Individual Encodings it can send in terms of identifiers which that relate
   to items in SDP [RFC4566].

   The Provider can also specify constraints on its ability to provide
   Media, and a sensible design choice for a Consumer is to take these
   into account when choosing the content and Capture Encodings it
   requests in the later offer/answer exchange.  Some constraints are
   due to the physical limitations of devices--for device; for example, a camera may
   not be able to provide zoom and non-zoom views simultaneously.  Other
   constraints are system based, such as maximum bandwidth.

   The following diagram illustrates the information contained in an

   .  Provider Advertisement             +--------------------+      .
   .                                     | Simultaneous Sets  |      .
   .        +------------------------+   +--------------------+      .
   .        |       Capture Scene N  |   +--------------------+      .
   .      +-+----------------------+ |   | Global View List   |      .
   .      |       Capture Scene 2  | |   +--------------------+      .
   .    +-+----------------------+ | |      +----------------------+ .
   .    |  Capture Scene 1       | | |      |  Encoding Group N    | .
   .    |    +---------------+   | | |    +-+--------------------+ | .
   .    |    | Attributes    |   | | |    |   Encoding Group 2   | | .
   .    |    +---------------+   | | |  +-+--------------------+ | | .
   .    |                        | | |  |   Encoding Group 1   | | | .
   .    |    +----------------+  | | |  |     parameters       | | | .
   .    |    |  V i e w s     |  | | |  |      bandwidth       | | | .
   .    |    |  +---------+   |  | | |  | +-------------------+| | | .
   .    |    |  |Attribute|   |  | | |  | | V i d e o         || | | .
   .    |    |  +---------+   |  | | |  | | E n c o d i n g s || | | .
   .    |    |                |  | | |  | | Encoding 1        || | | .
   .    |    | View 1         |  | | |  | |                   || | | .
   .    |    |  (list of MCs) |  | |-+  | +-------------------+| | | .
   .    |    +----|-|--|------+  |-+    |                      | | | .
   .    +---------|-|--|---------+      | +-------------------+| | | .
   .              | |  |                | | A u d i o         || | | .
   .              | |  |                | | E n c o d i n g s || | | .
   .              v |  |                | | Encoding 1        || | | .
   .      +---------|--|--------+       | |                   || | | .
   .      | Media Capture N     |------>| +-------------------+| | | .
   .    +-+---------v--|------+ |       |                      | | | .
   .    | Media Capture 2     | |       |                      | |-+ .
   .  +-+--------------v----+ |-------->|                      | |   .
   .  | Media Capture  1    | | |       |                      |-+   .
   .  |  +----------------+ |---------->|                      |     .
   .  |  | Attributes     | | |_+       +----------------------+     .
   .  |  +----------------+ |_+                                      .
   .  +---------------------+                                        .
   .                                                                 .

                     Figure 1: Advertisement Structure

   A very brief outline of

   Figure 2 illustrates the call flow used by a simple system (two
   Endpoints) in compliance with this document can be document.  A very brief outline of
   the call flow is described as
   follows, and as shown in the following figure. text that follows.

         +-----------+                     +-----------+
         | Endpoint1 |                     | Endpoint2 |
         +----+------+                     +-----+-----+
              | INVITE (BASIC SDP+CLUECHANNEL)   |
              |    200 0K (BASIC SDP+CLUECHANNEL)|
              | ACK                              |
              |                                  |
              |       BASIC MEDIA SESSION        |
              |                                  |
              |    CONNECT (CLUE CTRL CHANNEL)   |
              |            ...                   |
              |                                  |
              | ADVERTISEMENT 1                  |
              |                  ADVERTISEMENT 2 |
              |                                  |
              |                      CONFIGURE 1 |
              | CONFIGURE 2                      |
              |                                  |
              | REINVITE (UPDATED SDP)           |
              |              200 0K (UPDATED SDP)|
              | ACK                              |
              |                                  |
              |     UPDATED MEDIA SESSION        |
              |                                  |
              v                                  v

                      Figure 2: Basic Information Flow

   An initial offer/answer exchange establishes a basic media session,
   for example example, audio-only, and a CLUE channel between two Endpoints.
   With the establishment of that channel, the endpoints have consented
   to use the CLUE protocol mechanisms and, therefore, MUST adhere to
   the CLUE protocol suite as outlined herein.

   Over this CLUE channel, the Provider in each Endpoint conveys its
   characteristics and capabilities by sending an Advertisement as
   specified herein.  The Advertisement is typically not sufficient to
   set up all media.  The Consumer in the Endpoint receives the
   information provided by the Provider, Provider and can use it for several
   purposes.  It uses it, along with information from an offer/answer
   exchange, to construct a CLUE Configure message to tell the Provider
   what the Consumer wishes to receive.  Also, the Consumer may use the
   information provided to tailor the SDP it is going to send during any
   following SIP offer/answer exchange, and its reaction to SDP it
   receives in that step.  It is often a sensible implementation choice
   to do so.  Spatial relationships associated with the Media can be
   included in the Advertisement, and it is often sensible for the Media
   Consumer to take those spatial relationships into account when
   tailoring the SDP.  The Consumer can also limit the number of
   encodings it must set up resources to receive, and not waste
   resources on unwanted encodings, because it has the Provider's
   Advertisement information ahead of time to determine what it really
   wants to receive.  The Consumer can also use the Advertisement
   information for local rendering decisions.

   This initial CLUE exchange is followed by an SDP offer/answer
   exchange that not only establishes those aspects of the media that
   have not been "negotiated" over CLUE, but has also has the effect of
   setting up the media transmission itself, involving potentially
   security exchanges, ICE, Interactive Connectivity Establishment (ICE), and
   whatnot.  This step is plain considered "plain vanilla
   SIP. SIP".

   During the lifetime of a call, further exchanges MAY occur over the
   CLUE channel.  In some cases, those further exchanges lead to a
   modified system behavior of Provider or Consumer (or both) without
   any other protocol activity such as further offer/answer exchanges.
   For example, a Configure Message requesting that the Provider to place a
   different Capture source into a Capture Encoding, signaled over the
   CLUE channel, ought not to lead to heavy-handed mechanisms like SIP
   re-invites.  However, in  In other cases, however, after the CLUE negotiation negotiation, an
   additional offer/answer exchange becomes necessary.  For example, if
   both sides decide to upgrade the call from a single screen to a
   multi-screen call call, and more bandwidth is required for the additional
   video channels compared to what was previously negotiated using
   offer/answer, a new O/A offer/answer exchange is required.

   One aspect of the protocol outlined herein herein, and specified in more
   detail in companion documents documents, is that it makes available, available to the
   Consumer information regarding the Provider's capabilities to deliver Media,
   Media and attributes related to that Media such as their spatial
   relationship.  The operation of the renderer inside the Consumer is
   unspecified in that it can choose to ignore some information provided
   by the Provider, Provider and/or not render media streams available from the
   Provider (although the Consumer follows the CLUE protocol and,
   therefore, gracefully receives and responds to the Provider's
   information using a Configure operation).

   A CLUE-capable device interoperates with a device that does not
   support CLUE.  The CLUE-capable device can determine, by the result
   of the initial offer/answer exchange, if the other device supports
   and wishes to use CLUE.  The specific mechanism for this is described
   in [I-D.ietf-clue-signaling]. [RFC8848].  If the other device does not use CLUE, then the CLUE-capable CLUE-
   capable device falls back to behavior that does not require CLUE.

   As for the media, Provider and Consumer have an end-to-end
   communication relationship with respect to (RTP transported) (RTP-transported) media;
   and the mechanisms described herein and in companion documents do not
   change the aspects of setting up those RTP flows and sessions.  In
   other words, the RTP media sessions conform to the negotiated SDP
   whether or not CLUE is used.

6.  Spatial Relationships

   In order for a Consumer to perform a proper rendering, it is often
   necessary or (or at least helpful helpful) for the Consumer to have received
   spatial information about the streams it is receiving.  CLUE defines
   a coordinate system that allows Media Providers to describe the
   spatial relationships of their Media Captures to enable proper
   scaling and spatially sensible rendering of their streams.  The
   coordinate system is based on a few principles:


   *  Each Capture Scene has a distinct coordinate system, unrelated to
      the coordinate systems of other scenes.


   *  Simple systems which that do not have multiple Media Captures to
      associate spatially need not use the coordinate model, although it
      can still be useful to provide an Area of Capture.


   *  Coordinates can be either be in real, physical units (millimeters),
      have an unknown scale scale, or have no physical scale.  Systems which that
      know their physical dimensions (for example example, professionally
      installed Telepresence room systems) MUST provide those real-
      world real-world
      measurements to enable the best user experience for advanced
      receiving systems that can utilize this information.  Systems which that
      don't know specific physical dimensions but still know relative
      distances MUST use 'unknown scale'.  'No scale' "Unknown Scale".  "No Scale" is intended to be
      used only where Media Captures from different devices (with
      potentially different scales) will be forwarded alongside one
      another (e.g. (e.g., in the case of an MCU).


      -  "Millimeters" means the scale is in millimeters.

      *  "Unknown"

      -  "Unknown Scale" means the scale is not necessarily in
         millimeters, but the scale is the same for every Capture in the
         Capture Scene.


      -  "No Scale" means the scale could be different for each
         capture- capture
         -- an MCU Provider that advertises two adjacent captures and
         picks sources (which can change quickly) from different
         endpoints might use this value; the scale could be different
         and changing for each capture.  But the areas of capture still
         represent a spatial relation between captures.


   *  The coordinate system is right-handed Cartesian X, Y, Z with the
      origin at a spatial location of the Provider's choosing.  The
      Provider MUST use the same coordinate system with the same scale
      and origin for all coordinates within the same Capture Scene.

   The direction of increasing coordinate values is: is as follows: X
   increases from left to right, from the point of view of an observer
   at the front of the room looking toward the back back; Y increases from
   the front of the room to the back of the room room; Z increases from low
   to high (i.e. (i.e., floor to ceiling) ceiling).

   Cameras in a scene typically point in the direction of increasing Y,
   from front to back.  But there could be multiple cameras pointing in
   different directions.  If the physical space does not have a well-defined well-
   defined front and back, the provider chooses any direction for X and Y X, Y,
   and Z consistent with right-handed coordinates.

7.  Media Captures and Capture Scenes

   This section describes how Providers can describe the content of
   media to Consumers.

7.1.  Media Captures

   Media Captures are the fundamental representations of streams that a
   device can transmit.  What a Media Capture actually represents is


   *  It can represent the immediate output of a physical source (e.g. (e.g.,
      camera, microphone) or 'synthetic' source (e.g. (e.g., laptop computer,
      DVD player)

   o player).

   *  It can represent the output of an audio mixer or video composer

   o composer.

   *  It can represent a concept such as 'the loudest speaker'

   o speaker'.

   *  It can represent a conceptual position such as 'the leftmost

   To identify and distinguish between multiple Capture instances instances,
   Captures have a unique identity.  For instance: instance, VC1, VC2 and AC1,
   AC2, where VC1 and VC2 refer to two different video captures and AC1
   and AC2 refer to two different audio captures.

   Some key points about Media Captures:


   *  A Media Capture is of a single media type (e.g. (e.g., audio or
     . video).

   *  A Media Capture is defined in a Capture Scene and is given an
      Advertisement unique identity.  The identity may be referenced
      outside the Capture Scene that defines it through a Multiple
        Content Capture (MCC)
     . an MCC.

   *  A Media Capture may be associated with one or more Capture
        Scene Views
     . CSVs.

   *  A Media Capture has exactly one set of spatial information
     . information.

   *  A Media Capture can be the source of at most one Capture
        Encoding Encoding.

   Each Media Capture can be associated with attributes to describe what
   it represents.

7.1.1.  Media Capture Attributes

   Media Capture Attributes describe information about the Captures.  A
   Provider can use the Media Capture Attributes to describe the
   Captures for the benefit of the Consumer of the Advertisement
   message.  All these attributes are optional.  Media Capture
   Attributes include:


   *  Spatial information, such as point of capture, point on line of
      capture, and area of capture, all (all of which, in combination combination,
      define the capture field of, for example, a camera
     . camera).

   *  Other descriptive information to help the Consumer choose between
      captures (e.g. (e.g., description, presentation, view, priority,
      language, person information information, and type) type).

   The sub-sections subsections below define the Capture attributes.  Point of Capture

   The Point of Capture attribute is a field with a single Cartesian (X,
   Y, Z) point value which that describes the spatial location of the
   capturing device (such as camera).  For an Audio Capture with
   multiple microphones, the Point of Capture defines the nominal mid-
   midpoint of the microphones.  Point on Line of Capture

   The Point on Line of Capture attribute is a field with a single
   Cartesian (X, Y, Z) point value which that describes a position in space of
   a second point on the axis of the capturing device, toward the
   direction it is pointing; the first point being the Point of Capture
   (see above).

   Together, the Point of Capture and Point on Line of Capture define
   the direction and axis of the capturing device, for example example, the
   optical axis of a camera or the axis of a microphone.  The Media
   Consumer can use this information to adjust how it renders the
   received media if it so chooses.

   For an Audio Capture, the Media Consumer can use this information
   along with the Audio Capture Sensitivity Pattern to define a 3- three-
   dimensional volume of capture where sounds can be expected to be
   picked up by the microphone providing this specific audio capture.
   If the Consumer wants to associate an Audio Capture with a Video
   Capture, it can compare this volume with the area of capture for
   video media to provide a check on whether the audio capture is indeed
   spatially associated with the video capture.  For example, a video
   area of capture that fails to intersect at all with the audio volume
   of capture, or is at such a long radial distance from the microphone
   point of capture that the audio level would be very low, would be
   inappropriate.  Area of Capture

   The Area of Capture is a field with a set of four (X, Y, Z) points as
   a value which that describes the spatial location of what is being
   "captured".  This attribute applies only to video captures, not other
   types of media.  By comparing the Area of Capture for different Video
   Captures within the same Capture Scene Scene, a Consumer can determine the
   spatial relationships between them and render them correctly.

   The four points MUST be co-planar, forming a quadrilateral, which
   defines the Plane of Interest for the particular Media Capture.

   If the Area of Capture is not specified, it means the Video Capture
   might be spatially related to other Captures in the same Scene, but
   there is no detailed information on the relationship.For relationship.  For a switched
   Capture that switches between different sections within a larger
   area, the area of capture MUST use coordinates for the larger
   potential area.  Mobility of Capture

   The Mobility of Capture attribute indicates whether or not the point
   of capture, line on point of capture, and area of capture values stay
   the same over time, or are expected to change (potentially
   frequently).  Possible values are static, dynamic, and highly

   An example for "dynamic" is a camera mounted on a stand which that is
   occasionally hand-carried and placed at different positions in order
   to provide the best angle to capture a work task.  A camera worn by a
   person who moves around the room is an example for "highly dynamic".
   In either case, the effect is that the capture point, capture axis axis,
   and area of capture change with time.

   The capture point of a static Capture MUST NOT move for the life of
   the CLUE session.  The capture point of dynamic Captures is
   categorized by a change in position followed by a reasonable period
   of stability--in stability -- in the order of magnitude of minutes.  Highly dynamic
   captures are categorized by a capture point that is constantly
   moving.  If the "area of capture", "capture point" and "line of
   capture" attributes are included with dynamic or highly dynamic
   Captures they indicate spatial information at the time of the
   Advertisement.  Audio Capture Sensitivity Pattern

   The Audio Capture Sensitivity Pattern attribute applies only to audio
   captures.  This attribute gives information about the nominal
   sensitivity pattern of the microphone which that is the source of the
   Capture.  Possible values include patterns such as omni, shotgun,
   cardioid, and hyper-cardioid.  Description

   The Description attribute is a human-readable description (which
   could be in multiple languages) of the Capture.  Presentation

   The Presentation attribute indicates that the capture originates from
   a presentation device, that is is, one that provides supplementary
   information to a conference through slides, video, still images,
   data, etc.  Where more information is known about the capture capture, it MAY
   be expanded hierarchically to indicate the different types of
   presentation media, e.g. e.g., presentation.slides, presentation.image presentation.image,

   Note: It is expected that a number of keywords will be defined that
   provide more detail on the type of presentation.  Refer to [I-
   D.ietf-clue-data-model-schema] [RFC8846]
   for how to extend the model.  View

   The View attribute is a field with enumerated values, indicating what
   type of view the Capture relates to.  The Consumer can use this
   information to help choose which Media Captures it wishes to receive.
   Possible values are:

   Room - are as follows:

   Room:       Captures the entire scene

   Table -

   Table:      Captures the conference table with seated people
   Individual -

   Individual:  Captures an individual person

   Lectern -

   Lectern:    Captures the region of the lectern including the
               presenter, for example example, in a classroom style classroom-style conference

   Audience -

   Audience:   Captures a region showing the audience in a classroom classroom-
               style conference room  Language

   The Language attribute indicates one or more languages used in the
   content of the Media Capture.  Captures MAY be offered in different
   languages in case of multilingual and/or accessible conferences.  A
   Consumer can use this attribute to differentiate between them and
   pick the appropriate one.

   Note that the Language attribute is defined and meaningful both for
   audio and video captures.  In case of audio captures, the meaning is
   obvious.  For a video capture, "Language" could, for example, be sign
   interpretation or text.

   The Language attribute is coded per [RFC5646].  Person Information

   The Person Information attribute allows a Provider to provide
   specific information regarding the people in a Capture (regardless of
   whether or not the capture has a Presentation attribute).  The
   Provider may gather the information automatically or manually from a
   variety of sources however sources; however, the xCard [RFC6351] format is used to
   convey the information.  This allows various information information, such as
   Identification information (section 6.2/[RFC6350]), (Section 6.2 of [RFC6350]), Communication
   Information (section 6.4/[RFC6350]) (Section 6.4 of [RFC6350]), and Organizational
   (section 6.6/[RFC6350]) (Section 6.6 of [RFC6350]), to be communicated.  A
   Consumer may then automatically (i.e. (i.e., via a policy) or manually
   select Captures based on information about who is in a Capture.  It
   also allows a Consumer to render information regarding the people
   participating in the conference or to use it for further processing.

   The Provider may supply a minimal set of information or a larger set
   of information. However  However, it MUST be compliant to [RFC6350] and
   supply a "VERSION" and "FN" property.  A Provider may supply multiple
   xCards per Capture of any KIND (section 6.1.4/[RFC6350]). (Section 6.1.4 of [RFC6350]).

   In order to keep CLUE messages compact compact, the Provider SHOULD use a URI
   to point to any LOGO, PHOTO PHOTO, or SOUND contained in the xCARD rather
   than transmitting the LOGO, PHOTO PHOTO, or SOUND data in a CLUE message.  Person Type

   The Person Type attribute indicates the type of people contained in
   the capture with respect to the meeting agenda (regardless of whether
   or not the capture has a Presentation attribute).  As a capture may
   include multiple people the attribute may contain multiple values. However
   However, values MUST NOT be repeated within the attribute.

   An Advertiser associates the person type with an individual capture
   when it knows that a particular type is in the capture.  If an
   Advertiser cannot link a particular type with some certainty to a
   capture, then it is not included. A Consumer on  On reception of a capture with a
   person type attribute attribute, a Consumer knows with some certainly that the
   capture contains that person type.  The capture may contain other
   person types types, but the Advertiser has not been able to determine that
   this is the case.

   The types of Captured people include:

     . Chair -

   Chair:         the person responsible for running the meeting
                  according to the agenda.
     . Vice-Chair -

   Vice-Chair:    the person responsible for assisting the chair in
                  running the meeting.

   Minute Taker - Taker:  the person responsible for recording the minutes of
                  the meeting.
     . Attendee -

   Attendee:      the person has no particular responsibilities with
                  respect to running the meeting.
     . Observer -

   Observer:      an Attendee without the right to influence the
     . Presenter -

   Presenter:     the person is scheduled on the agenda to make a
                  presentation in the meeting.  Note: This is not
                  related to any "active speaker" functionality.
     . Translator -

   Translator:    the person is providing some form of translation or
                  commentary in the meeting.
     . Timekeeper -

   Timekeeper:    the person is responsible for maintaining the meeting


   Furthermore, the person type attribute may contain one or more
   strings allowing the Provider to indicate custom meeting specific meeting-specific
   types.  Priority

   The Priority attribute indicates a relative priority between
   different Media Captures.  The Provider sets this priority, and the
   Consumer MAY use the priority to help decide which Captures it wishes
   to receive.

   The "priority" attribute is an integer which that indicates a relative
   priority between Captures.  For example example, it is possible to assign a
   priority between two presentation Captures that would allow a remote
   Endpoint to determine which presentation is more important.  Priority
   is assigned at the individual Capture level.  It represents the
   Provider's view of the relative priority between Captures with a
   priority.  The same priority number MAY be used across multiple
   Captures.  It indicates that they are equally important.  If no
   priority is assigned assigned, no assumptions regarding relative importance of
   the Capture can be assumed.  Embedded Text

   The Embedded Text attribute indicates that a Capture provides
   embedded textual information.  For example example, the video Capture may
   contain speech to text speech-to-text information composed with the video image.  Related To

   The Related To attribute indicates the Capture contains additional
   complementary information related to another Capture.  The value
   indicates the identity of the other Capture to which this Capture is
   providing additional information.

   For example, a conference can utilize translators or facilitators
   that provide an additional audio stream (i.e. (i.e., a translation or
   description or commentary of the conference).  Where multiple
   captures are available, it may be advantageous for a Consumer to
   select a complementary Capture instead of or in addition to a Capture
   it relates to.

7.2.  Multiple Content Capture

   The MCC indicates that one or more Single Media Captures are
   multiplexed (temporally and/or spatially) or mixed in one Media
   Capture.  Only one Capture type (i.e. (i.e., audio, video, etc.) is allowed
   in each MCC instance.  The MCC may contain a reference to the Single
   Media Captures (which may have their own attributes) as well as
   attributes associated with the MCC itself.  A  An MCC may also contain
   other MCCs.  The MCC MAY reference Captures from within the Capture
   Scene that defines it or from other Capture Scenes.  No ordering is
   implied by the order that Captures appear within a an MCC.
   A  An MCC MAY
   contain no references to other Captures to indicate that the MCC
   contains content from multiple sources sources, but no information regarding
   those sources is given.  MCCs either contain the referenced Captures
   and no others, others or have no referenced captures
   and therefore and, therefore, may
   contain any Capture.

   One or more MCCs may also be specified in a CSV.  This allows an
   Advertiser to indicate that several MCC captures are used to
   represent a capture scene.  Table 14 provides an example of this

   As outlined in section 7.1. Section 7.1, each instance of the MCC has its own
   Capture identity i.e. identity, i.e., MCC1.  It allows all the individual captures
   contained in the MCC to be referenced by a single MCC identity.

   The example below shows the use of a Multiple Content Capture:


              | Capture Scene #1  |                         |
              | VC1               | {MC attributes}         |
              | VC2               | {MC attributes}         |
              | VC3               | {MC attributes}         |
              | MCC1(VC1,VC2,VC3) | {MC and MCC attributes} |
              | CSV(MCC1)         |                         |

                 Table 1: Multiple Content Capture concept Concept

   This indicates that MCC1 is a single capture that contains the
   Captures VC1, VC2 VC2, and VC3 VC3, according to any MCC1 attributes.

7.2.1.  MCC Attributes

   Media Capture Attributes may be associated with the MCC instance and
   the Single Media Captures that the MCC references.  A Provider should
   avoid providing conflicting attribute values between the MCC and
   Single Media Captures.  Where there is conflict the attributes of the MCC
   MCC, a Provider should override any that may be present in the
   individual Captures.

   A Provider MAY include as much or as little of the original source
   Capture information as it requires.

   There are MCC specific MCC-specific attributes that MUST only be used with
   Multiple Content Captures.  These are described in the sections
   below.  The attributes described in section 7.1.1. Section 7.1.1 MAY also be used
   with MCCs.

   The spatial related spatial-related attributes of an MCC indicate its area of capture
   and point of capture within the scene, just like any other media
   capture.  The spatial information does not imply anything about how
   other captures are composed within an MCC.

   For example:  A a virtual scene could be constructed for the MCC capture
   with two Video Captures with a "MaxCaptures" attribute set to 2 and
   an "Area of Capture" attribute provided with an overall area.  Each
   of the individual Captures could then also include an "Area of
   Capture" attribute with a sub-set subset of the overall area.  The Consumer
   would then know how each capture is related to others within the
   scene, but not the relative position of the individual captures
   within the composed capture.


           | Capture Scene |                                   |
           | #1            |                                   |
           | VC1           |      AreaofCapture=(0,0,0)(9,0,0) |
           |               |                    (0,0,9)(9,0,9) |
           | VC2           |    AreaofCapture=(10,0,0)(19,0,0) |
           |               |                  (10,0,9)(19,0,9) |
           | MCC1(VC1,VC2) |                     MaxCaptures=2 |
           |               |     AreaofCapture=(0,0,0)(19,0,0) |
           |               |                   (0,0,9)(19,0,9) |
           | CSV(MCC1)     |                                   |

              Table 2: Example of MCC and Single Media Capture attributes

   The sub-sections subsections below describe the MCC only MCC-only attributes.  Maximum Number of Captures within a an MCC

   The Maximum Number of Captures MCC attribute indicates the maximum
   number of individual Captures that may appear in a Capture Encoding
   at a time.  The actual number at any given time can be less than or
   equal to this maximum.  It may be used to derive how the Single Media
   Captures within the MCC are composed / switched composed/switched with regards regard to space
   and time.

   A Provider can indicate that the number of Captures in a an MCC Capture
   Encoding is equal "=" ("=") to the MaxCaptures value or that there may be
   any number of Captures up to and including "<=" ("<=") the MaxCaptures
   value.  This allows a Provider to distinguish between a an MCC that
   purely represents a composition of sources versus a and an MCC that represents
   switched sources or switched and composed sources.

   MaxCaptures may be set to one so that only content related to one of
   the sources are is shown in the MCC Capture Encoding at a time time, or it may
   be set to any value up to the total number of Source Media Captures
   in the MCC.

   The bullets below describe how the setting of MaxCapture MaxCaptures versus the
   number of Captures in the MCC affects how sources appear in a Capture

     . When

   *  A switched case occurs when MaxCaptures is set to <= 1 and the
      number of Captures in the MCC is greater than 1 (or not specified)
      in the MCC this
        is a switched case. MCC.  Zero or 1 one Captures may be switched into the Capture
      Encoding.  Note: zero is allowed because of the "<=".
     . When

   *  A switched case occurs when MaxCaptures is set to = 1 and the
      number of Captures in the MCC is greater than 1 (or not specified)
      in the MCC this
        is a switched case. MCC.  Only one Capture source is contained in a Capture
      Encoding at a time.
     . When

   *  A switched and composed case occurs when MaxCaptures is set to <=
      N (with N > 1) and the number of Captures in the MCC is greater
      than N (or not specified) this
        is a switched and composed case. specified).  The Capture Encoding may contain
      purely switched sources (i.e. (i.e., <=2 allows for 1 one source on its
      own), or it may contain composed and switched sources
        (i.e. (i.e., a
      composition of 2 two sources switched between the sources).
     . When

   *  A switched and composed case occurs when MaxCaptures is set to = N
      (with N > 1) and the number of Captures in the MCC is greater than
      N (or not specified) this
        is a switched and composed case. specified).  The Capture Encoding contains composed and
      switched sources (i.e. (i.e., a composition of N sources switched
      between the sources).  It is not possible to have a single source.
     . When

   *  A switched and composed case occurs when MaxCaptures is set to <= to
      the number of Captures in the
        MCC this is a switched and composed case. MCC.  The Capture Encoding may
      contain media switched between any number (up to the MaxCaptures)
      of composed sources.
     . When

   *  A composed case occurs when MaxCaptures is set to = to the number of
      Captures in the
        MCC this is a composed case. MCC.  All the sources are composed into a single
      Capture Encoding.

   If this attribute is not set set, then as default a default, it is assumed that
   all source media capture content can appear concurrently in the
   Capture Encoding associated with the MCC.

   For example: The example, the use of MaxCaptures equal to 1 on a an MCC with three
   Video Captures Captures, VC1, VC2 VC2, and VC3 VC3, would indicate that the Advertiser
   in the Capture Encoding would switch between VC1, VC2 or VC2, and VC3 as
   there may be only a maximum of one Capture at a time.  Policy

   The Policy MCC Attribute indicates the criteria that the Provider
   uses to determine when and/or where media content appears in the
   Capture Encoding related to the MCC.

   The attribute is in the form of a token that indicates the policy and
   an index representing an instance of the policy.  The same index
   value can be used for multiple MCCs.

   The tokens are:

   SoundLevel - are as follows:

   SoundLevel:  This indicates that the content of the MCC is determined
      by a sound level detection sound-level-detection algorithm.  The loudest (active)
      speaker (or a previous speaker, depending on the index value) is
      contained in the MCC.

   RoundRobin -

   RoundRobin:  This indicates that the content of the MCC is determined
      by a time based time-based algorithm.  For example: example, the Provider provides
      content from a particular source for a period of time and then
      provides content from another source source, and so on.

   An index is used to represent an instance in the policy setting.  An
   index of 0 represents the most current instance of the policy, i.e. i.e.,
   the active speaker, 1 represents the previous instance, i.e. i.e., the
   previous active speaker speaker, and so on.

   The following example shows a case where the Provider provides two
   media streams, one showing the active speaker and a second stream
   showing the previous speaker.


                | Capture Scene #1 |                     |
                | VC1              |                     |
                | VC2              |                     |
                | MCC1(VC1,VC2)    | Policy=SoundLevel:0 |
                |                  | MaxCaptures=1       |
                | MCC2(VC1,VC2)    | Policy=SoundLevel:1 |
                |                  | MaxCaptures=1       |
                | CSV(MCC1,MCC2)   |                     |

                  Table 3: Example Policy MCC attribute usage Attribute
                                  Usage Synchronisation  Synchronization Identity

   The Synchronisation Synchronization Identity MCC attribute indicates how the
   individual Captures in multiple MCC Captures are synchronised. synchronized.  To
   indicate that the Capture Encodings associated with MCCs contain
   Captures from the same source at the same time time, a Provider should set
   the same Synchronisation Synchronization Identity on each of the concerned MCCs.  It
   is the Provider that determines what the source for the Captures is,
   so a Provider can choose how to group together Single Media Captures
   into a combined "source" for the purpose of switching them together
   to keep them synchronized according to the
   SynchronisationID SynchronizationID
   attribute.  For example example, when the Provider is in an MCU MCU, it may
   determine that each separate CLUE Endpoint is a remote source of
   media.  The Synchronisation Synchronization Identity may be used across media types, i.e.
   i.e., to synchronize audio audio- and video related video-related MCCs.

   Without this attribute it is assumed that multiple MCCs may provide
   content from different sources at any particular point in time.

   For example:


              | Capture Scene #1      |                     |
              | VC1                   | Description=Left    |
              | VC2                   | Description=Centre Description=Center  |
              | VC3                   | Description=Right   |
              | AC1                   | Description=Room    |
              | CSV(VC1,VC2,VC3)      |                     |
              | CSV(AC1)              |                     |
              | Capture Scene #2      |                     |
              | VC4                   | Description=Left    |
              | VC5                   | Description=Centre Description=Center  |
              | VC6                   | Description=Right   |
              | AC2                   | Description=Room    |
              | CSV(VC4,VC5,VC6)      |                     |
              | CSV(AC2)              |                     |
              | Capture Scene #3      |                     |
              | VC7                   |                     |
              | AC3                   |                     |
              | Capture Scene #4      |                     |
              | VC8                   |                     |
              | AC4                   |                     |
              | Capture Scene #5      |                     |
              | MCC1(VC1,VC4,VC7)     | SynchronisationID=1 SynchronizationID=1 |
              |                       | MaxCaptures=1       |
              | MCC2(VC2,VC5,VC8)     | SynchronisationID=1 SynchronizationID=1 |
              |                       | MaxCaptures=1       |
              | MCC3(VC3,VC6)         | MaxCaptures=1       |
              | MCC4(AC1,AC2,AC3,AC4) | SynchronisationID=1 SynchronizationID=1 |
              |                       | MaxCaptures=1       |
              | CSV(MCC1,MCC2,MCC3)   |                     |
              | CSV(MCC4)             |                     |

                 Table 4: Example Synchronisation Synchronization Identity
                            MCC attribute usage Attribute Usage

   The above Advertisement would indicate that MCC1, MCC2, MCC3 MCC3, and
   MCC4 make up a Capture Scene.  There would be four Capture Encodings
   (one for each MCC).  Because MCC1 and MCC2 have the same
   SynchronizationID, each Encoding from MCC1 and MCC2 respectively MCC2, respectively,
   would together have content from only Capture Scene 1 or only Capture
   Scene 2 or the combination of VC7 and VC8 at a particular point in
   time.  In this case case, the Provider has decided the sources to be
   synchronized are Scene #1, Scene #2, and Scene #3 and #4 together.
   The Encoding from MCC3 would not be synchronised synchronized with MCC1 or MCC2.
   As MCC4 also has the same Synchronisation Synchronization Identity as MCC1 and MCC2 MCC2,
   the content of the audio Encoding will be
   synchronised synchronized with the video
   content.  Allow Subset Choice

   The Allow Subset Choice MCC attribute is a boolean value, indicating
   whether or not the Provider allows the Consumer to choose a specific
   subset of the Captures referenced by the MCC.  If this attribute is
   true, and the MCC references other Captures, then the Consumer MAY
   select (in a Configure message) a specific subset of those Captures
   to be included in the MCC, and the Provider MUST then include only
   that subset.  If this attribute is false, or the MCC does not
   reference other Captures, then the Consumer MUST NOT select a subset.

7.3.  Capture Scene

   In order for a Provider's individual Captures to be used effectively
   by a Consumer, the Provider organizes the Captures into one or more
   Capture Scenes, with the structure and contents of these Capture
   Scenes being sent from the Provider to the Consumer in the

   A Capture Scene is a structure representing a spatial region
   containing one or more Capture Devices, each capturing media
   representing a portion of the region.  A Capture Scene includes one
   or more Capture Scene Views (CSV), (CSVs), with each CSV including one or
   more Media Captures of the same media type.  There can also be Media
   Captures that are not included in a Capture Scene View.  A Capture
   Scene represents, for example, the video image of a group of people
   seated next to each other, along with the sound of their voices,
   which could be represented by some number of VCs and ACs in the
   Capture Scene Views.  An MCU can also describe in Capture Scenes what
   it constructs from media Streams it receives.

   A Provider MAY advertise one or more Capture Scenes.  What
   constitutes an entire Capture Scene is up to the Provider.  A simple
   Provider might typically use one Capture Scene for participant media
   (live video from the room cameras) and another Capture Scene for a computer generated
   computer-generated presentation.  In more
   complex more-complex systems, the use of
   additional Capture Scenes is also sensible.  For example, a classroom
   may advertise two Capture Scenes involving live video, video: one including
   only the camera capturing the instructor (and associated audio), audio) the
   other including camera(s) capturing students (and associated audio).

   A Capture Scene MAY (and typically will) include more than one type
   of media.  For example, a Capture Scene can include several Capture
   Scene Views for Video Captures, Captures and several Capture Scene Views for
   Audio Captures.  A particular Capture MAY be included in more than
   one Capture Scene View.

   A Provider MAY express spatial relationships between Captures that
   are included in the same Capture Scene.  However, there is no spatial
   relationship between Media Captures from different Capture Scenes.
   In other words, Capture Scenes each use their own spatial measurement
   system as outlined above in section Section 6.

   A Provider arranges Captures in a Capture Scene to help the Consumer
   choose which captures it wants to render.  The Capture Scene Views in
   a Capture Scene are different alternatives the Provider is suggesting
   for representing the Capture Scene.  Each Capture Scene View is given
   an advertisement unique advertisement-unique identity.  The order of Capture Scene Views
   within a Capture Scene has no significance.  The Media Consumer can
   choose to receive all Media Captures from one Capture Scene View for
   each media type (e.g. (e.g., audio and video), or it can pick and choose
   Media Captures regardless of how the Provider arranges them in
   Capture Scene Views.  Different Capture Scene Views of the same media
   type are not necessarily mutually exclusive alternatives.  Also note
   that the presence of multiple Capture Scene Views (with potentially
   multiple encoding options in each view) in a given Capture Scene does
   not necessarily imply that a Provider is able to serve all the
   associated media simultaneously (although the construction of such an
   over-rich Capture Scene is probably not sensible in many cases).
   What a Provider can send simultaneously is determined through the
   Simultaneous Transmission Set mechanism, described in section Section 8.

   Captures within the same Capture Scene View MUST be of the same media
   type - -- it is not possible to mix audio and video captures in the
   same Capture Scene View, for instance.  The Provider MUST be capable
   of encoding and sending all Captures (that have an encoding group) in
   a single Capture Scene View simultaneously.  The order of Captures
   within a Capture Scene View has no significance.  A Consumer can
   decide to receive all the Captures in a single Capture Scene View,
   but a Consumer could also decide to receive just a subset of those
   captures.  A Consumer can also decide to receive Captures from
   different Capture Scene Views, all subject to the constraints set by
   Simultaneous Transmission Sets, as discussed in
   section Section 8.

   When a Provider advertises a Capture Scene with multiple CSVs, it is
   essentially signaling that there are multiple representations of the
   same Capture Scene available.  In some cases, these multiple views
   would be used simultaneously (for instance instance, a "video view" and an
   "audio view").  In some cases cases, the views would conceptually be
   alternatives (for instance instance, a view consisting of three Video Captures
   covering the whole room versus a view consisting of just a single
   Video Capture covering only the center of a room).  In this latter
   example, one sensible choice for a Consumer would be to indicate
   (through its Configure and possibly through an additional
   offer/answer offer/
   answer exchange) the Captures of that Capture Scene View that most
   closely matched the Consumer's number of display devices or screen

   The following is an example of 4 four potential Capture Scene Views for
   an endpoint-style Provider:

   1.  (VC0, VC1, VC2) - left, center center, and right camera Video Captures

   2.  (MCC3) - Video Capture associated with loudest room segment

   3.  (VC4) - Video Capture zoomed out view of all people in the room

   4.  (AC0) - main audio

   The first view in this Capture Scene example is a list of Video
   Captures which that have a spatial relationship to each other.
   Determination of the order of these captures (VC0, VC1 VC1, and VC2) for
   rendering purposes is accomplished through use of their Area of
   Capture attributes.  The second view (MCC3) and the third view (VC4)
   are alternative representations of the same room's video, which might
   be better suited to some Consumers' rendering capabilities.  The
   inclusion of the Audio Capture in the same Capture Scene indicates
   that AC0 is associated with all of those Video Captures, meaning it
   comes from the same spatial region.  Therefore, if audio were to be
   rendered at all, this audio would be the correct choice choice, irrespective
   of which Video Captures were chosen.

7.3.1.  Capture Scene attributes Attributes

   Capture Scene Attributes can be applied to Capture Scenes as well as
   to individual media captures.  Attributes specified at this level
   apply to all constituent Captures.  Capture Scene attributes include

   the following:

   *  Human-readable description of the Capture Scene, which could be in
      multiple languages;

   *  xCard scene information

   *  Scale information (millimeters, unknown, no scale), ("Millimeters", "Unknown Scale", "No Scale"), as
      described in Section 6.  Scene Information

   The Scene information attribute provides information regarding the
   Capture Scene rather than individual participants.  The Provider may
   gather the information automatically or manually from a variety of
   sources.  The scene information attribute allows a Provider to
   indicate information such as: as organizational or geographic information
   allowing a Consumer to determine which Capture Scenes are of interest
   in order to then perform Capture selection.  It also allows a
   Consumer to render information regarding the Scene or to use it for
   further processing.

   As per Section, the xCard format is used to convey this
   information and the Provider may supply a minimal set of information
   or a larger set of information.

   In order to keep CLUE messages compact the Provider SHOULD use a URI
   to point to any LOGO, PHOTO PHOTO, or SOUND contained in the xCARD rather
   than transmitting the LOGO, PHOTO PHOTO, or SOUND data in a CLUE message.

7.3.2.  Capture Scene View attributes Attributes

   A Capture Scene can include one or more Capture Scene Views in
   addition to the Capture Scene wide Capture-Scene-wide attributes described above.
   Capture Scene View attributes apply to the Capture Scene View as a
   whole, i.e. i.e., to all Captures that are part of the Capture Scene View.

   Capture Scene View attributes include:

     . Human-readable include the following:

   *  A human-readable description (which could be in multiple
      languages) of the Capture Scene View View.

7.4.  Global View List

   An Advertisement can include an optional Global View list.  Each item
   in this list is a Global View.  The Provider can include multiple
   Global Views, to allow a Consumer to choose sets of captures
   appropriate to its capabilities or application.  The choice of how to
   make these suggestions in the Global View list for what represents
   all the scenes for which the Provider can send media is up to the
   Provider.  This is very similar to how each CSV represents a
   particular scene.

   As an example, suppose an advertisement has three scenes, and each
   scene has three CSVs, ranging from one to three video captures in
   each CSV.  The Provider is advertising a total of nine video Captures
   across three scenes.  The Provider can use the Global View list to
   suggest alternatives for Consumers that can't receive all nine video
   Captures as separate media streams.  For accommodating a Consumer
   that wants to receive three video Captures, a Provider might suggest
   a Global View containing just a single CSV with three Captures and
   nothing from the other two scenes.  Or a Provider might suggest a
   Global View containing three different CSVs, one from each scene,
   with a single video Capture in each.

   Some additional rules:


   *  The ordering of Global Views in the Global View list is

   *  The ordering of CSVs within each Global View is insignificant.

   *  A particular CSV may be used in multiple Global Views.

   *  The Provider must be capable of encoding and sending all Captures
      within the CSVs of a given Global View simultaneously.

   The following figure shows an example of the structure of Global
   Views in a Global View List.

      . Advertisement                                        .
      .                                                      .
      . +--------------+         +-------------------------+ .
      . |Scene 1       |         |Global View List         | .
      . |              |         |                         | .
      . | CSV1 (v)<----------------- Global View (CSV 1)   | .
      . |         <-------.      |                         | .
      . |              |  *--------- Global View (CSV 1,5) | .
      . | CSV2 (v)     |  |      |                         | .
      . |              |  |      |                         | .
      . | CSV3 (v)<---------*------- Global View (CSV 3,5) | .
      . |              |  | |    |                         | .
      . | CSV4 (a)<----------------- Global View (CSV 4)   | .
      . |         <-----------.  |                         | .
      . +--------------+  | | *----- Global View (CSV 4,6) | .
      .                   | | |  |                         | .
      . +--------------+  | | |  +-------------------------+ .
      . |Scene 2       |  | | |                              .
      . |              |  | | |                              .
      . | CSV5 (v)<-------' | |                              .
      . |         <---------' |                              .
      . |              |      |        (v) = video           .
      . | CSV6 (a)<-----------'        (a) = audio           .
      . |              |                                     .
      . +--------------+                                     .

                    Figure 3: Global View List Structure

8.  Simultaneous Transmission Set Constraints

   In many practical cases, a Provider has constraints or limitations on
   its ability to send Captures simultaneously.  One type of limitation
   is caused by the physical limitations of capture mechanisms; these
   constraints are represented by a Simultaneous Transmission Set.  The
   second type of limitation reflects the encoding resources available,
   such as bandwidth or video encoding throughput (macroblocks/second).
   This type of constraint is captured by Individual Encodings and
   Encoding Groups, discussed below.

   Some Endpoints or MCUs can send multiple Captures simultaneously;
   however, sometimes there are constraints that limit which Captures
   can be sent simultaneously with other Captures.  A device may not be
   able to be used in different ways at the same time.  Provider
   Advertisements are made so that the Consumer can choose one of
   several possible mutually exclusive usages of the device.  This type
   of constraint is expressed in a Simultaneous Transmission Set, which
   lists all the Captures of a particular media type (e.g. (e.g., audio,
   video, or text) that can be sent at the same time.  There are
   different Simultaneous Transmission Sets for each media type in the
   Advertisement.  This is easier to show in an example.

   Consider the example of a room system where there are three cameras cameras,
   each of which can send a separate Capture covering two persons
   each- people each:
   VC0, VC1, and VC2.  The middle camera can also zoom out (using an
   optical zoom lens) and show all six persons, people, VC3.  But the middle
   camera cannot be used in both modes at the same time - time; it has to
   either show the space where two participants sit or the whole six
   seats, but not both at the same time.  As a result, VC1 and VC3
   cannot be sent simultaneously.

   Simultaneous Transmission Sets are expressed as sets of the Media
   Captures that the Provider could transmit at the same time (though,
   in some cases, it is not intuitive to do so).  If a Multiple Content
   Capture is included in a Simultaneous Transmission Set Set, it indicates
   that the Capture Encoding associated with it could be transmitted as
   the same time as the other Captures within the Simultaneous
   Transmission Set. It does not imply that the Single Media Captures
   contained in the Multiple Content Capture could all be transmitted at
   the same time.

   In this example example, the two Simultaneous Transmission Sets are shown in
   Table 5.  If a Provider advertises one or more mutually exclusive
   Simultaneous Transmission Sets, then then, for each media type type, the
   Consumer MUST ensure that it chooses Media Captures that lie wholly
   within one of those Simultaneous Transmission Sets.


                            | Simultaneous Sets |
                            | {VC0, VC1, VC2}   |
                            | {VC0, VC3, VC2}   |

                                 Table 5: Two
                              Transmission Sets

   A Provider OPTIONALLY can include the Simultaneous Transmission Sets
   in its Advertisement.  These constraints apply across all the Capture
   Scenes in the Advertisement.  It is a syntax conformance syntax-conformance requirement
   that the Simultaneous Transmission Sets MUST allow all the media
   Captures in any particular Capture Scene View to be used
   simultaneously.  Similarly, the Simultaneous Transmission Sets MUST
   reflect the simultaneity expressed by any Global View.

   For shorthand convenience, a Provider MAY describe a Simultaneous
   Transmission Set in terms of Capture Scene Views and Capture Scenes.
   If a Capture Scene View is included in a Simultaneous Transmission
   Set, then all Media Captures in the Capture Scene View are included
   in the Simultaneous Transmission Set.  If a Capture Scene is included
   in a Simultaneous Transmission Set, then all its Capture Scene Views
   (of the corresponding media type) are included in the Simultaneous
   Transmission Set.  The end result reduces to a set of Media Captures,
   of a particular media type, in either case.

   If an Advertisement does not include Simultaneous Transmission Sets,
   then the Provider MUST be able to simultaneously provide all the
   Captures from any one CSV of each media type from each Capture Scene.
   Likewise, if there are no Simultaneous Transmission Sets and there is
   a Global View list, then the Provider MUST be able to simultaneously
   provide all the Captures from any particular Global View (of each
   media type) from the Global View list.

   If an Advertisement includes multiple Capture Scene Views in a
   Capture Scene Scene, then the Consumer MAY choose one Capture Scene View
   for each media type, or it MAY choose individual Captures based on
   the Simultaneous Transmission Sets.

9.  Encodings

   Individual encodings and encoding groups are CLUE's mechanisms
   allowing a Provider to signal its limitations for sending Captures,
   or combinations of Captures, to a Consumer.  Consumers can map the
   Captures they want to receive onto the Encodings, with the encoding
   parameters they want.  As for the relationship between the CLUE-
   specified mechanisms based on Encodings and the SIP offer/answer
   exchange, please refer to section Section 5.

9.1.  Individual Encodings

   An Individual Encoding represents a way to encode a Media Capture as
   a Capture Encoding, to be sent as an encoded media stream from the
   Provider to the Consumer.  An Individual Encoding has a set of
   parameters characterizing how the media is encoded.

   Different media types have different parameters, and different
   encoding algorithms may have different parameters.  An Individual
   Encoding can be assigned to at most one Capture Encoding at any given

   Individual Encoding parameters are represented in SDP [RFC4566], not
   in CLUE messages.  For example, for a video encoding using H.26x
   compression technologies, this can include parameters such

     . as

   *  Maximum bandwidth;
   *  Maximum picture size in pixels;
   *  Maximum number of pixels to be processed per second;

   The bandwidth parameter is the only one that specifically relates to
   a CLUE Advertisement, as it can be further constrained by the maximum
   group bandwidth in an Encoding Group.

9.2.  Encoding Group

   An Encoding Group includes a set of one or more Individual Encodings,
   and parameters that apply to the group as a whole.  By grouping
   multiple individual Encodings together, an Encoding Group describes
   additional constraints on bandwidth for the group.  A single Encoding
   Group MAY refer to Encodings for different media types.

   The Encoding Group data structure contains:


   *  Maximum bitrate for all encodings in the group combined;

   *  A list of identifiers for the Individual Encodings belonging to
      the group.

   When the Individual Encodings in a group are instantiated into
   Capture Encodings, each Capture Encoding has a bitrate that MUST be
   less than or equal to the max bitrate for the particular Individual
   Encoding.  The "maximum bitrate for all encodings in the group"
   parameter gives the additional restriction that the sum of all the
   individual Capture Encoding bitrates MUST be less than or equal to
   this group value.

   The following diagram illustrates one example of the structure of a
   media Provider's Encoding Groups and their contents.

   |             Media Provider                      |
   |                                                 |
   |  ,--------------------------------------.       |
   |  | ,--------------------------------------.     |
   |  | | ,--------------------------------------.   |
   |  | | |          Encoding Group              |   |
   |  | | | ,-----------.                        |   |
   |  | | | |           | ,---------.            |   |
   |  | | | |           | |         | ,---------.|   |
   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
   |  `.| | |           | |         | `---------'|   |
   |    `.| `-----------' `---------'            |   |
   |      `--------------------------------------'   |

                     Figure 4: Encoding Group Structure

   A Provider advertises one or more Encoding Groups.  Each Encoding
   Group includes one or more Individual Encodings.  Each Individual
   Encoding can represent a different way of encoding media.  For
   example, one Individual Encoding may be 1080p60 video, another could
   be 720p30, with a third being CIF, all in, for example, H.264 format.

   While a typical three codec/display three-codec/display system might have one Encoding
   Group per "codec box" (physical codec, connected to one camera and
   one screen), there are many possibilities for the number of Encoding
   Groups a Provider may be able to offer and for the encoding values in
   each Encoding Group.

   There is no requirement for all Encodings within an Encoding Group to
   be instantiated at the same time.

9.3.  Associating Captures with Encoding Groups

   Each Media Capture, including MCCs, MAY be associated with one
   Encoding Group.  To be eligible for configuration, a Media Capture
   MUST be associated with one Encoding Group, which is used to
   instantiate that Capture into a Capture Encoding.  When an MCC is
   configured, all the Media Captures referenced by the MCC will appear
   in the Capture Encoding according to the attributes of the chosen
   encoding of the MCC.  This allows an Advertiser to specify encoding
   attributes associated with the Media Captures without the need to
   provide an individual Capture Encoding for each of the inputs.

   If an Encoding Group is assigned to a Media Capture referenced by the MCC
   MCC, it indicates that this Capture may also have an individual
   Capture Encoding.

   For example:


                  | Capture Scene #1 |                 |
                  | VC1              | EncodeGroupID=1 |
                  | VC2              |                 |
                  | MCC1(VC1,VC2)    | EncodeGroupID=2 |
                  | CSV(VC1)         |                 |
                  | CSV(MCC1)        |                 |

                    Table 6: Example usage Usage of Encoding
                       with MCC and source Source Captures

   This would indicate that VC1 may be sent as its own Capture Encoding
   from EncodeGroupID=1 or that it may be sent as part of a Capture
   Encoding from EncodeGroupID=2 along with VC2.

   More than one Capture MAY use the same Encoding Group.

   The maximum number of Capture Encodings that can result from a
   particular Encoding Group constraint is equal to the number of
   individual Encodings in the group.  The actual number of Capture
   Encodings used at any time MAY be less than this maximum.  Any of the
   Captures that use a particular Encoding Group can be encoded
   according to any of the Individual Encodings in the group.

   It is a protocol conformance requirement that the Encoding Groups
   MUST allow all the Captures in a particular Capture Scene View to be
   used simultaneously.

10.  Consumer's Choice of Streams to Receive from the Provider

   After receiving the Provider's Advertisement message (that (which includes
   media captures and associated constraints), the Consumer composes its
   reply to the Provider in the form of a Configure message.  The
   Consumer is free to use the information in the Advertisement as it
   chooses, but there are a few obviously sensible design choices, which
   are outlined below.

   If multiple Providers connect to the same Consumer (i.e. (i.e., in an MCU-
   less multiparty call), it is the responsibility of the Consumer to
   compose Configures for each Provider that both fulfill each
   Provider's constraints as expressed in the Advertisement, as well as
   its own capabilities.

   In an MCU-based multiparty call, the MCU can logically terminate the
   Advertisement/Configure negotiation in that it can hide the
   characteristics of the receiving endpoint and rely on its own
   capabilities (transcoding/transrating/...) (transcoding/transrating/etc.) to create Media Streams
   that can be decoded at the Endpoint Consumers.  The timing of an
   MCU's sending of Advertisements (for its outgoing ports) and
   Configures (for its incoming ports, in response to Advertisements
   received there) is up to the MCU and is implementation dependent.

   As a general outline, a Consumer can choose, based on the
   Advertisement it has received, which Captures it wishes to receive,
   and which Individual Encodings it wants the Provider to use to encode
   the Captures.

   On receipt of an Advertisement with an MCC MCC, the Consumer treats the
   MCC as per other non-MCC Captures with the following differences:


   *  The Consumer would understand that the MCC is a Capture that
      includes the referenced individual Captures (or any Captures, if
      none are referenced) and that these individual Captures are
      delivered as part of the MCC's Capture Encoding.


   *  The Consumer may utilise utilize any of the attributes associated with the
      referenced individual Captures and any Capture Scene attributes
      from where the individual Captures were defined to choose Captures
      and for rendering decisions.


   *  If the MCC attribute Allow Subset Choice is true, then the
      Consumer may or may not choose to receive all the indicated
      Captures.  It can choose to receive a sub-set subset of Captures indicated
      by the MCC.

   For example example, if the Consumer consumer receives:


   A Consumer could choose all the Captures within a MCC however an MCC; however, if
   the Consumer determines that it doesn't want VC3 VC3, it can return
   MCC1(VC1,VC2).  If it wants all the individual Captures Captures, then it
   returns only the MCC identity (i.e. (i.e., MCC1).  If the MCC in the
   advertisement does not reference any individual captures, or the
   Allow Subset Choice attribute is false, then the Consumer cannot
   choose what is included in the MCC, MCC: it is up to the Provider to

   A Configure Message includes a list of Capture Encodings.  These are
   the Capture Encodings the Consumer wishes to receive from the
   Provider.  Each Capture Encoding refers to one Media Capture and one
   Individual Encoding.

   For each Capture the Consumer wants to receive, it configures one of
   the Encodings in that Capture's Encoding Group.  The Consumer does
   this by telling the Provider, in its Configure Message, which
   Encoding to use for each chosen Capture.  Upon receipt of this
   Configure from the Consumer, common knowledge is established between
   Provider and Consumer regarding sensible choices for the media
   streams.  The setup of the actual media channels, at least in the
   simplest case, is left to a following offer/answer exchange.
   Optimized implementations may speed up the reaction to the
   offer/answer offer/
   answer exchange by reserving the resources at the time of
   finalization of the CLUE handshake.

   CLUE advertisements and configure messages don't necessarily require
   a new SDP offer/answer for every CLUE message exchange.  But the
   resulting encodings sent via RTP must conform to the most recent most-recent SDP
   offer/answer result.

   In order to meaningfully create and send an initial Configure, the
   Consumer needs to have received at least one Advertisement, and an
   SDP offer defining the Individual Encodings, from the Provider.

   In addition, the Consumer can send a Configure at any time during the
   call.  The Configure MUST be valid according to the most recently
   received Advertisement.  The Consumer can send a Configure either in
   response to a new Advertisement from the Provider or on its own, for example
   example, because of a local change in conditions (people leaving the
   room, connectivity changes, multipoint related considerations).

   When choosing which Media Streams to receive from the Provider, and
   the encoding characteristics of those Media Streams, the Consumer
   advantageously takes several things into account: its local
   preference, simultaneity restrictions, and encoding limits.

10.1.  Local preference Preference

   A variety of local factors influence the Consumer's choice of Media
   Streams to be received from the Provider:

   o  if

   *  If the Consumer is an Endpoint, it is likely that it would choose,
      where possible, to receive video and audio Captures that match the
      number of display devices and audio system it has

   o  if has.

   *  If the Consumer is an MCU, it may choose to receive loudest
      speaker streams (in order to perform its own media composition)
      and avoid pre-composed video Captures

   o  user Captures.

   *  User choice (for instance, selection of a new layout) may result
      in a different set of Captures, or different encoding
      characteristics, being required by the Consumer Consumer.

10.2.  Physical simultaneity restrictions Simultaneity Restrictions

   Often there are physical simultaneity constraints of the Provider
   that affect the Provider's ability to simultaneously send all of the
   captures the Consumer would wish to receive.  For instance, an MCU,
   when connected to a multi-camera room system, might prefer to receive
   both individual video streams of the people present in the room and
   an overall view of the room from a single camera.  Some Endpoint
   systems might be able to provide both of these sets of streams
   simultaneously, whereas others might not (if the overall room view
   were produced by changing the optical zoom level on the center
   camera, for instance).

10.3.  Encoding and encoding group limits Encoding Group Limits

   Each of the Provider's encoding groups has limits on bandwidth, and
   the constituent potential encodings have limits on the bandwidth,
   computational complexity, video frame rate, and resolution that can
   be provided.  When choosing the Captures to be received from a
   Provider, a Consumer device MUST ensure that the encoding
   characteristics requested for each individual Capture fits within the
   capability of the encoding it is being configured to use, as well as
   ensuring that the combined encoding characteristics for Captures fit
   within the capabilities of their associated encoding groups.  In some
   cases, this could cause an otherwise "preferred" choice of capture
   encodings to be passed over in favor of different Capture Encodings--for Encodings
   -- for instance, if a set of three Captures could only be provided at
   a low resolution then a three screen device could switch to favoring
   a single, higher quality, Capture Encoding.

11.  Extensibility

   One important characteristics of the Framework is its extensibility.
   The standard for interoperability and handling multiple streams must
   be future-proof.  The framework itself is inherently extensible
   through expanding the data model types.  For example:


   *  Adding more types of media, such as telemetry, can done by
      defining additional types of Captures in addition to audio and


   *  Adding new functionalities, such as 3-D video Captures, say, may
      require additional attributes describing the Captures.

   The infrastructure is designed to be extended rather than requiring
   new infrastructure elements.  Extension comes through adding to
   defined types.

12.  Examples - Using the Framework (Informative)

   This section gives some examples, first from the point of view of the
   Provider, then the Consumer, then some multipoint scenarios scenarios.

12.1.  Provider Behavior

   This section shows some examples in more detail of how a Provider can
   use the framework to represent a typical case for telepresence rooms.  First
   First, an endpoint is illustrated, then an MCU case is shown.

12.1.1. Three screen  Three-Screen Endpoint Provider

   Consider an Endpoint with the following description:


   Three cameras, 3 three displays, and a 6 person six-person table


   *  Each camera can provide one Capture for each 1/3 section 1/3-section of the

   *  A single Capture representing the active speaker can be provided
      (voice activity based
      (voice-activity-based camera selection to a given encoder input
      port implemented locally in the Endpoint)

   o Endpoint).

   *  A single Capture representing the active speaker with the other
      two Captures shown picture in picture (PiP) within the stream can
      be provided (again, implemented inside the endpoint)

   o endpoint).

   *  A Capture showing a zoomed out view of all 6 six seats in the room
      can be provided provided.

   The video and audio Captures for this Endpoint can be described as

   Video Captures:

   o  VC0-

   VC0   (the left camera stream), encoding group=EG0, view=table

   o  VC1-

   VC1   (the center camera stream), encoding group=EG1, view=table

   o  VC2-

   VC2   (the right camera stream), encoding group=EG2, view=table

   o  MCC3-

   MCC3  (the loudest panel stream), encoding group=EG1, view=table,
         MaxCaptures=1, policy=SoundLevel

   o  MCC4-

   MCC4  (the loudest panel stream with PiPs), encoding group=EG1,
         view=room, MaxCaptures=3, policy=SoundLevel

   o  VC5-

   VC5   (the zoomed out view of all people in the room), encoding
         group=EG1, view=room

   o  VC6-

   VC6   (presentation stream), encoding group=EG1, presentation

   The following diagram is a top view of the room with 3 three cameras, 3
   three displays, and 6 six seats.  Each camera captures 2 two people.  The
   six seats are not all in a straight line.

      ,-. d
     (   )`--.__        +---+
      `-' /     `--.__  |   |
    ,-.  |            `-.._ |_-+Camera 2 (VC2)
   (   ).'     <--(AC1)-+-''`+-+
    `-' |_...---''      |   |
    ,-.c+-..__          +---+
   (   )|     ``--..__  |   |
    `-' |             ``+-..|_-+Camera 1 (VC1)
    ,-. |      <--(AC2)..--'|+-+                          ^
   (   )|     __..--'   |   |                             |
    `-'b|..--'          +---+                             |X
    ,-. |``---..___     |   |                             |
   (   )\          ```--..._|_-+Camera 0 (VC0)            |
    `-'  \     <--(AC0) ..-''`-+                          |
     ,-. \      __.--'' |   |                  <----------+
    (   ) |..-''        +---+                     Y
     `-' a                          (0,0,0) origin is under Camera 1

                       Figure 5: Room Layout Top View

   The two points labeled b 'b' and c 'c' are intended to be at the midpoint
   between the seating positions, and where the fields of view of the
   cameras intersect.

   The plane of interest for VC0 is a vertical plane that intersects
   points 'a' and 'b'.

   The plane of interest for VC1 intersects points 'b' and 'c'.  The
   plane of interest for VC2 intersects points 'c' and 'd'.

   This example uses an area scale of millimeters.

   Areas of capture:

       bottom left    bottom right  top left         top right
   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
   MCC3(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   MCC4(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   VC6 none

   Points of capture:

   VC0 (-1678,0,800)
   VC1 (0,0,800)
   VC2 (1678,0,800)
   MCC3 none
   MCC4 none
   VC5 (0,0,800)
   VC6 none

   In this example, the right edge of the VC0 area lines up with the
   left edge of the VC1 area.  It doesn't have to be this way.  There
   could be a gap or an overlap.  One additional thing to note for this
   example is the distance from a 'a' to b 'b' is equal to the distance from b
   'b' to c 'c' and the distance from c 'c' to d. 'd'.  All these distances are
   1346 mm.  This is the planar width of each area of capture for VC0,
   VC1, and VC2.

   Note the text in parentheses (e.g. (e.g., "the left camera stream") is not
   explicitly part of the model, it is just explanatory text for this
   example, and it is not included in the model with the media captures
   and attributes.  Also, MCC4 doesn't say anything about how a capture
   is composed, so the media consumer can't tell based on this capture
   that MCC4 is composed of a "loudest panel with PiPs".

   Audio Captures:

   Three ceiling microphones are located between the cameras and the
   table, at the same height as the cameras.  The microphones point down
   at an angle toward the seating positions.


   *  AC0 (left), encoding group=EG3


   *  AC1 (right), encoding group=EG3


   *  AC2 (center) (center), encoding group=EG3


   *  AC3 being a simple pre-mixed audio stream from the room (mono),
      encoding group=EG3


   *  AC4 audio stream associated with the presentation video (mono)
      encoding group=EG3, presentation

   Point of capture: Capture:           Point on Line of Capture:
   AC0 (-1342,2000,800)       (-1342,2925,379)
   AC1 ( 1342,2000,800)       ( 1342,2925,379)
   AC2 (    0,2000,800)       (    0,3000,379)
   AC3 (    0,2000,800)       (    0,3000,379)
   AC4 none

   The physical simultaneity information is:

      Simultaneous transmission set #1 {VC0, VC1, VC2, MCC3, MCC4, VC6}

      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

   This constraint indicates that it is not possible to use all the VCs
   at the same time.  VC5 cannot be used at the same time as VC1 or MCC3
   or MCC4.  Also, using every member in the set simultaneously may not
   make sense - -- for example MCC3(loudest) example, MCC3 (loudest) and MCC4 (loudest with
   PiP).  In addition, there are encoding constraints that make choosing
   all of the VCs in a set impossible.  VC1, MCC3, MCC4, VC5, and VC6
   all use EG1 and EG1 has only 3 three ENCs.  This constraint shows up in
   the encoding groups, not in the simultaneous transmission sets.

   In this example example, there are no restrictions on which Audio Captures
   can be sent simultaneously.

   Encoding Groups:

   This example has three encoding groups associated with the video
   captures.  Each group can have 3 three encodings, but with each
   potential encoding having a progressively lower specification.  In
   this example, 1080p60 transmission is possible (as ENC0 has a maxPps
   value compatible with that).  Significantly, as up to 3 three encodings
   are available per group, it is possible to transmit some video
   Captures simultaneously that are not in the same view in the Capture Scene.  For example
   Scene, for example, VC1 and MCC3 at the same time.  The information
   below about Encodings is a summary of what would be conveyed in SDP,
   not directly in the CLUE Advertisement.

   encodeGroupID=EG0, maxGroupBandwidth=6000000
       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxPps=124416000, maxBandwidth=4000000
       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxPps=27648000, maxBandwidth=4000000
       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxPps=15552000, maxBandwidth=4000000
   encodeGroupID=EG1  maxGroupBandwidth=6000000
       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxPps=124416000, maxBandwidth=4000000
       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxPps=27648000, maxBandwidth=4000000
       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxPps=15552000, maxBandwidth=4000000
   encodeGroupID=EG2  maxGroupBandwidth=6000000
       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxPps=124416000, maxBandwidth=4000000
       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxPps=27648000, maxBandwidth=4000000
       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxPps=15552000, maxBandwidth=4000000

                Figure 6: Example Encoding Groups for Video

   For audio, there are five potential encodings available, so all five
   Audio Captures can be encoded at the same time.

   encodeGroupID=EG3, maxGroupBandwidth=320000
       encodeID=ENC9, maxBandwidth=64000
       encodeID=ENC10, maxBandwidth=64000
       encodeID=ENC11, maxBandwidth=64000
       encodeID=ENC12, maxBandwidth=64000
       encodeID=ENC13, maxBandwidth=64000

                 Figure 7: Example Encoding Group for Audio

   Capture Scenes:

   The following table represents the Capture Scenes for this Provider.
   Recall that a Capture Scene is composed of alternative Capture Scene
   Views covering the same spatial region.  Capture Scene #1 is for the
   main people captures, and Capture Scene #2 is for presentation.

   Each row in the table is a separate Capture Scene View

                           +------------------+ View.

                             | Capture Scene #1 |
                             | VC0, VC1, VC2    |
                             | MCC3             |
                             | MCC4             |
                             | VC5              |
                             | AC0, AC1, AC2    |
                             | AC3              |

                             | Capture Scene #2 |
                             | VC6              |
                             | AC4              |

                               Table 7: Example
                             Capture Scene Views

   Different Capture Scenes are distinct from each other, other and are
   non-overlapping. do not
   overlap.  A Consumer can choose a view from each Capture Scene.  In
   this case case, the three Captures Captures, VC0, VC1, and VC2 VC2, are one way of
   representing the video from the Endpoint.  These three Captures
   should appear adjacent next to each other.  Alternatively, another way of
   representing the Capture Scene is with the capture MCC3, which
   automatically shows the person who is
   talking.  Similarly talking; this is the same for
   the MCC4 and VC5 alternatives.

   As in the video case, the different views of audio in Capture Scene
   #1 represent the "same thing", in that one way to receive the audio
   is with the 3 three Audio Captures (AC0, AC1, and AC2), and another way
   is with the mixed AC3.  The Media Consumer can choose an audio CSV it
   is capable of receiving.

   The spatial ordering is understood by the Media Capture attributes attribute's
   Area of Capture, Point of Capture Capture, and Point on Line of Capture.

   A Media Consumer would likely want to choose a Capture Scene View to receive
   receive, partially based in part on how many streams it can simultaneously
   receive.  A consumer that can receive three video streams would
   probably prefer to receive the first view of Capture Scene #1 (VC0,
   VC1, and VC2) and not receive the other views.  A consumer that can
   receive only one video stream would probably choose one of the other

   If the consumer can receive a presentation stream too, it would also
   choose to receive the only view from Capture Scene #2 (VC6).

12.1.2.  Encoding Group Example

   This is an example of an Encoding Group to illustrate how it can
   express dependencies between Encodings.  The information below about
   Encodings is a summary of what would be conveyed in SDP, not directly
   in the CLUE Advertisement.

   encodeGroupID=EG0 maxGroupBandwidth=6000000
       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
       encodeID=AUDENC0, maxBandwidth=96000
       encodeID=AUDENC1, maxBandwidth=96000
       encodeID=AUDENC2, maxBandwidth=96000

   Here, the Encoding Group is EG0.  Although the Encoding Group is
   capable of transmitting up to 6Mbit/s, 6 Mbit/s, no individual video Encoding
   can exceed 4Mbit/s. 4 Mbit/s.

   This encoding group also allows up to 3 three audio encodings, AUDENC<0-
   AUDENC<0-2>.  It is not required that audio and video encodings
   reside within the same encoding group, but if so so, then the group's
   overall maxBandwidth value is a limit on the sum of all audio and
   video encodings configured by the consumer.  A system that does not
   wish or need to combine bandwidth limitations in this way should
   instead use separate encoding groups for audio and video in order for
   the bandwidth limitations on audio and video to not interact.

   Audio and video can be expressed in separate encoding groups, as in
   this illustration.

   encodeGroupID=EG0 maxGroupBandwidth=6000000
       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
   encodeGroupID=EG1 maxGroupBandwidth=500000
       encodeID=AUDENC0, maxBandwidth=96000
       encodeID=AUDENC1, maxBandwidth=96000
       encodeID=AUDENC2, maxBandwidth=96000

12.1.3.  The MCU Case

   This section shows how an MCU might express its Capture Scenes,
   intending to offer different choices for consumers that can handle
   different numbers of streams.  Each MCC is for video.  A single Audio
   Capture is provided for all single and multi-screen configurations
   that can be associated (e.g. (e.g., lip-synced) with any combination of
   Video Captures (the MCCs) at the consumer.


      | Capture Scene #1         |                                  |
      | MCC                      | for a single screen single-screen consumer     |
      | MCC1, MCC2               | for a two screen two-screen consumer        |
      | MCC3, MCC4, MCC5         | for a three screen three-screen consumer      |
      | MCC6, MCC7, MCC8, MCC9| MCC9   | for a four screen four-screen consumer       |
      | AC0                      | AC representing all participants| participants |
      | CSV(MCC0)                |                                  |
      | CSV(MCC1,MCC2)           |                                  |
      | CSV(MCC3,MCC4,MCC5)      |                                  |
      | CSV(MCC6,MCC7,        |                                 |
        |     MCC8,MCC9) CSV(MCC6,MCC7,MCC8,MCC9) |                                  |
      | CSV(AC0)                 |                                  |

                      Table 8: MCU main Main Capture Scenes

   If / when

   If/when a presentation stream becomes active within the
   conference conference,
   the MCU might re-advertise the available media as:


        | Capture Scene #2 | note Note                                 |
        | VC10             | video Video capture for presentation       |
        | AC1              | presentation Presentation audio to accompany VC10 |
        | CSV(VC10)        |                                      |
        | CSV(AC1)         |                                      |

                  Table 9: MCU presentation Presentation Capture Scene

12.2.  Media Consumer Behavior

   This section gives an example of how a Media Consumer might behave
   when deciding how to request streams from the three screen three-screen endpoint
   described in the previous section.

   The receive side of a call needs to balance its requirements,
   based requirements (based
   on number of screens and speakers, speakers), its decoding capabilities
   and capabilities,
   available bandwidth, and the provider's capabilities in order to
   optimally configure the provider's streams.  Typically  Typically, it would want
   to receive and decode media from each Capture Scene advertised by the

   A sane, basic, algorithm might be for the consumer to go through each
   Capture Scene View in turn and find the collection of Video Captures
   that best matches the number of screens it has (this might include
   consideration of screens dedicated to presentation video display
   rather than "people" video) and then decide between alternative views
   in the video Capture Scenes based either on hard-coded preferences or
   on user choice.  Once this choice has been made, the consumer would
   then decide how to configure the provider's encoding groups in order
   to make best use of the available network bandwidth and its own
   decoding capabilities.

12.2.1. One screen  One-Screen Media Consumer

   MCC3, MCC4 MCC4, and VC5 are all different views by themselves, not
   grouped together in a single view, so view; so, the receiving device should
   choose between one of those.  The choice would come down to whether
   to see the greatest number of participants simultaneously at roughly
   equal precedence (VC5), a switched view of just the loudest region (MCC3)
   (MCC3), or a switched view with PiPs (MCC4).  An endpoint device with
   a small amount of knowledge of these differences could offer a
   dynamic choice of these options, in-
   call, in-call, to the user.

12.2.2. Two screen  Two-Screen Media Consumer configuring Configuring the example Example

   Mixing systems with an even number of screens, "2n", and those with
   "2n+1" cameras (and vice versa) is always likely to be the
   problematic case.  In this instance, the behavior is likely to be
   determined by whether a "2 screen" "two-screen" system is really a "2 decoder" "two-decoder"
   system, i.e., whether only one received stream can be displayed per
   screen or whether more than 2 two streams can be received and spread
   across the available screen area.  To enumerate 3 three possible
   behaviors here for the 2 screen two-screen system when it learns that the far
   end is "ideally" expressed via 3 three capture streams:

   1.  Fall back to receiving just a single stream (MCC3, MCC4 MCC4, or VC5
       as per the 1 screen one-screen consumer case above) and either leave one
       screen blank or use it for presentation if / when if/when a presentation
       becomes active.

   2.  Receive 3 three streams (VC0, VC1 VC1, and VC2) and display across 2 two
       screens (either with each capture being scaled to 2/3 of a screen
       and the center capture being split across 2 screens) two screens), or, as
       would be necessary if there were large bezels on the screens,
       with each stream being scaled to 1/2 the screen width and height
       and there being a 4th fourth "blank" panel.  This 4th fourth panel could
       potentially be used for any presentation that became active
       during the call.

   3.  Receive 3 three streams, decode all 3, three, and use control
       information indicating which was the most active to switch
       between showing the left and center streams (one per screen) and
       the center and right streams.

   For an endpoint capable of all 3 three methods of working described
   above, again it might be appropriate to offer the user the choice of
   display mode.

12.2.3. Three screen  Three-Screen Media Consumer configuring Configuring the example Example

   This is the most straightforward case - case: the Media Consumer would look
   to identify a set of streams to receive that best matched its
   available screens and so screens; so, the VC0 plus VC1 plus VC2 should match
   optimally.  The spatial ordering would give sufficient information
   for the correct Video Capture to be shown on the correct screen,
   and the screen.  The
   consumer would either need to divide a single encoding group's
   capability by 3 to determine what resolution and frame rate to
   configure the provider with or to configure the individual Video
   Captures' Encoding Groups with what makes most sense (taking into
   account the receive side decode capabilities, overall call bandwidth,
   the resolution of the screens plus any user preferences such as
   motion vs. sharpness).

12.3.  Multipoint Conference utilizing Utilizing Multiple Content Captures

   The use of MCCs allows the MCU to construct outgoing Advertisements
   describing complex media switching and composition scenarios.  The
   following sections provide several examples.

   Note: In in the examples the identities of the CLUE elements (e.g. (e.g.,
   Captures, Capture Scene) in the incoming Advertisements overlap.
   This is because there is no co-ordination coordination between the endpoints.  The
   MCU is responsible for making these unique in the outgoing

12.3.1.  Single Media Captures and MCC in the same Same Advertisement

   Four endpoints are involved in a Conference where CLUE is used.  An
   MCU acts as a middlebox between the endpoints with a CLUE channel
   between each endpoint and the MCU.  The MCU receives the following


           | Capture Scene #1 | Description=AustralianConfRoom |
           | VC1              | Description=Audience           |
           |                  | EncodeGroupID=1                |
           | CSV(VC1)         |                                |

              Table 10: Advertisement received Received from Endpoint A

             | Capture Scene #1 | Description=ChinaConfRoom |
             | VC1              | Description=Speaker       |
             |                  | EncodeGroupID=1           |
             | VC2              | Description=Audience      |
             |                  | EncodeGroupID=1           |
             | CSV(VC1, VC2)    |                           |

             Table 11: Advertisement received Received from Endpoint B


   Note: Endpoint B indicates that it sends two streams.

              | Capture Scene #1 | Description=USAConfRoom |
              | VC1              | Description=Audience    |
              |                  | EncodeGroupID=1         |
              | CSV(VC1)         |                         |

                  Table 12: Advertisement received Received from
                                Endpoint C

   Note: Endpoint B above indicates that it sends two streams.

   If the MCU wanted to provide a Multiple Content Capture Captures containing a round robin
   round-robin switched view of the audience from the 3 three endpoints
   and the speaker speaker, it could construct the following advertisement:

   Advertisement sent to Endpoint F

      | Capture Scene #1      | Description=AustralianConfRoom     |
      | VC1                   | Description=Audience               |
      | CSV(VC1)              |                                    |
      | Capture Scene #2      | Description=ChinaConfRoom          |
      | VC2                   | Description=Speaker                |
      | VC3                   | Description=Audience               |
      | CSV(VC2, VC3)         |                                    |
      | Capture Scene #3      | Description=USAConfRoom            |
      | VC4                   | Description=Audience               |
      | CSV(VC4)              |                                    |
      | Capture Scene #4      |                                    |
      | MCC1(VC1,VC2,VC3,VC4) | Policy=RoundRobin:1                |
      |                       | MaxCaptures=1                      |
      |                       | EncodingGroup=1                    |
      | CSV(MCC1)             |                                    |

        Table 13: Advertisement sent Sent to Endpoint F - One Encoding


   Alternatively, if the MCU wanted to provide the speaker as one media
   stream and the audiences as another another, it could assign an encoding
   group to VC2 in Capture Scene 2 and provide a CSV in Capture Scene #4
   as per the example below.

   Advertisement sent to Endpoint F

       | Capture Scene #1  | Description=AustralianConfRoom       |
       | VC1               | Description=Audience                 |
       | CSV(VC1)          |                                      |
       | Capture Scene #2  | Description=ChinaConfRoom            |
       | VC2               | Description=Speaker                  |
       |                   | EncodingGroup=1                      |
       | VC3               | Description=Audience                 |
       | CSV(VC2, VC3)     |                                      |
       | Capture Scene #3  | Description=USAConfRoom              |
       | VC4               | Description=Audience                 |
       | CSV(VC4)          |                                      |
       | Capture Scene #4  |                                      |
       | MCC1(VC1,VC3,VC4) | Policy=RoundRobin:1                  |
       |                   | MaxCaptures=1                        |
       |                   | EncodingGroup=1                      |
       |                   | AllowSubset=True                     |
       | MCC2(VC2)         | MaxCaptures=1                        |
       |                   | EncodingGroup=1                      |
       | CSV2(MCC1,MCC2)   |                                      |

        Table 14: Advertisement sent Sent to Endpoint F - Two Encodings


   Therefore, a Consumer could choose whether or not to have a separate
   speaker related
   speaker-related stream and could choose which endpoints to see.  If
   it wanted the second stream but not the Australian conference room room,
   it could indicate the following captures in the Configure message:


                       | MCC1(VC3,VC4) | Encoding |
                       | VC2           | Encoding |

                           Table 15: MCU case: Case:
                            Consumer Response

12.3.2.  Several MCCs in the same Same Advertisement

   Multiple MCCs can be used where multiple streams are used to carry
   media from multiple endpoints.  For example:

   A conference has three endpoints D, E E, and F.  Each end point endpoint has
   three video captures covering the left, middle middle, and right regions of
   each conference room.  The MCU receives the following advertisements
   from D and E.


           | Capture Scene #1 | Description=AustralianConfRoom |
           | VC1              | CaptureArea=Left               |
           |                  | EncodingGroup=1                |
           | VC2              | CaptureArea=Centre CaptureArea=Center             |
           |                  | EncodingGroup=1                |
           | VC3              | CaptureArea=Right              |
           |                  | EncodingGroup=1                |
           | CSV(VC1,VC2,VC3) |                                |

              Table 16: Advertisement received Received from Endpoint D


             | Capture Scene #1 | Description=ChinaConfRoom |
             | VC1              | CaptureArea=Left          |
             |                  | EncodingGroup=1           |
             | VC2              | CaptureArea=Centre CaptureArea=Center        |
             |                  | EncodingGroup=1           |
             | VC3              | CaptureArea=Right         |
             |                  | EncodingGroup=1           |
             | CSV(VC1,VC2,VC3) |                           |

             Table 17: Advertisement received Received from Endpoint E

   The MCU wants to offer Endpoint F three Capture Encodings.  Each
   Capture Encoding would contain all the Captures from either Endpoint
   D or Endpoint E E, depending based on the active speaker.  The MCU sends the
   following Advertisement:


    | Capture Scene #1    | Description=AustralianConfRoom           |
    | VC1                 |                                          |
    | VC2                 |                                          |
    | VC3                 |                                          |
    | CSV(VC1,VC2,VC3)    |                                          |
    | Capture Scene #2    | Description=ChinaConfRoom                |
    | VC4                 |                                          |
    | VC5                 |                                          |
    | VC6                 |                                          |
    | CSV(VC4,VC5,VC6)    |                                          |
    | Capture Scene #3    |                                          |
    | MCC1(VC1,VC4)       | CaptureArea=Left                         |
    |                     | MaxCaptures=1                            |
    |                     | SynchronisationID=1 SynchronizationID=1                      |
    |                     | EncodingGroup=1                          |
    | MCC2(VC2,VC5)       | CaptureArea=Centre CaptureArea=Center                       |
    |                     | MaxCaptures=1                            |
    |                     | SynchronisationID=1 SynchronizationID=1                      |
    |                     | EncodingGroup=1                          |
    | MCC3(VC3,VC6)       | CaptureArea=Right                        |
    |                     | MaxCaptures=1                            |
    |                     | SynchronisationID=1 SynchronizationID=1                      |
    |                     | EncodingGroup=1                          |
    | CSV(MCC1,MCC2,MCC3) |                                          |

                Table 18: Advertisement sent Sent to Endpoint F

12.3.3.  Heterogeneous conference Conference with switching Switching and composition Composition

   Consider a conference between endpoints with the following

   Endpoint A -  4 screens, 3 cameras

   Endpoint B -  3 screens, 3 cameras

   Endpoint C -  3 screens, 3 cameras

   Endpoint D -  3 screens, 3 cameras

   Endpoint E -  1 screen, 1 camera

   Endpoint F -  2 screens, 1 camera

   Endpoint G -  1 screen, 1 camera

   This example focuses on what the user in one of the 3-camera multi-
   screen three-camera
   multi-screen endpoints sees.  Call this person User A, at Endpoint A.
   There are 4 four large display screens at Endpoint A.  Whenever
   somebody at another site is speaking, all the video captures from
   that endpoint are shown on the large screens.  If the talker is at a 3-
   three-camera site, then the video from those 3 three cameras fills 3
   three of the screens.  If the talker person speaking is at a single-camera
   site, then video from that camera fills one of the screens, while the
   other screens show video from other single-camera endpoints.

   User A hears audio from the 4 four loudest talkers.

   User A can also see video from other endpoints, in addition to the
   current talker, person speaking, although much smaller in size.  Endpoint A
   has 4 four screens, so one of those screens shows up to 9 nine other
   Media Captures in a tiled fashion.  When video from a 3 camera three-camera
   endpoint appears in the tiled area, video from all 3 three cameras
   appears together across the screen with correct spatial relationship
   among those 3 three images.

      +---+---+---+ +-------------+ +-------------+ +-------------+
      |   |   |   | |             | |             | |             |
      +---+---+---+ |             | |             | |             |
      |   |   |   | |             | |             | |             |
      +---+---+---+ |             | |             | |             |
      |   |   |   | |             | |             | |             |
      +---+---+---+ +-------------+ +-------------+ +-------------+

                 Figure 8: Endpoint A - 4 Screen Four-Screen Display

   User B at Endpoint B sees a similar arrangement, except there are
   only 3 three screens, so the 9 nine other Media Captures are spread out
   across the bottom of the 3 three displays, in a picture-in-picture (PiP) PiP format.  When video
   from a 3 camera three-camera endpoint appears in the PiP area, video from all 3
   three cameras appears together across a single screen with correct
   spatial relationship.

              +-------------+ +-------------+ +-------------+
              |             | |             | |             |
              |             | |             | |             |
              |             | |             | |             |
              | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ |
              | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ |
              +-------------+ +-------------+ +-------------+

           Figure 9: Endpoint B - 3 Screen Three-Screen Display with PiPs

   When somebody at a different endpoint becomes the current talker, speaker,
   then User A and User B both see the video from the new talker person
   speaking appear on their large screen area, while the previous talker
   speaker takes one of the smaller tiled or PiP areas.  The person who
   is the current talker speaker doesn't see themselves; they see the previous talker
   speaker in their large screen area.

   One of the points of this example is that endpoints A and B each want
   to receive 3 three capture encodings for their large display areas, and 9
   nine encodings for their smaller areas.  A and B are be able to each
   send the same Configure message to the MCU, and each receive the same
   conceptual Media Captures from the MCU.  The differences are in how
   they are rendered and are purely a local matter at A and B.

   The Advertisements for such a scenario are described below.


             | Capture Scene #1    | Description=Endpoint x |
             | VC1                 | EncodingGroup=1        |
             | VC2                 | EncodingGroup=1        |
             | VC3                 | EncodingGroup=1        |
             | AC1                 | EncodingGroup=2        |
             | CSV1(VC1, VC2, VC3) |                        |
             | CSV2(AC1)           |                        |

               Table 19: Advertisement received Received at the MCU
                          from Endpoints A to D

               | Capture Scene #1 | Description=Endpoint y |
               | VC1              | EncodingGroup=1        |
               | AC1              | EncodingGroup=2        |
               | CSV1(VC1)        |                        |
               | CSV2(AC1)        |                        |

                  Table 20: Advertisement received Received at the
                         MCU from Endpoints E to G

   Rather than considering what is displayed displayed, CLUE concentrates more on
   what the MCU sends.  The MCU doesn't know anything about the number
   of screens an endpoint has.

   As Endpoints A to D each advertise that three Captures make up a
   Capture Scene, the MCU offers these in a "site" switching "site switching" mode.  That is that
   is, there are three Multiple Content Captures (and Capture Encodings)
   each switching between Endpoints.  The MCU switches in the applicable
   media into the stream based on voice activity.  Endpoint A will not
   see a capture from itself.

   Using the MCC concept concept, the MCU would send the following Advertisement
   to endpoint Endpoint A:


             | Capture Scene #1    | Description=Endpoint B |
             | VC4                 | CaptureArea=Left       |
             | VC5                 | CaptureArea=Center     |
             | VC6                 | CaptureArea=Right      |
             | AC1                 |                        |
             | CSV(VC4,VC5,VC6)    |                        |
             | CSV(AC1)            |                        |
             | Capture Scene #2    | Description=Endpoint C |
             | VC7                 | CaptureArea=Left       |
             | VC8                 | CaptureArea=Center     |
             | VC9                 | CaptureArea=Right      |
             | AC2                 |                        |
             | CSV(VC7,VC8,VC9)    |                        |
             | CSV(AC2)            |                        |
             | Capture Scene #3    | Description=Endpoint D |
             | VC10                | CaptureArea=Left       |
             | VC11                | CaptureArea=Center     |
             | VC12                | CaptureArea=Right      |
             | AC3                 |                        |
             | CSV(VC10,VC11,VC12) |                        |
             | CSV(AC3)            |                        |
             | Capture Scene #4    | Description=Endpoint E |
             | VC13                |                        |
             | AC4                 |                        |
             | CSV(VC13)           |                        |
             | CSV(AC4)            |                        |
             | Capture Scene #5    | Description=Endpoint F |
             | VC14                |                        |
             | AC5                 |                        |
             | CSV(VC14)           |                        |
             | CSV(AC5)            |                        |
             | Capture Scene #6    | Description=Endpoint G |
             | VC15                |                        |
             | AC6                 |                        |
             | CSV(VC15)           |                        |
             | CSV(AC6)            |                        |

               Table 21: Advertisement sent Sent to endpoint Endpoint A -
                               Source Part

   The above part of the Advertisement presents information about the
   sources to the MCC.  The information is effectively the same as the
   received Advertisements Advertisements, except that there are no Capture Encodings
   associated with them and the identities have been re-numbered. renumbered.

   In addition to the source Capture information information, the MCU advertises
   site switching of Endpoints B to G in three streams.


   | Capture Scene #7  | Description=Output3streammix                  |
        | MCC1(VC4,VC7,VC10,
   |MCC1(VC4,VC7,VC10, | CaptureArea=Left                              |
   | VC13)             | MaxCaptures=1                                 |
   |                   | SynchronisationID=1 SynchronizationID=1                           |
   |                   | Policy=SoundLevel:0                           |
   |                   | EncodingGroup=1                               |
        |                       |                                 |
        | MCC2(VC5,VC8,VC11,
   |MCC2(VC5,VC8,VC11, | CaptureArea=Center                            |
   | VC14)             | MaxCaptures=1                                 |
   |                   | SynchronisationID=1 SynchronizationID=1                           |
   |                   | Policy=SoundLevel:0                           |
   |                   | EncodingGroup=1                               |
        |                       |                                 |
        | MCC3(VC6,VC9,VC12,
   |MCC3(VC6,VC9,VC12, | CaptureArea=Right                             |
   | VC15)             | MaxCaptures=1                                 |
   |                   | SynchronisationID=1 SynchronizationID=1                           |
   |                   | Policy=SoundLevel:0                           |
   |                   | EncodingGroup=1                               |
        |                       |                                 |
        | MCC4()
   |MCC4() (for audio) | CaptureArea=whole scene                       |
   |                   | MaxCaptures=1                                 |
   |                   | Policy=SoundLevel:0                           |
   |                   | EncodingGroup=2                               |
        |                       |                                 |
        | MCC5()
   |MCC5() (for audio) | CaptureArea=whole scene                       |
   |                   | MaxCaptures=1                                 |
   |                   | Policy=SoundLevel:1                           |
   |                   | EncodingGroup=2                               |
        |                       |                                 |
        | MCC6() (for audio)
   |MCC6() (for audio) | CaptureArea=whole scene                       |
   |                   | MaxCaptures=1                                 |
   |                   | Policy=SoundLevel:2                           |
   |                   | EncodingGroup=2                               |
        |                       |                                 |
        | MCC7()
   |MCC7() (for audio) | CaptureArea=whole scene                       |
   |                   | MaxCaptures=1                                 |
   |                   | Policy=SoundLevel:3                           |
   |                   | EncodingGroup=2                               |
   |CSV(MCC1,MCC2,MCC3)|                                               |                       |                                 |
        | CSV(MCC1,MCC2,MCC3)   |                                 |
        | CSV(MCC4,MCC5,MCC6,   |
   |CSV(MCC4,MCC5,MCC6,|                                               |
   | MCC7)             |                                               |

        Table 22: Advertisement send Sent to endpoint Endpoint A - switching part Switching Part

   The above part describes the switched 3 three main switched streams that relate
   to site switching.  MaxCaptures=1 indicates that only one Capture
   from the MCC is sent at a particular time. SynchronisationID=1  SynchronizationID=1
   indicates that the source sending is synchronised. synchronized.  The provider can
   choose to group together VC13, VC14, and VC15 for the purpose of
   switching according to the SynchronisationID.  Therefore SynchronizationID.  Therefore, when the
   provider switches one of them into an MCC, it can also switch the
   others even though they are not part of the same Capture Scene.

   All the audio for the conference is included in this Scene #7.  There
   isn't necessarily a one to one one-to-one relation between any audio capture and
   video capture in this scene.  Typically  Typically, a change in the loudest
   talker will cause the MCU to switch the audio streams more quickly
   than switching video streams.

   The MCU can also supply nine media streams showing the active and
   previous eight speakers.  It includes the following in the


          | Capture Scene #8       | Description=Output9stream |
          |  MCC8(VC4,VC5,VC6,VC7, | MaxCaptures=1             |
          |     VC8,VC9,VC10,VC11, | Policy=SoundLevel:0       |
          |   VC12,VC13,VC14,VC15)| EncodingGroup=1                 |
        |   VC12,VC13,VC14,VC15) | EncodingGroup=1           |
          |  MCC9(VC4,VC5,VC6,VC7, | MaxCaptures=1             |
          |     VC8,VC9,VC10,VC11, | Policy=SoundLevel:1       |
          |   VC12,VC13,VC14,VC15)| EncodingGroup=1                 |   VC12,VC13,VC14,VC15) | EncodingGroup=1           |
          |           to                           to               |           |             to            |
          | MCC16(VC4,VC5,VC6,VC7, | MCC16(VC4,VC5,VC6,VC7,| MaxCaptures=1             |
          |     VC8,VC9,VC10,VC11, | Policy=SoundLevel:8       |
          |   VC12,VC13,VC14,VC15)| EncodingGroup=1   VC12,VC13,VC14,VC15) | EncodingGroup=1           |
          |   CSV(MCC8,MCC9,MCC10, |                           | CSV(MCC8,MCC9,MCC10,
          |     MCC11,MCC12,MCC13, |                           |     MCC11,MCC12,MCC13,|
          |     MCC14,MCC15,MCC16) |     MCC14,MCC15,MCC16)|                           |

               Table 23: Advertisement sent Sent to endpoint Endpoint A -
                             9 switched part Switched Part

   The above part indicates that there are 9 nine capture encodings.  Each
   of the Capture Encodings may contain any captures from any source
   site with a maximum of one Capture at a time.  Which Capture is
   present is determined by the policy.  The MCCs in this scene do not
   have any spatial attributes.

   Note: The Provider alternatively could provide each of the MCCs above
   in its own Capture Scene.

   If the MCU wanted to provide a composed Capture Encoding containing
   all of the 9 captures nine captures, it could advertise in addition:


            | Capture Scene #9       | Description=NineTiles |
            | MCC13(MCC8,MCC9,MCC10,| MaxCaptures=9 MCC13(MCC8,MCC9,MCC10, | MaxCaptures=9         |     MCC11,MCC12,MCC13,| EncodingGroup=1
            |     MCC11,MCC12,MCC13, |     MCC14,MCC15,MCC16)| EncodingGroup=1       |
            |     MCC14,MCC15,MCC16) |                       |
            | CSV(MCC13)             |                       |

               Table 24: Advertisement sent Sent to endpoint Endpoint A -
                             9 composed part Composed Part

   As MaxCaptures is 9 9, it indicates that the capture encoding contains
   information from 9 nine sources at a time.

   The Advertisement to Endpoint B is identical to the above above, other than
   the fact that captures from Endpoint A would be added and the
   captures from Endpoint B would be removed.  Whether the Captures are
   rendered on a four screen four-screen display or a three screen three-screen display is up to
   the Consumer to determine.  The Consumer wants to place video
   captures from the same original source endpoint together, in the
   correct spatial order, but the MCCs do not have spatial attributes.  So
   So, the Consumer needs to associate incoming media packets with the
   original individual captures in the advertisement (such as VC4, VC5,
   and VC6) in order to know the spatial information it needs for
   correct placement on the screens.  The Provider can use the RTCP
   CaptureId SDES source description (SDES) item and associated RTP header
   extension, as described in [I-D.ietf-clue-rtp-mapping], [RFC8849], to convey this information to
   the Consumer.

12.3.4.  Heterogeneous conference Conference with voice activated switching Voice-Activated Switching

   This example illustrates how multipoint "voice activated "voice-activated switching"
   behavior can be realized, with an endpoint making its own decision
   about which of its outgoing video streams is considered the "active
   talker" from that endpoint.  Then  Then, an MCU can decide which is the
   active talker among the whole conference.

   Consider a conference between endpoints with the following

   Endpoint A -  3 screens, 3 cameras

   Endpoint B -  3 screens, 3 cameras

   Endpoint C -  1 screen, 1 camera

   This example focuses on what the user at endpoint Endpoint C sees.  The user
   would like to see the video capture of the current talker, without
   composing it with any other video capture.  In this
   example endpoint example, Endpoint
   C is capable of receiving only a single video stream.  The following
   tables describe advertisements from Endpoints A and B to the MCU, and
   from the MCU to Endpoint C, that can be used to accomplish this.


         | Capture Scene #1  | Description=Endpoint x            |
         | VC1               | CaptureArea=Left                  |
         |                   | EncodingGroup=1                   |
         | VC2               | CaptureArea=Center                |
         |                   | EncodingGroup=1                   |
         | VC3               | CaptureArea=Right                 |
         |                   | EncodingGroup=1                   |
         | MCC1(VC1,VC2,VC3) | MaxCaptures=1                     |
         |                   | CaptureArea=whole scene           |
         |                   | Policy=SoundLevel:0               |
         |                   | EncodingGroup=1                   |
         | AC1               | CaptureArea=whole scene           |
         |                   | EncodingGroup=2                   |
         | CSV1(VC1, VC2,    |                                   |
         | VC3)              |                                   |
         | CSV2(MCC1)        |                                   |
         | CSV3(AC1)         |                                   |

              Table 25: Advertisement received Received at the MCU from
                             Endpoints A and B

   Endpoints A and B are advertising each individual video capture, and
   also a switched capture MCC1 which that switches between the other three
   based on who is the active talker.  These endpoints do not advertise
   distinct audio captures associated with each individual video
   capture, so it would be impossible for the MCU (as a media consumer)
   to make its own determination of which video capture is the active
   talker based just on information in the audio streams.


    | Capture Scene #1     | Description=conference                   |
    | MCC1()               | CaptureArea=Left                         |
    |                      | MaxCaptures=1                            |
    |                      | SynchronisationID=1 SynchronizationID=1                      |
    |                      | Policy=SoundLevel:0                      |
    |                      | EncodingGroup=1                          |
        |                       |                                 |
    | MCC2()               | CaptureArea=Center                       |
    |                      | MaxCaptures=1                            |
    |                      | SynchronisationID=1 SynchronizationID=1                      |
    |                      | Policy=SoundLevel:0                      |
    |                      | EncodingGroup=1                          |
        |                       |                                 |
    | MCC3()               | CaptureArea=Right                        |
    |                      | MaxCaptures=1                            |
    |                      | SynchronisationID=1 SynchronizationID=1                      |
    |                      | Policy=SoundLevel:0                      |
    |                      | EncodingGroup=1                          |
        |                       |                                 |
    | MCC4()               | CaptureArea=whole scene                  |
    |                      | MaxCaptures=1                            |
    |                      | Policy=SoundLevel:0                      |
    |                      | EncodingGroup=1                          |
        |                       |                                 |
    | MCC5() (for audio)   | CaptureArea=whole scene                  |
    |                      | MaxCaptures=1                            |
    |                      | Policy=SoundLevel:0                      |
    |                      | EncodingGroup=2                          |
        |                       |                                 |
    | MCC6() (for audio)   | CaptureArea=whole scene                  |
    |                      | MaxCaptures=1                            |
    |                      | Policy=SoundLevel:1                      |
    |                      | EncodingGroup=2                          |
    | CSV1(MCC1,MCC2,MCC3 CSV1(MCC1,MCC2,MCC3) |                                          |
    | CSV2(MCC4)           |                                          |
    | CSV3(MCC5,MCC6)      |                                          |

          Table 26: Advertisement sent Sent from the MCU to Endpoint C

   The MCU advertises one scene, with four video MCCs.  Three of them in
   CSV1 give a left, center, and right view of the conference, with
   "site switching". site
   switching.  MCC4 provides a single video capture representing a view
   of the whole conference.  The MCU intends for MCC4 to be switched
   between all the other original source captures.  In this example example,
   advertisement of the MCU is not giving all the information about all
   the other endpoints' scenes and which of those captures is are included
   in the MCCs.  The MCU could include all that information if it wants to give the
   consumers more information, but it is not necessary for this example

   The Provider advertises MCC5 and MCC6 for audio.  Both are switched
   captures, with different SoundLevel policies indicating they are the
   top two dominant talkers.  The Provider advertises CSV3 with both
   MCCs, suggesting the Consumer should use both if it can.

   Endpoint C, in its configure message to the MCU, requests to receive
   MCC4 for video, video and MCC5 and MCC6 for audio.  In order for the MCU to
   get the information it needs to construct MCC4, it has to send
   configure messages to Endpoints A and B asking to receive MCC1 from
   each of them, along with their AC1 audio.  Now the MCU can use audio
   energy information from the two incoming audio streams from Endpoints
   A and B to determine which of those alternatives is the current
   talker.  Based on that, the MCU uses either MCC1 from A or MCC1 from
   B as the source of MCC4 to send to Endpoint C.

13. Acknowledgements

   Allyn Romanow and Brian Baldino were authors of early versions.
   Mark Gorzynski also contributed much to the initial approach.
   Many others also contributed, including Christian Groves, Jonathan
   Lennox, Paul Kyzivat, Rob Hansen, Roni Even, Christer Holmberg,
   Stephen Botzko, Mary Barnes, John Leslie, Paul Coverdale.

14.  IANA Considerations



   This document does not require any IANA actions.

14.  Security Considerations

   There are several potential attacks related to telepresence, and
   specifically the protocols used by CLUE, in CLUE.  This is the case of
   conferencing sessions, due to
   conferencing sessions, the natural involvement of multiple
   endpoints endpoints,
   and the many, often user-invoked, capabilities provided by the

   An MCU involved in a CLUE session can experience many of the same
   attacks as that of a conferencing system such as that the one enabled by the XCON
   Conference Information Data Model for Centralized Conferencing (XCON)
   framework [RFC5239].  Examples of attacks include the following: an
   endpoint attempting to listen to sessions in which it is not
   authorized to participate, an endpoint attempting to disconnect or
   mute other users, and theft of service by an endpoint in attempting
   to create telepresence sessions it is not allowed to create.  Thus,
   it is RECOMMENDED that an MCU implementing the protocols necessary to
   support CLUE, CLUE follow the security recommendations specified in the
   conference control protocol documents.  In the case of CLUE, SIP is
   the conferencing protocol, thus the security considerations in
   [RFC4579] MUST be followed.  Other security issues related to MCUs
   are discussed in the XCON framework [RFC5239].  The use of xCard with
   potentially sensitive information provides another reason to
   implement recommendations in Section 11 of section 11/[RFC5239]. [RFC5239].

   One primary security concern, surrounding the CLUE framework
   introduced in this document, involves securing the actual protocols
   and the associated authorization mechanisms.  These concerns apply to endpoint to endpoint sessions,
   endpoint-to-endpoint sessions as well as sessions involving multiple
   endpoints and MCUs.  Figure 2 in
   section Section 5 provides a basic flow of
   information exchange for CLUE and the protocols involved.

   As described in section Section 5, CLUE uses SIP/SDP to establish the session
   prior to exchanging any CLUE specific CLUE-specific information. Thus  Thus, the
   security mechanisms recommended for SIP [RFC3261], including user
   authentication and authorization, MUST be supported.  In addition,
   the media MUST be secured. DTLS/SRTP  Datagram Transport Layer Security (DTLS)
   / Secure Real-time Transport Protocol (SRTP) MUST be supported and
   SHOULD be used unless the media, which is based on RTP, is secured by
   other means (see [RFC7201] [RFC7202]).  Media security is also
   discussed in [I-D.ietf-clue-signaling] [RFC8848] and [I-D.ietf-clue-
   rtp-mapping]. [RFC8849].  Note that SIP call setup is
   done before any CLUE
   specific CLUE-specific information is available available, so the
   authentication and authorization are based on the SIP mechanisms.
   The entity that will be authenticated may use the Endpoint identity
   or the endpoint user identity; this is an application issue and not a
   CLUE specific
   CLUE-specific issue.

   A separate data channel is established to transport the CLUE protocol
   messages.  The contents of the CLUE protocol messages are based on
   information introduced in this document.  The CLUE data model [I-D.ietf-clue-data-model-schema] defines
   [RFC8846] defines, through an XML
   schema schema, the syntax to be used. Some  One
   type of the information which that could possibly introduce privacy concerns is
   the xCard information information, as described in section Section  The
   decision about which xCard information to send in the CLUE channel is
   an application policy for point to point point-to-point and multipoint calls based
   on the authenticated identity that can be the endpoint identity or
   the user of the endpoint.  For example example, the telepresence multipoint
   application can authenticate a user before starting a CLUE exchange
   with the telepresence system and have a policy per user.

   In addition, the (text) description field in the Media Capture
   attribute (section (Section could possibly reveal sensitive
   information or specific identities.  The same would be true for the
   descriptions in the Capture Scene (section (Section 7.3.1) and Capture Scene
   View (7.3.2) (Section 7.3.2) attributes.  An implementation SHOULD give users
   control over what sensitive information is sent in an Advertisement.
   One other important consideration for the information in the xCard as
   well as the description field in the Media Capture and Capture Scene
   View attributes is that while the endpoints involved in the session
   have been authenticated, there
   is are no assurance that the information
   in the xCard or description fields is authentic.  Thus, this
   information MUST NOT be used to make any authorization decisions.

   While other information in the CLUE protocol messages does not reveal
   specific identities, it can reveal characteristics and capabilities
   of the endpoints.  That information could possibly uniquely identify
   specific endpoints.  It might also be possible for an attacker to
   manipulate the information and disrupt the CLUE sessions.  It would
   also be possible to mount a DoS attack on the CLUE endpoints if a
   malicious agent has access to the data channel.  Thus, it MUST be
   possible for the endpoints to establish a channel which that is secure
   against both message recovery and message modification.  Further
   details on this are provided in the CLUE data channel solution
   document [I-D.ietf-clue-datachannel]. [RFC8850].

   There are also security issues associated with the authorization to
   perform actions at the CLUE endpoints to invoke specific capabilities
   (e.g., re-arranging rearranging screens, sharing content, etc.).  However, the
   policies and security associated with these actions are outside the
   scope of this document and the overall CLUE solution.

16. Changes Since Last Version

   NOTE TO THE RFC-Editor: Please remove this section prior to
   publication as an RFC.

   Changes from 24 to 25:

   Updates from IESG review.

     1. A few clarifications in various places.
     2. Change references to RFC5239 and RFC5646 from informative to
   Changes from 23 to 24:

     1. Updates to Security Considerations section.
     2. Update version number of references to other CLUE documents
        in progress.
   Changes from 22 to 23:

     1. Updates to Security Considerations section.
     2. Update version number of references to other CLUE documents
        in progress.
     3. Change some "MAY" to "may".
     4. Fix a few grammatical errors.

   Changes from 21 to 22:

     1. Add missing references.
     2. Update version number of referenced working group drafts.
     3. Minor updates for idnits issues.

   Changes from 20 to 21:

     1. Clarify CLUE can be useful

15.  References

15.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for multi-stream non-telepresence
     2. Remove unnecessary ambiguous sentence about optional use of
        CLUE protocol.

     3. Clarify meaning if Area of Capture is not specified.
     4. Remove use of "conference" where it didn't fit according to
        the definition.  Use "CLUE session" or "meeting" instead.
     5. Embedded Text Attribute: Remove restriction it is for video
     6. Minor cleanup in section 12 examples.
     7. Minor editorial corrections suggested by Christian Groves.

   Changes from 19 RFCs to 20:

     1. Define term "CLUE" in introduction.
     2. Add MCC attribute Allow Subset Choice.
     3. Remove phrase about reducing SDP size, replace Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
              A., Peterson, J., Sparks, R., Handley, M., and E.
              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
              DOI 10.17487/RFC3261, June 2002,

   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
        potentially saving consumer resources.
     4. Change example of a CLUE exchange that does not require SDP
     5. Language attribute uses RFC5646.
     6. Change Member person type to Attendee.  Add Observer type.
     7. Clarify DTLS/SRTP MUST be supported.
     8. Change SHOULD NOT to MUST NOT regarding using xCard or
        description information for authorization decisions.
     9. Clarify definition of Global View.
     10. Refer to signaling doc regarding interoperating with a
        device that does not support CLUE.
     11. Various minor editorial changes from working group last call
     12. Capitalize defined terms.

   Changes from 18 to 19:

     1. Remove the Max Capture Encodings media capture attribute.
     2. Refer to RTP mapping document in the MCC example section.
     3. Update references to current versions of drafts in progress.

   Changes from 17 to 18:

     1. Add separate definition of Global View List.
     2. Add diagram for Global View List structure.
     3. Tweak definitions of Media Consumer and Provider.

   Changes from 16 to 17:

     1. Ticket #59 - rename Capture Scene Entry (CSE) to Capture
        Scene View (CSV)

     2. Ticket #60 - rename Global CSE List to Global View List

     3. Ticket #61 - Proposal for describing the coordinate system.
        Describe it better, without conflicts if cameras point in
        different directions.

     4. Minor clarifications and improved wording for Synchronisation
        Identity, MCC, Simultaneous Transmission Set.

     5. Add definitions for CLUE-capable device and CLUE-enabled
        call, taken from the signaling draft.

     6. Update definitions of Capture Device, Media Consumer, Media
        Provider, Endpoint, MCU, MCC.

     7. Replace "middle box" with "MCU".

     8. Explicitly state there can also be Media Captures that are
        not included in a Capture Scene View.

     9. Explicitly state "A single Encoding Group MAY refer to
        encodings for different media types."

     10. In example 12.1.1 add axes and audio captures to the
        diagram, and describe placement of microphones.

     11. Add references to data model and signaling drafts.

     12. Split references into Normative and Informative sections.
        Add heading number for references section.

   Changes from 15 to 16:

     1. Remove Audio Channel Format attribute
     2. Add Audio Capture Sensitivity Pattern attribute

     3. Clarify audio spatial information regarding point of capture
        and point on line of capture.  Area of capture does not apply
        to audio.

     4. Update section 12 example for new treatment of audio spatial

     5. Clean up wording of some definitions, and various places in
        sections 5 and 10.

     6. Remove individual encoding parameter paragraph from section

     7. Update Advertisement diagram.

     8. Update Acknowledgements.

     9. References to use cases and requirements now refer to RFCs.

     10. Minor editorial changes.

   Changes from 14 to 15:

     1. Add "=" and "<=" qualifiers to MaxCaptures attribute, and
        clarify the meaning regarding switched and composed MCC.

     2. Add section 7.3.3 Global Capture Scene Entry List, and a few
        other sentences elsewhere that refer to global CSE sets.

     3. Clarify: The Provider MUST be capable of encoding and sending
        all Captures (*that have an encoding group*) in a single
        Capture Scene Entry simultaneously.

     4. Add voice activated switching example in section 12.

     5. Change name of attributes Participant Info/Type to Person

     6. Clarify the Person Info/Type attributes have the same meaning
        regardless of whether or not the capture has a Presentation

     7. Update example section 12.1 to be consistent with the rest of
        the document, regarding MCC and capture attributes.

     8. State explicitly each CSE has a unique ID.

   Changes from 13 to 14:

     1. Fill in section for Security Considerations.

     2. Replace Role placeholder with Participant Information,
        Participant Type, and Scene Information attributes.

     3. Spatial information implies nothing about how constituent
        media captures are combined into a composed MCC.

     4. Clean up MCC example in Section 12.3.3.  Clarify behavior of
        tiled and PIP display windows.  Add audio.  Add new open
        issue about associating incoming packets to original source

     5. Remove editor's note and associated statement about RTP
        multiplexing at end of section 5.

     6. Remove editor's note and associated paragraph about
        overloading media channel with both CLUE and non-CLUE usage,
        in section 5.

     7. In section 10, clarify intent of media encodings conforming
        to SDP, even with multiple CLUE message exchanges.  Remove
        associated editor's note.

   Changes from 12 to 13:

     1. Added the MCC concept including updates to existing sections
        to incorporate the MCC concept. New MCC attributes:
        MaxCaptures, SynchronisationID and Policy.

     2. Removed the "composed" and "switched" Capture attributes due
        to overlap with the MCC concept.

     3. Removed the "Scene-switch-policy" CSE attribute, replaced by
        MCC and SynchronisationID.

     4. Editorial enhancements including numbering of the Capture
        attribute sections, tables, figures etc.

   Changes from 11 to 12:

     1. Ticket #44. Remove note questioning about requiring a
        Consumer to send a Configure after receiving Advertisement.

     2. Ticket #43. Remove ability for consumer to choose value of
        attribute for scene-switch-policy.

     3. Ticket #36. Remove computational complexity parameter,
        MaxGroupPps, from Encoding Groups.

     4. Reword the Abstract and parts of sections 1 and 4 (now 5)
        based on Mary's suggestions as discussed on the list.  Move
        part of the Introduction into a new section Overview &

     5. Add diagram of an Advertisement, in the Overview of the
        Framework/Model section.

     6. Change Intended Status to Standards Track.

     7. Clean up RFC2119 keyword language.

   Changes from 10 to 11:

     1. Add description attribute to Media Capture and Capture Scene

     2. Remove contradiction and change the note about open issue
        regarding always responding to Advertisement with a Configure

     3. Update example section, to cleanup formatting and make the
        media capture attributes and encoding parameters consistent
        with the rest of the document.

   Changes from 09 to 10:

     1. Several minor clarifications such as about SDP usage, Media
        Captures, Configure message.

     2. Simultaneous Set can be expressed in terms of Capture Scene
        and Capture Scene Entry.

     3. Removed Area of Scene attribute.

     4. Add attributes from draft-groves-clue-capture-attr-01.

     5. Move some of the Media Capture attribute descriptions back
        into this document, but try to leave detailed syntax to the
        data model.  Remove the OUTSOURCE sections, which are already
        incorporated into the data model document.

   Changes from 08 to 09:

     1. Use "document" instead of "memo".

     2. Add basic call flow sequence diagram to introduction.

     3. Add definitions for Advertisement and Configure messages.

     4. Add definitions for Capture and Provider.

     5. Update definition of Capture Scene.

     6. Update definition of Individual Encoding.

     7. Shorten definition of Media Capture and add key points in the
        Media Captures section.

     8. Reword a bit about capture scenes in overview.

     9. Reword about labeling Media Captures.

     10. Remove the Consumer Capability message.

     11. New example section heading for media provider behavior

     12. Clarifications in the Capture Scene section.

     13. Clarifications in the Simultaneous Transmission Set section.

     14. Capitalize defined terms.

     15. Move call flow example from introduction to overview section

     16. General editorial cleanup

     17. Add some editors' notes requesting input on issues
     18. Summarize some sections, and propose details be outsourced
        to other documents.

   Changes from 06 to 07:

     1. Ticket #9.  Rename Axis of Capture Point attribute to Point
        on Line of Capture.  Clarify the description of this

     2. Ticket #17.  Add "capture encoding" definition.  Use this new
        term throughout document as appropriate, replacing some usage
        of the terms "stream" and "encoding".

     3. Ticket #18.  Add Max Capture Encodings media capture

     4. Add clarification that different capture scene entries are
        not necessarily mutually exclusive.

   Changes from 05 to 06:

   1. Capture scene description attribute is a list of text strings,
      each in a different language, rather than just a single string.

   2. Add new Axis of Capture Point attribute.

   3. Remove appendices A.1 through A.6.

   4. Clarify that the provider must use the same coordinate system
      with same scale and origin for all coordinates within the same
      capture scene.

   Changes from 04 to 05:

   1. Clarify limitations of "composed" attribute.

   2. Add new section "capture scene entry attributes" and add the
      attribute "scene-switch-policy".

   3. Add capture scene description attribute and description
      language attribute.

   4. Editorial changes to examples section for consistency with the
      rest of the document.

   Changes from 03 to 04:

   1. Remove sentence from overview - "This constitutes a significant
      change ..."

   2. Clarify a consumer can choose a subset of captures from a
      capture scene entry or a simultaneous set (in section "capture
      scene" and "consumer's choice...").

   3. Reword first paragraph of Media Capture Attributes section.

   4. Clarify a stereo audio capture is different from two mono audio
      captures (description of audio channel format attribute).

   5. Clarify what it means when coordinate information is not
      specified for area of capture, point of capture, area of scene.

   6. Change the term "producer" to "provider" to be consistent (it
      was just in two places).

   7. Change name of "purpose" attribute to "content" and refer to
      RFC4796 for values.

   8. Clarify simultaneous sets are part of a provider advertisement,
      and apply across all capture scenes in the advertisement.

   9. Remove sentence about lip-sync between all media captures in a
      capture scene.

   10.   Combine the concepts of "capture scene" and "capture set"
      into a single concept, using the term "capture scene" to
      replace the previous term "capture set", and eliminating the
      original separate capture scene concept.

17. Normative References

              Holmberg, C., "CLUE Protocol Data Channel", draft-
              ietf-clue-datachannel-11 (work in progress), November

              Presta, R., Romano, S P., "An XML Schema for the CLUE
              data model", draft-ietf-clue-data-model-schema-11 (work
              in progress), October 2015.

              Presta, R. and S. Romano, "CLUE protocol", draft-
              ietf-clue-protocol-06 (work in progress), October 2015.

              Kyzivat, P., Xiao, L., Groves, C., Hansen, R., "CLUE
              Signaling", draft-ietf-clue-signaling-06 (work in
              progress), August 2015.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G.,
              Johnston, A., Peterson, J., Sparks, R., Handley, M.,
              and E. Schooler, "SIP: Session Initiation Protocol",
              RFC 3261, June 2002.

   [RFC3264]  Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model
              with the Session Description Protocol (SDP)", RFC 3264,
              June 2002.

   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
              Jacobson, "RTP: A Transport Protocol Session Description Protocol (SDP)", RFC 3264,
              DOI 10.17487/RFC3264, June 2002,

   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
              Jacobson, "RTP: A Transport Protocol for Real-Time
              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
              July 2003. 2003, <>.

   [RFC4566]  Handley, M., Jacobsen, Jacobson, V., and C. Perkins, C., "SDP: Session
              Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
              July 2006. 2006, <>.

   [RFC4579]  Johnston, A., A. and O. Levin, O., "SIP "Session Initiation Protocol
              (SIP) Call Control - Conferencing for User Agents",
              BCP 119, RFC 4579, DOI 10.17487/RFC4579, August 2006 2006,

   [RFC5239]  Barnes, M., Boulton, C., and O. Levin, O., "A Framework for
              Centralized Conferencing", RFC 5239, DOI 10.17487/RFC5239,
              June 2008. 2008, <>.

   [RFC5646]  Phillips, A., Ed. and M. Davis, M., Ed., "Tags for Identifying
              Languages", BCP 47, RFC 5646, DOI 10.17487/RFC5646,
              September 2009. 2009, <>.

   [RFC6350]  Perreault, S., "vCard Format Specification", RFC 6350,
              DOI 10.17487/RFC6350, August 2011. 2011,

   [RFC6351]  Perreault, S., "xCard: vCard XML Representation",
              RFC 6351, DOI 10.17487/RFC6351, August 2011.

18. 2011,

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <>.

   [RFC8846]  Presta, R. and S P. Romano, "An XML Schema for the
              Controlling Multiple Streams for Telepresence (CLUE) Data
              Model", RFC 8846, DOI 10.17487/RFC8846, June 2020,

   [RFC8847]  Presta, R. and S P. Romano, "Controlling Multiple Streams
              for Telepresence (CLUE) Protocol", RFC 8847,
              DOI 10.17487/RFC8847, June 2020,

   [RFC8848]  Hanton, R., Kyzivat, P., Xiao, L., and C. Groves, "Session
              Signaling for Controlling Multiple Streams for
              Telepresence (CLUE)", RFC 8848, DOI 10.17487/RFC8848, June
              2020, <>.

   [RFC8850]  Holmberg, C., "Controlling Multiple Streams for
              Telepresence (CLUE) Protocol Data Channel", RFC 8850,
              DOI 10.17487/RFC8850, June 2020,

15.2.  Informative References

              Even, R., Lennox, J., "Mapping RP streams to CLUE media
              captures", draft-ietf-clue-rtp-mapping-05 (work in
              progress), October 2015.

   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
              Session Initiation Protocol (SIP)", RFC 4353,
              DOI 10.17487/RFC4353, February 2006.

   [RFC5117] 2006,

   [RFC7201]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC
              5117, January 2008.

   [RFC7201]  Westerlund, M., C. Perkins, C., "Options for Securing RTP
              Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014. 2014,

   [RFC7202]  Perkins, C., C. and M. Westerlund, M., "Why "Securing the RTP
              Framework: Why RTP Does Not Mandate a Single Media
              Security Solution ", Solution", RFC 7202, DOI 10.17487/RFC7202, April
              2014, <>.

   [RFC7205]  Romanow, A., Botzko, S., Duckworth, M., and R. Even, R., Ed.,
              "Use Cases for Telepresence Multistreams", RFC 7205,
              DOI 10.17487/RFC7205, April 2014. 2014,

   [RFC7262]  Romanow, A., Botzko, S., and M. Barnes, M., "Requirements for
              Telepresence Multistreams", RFC 7262,
              DOI 10.17487/RFC7262, June 2014,

   [RFC7667]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 7667,
              DOI 10.17487/RFC7667, November 2015,

   [RFC8849]  Even, R. and J. Lennox, "Mapping RTP Streams to
              Controlling Multiple Streams for Telepresence (CLUE) Media
              Captures", RFC 8849, DOI 10.17487/RFC8849, June 2014.

19. 2020,


   Allyn Romanow and Brian Baldino were authors of early draft versions.
   Mark Gorzynski also contributed much to the initial approach.  Many
   others also contributed, including Christian Groves, Jonathan Lennox,
   Paul Kyzivat, Rob Hanton, Roni Even, Christer Holmberg, Stephen
   Botzko, Mary Barnes, John Leslie, and Paul Coverdale.

Authors' Addresses

   Mark Duckworth (editor)
   Andover, MA 01810
   United States of America


   Andrew Pepperell
   Uxbridge, England
   United Kingdom


   Stephan Wenger
   Vidyo, Inc.
   433 Hackensack Ave.
   Hackensack, N.J. NJ 07601
   United States of America