In general, the best compression is accomplished using RTP header compression, as it can compress the IP/UDP/RTP headers from 40 to one or two bytes. However, it only works for short-delay unicast connections on a single link.
For wide-area links that see a lot of voice trafic, e.g., for PBX interconnect, RTP muxing is far more efficient, since it avoids the overhead of IP and UDP packet headers, as well as featuring shorter RTP headers. Using RTP muxing, the overhead can be reduced to about two bytes per "channel", with one UDP/IP header for up to several dozen channels.
A minimal version of RTP would likely contain a sequence number (SN) and a payload type (PT), with a minimum combined size of two bytes. Unfortunately, such a choice would have a number of disadvantages:
The NTP timestamps in the SR are assumed to be synchronized between all media senders within a single session. If the media sources come from the same network source, this is obviously not an issue. Receiver(s) synchronize to the sender, the only solution feasible for multicast.
Experience has shown that all other cross-media, cross-host schemes end up doing clock synchronization, usually inferior to NTP and application-specific.
The marker bit is a hint; the beginning of a talkspurt can also be computed by comparing the difference in timestamps and sequence numbers between two packets, assuming the timestamp clock rate is known.
Packets may arrive out of order, so that the packet with the marker bit is received after the second packet in the talkspurt. As long as the playout delay is longer than this reordering, the receiver can still perform delay adaptation. If not, it simply has to wait for the next talkspurt.
Jitter is computed in timestamp units. For example, for an audio stream sampled at 8,000 Hz, the arrival time measured with the local clock is converted by multiplying the seconds by 8,000.
Steve Casner wrote:
For encodings such as MPEG that transmit data in a different order than it was sampled, this adds noise into the jitter calculation. I have heard handwavy arguments that this factor can be calculated out given that you know the shape of the noise, but my math isn't strong enough for that.
In many of the cases that we care about, the jitter introduced by MPEG will be small enough that when the network jitter is of the same order we don't have a problem anyway.
There is another problem for video in that all of the packets of a frame have the same timestamp because the whole frame is sampled at once. However, the dispersion in time of those packets really is all part of the network transfer process that the receiver must accommodate with its buffer.
It has been suggested that jitter be calculated only on the first packet of a video frame, or only on "I" frames for MPEG. However, that may color the results also because those packets may see transit delays different than the following packets see.
The main point to remember is that the primary function of the RTP timestamp is to represent the inherent notion of real time associated with the media. It also turns out to be useful for the jitter measure, but that is a secondary function.
The jitter value is not expected to be useful as an absolute value. It is more useful as a means of comparing the reception quality at two receiver or comparing the reception quality 5 minutes ago to now.
The session bandwidth is the nominal data bandwidth plus the IP, UDP and RTP headers (40 bytes). For example, for 64 kb/s PCM audio packetized in 20 ms increments, the session bandwidth would be (160 + 40) / 0.02 bytes/second or 80 kb/s. If there are multiple senders, the sum of their individual bandwidths is used.
The session bandwidth is typically defined out-of-band, e.g., in a session announcement protocol, based on reasonable estimates of the number of concurrent senders and their average bandwidth. Distributed and consistent on-line estimation of the session bandwidth may be hard as the number of senders and their bandwidth changes. The absolute value is less important than that all participants agree on a common value. (After all, there is nothing special about choosing the RTCP bandwidth to be 5% of the session bandwidth, it just has to be agree upon by all participants to avoid timing out members prematurely.)
m=audio 12345 RTP/AVP/121 a=rtpmap:121 RT24
Note that a number of encodings are described in the RTP A/V profile which do not have a static (permanent) payload type. The RTP A/V Profile defines names for encodings which may be used by SDP or other mechanisms to specify the mapping. Encodings may also be identified by object identifiers or other names.
Since the space for payload types is limited, only very common encodings should be assigned static types. These are typically audio and video encodings "blessed" by international standardization bodies, such as the G. series of ITU-T audio encodings. The RTP A/V Profile defines a set of criteria for making static assignments.
Also, in multicast environments, it is unlikely that every sender will use the same payload type.
However, for real-time delivery of audio and video, TCP and other reliable transport protocols such as XTP are inappropriate. The three main reasons are:
An additional small disadvantage is that the TCP and XTP headers are larger than a UDP header (40 bytes for TCP and XTP 3.6, 32 bytes for XTP 4.0, compared to 8 bytes). Also, these reliable transport protocols do not contain the necessary timestamp and encoding information needed by the receiving application, so that they cannot replace RTP. (They would not need the sequence number as these protocols assure that no losses or reordering takes place.)
While LANs often have sufficient bandwidth and low enough losses not to trigger these problems, TCP does not offer any advantages in that scenario either, except for the recovery from rare packet losses. Even in a LAN with no losses, the TCP slow start mechanism would limit the initial rate of the source for the first few round-trip times.
RTP has no protocol state by itself and can thus be used over either connection-less networks, such as IP/UDP, or connection-oriented networks, such as XTP, ST-II or ATM (AAL3/4 or AAL5). Many real-time multimedia applications use multicast with a large fan-out, e.g., several hundred to thousands for a lecture or concert. Connection-oriented protocols like XTP have difficulty scaling to such a large number of receivers.
XTP does not offer timing or content type (media) information, and thus would need these services, as offered by RTP. XTP provides no RTP-like direct feedback of the received quality-of-service, and thus, again, would have to "import" these from another protocol. Looking at existing applications using XTP for real-time services confirms that they need to add a layer similar in content to the RTP data part "between" XTP and the actual media.
The Java Media Framework (JMF), a Java API, also supports RTP and RTCP.
There is no standard API for RTP.
The VAT header format is only described in header files. (See the VAT and NeVoT sources for details.) Many aspects of RTP and the VAT protocol are similar, but RTP improves upon the VAT protocol in a number of ways:
Also, the multicast (version 3.5 and later) kernel sources use the following port ranges:As specified in the RTP protocol definition, RTP data is to be carried on an even UDP port number and the corresponding RTCP packets are to be carried on the next higher (odd) port number.
Applications operating under this profile may use any such UDP port pair. For example, the port pair may be allocated randomly by a session management program. A single fixed port number pair cannot be required because multiple applications using this profile are likely to run on the same host, and there are some operating systems that do not allow multiple processes to use the same UDP port with different multicast addresses.
However, port numbers 5004 and 5005 have been registered for use with this profile for those applications that choose to use them as the default pair. Applications that operate under multiple profiles may use this port pair as an indication to select this profile if they are not subject to the constraint of the previous paragraph. Applications need not have a default and may require that the port pair be explicitly specified. The particular port numbers were chosen to lie in the range above 5000 to accommodate port number allocation practice within the Unix operating system, where port numbers below 1024 can only be used by privileged processes and port numbers between 1024 and 5000 are automatically assigned by the operating system.
from | to | application | priority |
---|---|---|---|
0 | 16383 | unclassified | lowest |
16384 | 32767 | audio | highest |
32768 | 49151 | whiteboard | medium |
49152 | 65535 | video | low |
Note: The port ranges in question do not make any difference unless the traffic traverses an interface or tunnel where the multicast traffic rate exceeds the configured mrouted rate-limiter.
If RTP is used within the H.323 framework, port assignment is done by the H.225.0 signaling messages. In SDP and SIP, the conference controller or inviting party picks the port numbers.
Section 10 of RFC 1889 says:
In a unicast session, applications SHOULD be prepared to receive RTP data and control on one port pair and send to another.
Note that the SSRC values used for each source are always different.
Ports used:
H.323 | TCP | 1720 |
H.235 | TCP | ephemeral, > 1024 |
Name | Type | Algorithm | Sampling frequency (kHz) | Bit rate |
---|---|---|---|---|
MPEG L3 | audio | 22.05, 44.1 | 48..128 | |
G.711 | audio | mu-law, A-law | 8.0 | 64 kb/s |
G.721 subsumed by G.726 | audio | ADPCM | 8.0 | 32 kb/s |
G.722 | audio | 16.0 (7 kHz spectrum) | 64 kb/s | |
G.723 recommendation no longer in force! | audio | 8.0 | 24 kb/s | |
G.723.1 | audio | ACELP and MQ-CLP | 8.0 | 5.3 and 6.3 kb/s |
G.726 | audio | ADPCM | 8.0 | 16, 24, 32, 40 kb/s |
G.728 | audio | low-delay CELP | 8.0 | 16 kb/s |
G.729 | audio | CS-ACELP | 8.0 | 8 kb/s |
H.261 | video | DCT | ||
H.263 | video | DCT (improved version of H.261) |
For conferencing over ISDN:
The comp.speech FAQ contains many additional references, including a good summary. of how different algorithms work.
Too many, some may say. vat versions 3.4 and earlier, one of the early (recent) Internet audio applications, uses mostly the same audio encodings as specified in the RTP profile, but a different protocol. There are also a number of Internet telephony applications that usually only operate on PCs and in unicast mode. There are initial efforts to interconnect the public switched telephone network and the Internet.
CuSeeMe (for Windows PC and the Macintosh) is a combined audio and video tool using reflectors rather than IP-level multicast.
The Internet Telephony Consortium maintains a listing of standards and related efforts.
Last updated Sun Feb 27 16:27:33 2000 by Henning Schulzrinne