In the previous parts of the Voice over IP Overview, we described how the voice gets digitized, how it is encoded using codecs, and we also touched latency and bandwidth optimization issues. Now it’s the right time to learn more about how audio (and possibly video) streams are sent across the network.

The protocol used to send real-time streams of data across a network is simply called the Real Time Protocol (RTP for short). RTP has been originally defined by IETF in RFC1889 and the up-to-date definition is in RFC3550.

When transmitting the streams of data, the protocol needs to handle the following conditions in the network:

* The network can de-sequence packets
* Some packets can be lost
* Jitter is introduced (jitter is a variance of packet inter-arrival time, we will get to it later in greater detail).

Out of these three, RTP aims to solve only two issues, packet de-sequencing and jitter (using sequence numbers and timestamps). When it comes to packet loss, the protocol prefers “real-timeness” to reliability. If some packets get lost, they get lost, it’s more important to transmit the stream in real time. Because of this, RTP works on top of UDP. TCP is not suitable for real-time protocols because of its retransmission scheme.