When you make a VoIP call, SIP handles the signaling — setting up the call, negotiating parameters, and tearing it down when someone hangs up. But SIP does not carry the actual voice audio. That job belongs to RTP — Real-time Transport Protocol. RTP is the protocol that takes your voice, chops it into tiny packets, ships those packets across the internet, and reassembles them into audio at the other end. Every moment of every VoIP conversation you have ever had traveled as RTP packets.
Understanding RTP helps you diagnose call quality problems, configure your network correctly, and have informed conversations with your carrier when audio issues arise. SIPNEX is an FCC-licensed carrier that provides SIP trunking — our media infrastructure handles millions of RTP streams daily.
What RTP does
RTP (RFC 3550, published by the IETF in 2003 as an update to the original RFC 1889 from 1996) is a network protocol designed for delivering real-time data — audio, video, or other time-sensitive content — over IP networks. For VoIP, RTP carries the digitized voice audio between your phone system and the carrier’s media gateway.
RTP operates over UDP (User Datagram Protocol) rather than TCP. This is a deliberate design choice: TCP guarantees delivery by retransmitting lost packets, but retransmission adds delay. In real-time voice, a delayed packet is worse than a lost packet — hearing a word 500ms late is more disruptive than a brief gap. UDP sends packets without guarantees, accepting occasional loss in exchange for minimum latency. VoIP endpoints handle lost packets through concealment algorithms rather than retransmission.
Each RTP packet contains a small chunk of audio — typically 20 milliseconds of voice data. At 50 packets per second per call, a 3-minute phone conversation generates approximately 9,000 RTP packets in each direction. Each packet is independent — it has a sequence number and timestamp so the receiving end can reassemble them in the correct order, even if they arrive out of sequence due to network routing variations.
The anatomy of an RTP packet
An RTP packet consists of a header and a payload.
The header (12 bytes minimum) contains: version number (currently version 2), payload type (identifies the codec — type 0 for G.711u, type 18 for G.729), sequence number (incremented by 1 for each packet — the receiver uses this to detect lost or out-of-order packets), timestamp (reflects the sampling instant of the first audio sample in the packet — the receiver uses this to play audio at the correct timing), and SSRC (Synchronization Source identifier — uniquely identifies the RTP stream).
The payload contains the actual encoded audio data. For G.711u at 20ms per packet: 160 bytes of audio (8000 samples/sec × 0.020 sec × 1 byte/sample). For G.729 at 20ms per packet: 20 bytes (compressed). The total packet size including IP, UDP, and RTP headers is approximately 214 bytes for G.711u and 74 bytes for G.729.
How RTP flows during a call
When a VoIP call is established through SIP signaling, the SDP (Session Description Protocol) exchange in the SIP INVITE and 200 OK messages negotiates the RTP parameters: which codec to use, which IP addresses and UDP ports to send/receive media on, and which DTMF method to use.
Once the call is connected (SIP 200 OK acknowledged), RTP streams begin flowing. In a standard call, there are two RTP streams: one from your system to the carrier (your outbound audio) and one from the carrier to your system (the caller’s audio). These are independent UDP flows — each direction uses its own source and destination ports.
In a typical VICIdial configuration with canreinvite=no, all RTP media flows through your Asterisk server. This is required for call recording (Asterisk must see the media to record it) and AMD (Asterisk must analyze the audio to detect answering machines). If canreinvite=yes were set, Asterisk might negotiate direct media between the caller and the carrier, bypassing the server — which would break recording and AMD.
RTCP: the control companion
RTCP (RTP Control Protocol, also defined in RFC 3550) runs alongside RTP and provides quality feedback. While RTP carries the audio, RTCP carries statistics about the RTP stream — packets sent, packets lost, jitter measurements, and round-trip time estimates. RTCP packets are sent periodically (typically every 5 seconds) and use the next sequential UDP port after the RTP port (if RTP uses port 10000, RTCP uses 10001).
RTCP data is useful for real-time quality monitoring and for post-call quality analysis. Some VoIP monitoring tools consume RTCP data to generate MOS scores and quality alerts during active calls.
Common RTP problems and solutions
One-way audio. You can hear the other party but they cannot hear you (or vice versa). This is almost always a NAT (Network Address Translation) problem. Your system sends RTP from a private IP address, but the carrier needs to send return RTP to your public IP. If NAT traversal is not configured correctly, the return RTP goes to the wrong address. Solution: configure nat=force_rport,comedia in your Asterisk sip.conf. Ensure your firewall allows UDP traffic on your RTP port range (typically 10000-20000) from your carrier’s media IPs. See our VICIdial setup guide.
No audio in either direction. RTP ports are completely blocked by your firewall. Verify that UDP ports 10000-20000 (or your configured RTP range) are open for traffic to and from your carrier’s IP addresses.
Choppy or robotic audio. Caused by jitter (variation in packet arrival timing) or packet loss. The jitter buffer on the receiving end smooths out timing variations, but excessive jitter exhausts the buffer. Packet loss creates gaps that concealment algorithms fill imperfectly. Solution: implement QoS on your network to prioritize UDP voice traffic, use wired Ethernet instead of WiFi, and ensure your internet connection has adequate bandwidth with low jitter.
Echo. Caused by acoustic feedback (audio from the speaker feeding back into the microphone) or by impedance mismatch at analog/digital conversion points. In VoIP, echo is most common when calls traverse analog segments (FXO/FXS gateways). Solution: enable echo cancellation in your Asterisk configuration. If the echo is severe, check for hybrid echo at any analog interface points.
Audio delay. End-to-end latency exceeding 200ms creates noticeable conversational delay. Causes: network latency (distance, routing hops), jitter buffer depth (larger buffer = more delay), codec processing time (G.729 encoding/decoding adds a few milliseconds versus G.711). Solution: use G.711 to minimize codec delay, reduce jitter buffer size if your network is stable enough to support it, and choose a carrier with low-latency peering to your geographic region.
RTP and security
Standard RTP transmits audio unencrypted — anyone who can capture packets on the network path can decode and listen to the conversation. For sensitive communications, SRTP (Secure Real-time Transport Protocol, RFC 3711) encrypts the RTP payload using AES encryption. SRTP adds minimal overhead (the encryption/decryption processing is fast on modern hardware) and is supported by SIPNEX alongside TLS for SIP signaling. Together, TLS + SRTP provide end-to-end encryption of both call setup and conversation content.
Frequently asked questions
What is RTP in VoIP?
RTP (Real-time Transport Protocol) is the protocol that carries actual voice audio during a VoIP call. While SIP handles call signaling (setup, teardown, routing), RTP handles the media — digitized voice data packaged into small packets (typically 20ms of audio each) sent over UDP. Each call has two RTP streams: one for each direction of audio. RTP includes sequence numbers and timestamps so the receiving end can reassemble packets in the correct order and detect losses. RTP operates over UDP rather than TCP because real-time audio requires minimum latency — retransmitting lost packets (TCP’s approach) would introduce delays worse than brief audio gaps.
What ports does RTP use?
RTP uses a range of UDP ports, typically 10000 to 20000 in most Asterisk/VICIdial configurations (configurable in rtp.conf). Each active call uses one port for RTP and the next sequential port for RTCP (the control protocol). With 200 concurrent calls, 400 UDP ports are in use simultaneously. Your firewall must allow UDP traffic on the entire RTP port range to and from your carrier’s media IP addresses. Blocking any port in the range will cause audio failures on calls that attempt to use that port.
Why does RTP use UDP instead of TCP?
Voice communication is real-time — packets must arrive within a narrow time window to be useful. TCP guarantees delivery by retransmitting lost packets, but retransmission adds hundreds of milliseconds of delay. In a phone conversation, hearing a word 500ms late is more disruptive than missing it entirely. UDP sends packets without delivery guarantees, accepting occasional loss in exchange for minimum latency. VoIP endpoints handle lost packets through concealment algorithms (interpolating missing audio from surrounding packets) rather than requesting retransmission. The result is smoother audio with occasional brief gaps rather than delayed audio with guaranteed completeness.
SIPNEX media infrastructure handles RTP at scale — optimized for high-concurrency predictive dialing with G.711 pass-through, no transcoding, and low-latency media paths. Get a trunk built for volume or see our rates.
Keep Reading
SIPNEX
FCC-licensed carrier with its own STIR/SHAKEN SP certificate. Operator-owned. SIP trunks built for operators who dial at volume.