VoIP turns your voice into data, sends that data over the internet, and turns it back into voice at the other end. The entire process — from speaking into a microphone to hearing the response in your headset — happens in under 200 milliseconds. This guide explains how each step works, in order, without requiring a networking degree to understand.
SIPNEX is an FCC-licensed carrier providing SIP trunking — the carrier-grade version of VoIP that connects your phone system to the telephone network. Understanding how VoIP works helps you configure your system correctly, diagnose quality problems, and make informed decisions about codecs, bandwidth, and carrier selection.
Step 1: Your voice becomes data
When you speak into a microphone (headset, handset, or laptop mic), the microphone converts sound waves into an analog electrical signal — a continuously varying voltage that represents the shape of your voice waveform. This analog signal must be converted to digital data for transmission over IP networks.
The conversion process is called sampling. The system measures the analog signal at regular intervals — 8,000 times per second for standard telephony (the Nyquist rate for capturing voice frequencies up to 4 kHz). Each sample is converted to a numeric value — an 8-bit number for G.711 encoding, representing 256 possible amplitude levels. This produces a digital audio stream at 64,000 bits per second (8,000 samples × 8 bits = 64 kbps) — the foundation of the G.711 codec.
If you are using a compressed codec like G.729, an additional encoding step reduces the 64 kbps stream to 8 kbps using predictive algorithms that model how human speech changes from sample to sample and transmit only the differences. This saves bandwidth at the cost of slight audio quality reduction.
Step 2: Data becomes packets
The digital audio stream is divided into small chunks — typically 20 milliseconds of audio per chunk. Each chunk is placed into an RTP (Real-time Transport Protocol) packet with a header containing sequence number, timestamp, and codec identifier. The RTP packet is wrapped in a UDP packet (which provides the port addressing), which is wrapped in an IP packet (which provides the network addressing).
The complete packet structure for one 20ms chunk of G.711 audio: 160 bytes of audio payload + 12 bytes RTP header + 8 bytes UDP header + 20 bytes IP header = 200 bytes total. At 50 packets per second (one every 20ms), that is approximately 85 kbps per direction — the bandwidth cost of one G.711 VoIP call.
Step 3: Packets travel the network
The IP packets travel from your system to your carrier’s media gateway (or directly to the other party in a peer-to-peer configuration) over the internet. Each packet is routed independently — packets may take different paths through the network depending on routing conditions. They may arrive out of order. Some may be delayed. Some may be lost entirely.
This is where network quality matters. Three factors determine whether the packets arrive in usable condition:
Latency: How long each packet takes to arrive. Under 150ms one-way is acceptable for voice. Higher latency creates conversational delay.
Jitter: How much the arrival timing varies between packets. Under 30ms is acceptable. Higher jitter causes the receiving end’s buffer to struggle.
Packet loss: What percentage of packets never arrive. Under 1% is acceptable. Higher loss creates audio gaps.
Your internet connection’s raw speed (Mbps) matters less than these three quality metrics. A 10 Mbps connection with 20ms latency and 0.1% packet loss produces better voice quality than a 100 Mbps connection with 200ms latency and 3% packet loss.
Step 4: Packets become voice again
At the receiving end, the process reverses. The RTP packets arrive at the carrier’s media gateway (for PSTN-bound calls) or at the recipient’s VoIP device (for IP-to-IP calls). A jitter buffer collects incoming packets and holds them briefly — typically 20 to 60ms — to smooth out timing variations. Packets that arrive late are either inserted at the correct position (if the buffer is large enough) or dropped (if they arrive too late).
The buffered packets are reassembled in sequence order using the RTP sequence numbers. Any missing packets are handled by Packet Loss Concealment (PLC) — algorithms that interpolate the missing audio from surrounding packets. The digital audio data is decoded back to an analog signal (or played directly through a digital audio pipeline) and sent to the speaker.
The total time from your microphone to the recipient’s speaker: encoding (1-5ms) + packetization (20ms) + network transit (30-150ms) + jitter buffer (20-60ms) + decoding (1-5ms) = approximately 70 to 240ms end-to-end. Under 200ms feels like a normal phone conversation. Over 300ms creates the satellite-call effect where you talk over each other.
Step 5: The call is managed by SIP
While RTP handles the audio, SIP (Session Initiation Protocol) manages the call itself. When your phone system wants to place a call, it sends a SIP INVITE to the carrier’s SIP proxy. The INVITE contains who you are calling, your caller ID, and which codecs your system supports.
The carrier authenticates your trunk, validates your caller ID, signs the call with STIR/SHAKEN attestation, and routes it toward the destination. If the destination is a regular phone number, the carrier connects through the PSTN. If the destination is another VoIP system, the carrier may route via direct IP peering.
When the called party answers, both sides begin sending RTP packets. When either party hangs up, SIP sends a BYE message to end the call. The RTP streams stop. The call is over.
Where VoIP meets the telephone network
For calls between two VoIP systems (two SIP trunks, two softphones, two IP devices), the entire call stays on IP networks. SIP handles signaling end-to-end. RTP carries audio end-to-end. No traditional telephone network is involved.
For calls to regular phone numbers (cell phones, landlines), the call must transition from the IP network to the PSTN (Public Switched Telephone Network). This transition happens at the carrier level — your carrier’s gateway converts SIP signaling to SS7 signaling and RTP audio to TDM audio for delivery through the traditional telephone network. This is the core function of a SIP trunk provider — bridging your IP infrastructure to the PSTN.
The transition is invisible to you. You send a SIP INVITE. The carrier handles the conversion. The recipient’s phone rings. From your perspective, every call is a SIP call. From the recipient’s perspective, it is just a phone call.
Why carrier choice matters
Every step above is affected by your carrier selection:
Step 3 (network transit) is affected by your carrier’s peering quality — better peering means lower latency and packet loss between their network and the destination. Step 5 (SIP management) is affected by your carrier’s signaling infrastructure — faster processing means lower PDD. STIR/SHAKEN signing is affected by whether your carrier is a direct carrier or reseller — direct carriers sign at A-level. The PSTN transition is affected by your carrier’s gateway capacity and interconnection agreements — overloaded gateways cause quality degradation on PSTN-bound calls.
SIPNEX handles these carrier-side factors as an FCC-licensed carrier with our own network: low-latency peering, carrier-grade SIP infrastructure, direct STIR/SHAKEN signing with our own SP-KI certificate, and media gateways sized for high-concurrency predictive dialing.
Frequently asked questions
How does VoIP work in simple terms?
VoIP converts your voice into digital data packets, sends those packets over the internet, and converts them back to voice at the other end. Your microphone captures your voice. A codec (like G.711) digitizes it. The digital audio is split into small packets (20ms each) and sent over the internet using RTP protocol. At the destination, the packets are reassembled and played through the speaker. SIP protocol manages the call — setting it up when you dial and ending it when you hang up. The entire process happens in under 200 milliseconds.
Does VoIP quality depend on internet speed?
Not primarily. VoIP quality depends more on connection quality (jitter, packet loss, latency) than on raw speed. Each VoIP call uses only about 85 kbps — even a basic 5 Mbps connection can support dozens of simultaneous calls. But if that connection has high jitter (over 30ms), significant packet loss (over 1%), or high latency (over 200ms), call quality suffers regardless of speed. For best results: use a stable wired internet connection, implement QoS to prioritize voice traffic, and ensure your connection quality metrics meet VoIP thresholds.
Can someone listen to my VoIP calls?
Standard VoIP (SIP + RTP) transmits signaling and audio unencrypted. Anyone who can capture packets on the network path could theoretically decode and listen to the conversation. This risk is mitigated by TLS (encrypts SIP signaling) and SRTP (encrypts RTP audio). When both are enabled, the call is encrypted end-to-end between your system and the carrier. SIPNEX supports both TLS and SRTP. For sensitive communications (healthcare, financial, legal), encryption should be enabled. See our VoIP security guide for detailed configuration guidance.
SIPNEX handles the carrier side of VoIP — SIP trunking that connects your phone system to the telephone network with A-level STIR/SHAKEN, unlimited channels, and carrier-grade media infrastructure. Get started or see our rates.
Keep Reading
SIPNEX
FCC-licensed carrier with its own STIR/SHAKEN SP certificate. Operator-owned. SIP trunks built for operators who dial at volume.