A Cheat Sheet of Voice AI Agent API Pricing

When building a conversational voice AI application, developers need to understand the costs of the three core components: speech-to-text (STT), a large language model (LLM), and text-to-speech (TTS). Each provider charges differently—some by tokens, some by characters, others by minutes or subscription tiers—which makes direct comparison tricky.

For simplicity, we’ve converted all pricing in this post into a single metric: cost per minute. This isn’t an exact calculation, but a practical way to compare overall affordability across models for startups and developers. The use case focuses on conversational AI agents, such as those for therapy, voice assistant, education, language study, or consulting.

1. STT Pricing Comparison (Per Minute)

STT model

Price per Minute (USD)

Notes

Groq Whisper-Large-v3

$0.00185

Flat rate (no tiers): $0.111/hour ÷ 60. Starter/free tier via $10 credits (covers ~9,000 minutes). On-demand billing for business usage.

AssemblyAI Universal-2

$0.00380

Weighted: 70% starter/pay-as-you-go pre-recorded ($0.0045/min, $0.27/hour), 30% business streaming ($0.0025/min, $0.15/hour). Free: $50 credits (~3,700 minutes). No enterprise included.

Deepgram Nova-3

$0.00420

Weighted: 70% Pay As You Go/starter ($0.003/min monolingual), 30% Growth/business ($0.0036/min). Free: $200 credits (~46,500 minutes). 60% monolingual/40% multilingual split; no enterprise custom.

ElevenLabs Scribe

$0.00580

Weighted: 70% Pay As You Go/starter ($0.0067/min, $0.40/hour), 30% Business ($0.0037/min, $0.22/hour). Free: ~10-15 minutes. No enterprise custom.

OpenAI Whisper

$0.00600

Flat rate (no tiers): $0.006/min for large-v3 equivalent. Free: $5-18 credits (~13-50 minutes). Business usage same as starter.

Notes:

This aggregation assumes typical mixed workloads (e.g., 70% starter for individuals/low-volume, 30% business for regular teams). Actual costs may vary by exact usage. For free tiers, all models offer trial credits (no ongoing free usage beyond limits).

2. LLM Pricing Comparison (per minute)

LLM Model

Price per Minute (USD)

Notes

 groq Llama 3.1 8B Instant

$0.00338 (~$0.003)

Production-ready, fastest/cheapest. Slightly less fluent than Gemma for voice AI; optimize with prompts. Output: (36K/1M) × $0.08 = $0.00288; Input: (10K/1M) × $0.05 = $0.0005.

OpenAI GPT-4o-mini

$0.0069 (~$0.007)

Multimodal (text/audio/image), excellent for voice AI. Slower TPS than Groq models. Output: (9K/1M) × $0.60 = $0.0054; Input: (10K/1M) × $0.15 = $0.0015.

Google Gemini 2.5 Flash-Lite

$0.0082 (~$0.008)

Multimodal, strong for voice AI; slightly cheaper than Gemma. Output: (18K/1M) × $0.40 = $0.0072; Input: (10K/1M) × $0.10 = $0.001.

Groq Gemma 2 9B IT

$0.00872 (~$0.009)

Deprecated (Aug 2025). Fast, lightweight, text-only. Ideal for voice AI but no longer available on Groq. Output: (33,600/1M) × $0.20 = $0.00672; Input: (10K/1M) × $0.20 = $0.002.

xAi Grok-3-mini

$0.0090

Reasoning-focused but fits voice AI; matches Gemma’s cost. Output: (12K/1M) × $0.50 = $0.006; Input: (10K/1M) × $0.30 = $0.003.

Notes:

Aggregate price/min = (Output tokens × Output price/1M) + (Input tokens × Input price/1M). TPS varies by load/prompt (e.g., ±10-20%). Input tokens (10K/min) conservative for chat history; scale down for lighter use (e.g., 5K → halve input cost).

3. TTS Pricing Comparison (per minute)

TTS Model

Price per Minute (USD)

Notes

Google Gemini 2.5 Flash TTS

$0.01900

Dominated by output audio tokens ($10 per 1M tokens; ~1,920 tokens/min). Input negligible.

Google Chirp 3 HD TTS

$0.02700

Based on $30 per 1M characters; 900 characters/min. Free tier: 1M characters/month.

Deepgram Aura-2

$0.02700

$0.030 per 1,000 characters ($0.00003/char); 900 characters/min. $200 free credits initially.

Cartesia Sonic-2

$0.03000

Direct $0.03 per minute of audio output; input negligible. Plan-based with included credits.

OpenAI GPT-4o-mini TTS

$0.03000

~$0.015/min input text, ~$0.015/min output audio

ElevenLabs Flash v2.5

$0.06750

$0.000075 per character (0.5-1 credit discount for Flash); 900 characters/min. Plan-based; amortized weighted ~$0.067 across tiers.

Notes:

Assuming equal weighting across plans, using amortized rates for typical usage within limits.

Outro

The comparison table provides a clear baseline, but we recommend testing each option to ensure it meets your needs for accuracy, latency, and quality. For more guidance on choosing STT and TTS, explore our detailed comparisons of speech-to-text and text-to-speech models—covering pros, cons, pricing, API access, and documentation—to help you select the best fit for your application.

cheatsheet API pricing thumbnail
Download Cheat Sheet of AI API Pricing

Comments are closed