When building a conversational voice AI application, developers need to understand the costs of the three core components: speech-to-text (STT), a large language model (LLM), and text-to-speech (TTS). Each provider charges differently—some by tokens, some by characters, others by minutes or subscription tiers—which makes direct comparison tricky.
For simplicity, we’ve converted all pricing in this post into a single metric: cost per minute. This isn’t an exact calculation, but a practical way to compare overall affordability across models for startups and developers. The use case focuses on conversational AI agents, such as those for therapy, voice assistant, education, language study, or consulting.
1. STT Pricing Comparison (Per Minute)
STT model | Price per Minute (USD) | Notes |
Groq Whisper-Large-v3 | $0.00185 | Flat rate (no tiers): $0.111/hour ÷ 60. Starter/free tier via $10 credits (covers ~9,000 minutes). On-demand billing for business usage. |
AssemblyAI Universal-2 | $0.00380 | Weighted: 70% starter/pay-as-you-go pre-recorded ($0.0045/min, $0.27/hour), 30% business streaming ($0.0025/min, $0.15/hour). Free: $50 credits (~3,700 minutes). No enterprise included. |
Deepgram Nova-3 | $0.00420 | Weighted: 70% Pay As You Go/starter ($0.003/min monolingual), 30% Growth/business ($0.0036/min). Free: $200 credits (~46,500 minutes). 60% monolingual/40% multilingual split; no enterprise custom. |
ElevenLabs Scribe | $0.00580 | Weighted: 70% Pay As You Go/starter ($0.0067/min, $0.40/hour), 30% Business ($0.0037/min, $0.22/hour). Free: ~10-15 minutes. No enterprise custom. |
OpenAI Whisper | $0.00600 | Flat rate (no tiers): $0.006/min for large-v3 equivalent. Free: $5-18 credits (~13-50 minutes). Business usage same as starter. |
Notes:
This aggregation assumes typical mixed workloads (e.g., 70% starter for individuals/low-volume, 30% business for regular teams). Actual costs may vary by exact usage. For free tiers, all models offer trial credits (no ongoing free usage beyond limits).
2. LLM Pricing Comparison (per minute)
LLM Model | Price per Minute (USD) | Notes |
groq Llama 3.1 8B
Instant | $0.00338 (~$0.003) | Production-ready,
fastest/cheapest. Slightly less fluent than Gemma for voice AI; optimize with
prompts. Output: (36K/1M) × $0.08 = $0.00288; Input: (10K/1M) × $0.05 =
$0.0005. |
OpenAI GPT-4o-mini | $0.0069 (~$0.007) | Multimodal
(text/audio/image), excellent for voice AI. Slower TPS than Groq models. Output: (9K/1M) × $0.60 = $0.0054; Input:
(10K/1M) × $0.15 = $0.0015. |
Google Gemini 2.5
Flash-Lite | $0.0082 (~$0.008) | Multimodal, strong for
voice AI; slightly cheaper than Gemma. Output: (18K/1M) × $0.40 = $0.0072;
Input: (10K/1M) × $0.10 = $0.001. |
Groq Gemma 2 9B IT | $0.00872 (~$0.009) | Deprecated (Aug 2025).
Fast, lightweight, text-only. Ideal for voice AI but no longer available on Groq. Output: (33,600/1M) × $0.20 = $0.00672; Input:
(10K/1M) × $0.20 = $0.002. |
xAi Grok-3-mini | $0.0090 | Reasoning-focused but
fits voice AI; matches Gemma’s cost. Output: (12K/1M) × $0.50 = $0.006;
Input: (10K/1M) × $0.30 = $0.003. |
Notes:
Aggregate price/min = (Output tokens × Output price/1M) + (Input tokens × Input price/1M). TPS varies by load/prompt (e.g., ±10-20%). Input tokens (10K/min) conservative for chat history; scale down for lighter use (e.g., 5K → halve input cost).
3. TTS Pricing Comparison (per minute)
TTS Model | Price per Minute (USD) | Notes |
Google Gemini 2.5 Flash TTS | $0.01900 | Dominated by output audio tokens ($10 per 1M tokens; ~1,920 tokens/min). Input negligible. |
Google Chirp 3 HD TTS | $0.02700 | Based on $30 per 1M characters; 900 characters/min. Free tier: 1M characters/month. |
Deepgram Aura-2 | $0.02700 | $0.030 per 1,000 characters ($0.00003/char); 900 characters/min. $200 free credits initially. |
Cartesia Sonic-2 | $0.03000 | Direct $0.03 per minute of audio output; input negligible. Plan-based with included credits. |
OpenAI GPT-4o-mini TTS | $0.03000 | ~$0.015/min input text, ~$0.015/min output audio |
ElevenLabs Flash v2.5 | $0.06750 | $0.000075 per character (0.5-1 credit discount for Flash); 900 characters/min. Plan-based; amortized weighted ~$0.067 across tiers. |
Notes:
Assuming equal weighting across plans, using amortized rates for typical usage within limits.
Outro
The comparison table provides a clear baseline, but we recommend testing each option to ensure it meets your needs for accuracy, latency, and quality. For more guidance on choosing STT and TTS, explore our detailed comparisons of speech-to-text and text-to-speech models—covering pros, cons, pricing, API access, and documentation—to help you select the best fit for your application.
