A Cheat Sheet of AI Agent API Pricing Comparison

When building a conversational voice AI application, developers need to understand the costs of the three core components: speech-to-text (STT), a large language model (LLM), and text-to-speech (TTS). Each provider charges differently—some by tokens, some by characters, others by minutes or subscription tiers—which makes direct comparison tricky.

For simplicity, we’ve converted all pricing in this post into a single metric: cost per minute. This isn’t an exact calculation, but a practical way to compare overall affordability across models for startups and developers. The use case focuses on conversational AI agents, such as those for therapy, voice assistant, education, language study, or consulting.

Table of Content

1. STT Pricing Comparison (Per Minute)

STT model	Price per Minute (USD)	Notes
Groq Whisper-Large-v3	$0.00185	Flat rate (no tiers): $0.111/hour ÷ 60. Starter/free tier via $10 credits (covers ~9,000 minutes). On-demand billing for business usage.
AssemblyAI Universal-2	$0.00380	Weighted: 70% starter/pay-as-you-go pre-recorded ($0.0045/min, $0.27/hour), 30% business streaming ($0.0025/min, $0.15/hour). Free: $50 credits (~3,700 minutes). No enterprise included.
Deepgram Nova-3	$0.00420	Weighted: 70% Pay As You Go/starter ($0.003/min monolingual), 30% Growth/business ($0.0036/min). Free: $200 credits (~46,500 minutes). 60% monolingual/40% multilingual split; no enterprise custom.
ElevenLabs Scribe	$0.00580	Weighted: 70% Pay As You Go/starter ($0.0067/min, $0.40/hour), 30% Business ($0.0037/min, $0.22/hour). Free: ~10-15 minutes. No enterprise custom.
OpenAI Whisper	$0.00600	Flat rate (no tiers): $0.006/min for large-v3 equivalent. Free: $5-18 credits (~13-50 minutes). Business usage same as starter.

Notes:

This aggregation assumes typical mixed workloads (e.g., 70% starter for individuals/low-volume, 30% business for regular teams). Actual costs may vary by exact usage. For free tiers, all models offer trial credits (no ongoing free usage beyond limits).

2. LLM Pricing Comparison (per minute)

LLM Model	Price per Minute (USD)	Notes
groq Llama 3.1 8B Instant	$0.00105	Input cost = 5k/1M×0.05=$0.00025 Output cost = 10K/1M×0.08=$0.00080 Total = $0.00105 / minute.
Google Gemini 2.5 Flash-Lite	$0.00450	Input cost = 5k/1M×0.10=$0.00050 Output cost = 10K/1M×0.4=$0.00400 Total = $0.00450 / minute.
xAI grok-3-mini	$0.00650	Input cost = 5k/1M×0.30=$0.00150 Output cost = 10K/1M×0.5=$0.00500 Total = $0.00650 / minute.
OpenAI GPT-4o-mini	$0.00675	Input cost = 5k/1M×0.15=$0.00075 Output cost = 10K/1M×0.6=$0.00600 Total = $0.00675 / minute.

Notes:

The cost per minute is the sum of billed input and output tokens:
Price/min = (Effective Output Tokens × Output $/1M) + (Effective Input Tokens × Input $/1M),
where Effective Tokens = Tokens × (1 – Cache Hit Rate).
A 50% input cache, for example, halves input cost.
Adjust token counts (e.g., 10K/min full load or 5K/min light use) to estimate usage, noting that real costs can vary by ±10–20%.

3. TTS Pricing Comparison (per minute)

TTS Model	Price per Minute (USD)	Notes
Google Chirp 3 HD TTS	$0.02700	Based on $30 per 1M characters; 900 characters/min. Free tier: 1M characters/month.
Deepgram Aura-2	$0.02700	$0.030 per 1,000 characters ($0.00003/char); 900 characters/min. $200 free credits initially.
Cartesia Sonic-2	$0.03000	Direct $0.03 per minute of audio output; input negligible. Plan-based with included credits.
OpenAI GPT-4o-mini TTS	$0.03000	~$0.015/min input text, ~$0.015/min output audio
ElevenLabs Flash v2.5	$0.06750	$0.000075 per character (0.5-1 credit discount for Flash); 900 characters/min. Plan-based; amortized weighted ~$0.067 across tiers.

Notes:

Assuming equal weighting across plans, using amortized rates for typical usage within limits.

Outro and Download

The comparison table provides a clear baseline, but we recommend testing each option to ensure it meets your needs for accuracy, latency, and quality. For more guidance on choosing STT and TTS, explore our detailed comparisons of speech-to-text and text-to-speech models—covering pros, cons, pricing, API access, and documentation—to help you select the best fit for your application.