Disclaimer and Methodology for CO₂ Emission Estimates of LLM Inference
Tool: Free LLM Co2 Emission Calculator
Purpose of this document
This document explains how the CO₂ emission estimates for large language model (LLM) inference were derived, what sources were used, which assumptions were required, and where uncertainty remains. The goal is comparability and transparency, not exact accounting of any provider’s proprietary infrastructure.
These estimates should be interpreted as order-of-magnitude indicators suitable for comparison, decision-making, and sustainability analysis, rather than precise operational measurements.
Scope and boundaries
Included
- Inference-time energy use and associated CO₂-equivalent emissions
- Normalization to grams of CO₂e per 1,000,000 tokens
- Comparison across models and providers under a unified framework
Excluded
- Training-time emissions
- Embodied carbon of hardware manufacturing
- Networking, storage, logging, or end-user device energy
- Provider-specific optimizations such as aggressive batching, caching, or speculative decoding
Primary data sources
The estimates are anchored in peer-reviewed or openly published benchmarks, supplemented by documented disclosures from model providers.
1. Inference energy benchmarking
The main quantitative anchor is:
“How Hungry Is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference” (Jegham et al., 2025)
This work directly measures electricity consumption per query for a wide range of frontier and open models under controlled conditions. It reports:
- Energy consumption (Wh) per inference request
- Defined prompt sizes
- Carbon intensity factors used for conversion to CO₂e
Where models appear in this benchmark, their values are labeled anchored.
2. Carbon intensity factors (CIFs)
Carbon emissions were derived by multiplying electricity consumption by region-appropriate carbon intensity factors, consistent with the benchmark and public sustainability disclosures:
- Azure / OpenAI hosted workloads: ~0.35 kg CO₂e per kWh
- AWS hosted workloads: ~0.287 kg CO₂e per kWh
- Google data centers: assumed ~0.20–0.25 kg CO₂e per kWh, based on Google’s public claims of low-carbon energy sourcing
These values represent average operational intensity, not real-time marginal grid emissions.
3. Vendor sustainability disclosures
For Gemini models, Google reports CO₂ emissions per “median text prompt” in public environmental reports. However:
- Token counts per “median prompt” are not disclosed
- Hardware configuration and batching details are not provided
As a result, Gemini per-token estimates are necessarily derived and presented as ranges.
Normalization methodology
To make models comparable, all estimates are normalized to:
grams of CO₂e per 1,000,000 tokens
Reference workload
A standard “long prompt” definition from the benchmark is used:
- 10,000 input tokens
- 1,500 output tokens
- Total: 11,500 tokens per query
Conversion steps
- Start with measured or estimated Wh per query
- Convert electricity to emissions using a carbon intensity factor:
g CO₂e = Wh × (kg CO₂e / kWh) × 1000 - Normalize to 1,000,000 tokens:
g CO₂e / 1M tokens = g CO₂e per query × (1,000,000 / 11,500)
This approach preserves proportional differences between models while removing dependence on a single prompt size.
Treatment of reasoning modes
Models with explicit or implicit reasoning modes require special handling.
- Benchmarks show that reasoning-heavy inference consumes significantly more compute per token
- In some cases, energy usage increases by 5× to 15× relative to standard inference
- Where explicit benchmark data exists (for example, high-reasoning adaptive routing), it is used directly
- Where it does not, reasoning-related estimates are extrapolated based on:
- Increased token generation
- Longer GPU occupancy
- Reported architectural behavior
All such rows are clearly labeled as educated guesses.
Educated guesses and extrapolations
For models without public per-query energy data (for example GPT-5.x and some Gemini tiers), estimates are derived using:
- Anchored models of similar size and capability as reference points
- Expected efficiency gains from newer architectures
- Documented presence or absence of reasoning controls
- Public statements about model positioning (e.g., “nano”, “flash”, “ultra”)
These estimates are intentionally conservative and presented as ranges where uncertainty is high.
Key limitations and uncertainties
1. Token variability
- Tokens are not equivalent across providers or languages
- Actual emissions per token vary with context length and output verbosity
2. Infrastructure variance
- Same model on different data centers can vary by 5–10× in CO₂
- Reported values assume average conditions
3. Batching and caching
- Real-world systems often batch requests, reducing per-token energy
- These effects are not included to preserve comparability
4. Lack of standardized reporting
- There is no industry-wide standard for disclosing per-token energy or emissions
- All cross-provider comparisons should be interpreted cautiously
How these estimates should be used
Appropriate uses:
- Comparing relative environmental impact across models
- Informing product defaults (e.g. reasoning on vs off)
- High-level sustainability reporting
- Scenario analysis and architectural decision-making
Inappropriate uses:
- Claiming exact emissions for billing or regulatory compliance
- Comparing against real-time grid emissions
- Auditing individual provider operations