Disclaimer and Methodology for CO₂ Emission Estimates of LLM Inference

Tool: Free LLM Co2 Emission Calculator

Purpose of this document

This document explains how the CO₂ emission estimates for large language model (LLM) inference were derived, what sources were used, which assumptions were required, and where uncertainty remains. The goal is comparability and transparency, not exact accounting of any provider’s proprietary infrastructure.

These estimates should be interpreted as order-of-magnitude indicators suitable for comparison, decision-making, and sustainability analysis, rather than precise operational measurements.

Scope and boundaries

Included

Inference-time energy use and associated CO₂-equivalent emissions
Normalization to grams of CO₂e per 1,000,000 tokens
Comparison across models and providers under a unified framework

Excluded

Training-time emissions
Embodied carbon of hardware manufacturing
Networking, storage, logging, or end-user device energy
Provider-specific optimizations such as aggressive batching, caching, or speculative decoding

Primary data sources

The estimates are anchored in peer-reviewed or openly published benchmarks, supplemented by documented disclosures from model providers.

1. Inference energy benchmarking

The main quantitative anchor is:

“How Hungry Is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference” (Jegham et al., 2025)

This work directly measures electricity consumption per query for a wide range of frontier and open models under controlled conditions. It reports:

Energy consumption (Wh) per inference request
Defined prompt sizes
Carbon intensity factors used for conversion to CO₂e

Where models appear in this benchmark, their values are labeled anchored.

2. Carbon intensity factors (CIFs)

Carbon emissions were derived by multiplying electricity consumption by region-appropriate carbon intensity factors, consistent with the benchmark and public sustainability disclosures:

Azure / OpenAI hosted workloads: ~0.35 kg CO₂e per kWh
AWS hosted workloads: ~0.287 kg CO₂e per kWh
Google data centers: assumed ~0.20–0.25 kg CO₂e per kWh, based on Google’s public claims of low-carbon energy sourcing

These values represent average operational intensity, not real-time marginal grid emissions.

3. Vendor sustainability disclosures

For Gemini models, Google reports CO₂ emissions per “median text prompt” in public environmental reports. However:

Token counts per “median prompt” are not disclosed
Hardware configuration and batching details are not provided

As a result, Gemini per-token estimates are necessarily derived and presented as ranges.

Normalization methodology

To make models comparable, all estimates are normalized to:

grams of CO₂e per 1,000,000 tokens

Reference workload

A standard “long prompt” definition from the benchmark is used:

10,000 input tokens
1,500 output tokens
Total: 11,500 tokens per query

Conversion steps

Start with measured or estimated Wh per query
Convert electricity to emissions using a carbon intensity factor:
g CO₂e = Wh × (kg CO₂e / kWh) × 1000
Normalize to 1,000,000 tokens:
g CO₂e / 1M tokens = g CO₂e per query × (1,000,000 / 11,500)

This approach preserves proportional differences between models while removing dependence on a single prompt size.

Treatment of reasoning modes

Models with explicit or implicit reasoning modes require special handling.

Benchmarks show that reasoning-heavy inference consumes significantly more compute per token
In some cases, energy usage increases by 5× to 15× relative to standard inference
Where explicit benchmark data exists (for example, high-reasoning adaptive routing), it is used directly
Where it does not, reasoning-related estimates are extrapolated based on:
- Increased token generation
- Longer GPU occupancy
- Reported architectural behavior

All such rows are clearly labeled as educated guesses.

Educated guesses and extrapolations

For models without public per-query energy data (for example GPT-5.x and some Gemini tiers), estimates are derived using:

Anchored models of similar size and capability as reference points
Expected efficiency gains from newer architectures
Documented presence or absence of reasoning controls
Public statements about model positioning (e.g., “nano”, “flash”, “ultra”)

These estimates are intentionally conservative and presented as ranges where uncertainty is high.

Key limitations and uncertainties

1. Token variability

Tokens are not equivalent across providers or languages
Actual emissions per token vary with context length and output verbosity

2. Infrastructure variance

Same model on different data centers can vary by 5–10× in CO₂
Reported values assume average conditions

3. Batching and caching

Real-world systems often batch requests, reducing per-token energy
These effects are not included to preserve comparability

4. Lack of standardized reporting

There is no industry-wide standard for disclosing per-token energy or emissions
All cross-provider comparisons should be interpreted cautiously

How these estimates should be used

Appropriate uses:

Comparing relative environmental impact across models
Informing product defaults (e.g. reasoning on vs off)
High-level sustainability reporting
Scenario analysis and architectural decision-making

Inappropriate uses:

Claiming exact emissions for billing or regulatory compliance
Comparing against real-time grid emissions
Auditing individual provider operations