Tokenomics-from-first-principles

Codebase behind the "Tokenomics from first principles" project, which models and analyzes LLM inference performance with a focus on throughput prediction and bottleneck analysis.

Overview

This project provides tools to:

Model theoretical throughput of LLM inference
Compare theoretical vs. actual performance
Analyze compute vs. memory bottlenecks
Study how context length impacts performance
Examine batch size scaling effects

Installation

pip install -r requirements.txt

File Structure

model_params.py: Parameters for LLM model architectures
hardware_params.py: Parameters for hardware specifications
tokenomics_model.py: Basic throughput modeling
advanced_tokenomics_model.py: Advanced model with realistic memory/attention effects
visualizations.py: Plotting functions
main.py: Command-line interface

Models Supported

llama-3-1-8b: Llama 3.1 8B parameter model
llama-3-3-70b: Llama 3.3 70B parameter model

Usage Examples

Basic Theoretical Throughput

To generate a theoretical throughput estimate:

python main.py

Comparing with Benchmark Data

Compare theoretical predictions with actual benchmark results:

python main.py --benchmark ../experiments/llama_33_70b_2000in_300out.json

Using the Advanced Model

The advanced model incorporates realistic factors like memory fragmentation, attention optimizations, and more:

python main.py --benchmark ../experiments/llama_33_70b_16000in_1000out.json --advanced

Specifying Different Model & Hardware Configurations

For Llama 3.1 8B on a single GPU:

python main.py --model llama-3-1-8b --gpus 1 -b ../experiments/llama_31_8b_2000in_300out.json --advanced

Analyzing Long Context Performance

For long context lengths:

python main.py --model llama-3-1-8b --gpus 1 -b ../experiments/llama_31_8b_16000in_1000out.json --advanced

Command Line Options

--benchmark, -b        Path to benchmark data JSON file
--model, -m            Model type (llama-3-1-8b or llama-3-3-70b)
--input-tokens, -i     Number of input tokens (default: 2035)
--output-tokens, -o    Number of output tokens (default: 300)
--gpus, -g             Number of GPUs (default: 4)
--advanced, -a         Use advanced model with realistic factors
--context-analysis, -c Generate context length scaling analysis

Benchmark File Format

The benchmark files should be JSON with the following structure:

{
  "metadata": {
    "model": "llama-3-3-70b",
    "batch_sizes": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
  },
  "results": {
    "1": {
      "avg_input_tokens": 2035,
      "avg_output_tokens": 300,
      "tokens_per_second_in_batch": 102.9,
      "avg_tokens_per_second": 102.9
    },
    "2": {
      "tokens_per_second_in_batch": 182.5,
      "avg_tokens_per_second": 91.25
    },
    ...
  }
}

Understanding the Outputs

Throughput Plots

The throughput plots show how many tokens per second the model can generate at different batch sizes:

Total Throughput: Combined tokens/second across all batch entries
Per-Request Throughput: Tokens/second for each individual request

Bottleneck Analysis

Shows whether compute or memory bandwidth is the limiting factor at each batch size:

Blue markers: Compute-bound regions
Red markers: Memory-bound regions

Context Length Analysis

When using the advanced model with --context-analysis, visualizes:

How throughput decreases with context length
Memory bandwidth efficiency changes
KV cache penalties
Memory fragmentation effects

Feel free to adjust the details, add any missing information, or modify the citation section with your actual information. Is there any specific aspect of the README you'd like me to expand on further?

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
experiments		experiments
sharding_experiments		sharding_experiments
throughput_simulations		throughput_simulations
traces		traces
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tokenomics-from-first-principles

Overview

Installation

File Structure

Models Supported

Usage Examples

Basic Theoretical Throughput

Comparing with Benchmark Data

Using the Advanced Model

Specifying Different Model & Hardware Configurations

Analyzing Long Context Performance

Command Line Options

Benchmark File Format

Understanding the Outputs

Throughput Plots

Bottleneck Analysis

Context Length Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

tugot17/tokenomics-from-first-principles

Folders and files

Latest commit

History

Repository files navigation

Tokenomics-from-first-principles

Overview

Installation

File Structure

Models Supported

Usage Examples

Basic Theoretical Throughput

Comparing with Benchmark Data

Using the Advanced Model

Specifying Different Model & Hardware Configurations

Analyzing Long Context Performance

Command Line Options

Benchmark File Format

Understanding the Outputs

Throughput Plots

Bottleneck Analysis

Context Length Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages