Data-Driven Analysis

AI Insights

Trends, patterns, and analysis derived from 207 models, 46 papers, and 24 hardware milestones.

Insight #1

The Cambrian Explosion

From 2 models in 2018 to 79 in 2024 — a 40× increase

With 207 models now tracked, the AI landscape went from a handful of research projects to an industry producing 79 new models per year. 2023 was the inflection point — the year ChatGPT's success triggered an industry-wide arms race. Every major tech company, from Apple to Amazon, scrambled to release their own models. 63 organizations across 10+ countries are now competing.

Insight #2

The Open Source Revolution

Open models went from 0% in 2021 to 59% in 2024

Closed Models

Open Models

In 2021, every major model was closed and proprietary. By 2024, open-weightopen-weightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction. and open-source models made up the majority of releases. Meta's LLaMA leak in 2023 was the spark — once researchers could study and fine-tune frontier-class models, the community produced an explosion of derivatives. This democratization may be the most consequential trend in AI history.

Insight #3

The Deepest Family Trees

GPT lineage reaches 14 generations deep

OpenAI's GPT family is the deepest evolutionary tree in AI, with 14 generations from GPT-1 to GPT-5.5. This isn't just version numbering — each generation represents genuine architectural or training breakthroughs. Anthropic's Claude lineage, while shorter at 10 generations, shows the fastest iteration pace.

Insight #4

The MoE Takeover

48 models now use Mixture-of-Experts architecture

Mixture-of-Expertsmixture-of-expertsArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute. started as a 2017 paper and was mostly ignored. Mixtral 8x7B's viral success in late 2023 proved MoEMoEMixture of Experts — architecture where only a fraction of parameters activate per input, enabling massive scale at lower compute cost. could deliver GPT-4-class quality at a fraction of the inferenceInferenceUsing a trained model to generate predictions or outputs (as opposed to training it). cost. Within 12 months, MoEMoEMixture of Experts — architecture where only a fraction of parameters activate per input, enabling massive scale at lower compute cost. became the default architecture for any model over 100B parameters — adopted by DeepSeek V2/V3, Grok, DBRX, Arctic, and Qwen.

Insight #5

The Context Window Explosion

Context windows grew from 512 to 10M+ tokens — a 19,000× increase

In 2018, GPT-1 had a 512-token context windowContext windowThe maximum number of tokens a model can process in a single input. Ranges from 2K to 10M+.. By 2025, Gemini offered 10 million tokens — enough to process entire codebases or dozens of novels at once. This wasn't just incremental improvement; it required fundamental innovations like Flash AttentionFlash AttentionAn IO-aware exact attention algorithm that's 2-4× faster by minimizing GPU memory reads/writes. and rotary position embeddings. The practical impact: AI went from answering single questions to analyzing entire projects.

Insight #6

63 Companies, One Race

Models from 63 organizations across 10+ countries

OpenAI

Google DeepMind

Anthropic

Innovation Pipeline: Paper to Product

Average ~2 years from paper to widespread model adoption

Attention Is All You Need (2017)

GPT-1 (2018)

1 year

RLHF (2017)

InstructGPT (2022)

5 years

Chain-of-Thought (2022)

o1 (2024)

2 years

MoE (Shazeer et al.) (2017)

Mixtral 8x7B (2023)

6 years

LoRA (2021)

Widespread use (2023)

2 years

Flash Attention (2022)

Default everywhere (2023)

1 year

RoPE (2021)

Default everywhere (2023)

~2 years

GQA (2023)

LLaMA 2 (2023)

<1 year

Mamba (SSM) (2023)

Jamba (2024)

~4 months

The pipeline from research paper to production model has dramatically accelerated. Early innovations like RLHFrlhfReinforcement Learning from Human Feedback — training models to align with human preferences by having humans rank outputs. took 5+ years to go mainstream. Now, architectures like Mamba go from paper to production in months. This compression is both exciting (faster progress) and concerning (less time for safety evaluation).

Insight #8

The Modality Matrix

Text dropped from 100% to just 20% of new models

text

multimodal

code

image

audio

video

Early AI was text-only. Now, over half of new models handle multiple modalities — images, audio, video, and code. The trend is unmistakable: the future of AI is models that can see, hear, speak, code, and reason simultaneously. The arrival of models like GPT-4o (text+image+audio) and Gemini 2.0 (text+image+video) marks the beginning of truly general-purpose AI.

Insight #9

The Reasoning Revolution

45 models now claim reasoning capabilities

ReasoningreasoningStructured step-by-step problem solving, often using chain-of-thought or tree-of-thought approaches. has exploded from a niche capability (Chain-of-Thoughtchain-of-thoughtPrompting technique where the model 'thinks out loud' step by step before giving a final answer. prompting in 2022) to the most sought-after feature in AI. OpenAI's o1 proved that 'thinking longer' (test-time compute) could dramatically improve performance on hard problems. Now every major lab — Anthropic, Google, DeepSeek — is racing to build models that don't just pattern-match but actually reason step-by-step.

#10

Insight #10

The China Factor

15 Chinese models tracked, up from 0 in 2022

United States

China

China has emerged as the world's second AI superpower. Companies like DeepSeek proved that innovative architecture (MLAMLAMulti-head Latent Attention — DeepSeek's innovation that compresses key-value caches into a low-rank latent space., multi-head latent attention) can compete with brute-force scaling. Moonshot AI's Kimi K2 (1 trillion parameters, open-weightopen-weightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.) and MiniMax-01 (4 million token context) show that Chinese labs are no longer following — they're leading on specific frontiers. The US-China AI race is now the defining dynamic of the industry, with implications for regulation, export controls, and the future of open research.

#11

Insight #11

The Efficiency Revolution

66 models use efficiency innovations (distillation, MoE, parameter sharing)

The AI industry hit a wall: training ever-larger models became prohibitively expensive. The response was an efficiency revolution. DistilBERT showed you could compress BERT to 60% of its size while keeping 97% of its capability. ALBERT proved parameter sharing could slash model size 18×. Mixture-of-Expertsmixture-of-expertsArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute. architectures (Mixtral, DeepSeek-V2) activate only a fraction of parameters per query. LoRALoRALow-Rank Adaptation — an efficient fine-tuning technique that adds small trainable matrices to frozen model weights. made fine-tuningFine-tuningAdapting a pre-trained model to a specific task or domain by training on additional data. accessible on consumer GPUs. The Chinchilla paper proved most models were undertrained relative to their size. The new mantra: smaller, smarter, cheaper.

#12

Insight #12

The Safety Imperative

2 dedicated safety models + 7 RLHF-aligned models

Safety went from an academic afterthought to an industry imperative. The RLHFrlhfReinforcement Learning from Human Feedback — training models to align with human preferences by having humans rank outputs. paper (2017) took 5 years to become standard practice. Constitutional AI gave Anthropic a principled framework for self-improvement. But the real shift came when Meta released Llama Guard — a dedicated safety classifier that any developer could use. Google followed with ShieldGemma. Meanwhile, the open-source community pushed back with 'abliterationabliterationRemoving safety guardrails from a model through targeted fine-tuning or weight manipulation. Controversial but popular in open-source community.' techniques, raising fundamental questions: who decides what's safe, and should guardrails be removable?

#13

Insight #13

From Text to Everything

6 modalities × 8 specialized verticals = 30 specialist models

Image Gen

Coding

Music

Speech

Embedding

Safety

Robotics

AI has fragmented from one thing (text prediction) into a dozen specialized disciplines. Embedding models (text-embedding-3, BGE) power every search engine and RAGRAGRetrieval-Augmented Generation — combining a language model with a search/retrieval system to ground responses in external knowledge. pipeline. Safety models (Llama Guard, ShieldGemma) act as AI immune systems. Robotics models (RT-2, PaLM-E) bridge language and physical action. Music generators (Suno, Udio), speech synthesizers (VALL-E, ElevenLabs), and coding agents (Devin, Cursor, SWE-Agent) each represent billion-dollar verticals. The 'foundation model' era is giving way to an era of specialized, deeply integrated AI products.

Insight #14

What's Next: Emerging Patterns

Based on the trajectories we've tracked, here are the patterns most likely to define AI's next chapter.

🤖

Agent-first

26 models already have agenticagenticModels that can autonomously plan, execute multi-step tasks, use tools, and self-correct without human intervention. capabilities — expect this to become the default interaction model.

⚡

Efficiency over scale

Parameter growth is plateauing; efficiency (MoEMoEMixture of Experts — architecture where only a fraction of parameters activate per input, enabling massive scale at lower compute cost., Mamba, MLAMLAMulti-head Latent Attention — DeepSeek's innovation that compresses key-value caches into a low-rank latent space.) is the new frontier. Smaller, smarter models are winning.

🔓

Open is winning… for now

Open-source share peaked at 62% but dropped to 35% in 2026 as frontier labs restrict access to their most powerful models.

🖥️

Hardware is the bottleneck

The 24 hardware milestones show compute doubling every ~18 months, but model demands are growing faster.

🎯

Specialization

Coding tools, search engines, music generators — AI is fragmenting into specialized, deeply integrated products that do one thing exceptionally well.

🌏

The US-China race

The US-China AI race will intensify — Chinese open-weightopen-weightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction. models (DeepSeek, Kimi K2) are already matching Western closed models on key benchmarks.

🏥

Vertical foundation models

Every industry vertical will have its own foundation model — legal AI, medical AI, financial AI — each trained on domain-specific data at scale.

Analysis based on 207 models, 46 papers, and 24 hardware milestones tracked in the LLM Tree of Life.