1. Overview
This paper introduces Gemma, a family of lightweight open large language models derived from the research and technology used in Gemini.
The goal is to provide efficient, high-performing, and safe open models that can be widely deployed across different computational settings.
2. Problem Definition
Existing large language models face several issues:
- High computational cost limits accessibility
- Open models often underperform compared to closed models
- Safety and alignment remain challenging
- Efficient deployment is difficult at scale
The paper aims to build smaller models that maintain strong performance while improving accessibility and safety.
3. Key Idea
The main idea is to transfer proven techniques from Gemini into smaller open models.
This includes architecture design, large-scale training, data processing, and alignment methods such as instruction tuning and reinforcement learning.
4. Method
- Model Architecture
- Transformer decoder-based architecture
- Uses improvements such as:
- Multi-Query Attention
- RoPE positional embeddings
- GeGLU activation
- RMSNorm
- Context length: 8192 tokens
- Model sizes: 2B and 7B parameters
- Tokenization
- Based on SentencePiece (subset of Gemini tokenizer)
- Splits digits into smaller units
- Preserves whitespace
- Uses byte-level encoding for unknown tokens
- Vocabulary size: 256k tokens
- Pretraining
- Trained on large-scale datasets:
- 3T tokens (2B model)
- 6T tokens (7B model)
- Data sources include web text, mathematics, and code
- Applies filtering for quality and safety
- Uses staged training with increasing data quality
- Instruction Tuning
Two-stage alignment process:
- Supervised Fine-Tuning (SFT):
- Uses synthetic and human-generated prompt–response data
- Reinforcement Learning from Human Feedback (RLHF):
- Trains a reward model based on human preferences
- Optimizes model outputs accordingly
Both stages improve performance on benchmarks and human evaluations.
- Pipeline Summary
text → tokenizer → transformer → pretraining → SFT → RLHF → final model
5. Contributions
- Introduces an efficient open LLM family based on Gemini
- Achieves strong performance relative to model size
- Provides both pretrained and instruction-tuned models
- Emphasizes safety and responsible deployment
6. Strengths
- Strong performance for small model sizes
- Efficient deployment on limited hardware
- Well-designed training and alignment pipeline
- Balanced focus on performance and safety
7. Limitations
- Text-only model (no multimodal capability)
- Performance still below frontier large-scale models
- Limited multilingual optimization
- Requires additional fine-tuning for specific tasks
8. Insights and Future Directions
- Alignment methods (SFT + RLHF) are critical for practical performance
- Data quality and filtering significantly impact model behavior
- Efficient scaling strategies can produce strong small models
- Future work can explore multimodal extensions and improved reasoning
9. One-line Summary
Gemma is a family of efficient open language models that transfer Gemini’s architecture and training strategies into smaller models while maintaining strong performance and safety.
Notes
Released both pre-trained and fine-tuned checkpoints, as well as an open-source codebase for inference and serving.
Gemma comes in two sizes: a 7 billion parameter model for efficient deployment and development on GPU and TPU, and a 2 billion parameter model for CPU and on-device applications.
The Gemma model architecture is based on the transformer decoder
Models are trained on a context length of 8192 tokens.
Parameters 2B 7B
| d_model | 2048 | 3072 |
| Layers | 18 | 28 |
| Feedforward hidden dims | 32768 | 49152 |
| Num heads | 8 | 16 |
| Num KV heads | 1 | 16 |
| Head size | 256 | 256 |
| Vocab size | 256128 | 256128 |
Multi-Query Attention
the 7B model uses multi-head attention while the 2B checkpoints use multi-query attention (with num_kv_heads = 1)
RoPE Embeddings
share embeddings across our inputs and outputs to reduce model size.
GeGLU Activations
non-linearity is replaced by the aproximate version
RMSNorm
normalize the input of each transformer sub-layer, the attention layer and the feedforward layer, with RMSNorm
Pretraining
Gemma 2B and 7B are trained on 3T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code.
not multimodal, nor are they trained for state-of-the-art performance on multilingual tasks.
use a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018) of Gemini for compatibility.
Instruction Tuning
Gemma is fine-tuned using both supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), and both are essential for improving performance.