2024 - Gemma Open Models Based on Gemini Research and Technology

2026. 4. 9. 21:34

728x90

1. Overview

This paper introduces Gemma, a family of lightweight open large language models derived from the research and technology used in Gemini.

The goal is to provide efficient, high-performing, and safe open models that can be widely deployed across different computational settings.

2. Problem Definition

Existing large language models face several issues:

High computational cost limits accessibility
Open models often underperform compared to closed models
Safety and alignment remain challenging
Efficient deployment is difficult at scale

The paper aims to build smaller models that maintain strong performance while improving accessibility and safety.

3. Key Idea

The main idea is to transfer proven techniques from Gemini into smaller open models.

This includes architecture design, large-scale training, data processing, and alignment methods such as instruction tuning and reinforcement learning.

4. Method

- Model Architecture

Transformer decoder-based architecture
Uses improvements such as:
- Multi-Query Attention
- RoPE positional embeddings
- GeGLU activation
- RMSNorm
Context length: 8192 tokens
Model sizes: 2B and 7B parameters

- Tokenization

Based on SentencePiece (subset of Gemini tokenizer)
Splits digits into smaller units
Preserves whitespace
Uses byte-level encoding for unknown tokens
Vocabulary size: 256k tokens

- Pretraining

Trained on large-scale datasets:
- 3T tokens (2B model)
- 6T tokens (7B model)
Data sources include web text, mathematics, and code
Applies filtering for quality and safety
Uses staged training with increasing data quality

- Instruction Tuning

Two-stage alignment process:

Supervised Fine-Tuning (SFT):
- Uses synthetic and human-generated prompt–response data
Reinforcement Learning from Human Feedback (RLHF):
- Trains a reward model based on human preferences
- Optimizes model outputs accordingly

Both stages improve performance on benchmarks and human evaluations.

- Pipeline Summary

text → tokenizer → transformer → pretraining → SFT → RLHF → final model

5. Contributions

Introduces an efficient open LLM family based on Gemini
Achieves strong performance relative to model size
Provides both pretrained and instruction-tuned models
Emphasizes safety and responsible deployment

6. Strengths

Strong performance for small model sizes
Efficient deployment on limited hardware
Well-designed training and alignment pipeline
Balanced focus on performance and safety

7. Limitations

Text-only model (no multimodal capability)
Performance still below frontier large-scale models
Limited multilingual optimization
Requires additional fine-tuning for specific tasks

8. Insights and Future Directions

Alignment methods (SFT + RLHF) are critical for practical performance
Data quality and filtering significantly impact model behavior
Efficient scaling strategies can produce strong small models
Future work can explore multimodal extensions and improved reasoning

9. One-line Summary

Gemma is a family of efficient open language models that transfer Gemini’s architecture and training strategies into smaller models while maintaining strong performance and safety.

Notes

Released both pre-trained and fine-tuned checkpoints, as well as an open-source codebase for inference and serving.

Gemma comes in two sizes: a 7 billion parameter model for efficient deployment and development on GPU and TPU, and a 2 billion parameter model for CPU and on-device applications.

The Gemma model architecture is based on the transformer decoder

Models are trained on a context length of 8192 tokens.

Parameters 2B 7B

d_model	2048	3072
Layers	18	28
Feedforward hidden dims	32768	49152
Num heads	8	16
Num KV heads	1	16
Head size	256	256
Vocab size	256128	256128

Multi-Query Attention
the 7B model uses multi-head attention while the 2B checkpoints use multi-query attention (with num_kv_heads = 1)

RoPE Embeddings
share embeddings across our inputs and outputs to reduce model size.

GeGLU Activations
non-linearity is replaced by the aproximate version

RMSNorm
normalize the input of each transformer sub-layer, the attention layer and the feedforward layer, with RMSNorm

Pretraining

Gemma 2B and 7B are trained on 3T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code.
not multimodal, nor are they trained for state-of-the-art performance on multilingual tasks.

use a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018) of Gemini for compatibility.

Instruction Tuning

Gemma is fine-tuned using both supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), and both are essential for improving performance.

저작자표시 (새창열림)

'HITSZ > Year 1, Spring' 카테고리의 다른 글

Lebedeva et all - 2023 - Personalized facial beauty assessment a meta-learning approach (0)	2026.04.06
LR Coarse-to-Fine Image Aesthetics Assessment With Dynamic Attribute Selection (Huang et al., 2024) (0)	2026.04.03

Shijuan's AI Diary