728x90

1. Overview

This paper introduces Gemma, a family of lightweight open large language models derived from the research and technology used in Gemini.

The goal is to provide efficient, high-performing, and safe open models that can be widely deployed across different computational settings.


2. Problem Definition

Existing large language models face several issues:

  1. High computational cost limits accessibility
  2. Open models often underperform compared to closed models
  3. Safety and alignment remain challenging
  4. Efficient deployment is difficult at scale

The paper aims to build smaller models that maintain strong performance while improving accessibility and safety.


3. Key Idea

The main idea is to transfer proven techniques from Gemini into smaller open models.

This includes architecture design, large-scale training, data processing, and alignment methods such as instruction tuning and reinforcement learning.


4. Method


- Model Architecture

  • Transformer decoder-based architecture
  • Uses improvements such as:
    • Multi-Query Attention
    • RoPE positional embeddings
    • GeGLU activation
    • RMSNorm
  • Context length: 8192 tokens
  • Model sizes: 2B and 7B parameters

- Tokenization

  • Based on SentencePiece (subset of Gemini tokenizer)
  • Splits digits into smaller units
  • Preserves whitespace
  • Uses byte-level encoding for unknown tokens
  • Vocabulary size: 256k tokens

- Pretraining

  • Trained on large-scale datasets:
    • 3T tokens (2B model)
    • 6T tokens (7B model)
  • Data sources include web text, mathematics, and code
  • Applies filtering for quality and safety
  • Uses staged training with increasing data quality

- Instruction Tuning

Two-stage alignment process:

  • Supervised Fine-Tuning (SFT):
    • Uses synthetic and human-generated prompt–response data
  • Reinforcement Learning from Human Feedback (RLHF):
    • Trains a reward model based on human preferences
    • Optimizes model outputs accordingly

Both stages improve performance on benchmarks and human evaluations.


- Pipeline Summary

text → tokenizer → transformer → pretraining → SFT → RLHF → final model


5. Contributions

  • Introduces an efficient open LLM family based on Gemini
  • Achieves strong performance relative to model size
  • Provides both pretrained and instruction-tuned models
  • Emphasizes safety and responsible deployment

6. Strengths

  • Strong performance for small model sizes
  • Efficient deployment on limited hardware
  • Well-designed training and alignment pipeline
  • Balanced focus on performance and safety

7. Limitations

  • Text-only model (no multimodal capability)
  • Performance still below frontier large-scale models
  • Limited multilingual optimization
  • Requires additional fine-tuning for specific tasks

8. Insights and Future Directions

  • Alignment methods (SFT + RLHF) are critical for practical performance
  • Data quality and filtering significantly impact model behavior
  • Efficient scaling strategies can produce strong small models
  • Future work can explore multimodal extensions and improved reasoning

9. One-line Summary

Gemma is a family of efficient open language models that transfer Gemini’s architecture and training strategies into smaller models while maintaining strong performance and safety.


Notes

Released both pre-trained and fine-tuned checkpoints, as well as an open-source codebase for inference and serving.

Gemma comes in two sizes: a 7 billion parameter model for efficient deployment and development on GPU and TPU, and a 2 billion parameter model for CPU and on-device applications.

The Gemma model architecture is based on the transformer decoder

Models are trained on a context length of 8192 tokens.

Parameters 2B 7B

d_model 2048 3072
Layers 18 28
Feedforward hidden dims 32768 49152
Num heads 8 16
Num KV heads 1 16
Head size 256 256
Vocab size 256128 256128

Multi-Query Attention
the 7B model uses multi-head attention while the 2B checkpoints use multi-query attention (with num_kv_heads = 1)

RoPE Embeddings
share embeddings across our inputs and outputs to reduce model size.

GeGLU Activations
non-linearity is replaced by the aproximate version

RMSNorm
normalize the input of each transformer sub-layer, the attention layer and the feedforward layer, with RMSNorm

Pretraining

Gemma 2B and 7B are trained on 3T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code.
not multimodal, nor are they trained for state-of-the-art performance on multilingual tasks.

use a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018) of Gemini for compatibility.

Instruction Tuning

Gemma is fine-tuned using both supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), and both are essential for improving performance.

반응형