IBM’s Granite foundation model: A detailed look at its training data

This may be the most transparent disclosure of training data for a competitive LLM yet, providing great insight in what it takes to bootstrap a modern LLM

WEB Data

FineWeb: More than 15T tokens of cleaned and deduplicated English data from CommonCrawl.
Webhose: Unstructured web content in English converted into machine-readable data.
DCLM-Baseline: A 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks.

CODE

Code Pile: Sourced from publicly available datasets like GitHub Code Clean and StarCoderdata.
FineWeb-Code: Contains programming/coding-related documents filtered from the FineWeb dataset using annotation.
CodeContests: Competitive programming dataset with problems, test cases, and human solutions in multiple languages.

DOMAIN

USPTO: Collection of US patents granted from 1975 to 2023.
Free Law: Public-domain legal opinions from US federal and state courts.
PubMed Central: Biomedical and life sciences papers.
EDGAR Filings: Annual reports from US publicly traded companies over 25 years.

MULTILINGUAL

Multilingual Wikipedia: Data from 11 languages to support multilingual capabilities.
Multilingual Webhose: Multilingual web content converted into machine-readable data feeds.
MADLAD-12: Document-level multilingual dataset covering 12 languages.

INSTRUCTIONS

Code Instructions Alpaca: Instruction-response pairs about code generation problems.
Glaive Function Calling: Dataset focused on function calling in real scenarios.

ACADEMIC

peS2o: A collection of 40M open-access academic papers for pre-training.
arXiv: Scientific paper pre-prints posted to arXiv. Full author acknowledgement can be found here.
IEEE: Technical content from IEEE acquired by IBM.

TECHNICAL

Wikipedia: Technical articles sourced from Wikipedia.
Library of Congress Public Domain Books: More than 140,000 public domain English books.
Directory of Open Access Books: Publicly available technical books from the Directory of Open Access Books.
Cosmopedia: Synthetic textbooks, blog posts, stories, and WikiHow articles.

MATH

OpenWebMath: Mathematical text from the internet, filtered from 200B HTML files.
Algebraic-Stack: Mathematical code dataset including numerical computing and formal mathematics.
Stack Exchange: User-contributed content from the Stack Exchange network.
MetaMathQA: Dataset of rewritten mathematical questions.
StackMathQA: A curated collection of 2 million mathematical questions from Stack Exchange.
MathInstruct: Focused on chain-of-thought (CoT) and program-of-thought (PoT) rationales for mathematical reasoning.
TemplateGSM: Collection of over 7 million grade-school math problems with code and natural language solutions.

IBM’s Granite foundation model: A detailed look at its training data

C4AIL Commentary