logo
LLM
AI Fundamentals

IBM’s Granite foundation model: A detailed look at its training data

11/5/2024 • redhat.com
IBM’s Granite foundation model: A detailed look at its training data

While many AI model developers publicly release research papers and their data training approaches, we’ll focus on one model in particular– IBM’s Granite model, where IBM has gone one step further and released their specific training data.

Read Full Article...

C4AIL Commentary

This may be the most transparent disclosure of training data for a competitive LLM yet, providing great insight in what it takes to bootstrap a modern LLM

WEB Data

  • FineWeb: More than 15T tokens of cleaned and deduplicated English data from CommonCrawl.
  • Webhose: Unstructured web content in English converted into machine-readable data.
  • DCLM-Baseline: A 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks.

CODE

  • Code Pile: Sourced from publicly available datasets like GitHub Code Clean and StarCoderdata.
  • FineWeb-Code: Contains programming/coding-related documents filtered from the FineWeb dataset using annotation.
  • CodeContests: Competitive programming dataset with problems, test cases, and human solutions in multiple languages.

DOMAIN

  • USPTO: Collection of US patents granted from 1975 to 2023.
  • Free Law: Public-domain legal opinions from US federal and state courts.
  • PubMed Central: Biomedical and life sciences papers.
  • EDGAR Filings: Annual reports from US publicly traded companies over 25 years.

MULTILINGUAL

  • Multilingual Wikipedia: Data from 11 languages to support multilingual capabilities.
  • Multilingual Webhose: Multilingual web content converted into machine-readable data feeds.
  • MADLAD-12: Document-level multilingual dataset covering 12 languages.

INSTRUCTIONS

  • Code Instructions Alpaca: Instruction-response pairs about code generation problems.
  • Glaive Function Calling: Dataset focused on function calling in real scenarios.

ACADEMIC

  • peS2o: A collection of 40M open-access academic papers for pre-training.
  • arXiv: Scientific paper pre-prints posted to arXiv. Full author acknowledgement can be found here.
  • IEEE: Technical content from IEEE acquired by IBM.

TECHNICAL

  • Wikipedia: Technical articles sourced from Wikipedia.
  • Library of Congress Public Domain Books: More than 140,000 public domain English books.
  • Directory of Open Access Books: Publicly available technical books from the Directory of Open Access Books.
  • Cosmopedia: Synthetic textbooks, blog posts, stories, and WikiHow articles.

MATH

  • OpenWebMath: Mathematical text from the internet, filtered from 200B HTML files.
  • Algebraic-Stack: Mathematical code dataset including numerical computing and formal mathematics.
  • Stack Exchange: User-contributed content from the Stack Exchange network.
  • MetaMathQA: Dataset of rewritten mathematical questions.
  • StackMathQA: A curated collection of 2 million mathematical questions from Stack Exchange.
  • MathInstruct: Focused on chain-of-thought (CoT) and program-of-thought (PoT) rationales for mathematical reasoning.
  • TemplateGSM: Collection of over 7 million grade-school math problems with code and natural language solutions.