Skip to content

Foundation models

Download this section as a mindmap!

Overview

  • Pre-trained on unlabeled datasets
  • Leverage self-supervised learning
  • Learn generalizable & adaptable data representations
  • Can be effectively used in multiple downstream tasks (e.g., text generation, machine translation, classification for languages)
  • Note: while transformer architecture is most prevalent in foundation models, definition not restricted by model architecture

Data Modalities

  • Natural Language
  • Speech
  • Business Data
  • IT Data
  • Sensor Data
  • Chemistry & Materials
  • Geospatial
  • Programming Languages (Code)
  • Images
  • Dialog

Architectures

Encoder-only

  • Best cost performance trade-off for non generative use cases
  • Most classical NLP tasks: classification, entity and relation extraction, extractive summarization, extractive question answer, etc.
  • Require task-specific labeled data for fine tuning. Examples: BERT/RoBERTa models.

Encoder-Decoder

  • Support both generative and non-generative use cases
  • Best cost performance trade-off for generative use cases when input is large but generated output is small.
  • Can be prompt-engineered once we hit a size of ~10B but below that can be fine tuned using labeled data. Examples: Google T5 models, UL2 models.

Decoder-only

  • Designed explicitly for generative AI use cases
  • summarization, generative question answer, translation, copywriting
  • Architectures used in GPT-3, ChatGPT, etc.

Training and Tuning (⬇️ value)

Base model

  • Pre-trained on 10s of TBs of unlabeled Internet data
  • Examples: Watson Studio base LLM models, Google T5, etc.

Custom pre-trained model

  • Pre-trained on 10s of GB of domain/industry specific data
  • Examples: IBM-NASA collaboration models, Watson Code Assistant models

Fine-tuned model

  • Fine-tuned on a class of tasks
  • Examples: Watson NLP OOTB Entity, Sentiment models, Google FLAN T5, watsonx sandstone.instruct

Human-in-the-loop refined model

  • ChatGPT, watsonx sandstone.chat

Adaptation to multiple tasks (⬇️ complexity/skills, ⬆️ model size)

Prompt engineering

  • Training: None
  • Inference: Engineered Prompt + Input Text ➡️ output
  • Designing and constructing effective prompts to obtain desired outputs
  • Recommended way to start
  • Advantages ➕
  • Quick experimentation for various tasks
  • Little to no training data
  • Disadvantages ➖
  • Success depends on choice of prompt and model size
  • Mostly a trial-and-error process
  • Number of examples limited by prompt input size limitations
  • Lower accuracy compared to fine-tuning
  • Longer prompts may give better accuracy but cost more

Prompt-tuning

  • Training: Pre-trained model + Labeled Data ➡️ Prompt-Tuning Algorithm ➡️ Tuned Soft Prompt
  • Inference: Tuned Soft Prompt + Input Text ➡️ Pre-trained model ➡️ output
  • Relatively new technique
  • Training data format is the same as for fine tuning
  • Pre-trained models: LLMs with decoders
  • Advantages ➕
  • Faster training as only few parameters are learnt
  • Model accuracy comparable to fine tuning in some cases
  • The Pre-trained model is reused for inference in multiple tasks
  • Middle ground between fine-tuning and prompt engineering
  • Fewer parameters compared to fine tuning

Fine-tuning

  • Training: Pre-trained model + Labeled Data ➡️ Fine-Tuning Algorithm ➡️ Fined-Tuned model
  • Inference: Input Text ➡️ Fined-Tuned model ➡️ output
  • 📈 SotA accuracy with small models and many popular NLP tasks (classification, extraction)
  • Requires data science expertise
  • Requires separate instance of the model for each task (can be expensive)
  • Difficult as model size increases (e.g., overfitting issues) i.e. typically less than 1B parameters

Enterprise considerations

  • Head-on comparison with ChatGPT is a trap
  • A single solution does not fit all trust matters
  • ROI determined by use case and inference cost
  • Need to manage risks and limitations of today's LLMs
  • Consider ability to run workloads as desired, train models, provide trusted models, backend integration, enterprise features, and other NFRs