Foundation models

Download this section as a mindmap!

Overview

Pre-trained on unlabeled datasets
Leverage self-supervised learning
Learn generalizable & adaptable data representations
Can be effectively used in multiple downstream tasks (e.g., text generation, machine translation, classification for languages)
Note: while transformer architecture is most prevalent in foundation models, definition not restricted by model architecture

Best cost performance trade-off for non generative use cases
Most classical NLP tasks: classification, entity and relation extraction, extractive summarization, extractive question answer, etc.
Require task-specific labeled data for fine tuning. Examples: BERT/RoBERTa models.

Support both generative and non-generative use cases
Best cost performance trade-off for generative use cases when input is large but generated output is small.
Can be prompt-engineered once we hit a size of ~10B but below that can be fine tuned using labeled data. Examples: Google T5 models, UL2 models.

Fine-tuned on a class of tasks
Examples: Watson NLP OOTB Entity, Sentiment models, Google FLAN T5, watsonx sandstone.instruct

Training: Pre-trained model + Labeled Data ➡️ Prompt-Tuning Algorithm ➡️ Tuned Soft Prompt
Inference: Tuned Soft Prompt + Input Text ➡️ Pre-trained model ➡️ output
Relatively new technique
Training data format is the same as for fine tuning
Pre-trained models: LLMs with decoders
Advantages ➕
Faster training as only few parameters are learnt
Model accuracy comparable to fine tuning in some cases
The Pre-trained model is reused for inference in multiple tasks
Middle ground between fine-tuning and prompt engineering
Fewer parameters compared to fine tuning

Training: Pre-trained model + Labeled Data ➡️ Fine-Tuning Algorithm ➡️ Fined-Tuned model
Inference: Input Text ➡️ Fined-Tuned model ➡️ output
📈 SotA accuracy with small models and many popular NLP tasks (classification, extraction)
Requires data science expertise
Requires separate instance of the model for each task (can be expensive)
Difficult as model size increases (e.g., overfitting issues) i.e. typically less than 1B parameters

Head-on comparison with ChatGPT is a trap
A single solution does not fit all trust matters
ROI determined by use case and inference cost
Need to manage risks and limitations of today's LLMs
Consider ability to run workloads as desired, train models, provide trusted models, backend integration, enterprise features, and other NFRs