Continual Learning: The Eras

(Part 1: The Past)

What does it mean when a neural network “learns” something new?

“So I prepared the talk, and when the day came, I went in and did something that young men who have had no experience in giving talks often do – I put too many equations up on the blackboard.” Richard Feynman

Neural networks can learn to perform various types of tasks, from simple binary classification to generating human-like text, images and videos. Researchers in the field of continual learning have, for a while, been studying how a neural network that’s been trained to do one/few types of tasks can be trained sequentially to perform more tasks. As neural network architectures have evolved, so has the focus on what these ‘tasks’ are.

In this article, we’ll take a trip through the various ‘eras’ of the field of continual learning, as I see them, noting some key research papers along the way. Towards the end, I’ll discuss how the field has evolved and what, in my opinion, the next pressing challenges and interesting directions of research will be.

Who is this article for?
If you’re a casual reader, I hope you will find this an interesting introduction to the field of continual learning and understand what it means with regards to modern Artificial Intelligence (AI), assuming you already have some basic machine learning background. Technical keywords are italicised, so feel free to look these up if you want to know more about them.
If you’re an AI researcher, this article should give you an idea of how various fields within AI link to continual learning and maybe even inspire you.
If you’re researching continual learning, perhaps you will learn something new of interest to you, perhaps you will disagree with my opinions, or perhaps you will notice I have missed something – either way, reach out to me! 🙂

1. Early Days: Forgetting and Knowledge Transfer

~ Up to 2018

Early neural network (NN) architectures were relatively simple and rarely consisted of the deep stacked layers that have now become the norm for AI. Therefore, early applications of NNs focused on simple tasks such as image classification. Continual learning began as a study of how NNs could be trained to perform multiple such tasks in a sequential manner. Common areas of study were training a NN for recognising different types of objects in images and training a reinforcement learning based NN agent to play multiple games in an arcade environment.

Research in continual learning in this era largely focused on the problem of catastrophic forgetting (CF), which is the degradation of performance of the NN on the task it was previously trained for, after optimising it on a new task. Techniques for solving this CF phenomenon involved either algorithmic (such as replay, regularisation and gradient projection) or architectural (such as parameter-masking and network expansion) ways of balancing the amount of forgetting and the amount of learning, known as the stability-plasticity trade-off. Some research in this era also focused on the aspect of knowledge transfer (KT) between tasks, which is the improvement in performance on one task achieved by re-using knowledge acquired from a different task.

The continual learning methods proposed in this era were developed to cater to different practical constraints. The replay methods were useful when some data from previous tasks was always available to re-use after moving to subsequent tasks. When this assumption of replay data availability couldn’t be made in practice (e.g. due to data privacy), regularisation and gradient projection methods could be employed. In cases where the tasks were distinct enough (e.g. classifying animals vs classifying plants), the parameter-masking methods allowed different parts/parameters (i.e. trainable values) of the NN to specialise in different tasks, ensuring minimal interference between the tasks and thus minimal CF. At test-time, one could simply use a task-id to select the associated part of the network. In cases which allowed for the model size to increase over time (unlike on-device models where the storage is limited), the network expansion methods could be used to minimise forgetting by adding new parameters to accommodate new tasks.

Key Papers:

Rusu et al. (2016). “Progressive neural networks”: This paper introduced the notion of expanding the NN to accommodate new tasks and thus minimise CF while also enabling KT by connecting relevant parameters across tasks.
Kirkpatrick et al. (2017). “Overcoming catastrophic forgetting in neural networks”: This paper began the regularisation line of research, wherein each NN parameter is assigned an importance score with respect to each task and the parameters important for learning past tasks are prevented from being changed when the NN is optimised to learn new tasks.
Rolnick et al. (2018). “Experience Replay for Continual Learning”: This paper showed that a simple strategy of storing some data samples from previously encountered tasks, called the replay buffer, and combining these with new training data can substantially negate the CF problem.
Serra et al. (2018). “Overcoming catastrophic forgetting with hard attention to the task”: This paper initiated the parameter-masking line of research, wherein each task is associated with an id and a corresponding set of parameters within the NN.

Tasks:

Continual image classification: Each task contains a new set of objects in images. E.g. Classify cat/dog -> Classify cat/dog/deer/tiger -> Classify cat/dog/deer/tiger/leopard/elephant
Gaming agents in arcade environments: Each task involves a new game. E.g. Playing Space Invaders -> Playing Space Invaders + Road Runners

2. The First Pre-Training Era: A Universal Shared Knowledge Base

~ 2019 to 2022

As NN architectures became more sophisticated and deep neural networks (DNNs) became easier to train, the first generation of pre-trained models such as BERT and ViT came into the picture. BERT was trained on large-scale text data from the internet while ViT was trained on large-scale image data. These models could encode any sequence of text or pixels, respectively, into a highly contextual embedding vector. These embeddings could then be quickly used for classification or segmentation or other such tasks. With these general purpose models also came the ability to fine-tune the same pre-trained model for different downstream tasks. This transformed the way continual learning was approached.

A new paradigm emerged where, rather than worrying about updating the entire DNN when learning new tasks, the focus now shifted towards re-using the same pre-trained network that now contained rich real-world context. To accommodate any new application/domain specific knowledge, new components could be inserted into the network layers. The pre-trained part of the network would be left untouched to help re-use it to quickly adapt the network to any new task. In some ways this paradigm is similar to the architectural approach of the previous era – while traditional architectural methods had a separate mechanism to identify which parts of the network contained knowledge that could be shared between tasks, in this era, the pre-trained part already provided a very good shared knowledge base. Keeping the pre-trained part frozen could also easily minimise the risk of forgetting this useful shared knowledge.

Notably, the emergence of the first set of pre-trained language models (i.e. models that can take natural language text as input) in this era began the wave of interest in continual learning for textual applications.

Key Papers:

Monaikul et al. (2021). “Continual Learning for Named Entity Recognition”: This paper adapted the idea of knowledge distillation to continual learning by re-using the model trained on the previous entity recognition task (eg. tagging ‘person’ entities in a sentence) as a teacher to produce the old entity labels on the new data when training the model on new data containing only new entity labels (eg. ‘location’ entities in a sentence). The pre-trained BERT encoder acts as a shared base layer, with new classifier layers added for new entities. It is the rich language understanding ability of BERT that allows the model to act as a good teacher on new data, which is particularly important when the new data contains new domains.
Ke et al. (2021). “Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning”: This paper introduced new components at every layer of the pre-trained BERT model, to distinguish between task-specific knowledge and shareable knowledge at each layer, which is then routed accordingly to the task-specific or shared parts of the subsequent layers. The pre-trained BERT parameters are frozen to minimise CF while the new components at each layer further enhance KT and CF mitigation. While the new components manage the ‘routing’ of information between layers, it is the rich pre-trained embeddings at each layer that make it possible for such lightweight components to be effective.
Xue et al. (2022). “Meta-attention for ViT-backed Continual Learning”: In the standard attention mechanism, which is the building block of ViT, when processing a given part of the image, information from all other parts of the image is always fully taken into account. This paper introduced the notion of a task-specific mask as part of the attention mechanism to ensure that only relevant parts of the image, as defined by the task, would be taken into account, thus improving the performance for each task. Again, it is the rich pre-trained embeddings that make it possible for such a lightweight modification to be effective.
Wang et al. (2022). “Continual Learning with Lifelong Vision Transformer”: This paper also modified the standard attention mechanism in ViT by replacing some of the pre-trained parameters with new learnable parameters, which are then regularised to prevent CF during continual learning. This is another effective lightweight modification that combines traditional regularisation and replay techniques with a pre-trained network.

Tasks:

Continual image classification: Each task contains a new set of objects in images. E.g. Classify cat/dog -> Classify cat/dog/deer/tiger -> Classify cat/dog/deer/tiger/leopard/elephant
Continual text classification: Each task contains a new set of requests that can be made to a home assistant. E.g. Classify factoid/alarm -> Classify factoid/alarm/repeat/music -> Classify factoid/alarm/repeat/music/calendar/negate

3. The Generative Pre-Training Era: One Personal Assistant for Everyone

~ 2023 to 2025

The scaling up of the generative DNN architecture to billions of parameters resulted in general-purpose pre-trained models that could produce highly coherent and human-like text and images. A crucial discovery of this era was that increasing model size produces interesting emergent abilities – e.g. the ability to directly learn from examples in the input, without having to optimise the network itself. This created a shift in modelling many tasks such as classification, from something the network needs to be ‘optimised for’ with hundreds/thousands of examples, to something the network needs to be ‘explained to do’ with just a few examples – in other words, this was the advent of prompt engineering. This encouraged significant research efforts into optimising the architecture, training algorithms as well as inference speed to improve the utility of these models. Instruction fine-tuning became a common recipe for making a pre-trained model more adept at following human instructions. While early models in this era like GPT and GPT-2 were purely textual models – the reason for the term “large language models” (LLMs) – research quickly transitioned to multi-modality, with most models these days supporting at least two modes of interaction. The most recent LLMs can perform various complex tasks, process long inputs in the range of millions of words and even integrate with external tools (e.g. search engines), all while interacting fully in a conversational mode – of course, you’ve already seen this in action 🙂

Naturally, this era had yet again changed the focus for researchers in the continual learning field. Several simple tasks such as object recognition in images or intent recognition in texts no longer required the network to be optimised at all. The question now became how a given pre-trained LLM could be injected with newer capabilities such as mathematical reasoning, or aligned to human-desired ways of responding. The objective of continual learning in a DNN thus shifted from accumulating abilities for simple tasks to accumulating higher level functionalities. This era also brought back the study of stability-plasticity trade-off from a new lens – the pre-trained knowledge needed to be kept intact while the full DNN was now optimised to incorporate new abilities/human preferences. Pre-training data being inaccessible for many models meant replay methods were not feasible, while even with data access, the sheer scale of pre-training knowledge and model size also made regularisation approaches difficult (e.g. it’s hard to estimate importance scores). This resulted in the research either making assumptions about pre-training data or innovating with new data-free methods.

Key Papers:

Y. Lin, H. Lin, W Xiong, S Diao et al. (2024). “Mitigating the Alignment Tax of RLHF”: This paper focused on the cases where a pre-trained model is fine-tuned to a dataset of human preferences using the Reinforcement Learning with Human Feedback (RLHF) technique. The paper discovered that averaging the parameter values of the pre-trained model and the RLHF fine-tuned model is a simple and effective technique to avoid CF of pre-trained abilities while incorporating the preference data.
Srivastava et al. (2025). “Improving Multimodal Large Language Models Using Continual Learning”: This paper investigated the CF of linguistic abilities when an LLM is fine-tuned to incorporate a new modality (vision, in this case), finding that regularisation can help mitigate CF. Interestingly, the multi-modal fine-tuning was found to impact language understanding abilities positively (i.e. achieving KT in language through vision-language tasks) while causing CF of language generation abilities in base pre-trained models. On the other hand, instruction tuned models were found to be more resilient to CF overall.
Abbes et al. (2025). “Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models”: This paper focused on the scenario when an LLM is continually trained to accumulate new languages, finding that replay and gradient projection methods can help mitigate CF. The paper provides useful recommendations regarding compute efficiency when deciding between high replay rates and increased model size.

Tasks:

Continual capability injection: Each task introduces new functionalities to the model. E.g. Language modelling -> Language modelling + Mathematical reasoning -> Language modelling + Mathematical reasoning + Programming
Modality fusion: Each task introduces a new modality of operation to the model. E.g. Language modelling -> Language modelling + Image modelling
Preference alignment: Each task introduces a new set of human preferences.

In the next part….

In the next part of this article, I discuss some ongoing research and where I think the next era of continual learning is headed.