Continual Learning: The Eras

(Part 2: The Present and Future)

In the previous part of this article, I discussed the various past eras of continual learning. In this part, I discuss where we are at present, and put forward some arguments towards what I think are important research directions for the next era.

The Next Era: Towards Artificial General Intelligence? Mammoth or Modular?

2026 onwards

With large pre-trained models having achieved incredible reasoning abilities, multi-modality and online tool usage, an interesting question is – how close are we to achieving Artificial General Intelligence (AGI), i.e. AI that matches or surpasses human abilities in cognitive tasks? Let’s approach this question through the lens of the human ability to continually learn – i.e. our ability to continually adapt to the environment by acquiring new understanding and crucially, updating our understanding as we encounter new experiences. Consider, as an example, that the goal of our AGI agent is scientific discovery, i.e. an AGI agent capable of designing and conducting experiments and discovering new and useful insights from the data. Indeed, current models already have the ability to process a multi-modal input – such as experimental data and text describing the experiment – and reason over the input – such as identifying patterns in the data. This shows the extent to which current models can cognitively process a given experience (i.e. the input). In my opinion, the next challenge towards AGI lies in establishing how these individual experiences are to be tracked over time and how they contribute towards updating the agent’s understanding – e.g. how do new evidence and interactions on the topic change the agent’s subsequent actions?

An important question that arises for this next era is whether research should now move towards a mammoth Deep Neural Network (DNN) agent with large amounts of memorised knowledge and abilities stored in its parameters, or a modular approach with a mix of DNNs and other deterministic software. Each approach has its own merits and challenges. The mammoth approach has already shown great promise – i.e. large DNN architectures are capable of high level cognitive reasoning. The modular approach on the other hand allows a separation of distinct capabilities across dedicated DNNs/software, thus minimising the stability-plasticity issue when updating model parameters. However it requires significant engineering effort and careful design of what the constituent modules are and how they interact.

With these ideas in mind, let’s look at some specific angles through which one can approach research in this new era of continual learning:

Isolating reasoning from memory:

In order to achieve the ability to ‘learn from each experience’, a prerequisite is the ability to update prior knowledge based on new experience. The billions of parameters in current models help achieve both incredible reasoning and memorisation of a large number of independent facts. However, since the memorised facts are stored across the parameters, it becomes difficult to accurately inject a new piece of knowledge or update parameters when the memorised knowledge is outdated/incorrect (called a knowledge conflict). An important research direction, therefore, is to disentangle the core reasoning abilities from memorised facts and associations stored in the model parameters. This has so far involved post-training techniques, such as model editing and retrieval-augmented generation (RAG). Model editing aims to update specific facts by modifying the parameters directly but does not guarantee accuracy. RAG brings in contextually appropriate knowledge for each input using external sources as memory, but can involve knowledge conflicts and retrieval accuracy issues. Therefore, I think some exciting research questions for this new era are – how can we design model architectures and training objectives that still benefit from large-scale pre-training (i.e. enable reasoning), but treat factual tokens/entities as being separate from linguistic/reasoning tokens during the inference stage? Could neuro-symbolic approaches to pre-training be a feasible alternative paradigm? What are the scaling laws for such alternative approaches – can we achieve comparable reasoning performance as current DNNs but with fewer parameters?

Storing, retrieving and updating memory:

A parallel direction to be pursued towards the goal of an agent that ‘learns from each experience’, is determining how experiences themselves are stored and when a set of experiences represents a novel piece of knowledge for the agent. There is already a trend towards persisting an agent’s interactive sessions as its memory in digital personal assistant agents – e.g. the idea behind “memory.md” in OpenClaw or Hermes Agent. Hermes Agent even enables self-improvement by analysing traces of workflows (an example ‘workflow’ is monitoring websites and notifying when a pre-specified event occurs) when there are errors or human suggested corrections and persisting the corrected workflows. Other recent work towards such self-evolving agents focuses on iteratively evolving the prompt and executable code for performing language, math or coding tasks, where the underlying model itself remains unmodified. Currently, the experience storage in such systems appears to be largely text-based (including math and code) and the self-improvement loop is focused on externally specified objectives. Therefore, the questions that I think need to be addressed next are – how can we design agents capable of distilling abstract concepts or generalisable knowledge from a series of experiences? What kind of memory hierarchies will support this – in particular, how should we store multimodal agent experiences? This is an exciting area with different branches depending on whether a mammoth or modular approach is taken. Naturally, this area can benefit from lessons learnt from neuroscience research, giving it an interesting cross-disciplinary angle.

Related to this is the explore-vs-exploit trade off – If an agent were to initiate actions towards acquiring new experiences (as opposed to merely responding to prompts, or even structured self-evolution tasks), then what should be the principles that guide its actions? An interesting question is – How do we enable an agent to design effective and evolving objectives for itself over time?

Physical world modelling:

Now, what about agents that can move in the physical world? How will such an agent accumulate knowledge from new sensory experiences? While research in physical world modelling had previously remained largely independent, the generative pre-training era opened up new opportunities to merge this area with mainstream language/image/video generative models – e.g. generating large-scale video content to act as training data for robots moving in the physical world. So we now have, on the one hand, multi-modal reasoning models that can generate physically plausible action sequences (i.e. text commands/video simulations) and on the other hand, models that can convert a text command/video simulation to movement in the physical world (i.e. embodied agents). The integration of physical movement with pre-trained knowledge/reasoning/simulation capabilities all within one agent does not seem far off for this next era. I believe what is required next is the creation of more sophisticated benchmarks in which to test and develop intelligent embodied agents that learn continually – e.g. measuring how an agent combines inputs at disparate points in time to form an accurate representation of a complex dynamic environment (such as a scientific lab) and events in the environment (e.g. progression of experiments over time, whether unrelated events have affected ongoing experiments).

Compute efficiency for reasoning:

An agent deciding for itself on which objectives to explore and which actions to perform – e.g. actions in the physical world, actions towards updating specific memory components, etc. – all require reasoning. Reasoning in agents is now often achieved through chain-of-thought, where the model first generates the steps or notes for consideration before arriving at the final output, much like the way humans “think”. But the predominant strategy of generating tokens one at a time requires a lot of computation and also has the risk that incorrect reasoning at the start can be hard to recover from. Diffusion models are emerging as an alternate paradigm for training LLMs to make them generate multiple tokens at a time, making them more compute efficient (i.e. faster and cheaper). Naturally, this new way of how models “think” makes it necessary to investigate questions such as – Is the chain-of-thought in diffusion models causal in the same way as purely autoregressive models? How does this change the ways in which we rely on chain-of-thought (as an interpretability tool, as an accuracy guarantee, etc.)?

Data efficiency:

Regardless of whether we move towards a mammoth or a modular approach, current DNNs still require large-scale curated data for training. While this is feasible for general knowledge and reasoning, specialised knowledge is not always supported by large chunks of data. Therefore, investigating data efficient training paradigms for DNNs or novel approaches of leveraging a general reasoning model for specialised domains will be interesting next directions of research. Along these lines, recent work involves iteratively fine-tuning the main agent with data generated by a secondary agent whole goal is to create progressively harder tasks that encourage the main agent to use external tools to complete tasks successfully. Another approach has been to fine-tune a model incrementally with data generated by the model itself, using logical constraint based objectives that enable initial incorrect/inconsistent model predictions to be automatically corrected over time. Such approaches make use of noisy data and therefore require several iterations of data generation and training. This appears to me to transfer the problem of inefficiency from data to compute. An interesting line of research for me is – How do we design approaches that are both data and compute efficient?

Interpretability and Transparency:

Finally, while it is exciting to create agents that are independent and keep evolving and improving, it is also essential that we develop systems to keep track of, understand and intervene in their functioning. Establishing whether a model is “right for the right reasons” has always been an underlying driver of such interpretability research. Current investigations are directed towards understanding and ensuring the reliability of chain-of-thought and mechanisms to track underlying goals and belief states influencing model outputs. While these studies have provided several empirical insights, they are preliminary and limited in scope and therefore will require substantial continued research effort. Additionally, a direction that appears to me to be underexplored is – How do we interpret chain-of-thought, goals and belief states, or some such equivalent, in the case of multimodal outputs?

Also, in the generative pre-training era, techniques like watermarking enabled differentiating LLM-generated content from human-generated content. With generative modelling evolving and models becoming increasingly multimodal, what new techniques do we need to ensure AI usage remains transparent and detectable?

Concluding Thoughts

Overall, I think there are several fundamental open questions (e.g. alternative pre-training strategies, alternatives to autoregressive generation) that could lead us in directions different to the models of the generative pre-training era. Advances in one area are likely to be linked to those in another – for instance, data and compute efficient continual learning may require advances in memory organisation methods for modular architectures. Therefore, while it’s an exciting time to be doing research in AI (and particularly, continual learning), I believe we, as researchers, are also going to have to get out of our comfort zones more as the field requires ever more collaboration across different areas of expertise!

The possible shift to a modular approach indicates to me that techniques previously investigated in smaller networks (e.g. replay, regularisation, parameter-masking) will be relevant again in different components of an AGI agent. Along these lines, an interesting aspect of research for me is – whether an agent can automatically select/instantiate the appropriate continual learning technique with which to update its component parameters – since this could mean less hand-engineered designs and more flexibility in adapting to different applications.