Publications

Conferences:

NeurIPS

Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers

Vasan, G., Elsayed, M., Azimi, S. A., He, J., Shahriar, F., Bellinger, C., White, M., & Mahmood, A. R.,

Neural Information Processing Systems, 2024

PDF arXiv Code Abstract

Modern deep policy gradient methods achieve effective performance on simulated robotic tasks, but they all require large replay buffers or expensive batch updates, or both, making them incompatible for real systems with resource-limited computers. We show that these methods fail catastrophically when limited to small replay buffers or during incremental learning, where updates only use the most recent sample without batch updates or a replay buffer. We propose a novel incremental deep policy gradient method --- Action Value Gradient (AVG) and a set of normalization and scaling techniques to address the challenges of instability in incremental learning. On robotic simulation benchmarks, we show that AVG is the only incremental method that learns effectively, often achieving final performance comparable to batch policy gradient methods. This advancement enabled us to show for the first time effective deep reinforcement learning with real robots using only incremental updates, employing a robotic manipulator and a mobile robot.

RLC

Weight Clipping for Deep Continual and Reinforcement Learning

Elsayed, M., Lan, Q., Lyle C., & Mahmood, A. R.,

Reinforcement Learning (RLC), 2024

PDF arXiv Code Abstract

Many failures in deep continual and reinforcement learning are associated with increasing magnitudes of the weights, making them hard to change and potentially causing overfitting. While many methods address these learning failures, they often change the optimizer or the architecture, a complexity that hinders widespread adoption in various systems. In this paper, we focus on learning failures that are associated with increasing weight norm and we propose a simple technique that can be easily added on top of existing learning systems: clipping neural network weights to limit them to a specific range. We study the effectiveness of weight clipping in a series of supervised and reinforcement learning experiments. Our empirical results highlight the benefits of weight clipping for generalization, addressing loss of plasticity and policy collapse, and facilitating learning with a large replay ratio.

ICML

Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning

Elsayed, M., Farrahi, H., Dangel F., & Mahmood, A. R.,

International Conference on Machine Learning (ICML), 2024

PDF arXiv Code Abstract

Second-order information is valuable for many applications but challenging to compute. Several works focus on computing or approximating Hessian diagonals, but even this simplification introduces significant additional costs compared to computing a gradient. In the absence of efficient exact computation schemes for Hessian diagonals, we revisit an early approximation scheme proposed by Becker and LeCun (1989, BL89), which has a cost similar to gradients and appears to have been overlooked by the community. We introduce HesScale, an improvement over BL89, which adds negligible extra computation. On small networks, we find that this improvement is of higher quality than all alternatives, even those with theoretical guarantees, such as unbiasedness, while being much cheaper to compute. We use this insight in reinforcement learning problems where small networks are used and demonstrate HesScale in second-order optimization and scaling the step-size parameter. In our experiments, HesScale optimizes faster than existing methods and improves stability through step-size scaling. These findings are promising for scaling second-order methods in larger models in the future.

ICLR

Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

Elsayed, M., & Mahmood, A. R.,

International Conference on Learning Representations (ICLR), 2024

PDF arXiv Code Abstract

Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utility-based Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We use a challenging streaming learning setup where continual learning problems have hundreds of non-stationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses or is competitive with all methods in all problems. Finally, in extended reinforcement learning experiments with PPO, we show that while Adam exhibits a performance drop after initial learning, UPGD avoids it by addressing both continual learning issues.

Workshops and Preprints:

ICML

Multi-stream Sequence Learning

Elsayed, M., & Mahmood, A. R.,

ICML Workshop on Efficient Systems for Foundation Models, 2025

PDF OpenReview Abstract

We re-evaluate the suitability of the independent and identically distributed (IID) training paradigm for sequence learning, where long data streams are segmented into shorter and shuffled chunks, thereby breaking their natural continuity and undermining long-range credit assignment. This paper offers multi-stream sequence learning, a training framework that presents multiple data streams in their natural order. To support this framework, we propose Memora, a recurrent-only architecture whose persistent hidden states make it more suitable for sequence learning than Transformers. Memora builds on Gated Linear Recurrent Unit (GLRU)---a new lightweight recurrent unit designed for efficient parallel training and robust temporal reasoning---and achieves effective learning on long byte-level sequences. Our experiments on structured and byte-level benchmarks demonstrate that models trained under the multi-stream sequence learning framework consistently outperform standard recurrent and state-space models trained with IID training setting, underscoring the importance of preserving continuity in sequence learning.

arXiv

Streaming Deep Reinforcement Learning Finally Works

Elsayed, M., Vasan, G., & Mahmood, A. R.,

arXiv preprint arXiv:2410.14606, 2024

PDF arXiv Code Abstract

Natural intelligence processes experience as a continuous stream, sensing, acting, and learning moment-by-moment in real time. Streaming learning, the modus operandi of classic reinforcement learning (RL) algorithms like Q-learning and TD, mimics natural learning by using the most recent sample without storing it. This approach is also ideal for resource-constrained, communication-limited, and privacy-sensitive applications. However, in deep RL, learners almost always use batch updates and replay buffers, making them computationally expensive and incompatible with streaming learning. Although the prevalence of batch deep RL is often attributed to its sample efficiency, a more critical reason for the absence of streaming deep RL is its frequent instability and failure to learn, which we refer to as stream barrier. This paper introduces the stream-x algorithms, the first class of deep RL algorithms to overcome stream barrier for both prediction and control and match sample efficiency of batch RL. Through experiments in Mujoco Gym, DM Control Suite, and Atari Games, we demonstrate stream barrier in existing algorithms and successful stable learning with our stream-x algorithms: stream Q, stream AC, and stream TD, achieving the best model-free performance in DM Control Dog environments. A set of common techniques underlies the stream-x algorithms, enabling their success with a single set of hyperparameters and allowing for easy extension to other algorithms, thereby reviving streaming RL.

NeurIPS

Deep Reinforcement Learning Without Experience Replay, Target Networks, or Batch Updates

Elsayed, M., Vasan, G., & Mahmood, A. R.,

NeurIPS Workshop on Fine-Tuning in Modern Machine Learning, 2024

PDF Abstract

Natural intelligence processes experience as a continuous stream, sensing, acting, and learning moment-by-moment in real time. Streaming learning, the modus operandi of classic reinforcement learning (RL) algorithms like Q-learning and TD, mimics natural learning by using the most recent sample without storing it. This approach is also ideal for resource-constrained, communication-limited, and privacy-sensitive applications. However, in deep RL, learners almost always use batch updates and replay buffers, making them computationally expensive and incompatible with streaming learning. Although the prevalence of batch deep RL is often attributed to its sample efficiency, a more critical reason for the absence of streaming deep RL is its frequent instability and failure to learn, which we refer to as stream barrier. This paper introduces the stream-x algorithms, the first class of deep RL algorithms to overcome stream barrier for both prediction and control and match sample efficiency of batch RL. Through experiments in Mujoco Gym, DM Control, and Atari Games, we demonstrate stream barrier in existing algorithms and successful stable learning with our stream-x algorithms: stream Q, stream AC, and stream TD, achieving the best model-free performance in DM Control Dog environments. A set of common techniques underlies the stream-x algorithms, enabling their success with a single set of hyperparameters and allowing for easy extension to other algorithms, thereby reviving streaming RL.

NeurIPS

Utility-based Perturbed Gradient Descent: An Optimizer for Continual Learning

Elsayed, M., & Mahmood, A. R.,

NeurIPS Workshop on Optimization for Machine Learning, 2023

PDF Abstract

Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utility-based Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We adopt the challenging setup of streaming learning as the testing ground and design continual learning problems with hundreds of non-stationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses all methods in all problems, being demonstrably capable of addressing both issues.

NeurIPS

HesScale: Scalable Computation of Hessian Diagonals

Elsayed, M., & Mahmood, A. R.,

NeurIPS Workshop on Higher-Order Optimization in Machine Learning, 2022

PDF Abstract

Second-order optimization uses curvature information about the objective function, which can help in faster convergence. However, such methods typically require expensive computation of the Hessian matrix, preventing their usage in a scalable way. The absence of efficient ways of computation drove the most widely used methods to focus on first-order approximations that do not capture the curvature information. In this paper, we develop HesScale, a scalable approach to approximating the diagonal of the Hessian matrix, to incorporate second-order information in a computationally efficient manner. We show that HesScale has the same computational complexity as backpropagation. Our results on supervised classification show that HesScale achieves high approximation accuracy, allowing for scalable and efficient second-order optimization.

NeurIPS

ULTRA: A reinforcement learning generalization benchmark for autonomous driving

Elsayed, M., Hassanzadeh, K., Nguyen, N. M., Alban, M., Zhu, X., Graves, D., & Luo, J.,

NeurIPS Workshop on Machine Learning for Autonomous Driving, 2020

PDF Abstract

The unprotected left turn is one of the most difficult problems in real-world autonomous driving. Its difficulty is due to the diverse and hard-to-predict interactions among possibly many road participants in the absence of traffic lights. To help address this challenge, we developed a new benchmark called ULTRA (Unprotected Left Turn for Robust Agents). ULTRA offers controllable diversity and a way to measure the generalization performance of agents. It is also readily expandable as more scenarios and behavior models are developed and incorporated. Unlike prior benchmarks, ULTRA is explicitly focused on providing a rich diversity of interaction scenarios. In this way, it challenges the RL community to develop algorithms and driving policies that generalize better and are thereby more suitable for real-world autonomous driving.

Theses

Thesis

Investigating Generate and Test for Online Representation Search with Softmax Outputs

Elsayed, M.,

Master's Thesis, University of Alberta, 2022

PDF Abstract

Modern representation learning methods perform well on oﬄine tasks and primarily revolve around batch updates. However, batch updates preclude those methods from focusing on new experience, which is essential for fast online adaptation. In this thesis, we study an online and incremental representation search algorithm called Generate and Test, which continually replaces the least useful features with newly generated features. In this algorithm, the utility of features is estimated by a heuristic tester based on the magnitude of their corresponding outgoing weights; the least useful features are those with the smallest weight magnitudes. Generate and Test was developed and evaluated only on single-output regression problems. However, it has not been investigated in multi-output regression problems. Moreover, it is not clear that magnitude-based testers are appropriate for other outputs such as softmax. In this thesis, we investigate Generate and Test in these new cases and introduce testbeds for online representation learning in multi-output regression, classification, and reinforcement learning environments with discrete action spaces. We show that magnitude-based feature utility may give wrong estimates of the utility when softmax outputs are used, for example, in classification and discrete control tasks. We propose a new tester to extend the scope of the Generate and Test algorithm to these cases. We empirically show that this new tester can improve representations better than the magnitude-based tester. Thus, ours is the first work to make the Generate and Test algorithm applicable beyond supervised regression tasks.