Safety Monitoring of Deep Reinforcement Learning Agents

Amirhossein Zolfagharian, Manel Abdellatif, Lionel C. Briand, S. Ramesh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Problem. Deep Reinforcement Learning (DRL) algorithms are increasinglybeing used in safety-critical systems. Ensuring the safetyof DRL agents is a critical concern in such contexts. However, relyingsolely on testing is not sufficient to ensure safety as it doesnot offer guarantees. Building safety monitors is one solution toalleviate this challenge. Existing safety monitoring techniques forregular software systems often rely on formal verification to ensurecompliance with safety constraints [4]. However, when it comesto DRL policies, formally verifying their behavior to satisfy safetyproperties becomes an NP-complete problem [6]. Further, monitoringDRL agents in a black-box manner is practically important,as testers and safety engineers often do not have full access to theinternals nor the training dataset of the DRL agent [2, 8]. Approach. We propose SMARLA, a machine learning-based safetymonitoring approach designed for DRL agents, where the goal isto predict safety violations at runtime, as early as possible, by monitoringthe behavior of the agent. For practical reasons, SMARLAis designed to be a black-box monitoring approach (as it does notrequire access to the internals of the agent). SMARLA uses machinelearning to predict safety violations, accurately and early, during theexecution of DRL agents. SMARLA currently focuses on Q-learningalgorithms, a widely used and common type of RL algorithms [5].We illustrate the overview of SMARLA in Figure 1. SMARLApredicts safety violations using a machine learning (ML) model(i.e., Random Forest) that relies on the agent's states throughoutthe episodes as features. We need a lightweight ML model that canclassify RL episodes as safe or unsafe, and effectively deployed onresource-constrained edge devices. Thus, we exclude DNN modelsand choose Random Forest as a machine learning model becauseof (1) its ability to handle a large number of features, (2) its efficiencyin providing prediction results, and (3) its proven robustnessto overfitting, as stated in the literature [8]. We also conductedablation studies with other ML models, such as KNN, SVM, andDecision Trees. However, Random Forest provided the most accurateprediction results. To train the ML model we randomly executethe RL agent and extract the generated episodes. These episodesare labeled as either safe or unsafe. Since the state space is verylarge, we rely on state abstraction [1] to reduce the large state spaceusually associated with DRL agents, by grouping similar concretestates and thus increase the learnability of machine learning modelsto predict violations. Next, we translate each episode to a featurevector that determines whether abstract states are present or not inepisodes. We represent each episode using a binary feature vectorthat encodes the presence (1) or absence (0) of each abstract statewithin the episode. After training the model, it monitors the behaviorof the agent and estimates the probability of encountering anunsafe state while an episode is being executed. We then, rely onthe confidence intervals of such probability to accurately determinethe optimal time step to trigger safety mechanisms. Figure 2 showsexample estimated probabilities of safety violations along someepisodes. Confidence intervals are computed at each time step t[3], from which the upper bound Up(t ) and lower bound Low(t )are derived. At time step t , when the upper bound of the confidenceinterval Up(t ) is above a certain threshold, SMARLA classifies theepisode as unsafe. Depending on the agent and the specific context,the threshold may be adjusted accordingly. We rely on the upperbound because, to ensure safety, we take a conservative approachand rather err on the side of caution.Evaluation. To evaluate the effectiveness of SMARLA, we implementedour safety monitor for two well-known and widely usedbenchmark problems in the RL literature: (1) Cart-Pole problem;a pole is attached to a cart, as shown in Figure 3. The objectiveis to keep the pole upright by moving the cart. A safety violationoccurs when the cart passes the border of the environment. Thiscan cause damages and, therefore considered a safety violation. (2)Mountain-Car problem; an under-powered car is placed in a valleybetween two hills as illustrated in Figure 3. The objective is to controlthe car and build momentum to reach the goal state on top ofthe right hill as soon as possible. A safety violation is simulated byconsidering the crossing of the left border of the environment as anirrecoverable unsafe state that poses potential damage to the car.Results. Our empirical evaluation is designed to answer the followingresearch questions.

Original languageEnglish
Title of host publicationProceedings - 2024 ACM/IEEE 46th International Conference on Software Engineering
Subtitle of host publicationCompanion, ICSE-Companion 2024
PublisherIEEE Computer Society
Pages286-287
Number of pages2
ISBN (Electronic)9798400705021
DOIs
Publication statusPublished - 14 Apr 2024
Event46th International Conference on Software Engineering: Companion, ICSE-Companion 2024 - Lisbon, Portugal
Duration: 14 Apr 202420 Apr 2024

Publication series

NameProceedings - International Conference on Software Engineering
ISSN (Print)0270-5257

Conference

Conference46th International Conference on Software Engineering: Companion, ICSE-Companion 2024
Country/TerritoryPortugal
CityLisbon
Period14/04/2420/04/24

Fingerprint

Dive into the research topics of 'Safety Monitoring of Deep Reinforcement Learning Agents'. Together they form a unique fingerprint.

Cite this