15 KiB
| authors | citekey | publish_date | pages | last_import | |||
|---|---|---|---|---|---|---|---|
|
lazarusRuntimeSafetyAssurance2020 | 2020-10-01 | 1-9 | 2025-07-21 |
Indexing Information
Published: 2020-10
DOI 10.1109/DASC50938.2020.9256446 #Control-systems, #Reinforcement-learning, #reinforcement-learning, #Safety, #Switches, #Aerospace-control, #Aircraft, #Atmospheric-modeling, #runtime-safety-assurance, #Unmanned-Aerial-Systems-UAS
#InSecondPass
[!Abstract] The airworthiness and safety of a non-pedigreed autopilot must be verified, but the cost to formally do so can be prohibitive. We can bypass formal verification of non-pedigreed components by incorporating Runtime Safety Assurance (RTSA) as mechanism to ensure safety. RTSA consists of a meta-controller that observes the inputs and outputs of a non-pedigreed component and verifies formally specified behavior as the system operates. When the system is triggered, a verified recovery controller is deployed. Recovery controllers are designed to be safe but very likely disruptive to the operational objective of the system, and thus RTSA systems must balance safety and efficiency. The objective of this paper is to design a meta-controller capable of identifying unsafe situations with high accuracy. High dimensional and non-linear dynamics in which modern controllers are deployed along with the black-box nature of the nominal controllers make this a difficult problem. Current approaches rely heavily on domain expertise and human engineering. We frame the design of RTSA with the Markov decision process (MDP) framework and use reinforcement learning (RL) to solve it. Our learned meta-controller consistently exhibits superior performance in our experiments compared to our baseline, human engineered approach.>[!seealso] Related Papers
Annotations
Notes
!Runtime Safety Assurance Using Reinforcement Learning-Note
Highlights From Zotero
[!tip] Brilliant We can bypass formal verification of non-pedigreed components by incorporating Runtime Safety Assurance (RTSA) as mechanism to ensure safety. RTSA consists of a metacontroller that observes the inputs and outputs of a non-pedigreed component and verifies formally specified behavior as the system operates. When the system is triggered, a verified recovery controller is deployed. 2025-07-08 9:37 am
[!tip] Brilliant Recovery controllers are designed to be safe but very likely disruptive to the operational objective of the system, and thus RTSA systems must balance safety and efficiency. 2025-07-08 9:37 am
[!highlight] Highlight Unfortunately, the cost to formally verify a nonpedigreed or black-box autopilot for a variety of vehicle types and use cases is generally prohibitive. 2025-07-08 9:44 am
[!highlight] Highlight In order for this mechanism to work, the system needs to be able to distinguish between safe scenarios under which the operation should remain controlled by πn and scenarios that would likely lead to unsafe conditions in which the control should be switched to πr. We assume that a recovery controller πr is given and this work does not focus on its design or implementation. 2025-07-08 9:46 am
[!highlight] Highlight The problem that we address in this work is determining how to decide when to switch from the nominal controller πn to the recovery controller πr while balancing the trade-off between safety and efficiency. 2025-07-08 9:46 am
[!warning] Dubious We postulate that the task of navigating an aircraft from an origin to a destination by following a pre-planned path is far more complex than the task of predicting whether the aircraft is operating safely within a short time horizon of a given point. 2025-07-08 9:47 am
[!done] Important The goal of RTSA systems is to guarantee the safe operation of a system despite having black-box components as a part of its controller. Safe operation is specified by an envelope E ⊂ S which corresponds to a subset of the state space S within which the system is expected to operate ideally. RTSA continuously monitors the state of the system and switches to the recovery controller if and only if when not doing so, would lead the system to exit the safety envelope. 2025-07-08 9:50 am
[!warning] Dubious It must switch to the recovery control πr whenever the aircraft leaves the envelope. 2025-07-08 9:51 am
I mean really there is an envelope E_r \subset E \subset S that is the region within the safety envelope that is recoverable... no?
[!done] Important • Its implementation must be easily verifiable. This means it must avoid black box models that are hard to verify such as deep neural networks (DNNs)[3]. 2025-07-08 9:53 am
[!done] Important We model the evolution of the flight of an aircraft equipped with an RTSA system by defining the following MDP: M = (S, A, T, R) where the elements are defined below. • State space S ∈ Rp: a vector representing the state of the environment and the vehicle. • Action space A ∈ {deploy, continue}: whether to deploy the recovery system or let the nominal controller remain in control. • Transition T (s, a): a function that specifies the transition probability of the next state s′ given that action a was taken at step s, in our case this will be sampled from a simulator by querying f (s, a). • Reward R(s, a, s′), the reward collected corresponding to the transition. This will be designed to induce the desired behavior. 2025-07-08 9:56 am
[!done] Important In this model, the RTSA system is considered the agent while the states correspond to the position, velocity and other relevant information about the aircraft with respect to the envelope and the actions correspond to deploying the recovery controller or not. The agent receives a large negative reward for abandoning the envelope and a smaller negative reward for deploying the recovery controller in situations where it was not required. This reward structure is designed to heavily penalize situations in which the aircraft exits the safety envelope and simultaneously disincentivize unnecessary deployments of the recovery controller. The rewards at each step are weighted by a discount factor γ < 1 such that present rewards are worth more than future ones. 2025-07-08 9:59 am
[!done] Important Additionally, we consider the nominal controller to be a black box. All of these conditions combined lead us to operate under the assumption that the transition function is unknown and we do not have access to it. We do, however, have access to simulators from which we can query experience tuples (s, a, r, s′) by providing a state s and an action a and fetching the next state s′ and associated reward r. In this setting we can learn a policy from experience with RL. 2025-07-08 10:00 am
[!fail] This ain't right RL has successfully been applied to many fields [8]–[10]. 2025-07-08 10:00 am
Citation stash lmao
[!done] Important Q-learning as described above enables the estimation of tabular Q-functions which are useful for small discrete problems. However, cyber-physical systems often operate in contexts which are better described with a continuous state space. The problem with applying tabular Q-learning to these larger state spaces is not only that they would require a large state-action table but also that a vast amount of experience would be required to accurately estimate the values. An alternative approach to handle continuous state spaces is to use Q-function approximation where the state-action value function is approximated which enables the agent to generalize from limited experience to states that have not been visited before and additionally avoids storing a gigantic table. Policies based on value function approximation have been successfully demonstrated in the aerospace domain before [12]. 2025-07-08 11:14 am
[!warning] Dubious Instead, we will restrict our attention to linear value function approximation, which involves defining a set of features Φ(s, a) ∈ Rm that captures relevant information about the state and then use these features to estimate Q by linearly combining them with weights θ ∈ Rm×|A| to estimate the value of each action. In this context, the value function is represented as follows: Q(s, a) = m ∑ i=1 θiφi(s, a) = θT Φ(s, a) (9) Our learning problem is therefore reduced to estimating the parameters θ and selecting our features. Typically, domain knowledge will be leveraged to craft meaningful features Φ(s, a) = (φ1(s, a), φ2(s, a), ..., φm(s, a)), and ideally they would capture some of the geometric information relevant for the problem, e.g. in our setting, heading, velocity and distance to the geofence. The ideas behind the Q-learning algorithm can be extended to the linear value function approximation setting. Here, we initialize our parameters θ and update them at each transition to reduce the error between the predicted value and the observed reward. The algorithm is outlined below and it forms the basis of the learning procedure used in the experiments in Section III. 2025-07-08 11:18 am
This sounds like a constantly updating principal component analysis with some best fitting going on. Interesting, but if the reward is nonlinear (which seems likely?) isn't this problematic?
[!highlight] Highlight We restrict our function family to linear functions which can be easily understood and verified. A major drawback, however, is that linear functions are less expressive than DNNs, which makes their training more difficult and requires careful crafting of features. 2025-07-08 11:25 am
[!done] Important Despite the relative simplicity of linear value function approximators when compared to DNNs, we observed that they are able to capture relevant information about the environment and are well suited for this task. 2025-07-08 11:29 am
[!highlight] Highlight One approach to address this problem is to avoid or reduce the chance of random exploration. We dramatically increase the likelihood of observing episodes where the mission is completed successfully without exiting the envelope, but we also bias the learning process towards exploitation. It is well known that in order for a policy to converge under Q-learning, exploration must proceed indefinitely [15]. Additionally, in the limit of the the number of steps, the learning policy has to be greedy with respect to the Q-function [15]. Accordingly, avoiding or dramatically reducing random exploration can negatively affect the learning process and should be avoided. 2025-07-08 11:33 am
[!highlight] Highlight Instead of randomly initializing the parameters of the Qfunction approximation and then manually biasing the weights to decrease the chance of randomly deploying the recovery controller, we can use a baseline policy to generate episodes in which the RTSA system exhibited a somewhat acceptable performance. From these episodes, we learn the parameters in an offline approach known as batch reinforcement learning. It is only after we learn a good initialization of our parameters that we then start the training process of our policy πRT SA. For this purpose and to have a benchmark to compare our approach to, we define a baseline policy that consists of shrinking the safety envelope by specifying a distance threshold δ > 0. When the vehicle reaches a state that is less than δ distance away from exiting the envelope, the recovery controller is deployed. This naive approach serves both as a baseline for our experiments and also provides us with experience to initialize the weights of our policy before we do on-policy learning. 2025-07-08 11:36 am
[!highlight] Highlight We used configuration composed of a hexarotor simulator that models lift rotor aircraft features and includes a three dimensional Bezier curve trajectory definition module, nonlinear multi-rotor dynamics, input/output linearization, nested saturation, a cascade PID flight control model, and an extended Kalman filter estimation model. An illustration of the simulator environment in three dimensions is included in Figure 2 [17]. 2025-07-08 11:46 am
Notably, not really anything special with the controller setup. Shouldn't we be able to do math using the PID controller and set bounds? or perhaps it's a nonlinear issue? Probably the second thing.
[!fail] This ain't right deploying a parachute which was modeled using a simplified model that introduces a drag coefficient that only affects the z coordinates in the simulation. 2025-07-08 11:47 am
[!fail] This ain't right The state space in our simulation is comprised of more than 250 variables. Some correspond to simulation parameters 0246 0 1 2 3 −0−.040..200.02.4 0 1 2 3 path waypoints trajectory 0246 −0.4 −0.2 0.0 0.2 0.4 Fig. 3. Example of environment configuration and episode data with wind. such as sampling time, simulation time and physical constants. Another set of variables represent physical magnitudes such as velocity scaling factors, the mass of components of the hexarotor, distribution of the physical components of the hexarotor, moments of inertia and drag coefficients. Other variables represent maximum and minimum roll, pitch and yaw rates, rotor speed and thrust. The sensor readings, their biases, frequencies and other characteristics are also represented by other variables. Other variables represent the state of the controller and the actions it prescribes for the different actuators in the vehicle. And a few of them correspond to the position and velocity of the hexarotor during the simulation. Figure 4 shows the evolution of the position and velocity variables for a simulation episode corresponding to the example environment configuration. All of these variables are needed to completely specify the state of the world in the simulation and illustrate the high dimensional requirements of a somewhat accurate representation of flight dynamics in a simulator. In principle, all these variables would be needed to specify a policy for our RTSA system, but as discussed in Section II we can rely on value function approximation by crafting a set of informative features which significantly reduces the dimensionality of our problem. 2025-07-08 11:50 am
This has to be wrong. There's no way they need 250 states when many of these phenomena must be coupled. Drag coefficients for example are completely dependent on velocity and constants. That is a redundant state, less for the issue of it changing for the recovery controller. I'd still say that's a hybrid system thing.
[!highlight] Highlight In this work we restricted our attention to terminal recovery controllers. 2025-07-08 9:43 am