Safe Reinforcement Learning via Shielding

Indexing Information

Published: 2018-04

DOI 10.1609/aaai.v32i1.11797 #Formal-Methods

#InSecondPass

[!Abstract] Reinforcement learning algorithms discover policies that maximize reward, but do not necessarily guarantee safety during learning or execution phases. We introduce a new approach to learn optimal policies while enforcing properties expressed in temporal logic. To this end, given the temporal logic specification that is to be obeyed by the learning system, we propose to synthesize a reactive system called a shield. The shield monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification. We discuss which requirements a shield must meet to preserve the convergence guarantees of the learner. Finally, we demonstrate the versatility of our approach on several challenging reinforcement learning scenarios.>[!seealso] Related Papers

Annotations

Notes

!Zettelkasten/Literature Notes/Notes on Papers/Safe Reinforcement Learning via Shielding-notes.md

Highlights From Zotero

[!highlight] Highlight . Increasing use of learning-based controllers inphysical systems in the proximity of humans strengthens theconcern of whether these systems will operate safely. 2025-05-13 4:28 pm

[!tip] Brilliant In this paper, we introduce shielded learning, a frame-work that allows applying machine learning to control sys-tems in a way that the correctness of the system’s executionagainst a given specification is assured during the learningand controller execution phases, regardless of how fast the learning process converges. The shield monitors the actionsselected by the learning agent and corrects them if and onlyif the chosen action is unsafe. 2025-05-13 4:29 pm

[!done] Important Last but not least, the shieldingframework is compatible with mechanisms such as functionapproximation, employed by learning algorithms in order to improve their scalability. 2025-05-13 4:31 pm

[!tip] Brilliant A specification that is however satisfied along all traces of the system is G(G(ra ∧ ¬rb) → FGga), which can be read as “If from some point onwards, request ra is always set to true while request proposition rb is not, then eventually, a grant is given to process a for eternity.” 2025-06-06 2:05 pm

[!tip] Brilliant A specification is called a safety specification if every trace σ that is not in the language represented by the specification has a prefix such that all words starting with the prefix are also not in the language. Intuitively, a safety specification states that “something bad should never happen”. Safety specifications can be simple invariance properties (such as “the level of a water tank should never fall below 1 liter”), but can also also be more complex (such as “whenever a valve is opened, it stays open for at least three seconds”). 2025-06-06 2:05 pm

Follow-Ups

[!example] (Bloem et al. 2015) proposed the idea to synthesize ashield that is attached to a system to enforce safety properties at run time.

#Follow-Up

3.5 KiB Raw Blame History Unescape Escape

Safe Reinforcement Learning via Shielding

Indexing Information

Annotations

Notes

Highlights From Zotero

Follow-Ups

3.5 KiB

Raw Blame History