Post

Adversarial attack and defense in RL-from AI security view

Adversarial attack and defense in RL-from AI security view

Springer 2019. Paper Chen, T., Liu, J., Xiang, Y. et al.

๐Ÿ’ Key Takeaways

Abstract

  • AI security ๊ด€์ ์—์„œ RL์—์„œ์˜ Adversarial attack์— ๋Œ€ํ•œ ์ตœ์ดˆ์˜ comprehensive survey
  • existing adversarial attacks์— ๋Œ€ํ•œ ๋Œ€ํ‘œ์ ์ธ defense technologies against existing adversarial attacks์— ๋Œ€ํ•ด์„œ๋„ ๊ฐ„๋žตํžˆ ์†Œ๊ฐœ

1. Introduction

  • Huang et al. (2017)
    • RL์€ input์— ์ž‘์€ perturbation๋ฅผ ์ถ”๊ฐ€ํ•œ adversarial attackํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์ฒ˜์Œ์œผ๋กœ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค.
    • cross-dataset transferability in RL: ๋™์ผํ•œ task๋ฅผ ์œ„ํ•œ policy ๊ฐ„์— adversarial examples์ด ์ „์ด๋  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€์Œ.
  • ์ฃผ์š” ์—ฐ๊ตฌ ๋ถ„์•ผ
    • Atari Game: ๋งค time step๋งˆ๋‹ค๊ฐ€ ์•„๋‹Œ, ํŠน์ • step์—์„œ๋งŒ independentlyํ•˜๊ฒŒ adversarial examples๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๊ณต๊ฒฉํ•˜๋Š” ๋ฐฉ์‹
    • Auto Path Planning

Image

2. Preliminaries

Definition

  • Adversarial Example
    • information carrier (such as image, voice or txt) with small perturbations added
    • can remain imperceptible to human vision system
    • 1๏ธโƒฃ Implicit Adversarial Example: pixel ์ˆ˜์ค€์—์„œ ์‚ฌ๋žŒ์—๊ฒŒ ๋ณด์ด์ง€ ์•Š๋Š” perturbations์„ ์ถ”๊ฐ€ํ•˜์—ฌ global information๋ฅผ ์ˆ˜์ •ํ•จ
    • 2๏ธโƒฃ Dominant Adversarial Example
      • a modified version of clean map
      • physical-level obstacles์„ ์ถ”๊ฐ€ํ•˜์—ฌ e local information๋ฅผ ๋ณ€๊ฒฝํ•˜์˜€๋‹ค.
  • Transferability: ํ•œ ๋ชจ๋ธ์—์„œ ์ž˜๋ชป ๋ถ„๋ฅ˜๋˜๋Š” adversarial example๊ฐ€ ๋™์ผํ•œ task๋ฅผ ํ•ด๊ฒฐํ•˜๋„๋ก ํ›ˆ๋ จ๋œ ๋‹ค๋ฅธ ๋ชจ๋ธ์—์„œ๋„ ์ž˜๋ชป ๋ถ„๋ฅ˜๋˜๋Š” ์„ฑ์งˆ
  • Threat Model
    • Finding system potential threat to establish an adversarial policy
    • policy์˜ raw input์— small perturbations๋ฅผ ๋„ฃ์„ ์ˆ˜ ์žˆ๋Š” adversary๋ฅผ ๊ณ ๋ คํ•œ๋‹ค.
  • Target Agent
    • adversarial examples์— ์˜ํ•ด ๊ณต๊ฒฉ๋ฐ›๋Š” ๋Œ€์ƒ ์ฃผ์ฒด
    • ์ผ๋ฐ˜์ ์œผ๋กœ RL policy์œผ๋กœ ํ›ˆ๋ จ๋œ NN

UNREAL ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • RL algorithm ์ž์ฒด๋Š” ์•„๋‹˜
  • ์ ์šฉ ๋ถ„์•ผ:
    • Atari Pong Game : 8.8 times against human performance
    • Labyrinth (1์ธ์นญ 3D ๋ฏธ๋กœ): reached 87% of human level
  • RL algorithm: A3C ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜

UNREAL์—๋Š” 2๊ฐ€์ง€ auxiliary tasks๊ฐ€ ์žˆ๋‹ค.

  • 1๏ธโƒฃ Control Task
    • Pixel Control: ํ”ฝ์…€ ๋‹จ์œ„๋กœ ๋ณ€ํ™”๋ฅผ ์ถ”์ ํ•˜์—ฌ environment๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•˜๋„๋ก ํ•™์Šตํ•œ๋‹ค.
    • Hidden Layer Activation Control: ํ•™์Šต ์•ˆ์ •์„ฑ ๊ฐ•ํ™”์— ๋„์›€์ด ๋œ๋‹ค.
  • 2๏ธโƒฃ Back Prediction Task
    • ์ผ๋ถ€ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” feedback $r$ ์„ ์–ป์„ ์ˆ˜ ์—†๋‹ค.
    • NN์ด ๋‹ค์Œ ๋‹จ๊ณ„์˜ feedback value ์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต โ†’ ํ‘œํ˜„๋ ฅ ํ–ฅ์ƒ
  • UNREAL์€ ๊ณผ๊ฑฐ ์—ฐ์†๋œ ๋‹ค์ค‘ ํ”„๋ ˆ์ž„ ์ด๋ฏธ์ง€ ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•ด ๋‹ค์Œ ๋‹จ๊ณ„์˜ ํ”ผ๋“œ๋ฐฑ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ณ  ์ด๋ฅผ ํ•™์Šต ๋ชฉํ‘œ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

historical continuous multi-frame image input์œผ๋กœ ๋‹ค์Œ ๋‹จ๊ณ„์˜ feedback value๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ feedback value๋ฅผ training target์œผ๋กœ ์„ค์ •ํ•œ๋‹ค. ๋˜ํ•œ history information๋ฅผ ํ™œ์šฉํ•˜์—ฌ Value Iteration Task๋ฅผ ๊ฐ•ํ™”ํ•œ๋‹ค.

์ฐพ์•„๋ณด๋‹ˆ UNREAL Engine๊ณผ Python Deep RL algorithm ๊ฐ„์˜ TCP ํ†ต์‹ ์„ ์ง€์›ํ•˜๋Š” ๋ฏธ์™„์„ฑ plugin ์ด ์žˆ์—ˆ๋‹ค. bridge environment๋กœ Unreal๊ณผ Python ์‚ฌ์ด์˜ data๋ฅผ ์†ก์ˆ˜์‹ ํ•˜๋Š” ๊ฒŒ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฑธ ์•Œ๊ฒŒ ๋˜์–ด ๊ฐ™์ด ๋„ฃ์–ด ๋ณด์•˜๋‹ค. ์ด plugin์˜ ๊ตฌ์„ฑ์€ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.

Image

3. White-box Adversarial attack in RL

๐ŸŽฏ Fast gradient sign method (FGSM)

  • ์ ์šฉ ๋ถ„์•ผ: Atari Pong Game
    • agent๊ฐ€ ๊ณต์˜ ์ด๋™ ๋ฐฉํ–ฅ์„ ์ œ๋Œ€๋กœ ํŒ๋‹จํ•˜์ง€ ๋ชปํ•จ
  • RL algorithm
    • DQN(Deep Q-Network): ๊ฐ€์žฅ ์ทจ์•ฝ โ†’ ๋†’์€ ๊ณต๊ฒฉ ์„ฑ๊ณต๋ฅ 
    • TRPO/A3C: ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ resistance

Image

์›๋ณธ input $x$ ์— ์ถฉ๋ถ„ํžˆ ์ž‘์€ perturbation $ฮท$ ๋ฅผ element-wise๋กœ ๋”ํ•ด $x$ ์— ๋Œ€ํ•œ adversarial example $\tilde x$ ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, $\tilde x = ฯ‰^T\tilde x = ฯ‰^Tx + ฯ‰^Tฮท$ ์ด๋‹ค. ์ด๋•Œ $ฮท$ ๋ฅผ ์ด๋ฃจ๋Š” ๊ฐ’๋“ค์€ $-ฯต$ ๋ณด๋‹ค ํฌ๊ณ , $ฯต$ ๋ณด๋‹ค ์ž‘์•„์•ผ ํ•œ๋‹ค. ์ด๋กœ์จ classifier๋Š” $x$ ์™€ $\tilde x$ ๋ฅผ ๊ฐ™์€ class๋กœ ๋ถ„๋ฅ˜ํ•˜๊ฒŒ ๋œ๋‹ค.

\[ฮท=ฯตโ‹…sign(ฯ‰), โˆฅฮทโˆฅ_{โˆž}<ฯต\]
  • $ฯ‰$: weight vector
  • $ฮท$ maximizes the change in output for the adversarial example $\tilde x$
  • $ฯต$ : perturbation ํฌ๊ธฐ๋ฅผ ์กฐ์ ˆํ•˜๋Š” hyper-parameter

FGSM์€ cost function $J$ ๋ฅผ linearizationํ•˜์—ฌ classifier์˜ misclassification์„ ์ผ์œผํ‚ค๋Š” $ฮท$ ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. $โˆ‡_xJ(ฮธ,x,y)$ ๋Š” $x$ ์— ๋Œ€ํ•œ $J$์˜ ๋ณ€ํ™”์œจ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ฆ‰, $ฮท$ ๋Š” GT์ธ $y$ ์— ๋ฐ˜๋Œ€๋˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ทธ ๋ถ€ํ˜ธ๋ฅผ ์–ป๊ฒŒ ๋œ๋‹ค. \(ฮท=ฯตโ‹…sign(โˆ‡_xJ(ฮธ,x,y))\)

๐ŸŽฏ Start Point-based Adversarial Attack on Q-learning (SPA)

  • ์ ์šฉ ๋ถ„์•ผ: Automatic Path Finding
    • Key Point ($k$)
      • strategically important location
      • e.g.) starting point, target point, ๊ฒฝ๋กœ ์ƒ์˜ ์ฃผ์š” ๋ถ„๊ธฐ์ 
    • Key Vector ($v$)
      • Key Point($k$)์—์„œ ๋ชฉํ‘œ ์ง€์ ($t$)๊นŒ์ง€์˜ directional vector
      • $v=(t_cโˆ’k_c,t_rโˆ’k_r)$
      • $t_c, t_r$: Coordinates of the target point
  • RL algorithm: Q-learning
    • Q-learning ๊ธฐ๋ฐ˜ Automatic Path Finding์—์„œ ์ฒ˜์Œ์œผ๋กœ adversarial example๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์ฐพ์•„๋ƒˆ๋‹ค(precision: 70%).
    • BUT limited, fixed map size (28ร—28)
  • probabilistic output model
    • STEP1: 4๊ฐ€์ง€ factor๋ฅผ ๊ณ„์‚ฐ
    • STEP2: ๊ฐ adversarial point ($a_i$)์ด agent์˜ ๊ฒฝ๋กœ ํƒ์ƒ‰์„ inferenceํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ factor์™€์˜ ๊ฐ€์ค‘ํ•ฉ์œผ๋กœ ๊ณ„์‚ฐํ•œ๋‹ค.
    • STEP3: ๋ชจ๋“  $p_{a_i}$ ๊ฐ’ ์ค‘ TOP 10์„ ์„ ํƒํ•œ๋‹ค.

Image


STEP1์˜ 4๊ฐ€์ง€ factor๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1๏ธโƒฃ Energy Point Gravitation adversarial point $k$ ๊ฐ€ key vector $v$ ์ƒ์— ์œ„์น˜ํ• ์ˆ˜๋ก ๊ณต๊ฒฉ ์„ฑ๊ณต ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค. row์™€ column ์—์„œ ๊ฐ๊ฐ ๊ฐ’์ด ํ•˜๋‚˜์”ฉ ๋‚˜์˜จ๋‹ค. ์ด๋•Œ key point $k$ ์™€ adversarial point $k$ ๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ์— ์œ ์˜ํ•˜์ž.

\[\begin{cases} e_{ic} = k_c + i \cdot d' \cdot \frac{k'_c - k_c}{\sqrt{(k'_c - k_c)^2 + (k'_r - k_r)^2}} \\ e_{ir} = k_r + i \cdot d' \cdot \sqrt{1 - \left( \frac{k'_c - k_c}{\sqrt{(k'_c - k_c)^2 + (k'_r - k_r)^2}} \right)^2} \end{cases}\]

2๏ธโƒฃ Key Point Gravitation adversarial point๊ฐ€ ์ด key point $k$์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋ฐฉํ•ด ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง:

\[d_{1i} = |a_{ic} - k_c| + |a_{ir} - k_r| \\ \text{where } (k_c, k_r) = k, \ (a_{ic}, a_{ir}) = a_i \in A\]

3๏ธโƒฃ Path Gravitation ์ ๋Œ€์  ์ ์ด ์ดˆ๊ธฐ ๊ฒฝ๋กœ $Z_1$ ์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋ฐฉํ•ด ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง \(d_{2i} = \min \left\{ d_2 \ \bigg| \ d_2 = |a_{ic} - z_{jc}| + |a_{ir} - z_{jr}|, \ z_j \in Z_1 \right\} \\ \text{where } (z_{jc}, z_{jr}) = z_j, \ (a_{ic}, a_{ir}) = a_i \in A\)

4๏ธโƒฃ Included Angle key point $k$์—์„œ adversarial point $a_i$๋กœ ํ–ฅํ•˜๋Š” ๋ฒกํ„ฐ์™€ ๋ชฉํ‘œ ์ง€์  $t$ ๋กœ ํ–ฅํ•˜๋Š” ๋ฒกํ„ฐ(key vector) ๊ฐ„์˜ ๊ฐ๋„ $ฮธ_i$ ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๊ฐ๋„ $ฮธ_i$ ๊ฐ€ ์ž‘์„์ˆ˜๋ก(key vector์™€ ์œ ์‚ฌํ•œ ๋ฐฉํ–ฅ) ๊ณต๊ฒฉ ํšจ๊ณผ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ๋Š” adversarial point์˜ ํ‘œ๊ธฐ๊ฐ€ $a_i$ ๋กœ ๋ฐ”๋€Œ๊ณ , key vector์˜ ํ‘œ๊ธฐ๊ฐ€ ${v}_{kt}$๋กœ ๋ฐ”๋€ ๊ฒƒ์— ์œ ์˜ํ•˜์ž.

\[\begin{aligned} \mathbf{v}_{ka} &= (a_{ic} - k_c, \ a_{ir} - k_r) \\ \mathbf{v}_{kt} &= (t_c - k_c, \ t_r - k_r) \\ \cos \theta_i &= \frac{\mathbf{v}_{ka} \cdot \mathbf{v}_{kt}}{|\mathbf{v}_{ka}| \ |\mathbf{v}_{kt}|} \\ \theta_i &= \arccos(\cos \theta_i) \end{aligned}\]

STEP2, ์ฆ‰ ๊ฐ adversarial point $a_i$์˜ ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐํ•œ๋‹ค.

\[\begin{equation} p_{a_i} = \sum_{j=1}^{4} \omega_j f_j(a_i) = \omega_1 \cdot a_{ie} + \omega_2 \cdot d'_{1i} + \omega_3 \cdot d'_{2i} + \omega_4 \cdot \theta'_i \end{equation}\]
  • $ฯ‰_j$: the weight for each factor respectively
    • PCA๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ๋œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ STEP3์—์„œ๋Š” ๋ชจ๋“  $p_{a_i}$ ๊ฐ’ ์ค‘ TOP 10์„ ์„ ํƒํ•œ๋‹ค.

๐ŸŽฏ White-box based adversarial attack on DQN (WBA)

  • SPA์˜ ํ™•์žฅ ๋ฒ„์ „
  • ์ ์šฉ ๋ถ„์•ผ: Automatic Path Finding
  • RL algorithm: Q-learning
  • SPA ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™•์žฅํ•˜์—ฌ DQN์˜ Q-table ๋ณ€๋™ ํŒจํ„ด์„ ๋ถ„์„ํ•˜๊ณ , ๊ฒฝ๋กœ ํƒ์ƒ‰ ๊ณผ์ •์—์„œ ์ทจ์•ฝ์ (vulnerable points)์„ ์‹๋ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•

๐ŸŽฏ Common dominant adversarial examples generation method (CDG)

4. Black-box Adversarial attack in RL

๐Ÿช‡ Policy induction attack (PIA)

๐Ÿช‡ Specific time-step attack

๐Ÿช‡ Adversarial attack on VIN (AVI)

5. Defense technology against adversarial attack

6. Conclusion and discussion

๐Ÿ‹ After Read

This post is licensed under CC BY 4.0 by the author.