우다다 — 우다다

글 원문: https://seohong.me/blog/q-learning-is-not-yet-scalable/ Q-learning is not yet scalableQ-learning is not yet scalable Seohong ParkUC BerkeleyJune 2025 Does RL scale? Over the past few years, we've seen that next-token prediction scales, denoising diffusion scales, contrastive learning scales, and so on, all the way to the point where we canseohong.me 강화학습에 대한 좋은 아티클이 있어서 한국어로도 다시 정리해보면 좋을 ..

1. 스킨 편집 들어가기2. HTML 편집이런데에 꾸겨 넣으시면 됩니다. 3. Test$$ X^2 = 9 $$

강화학습의 on-policy 와 off-policy update 방식의 차이점을 생각하고 정리합니다. 먼저, 직관적인 배경 지식에 대해 먼저 알고 갑니다. on-policy와 off-policy를 나누는 기준은 무엇인가?Q-learning (off-policy)\begin{equation} Q(a, s) \leftarrow Q(a, s)+\alpha \cdot\left(r_s+\gamma \max _{a^{\prime}} Q\left(a^{\prime}, s^{\prime}\right)-Q(a, s)\right) \end{equation} Sarsa (on-policy)\begin{equation} Q(a, s) \leftarrow Q(a, s)+\alpha \cdot\left(r_s+\gamma \cd..

1. State Value Function$$ V^\pi(s)=\mathbb{E}_\pi\left[G_t \mid s_t=s\right]=\mathbb{E}\left[\sum_{i=0}^{\infty} \gamma^i r_{t+1+i} \mid s_t=s\right] $$강화 학습을 공부한다면, state value function에 대해 많이 보았을것이다. $$ \sum_{a, s^{\prime}} \pi(a \mid s) P_{s s^{\prime}}^a\left[R\left(s, a, s^{\prime}\right)+\gamma V^\pi\left(s^{\prime}\right)\right] $$결국은 Bellman equation 형태로 정리가 가능한데, 왜 가능한지에 대한 수식 전개와 그림 전개..

DDPG의 최대 단점은 성능의 monotonically improvement가 안된다. 따라서 TRPO는 Minorization-Maximization algorithm과 Trust-region이 사용된다. Trust-region이 사용되면, monotonically improvement가 보장이 된다. TRPO는 성능이 좋지만, 구현이 어렵고 계산 비용이 너무 높아 잘 사용하지 않는다.

보호되어 있는 글입니다.

티스토리툴바