[논문]Dynamic Action Space Handling Method for Reinforcement Learning Models

Woo, Sangchul; Sung, Yunsick

doi:10.3745/jips.02.0146

Dynamic Action Space Handling Method for Reinforcement Learning Models 원문보기

Journal of information processing systems, v.16 no.5, 2020년, pp.1223 - 1230

Woo, Sangchul (Dept. of Multimedia Engineering, Dongguk University) , Sung, Yunsick (Dept. of Multimedia Engineering, Dongguk University)

Abstract ▼ AI-Helper

Recently, extensive studies have been conducted to apply deep learning to reinforcement learning to solve the state-space problem. If the state-space problem was solved, reinforcement learning would become applicable in various fields. For example, users can utilize dance-tutorial systems to learn how to dance by watching and imitating a virtual instructor. The instructor can perform the optimal dance to the music, to which reinforcement learning is applied. In this study, we propose a method of reinforcement learning in which the action space is dynamically adjusted. Because actions that are not performed or are unlikely to be optimal are not learned, and the state space is not allocated, the learning time can be shortened, and the state space can be reduced. In an experiment, the proposed method shows results similar to those of traditional Q-learning even when the state space of the proposed method is reduced to approximately 0.33% of that of Q-learning. Consequently, the proposed method reduces the cost and time required for learning. Traditional Q-learning requires 6 million state spaces for learning 100,000 times. In contrast, the proposed method requires only 20,000 state spaces. A higher winning rate can be achieved in a shorter period of time by retrieving 20,000 state spaces instead of 6 million.

주제어

표/그림 (7)

그림 Fig. 1. Q-table.
그림 Fig. 2. State space.
표 Table 1. State space number
그림 Fig. 3. Case 1 of the action space: (a) state expression, (b) distances from last position, and (c) state space within √2.
그림 Fig. 4. Case 2 of the action space: (a) state expression, (b) distances from last position, and (c) state space within √2.
그림 Fig. 5. The comparison of action space search space: (a) the proposed method and (b) traditional Q-learning.
그림 Fig. 6. The comparison of the accumulated wins.

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

제안 방법

α and γ are constants; γ is the step size in the incremental mean, and is the depreciation rate. The proposed method has the advantage of being applied to various algorithms of reinforcement learning without modifications, as it reduces the state space by decreasing the number of selectable actions.
Because actions that were not performed or were unlikely to be the optimal behaviors were not learned, and state space was not allocated to them, the learning time could be shortened, and the state space could be reduced. The proposed method was experimentally verified by applying it to a game of Tic-Tac-Toe. The proposed method showed results similar to those of traditional Q-learning even when the state space was reduced to approximately 0.
This paper proposes a reinforcement learning algorithm in which learning is performed by dynamically adjusting the action space. Because actions that are not performed or are unlikely to be optimal are not learned, and the state space is not allocated, the learning time can be shortened owing to the reduced state space.

이론/모형

We propose a method to solve such a state-space problem by reducing the action space. The proposed method is applicable to various algorithms of reinforcement learning, such as the Monte Carlo method, Sarsa, and Q-learning [5], to enable their use in real time.
This section introduces a series of processes to verify the proposed method. The proposed method is applied to Tic-Tac-Toe games for verification. This section presents the Tic-Tac-Toe game and details the application process and experimental results of the proposed method.
The proposed method was applied to solve the state-space problem. The action space of the Tic-Tac-Toe game was reduced as follows to remove the state-space in relation to actions that were not performed or were unlikely to be performed.
We trained the model by applying a Q-learning algorithm, which is a time-difference-based learning system that utilizes the merits of the Monte Carlo method and the dynamic planning method. Q-learning learns an optimal policy via an action value function used to calculate the cost incurred when a specific action is performed in the current state.

성능/효과

In this experiment, the proposed method and traditional Q-learning were compared according to the number of action spaces retrieved, and the count of accumulated wins. As shown in Fig.
The number of wins with the proposed method was 44,683 times that of traditional Q-learning. The proposed method was far more efficient as it used only 20,000 state spaces, which is approximately 0.
6 shows a comparison of the two algorithms according to their accumulated wins. The proposed method resulted in approximately 45,000 wins in training with 100,000 episodes, which was similar to the count of wins for traditional Q-learning. Although the win count differed according to learning type, it did not significantly affect the performance analysis.
The proposed method was experimentally verified by applying it to a game of Tic-Tac-Toe. The proposed method showed results similar to those of traditional Q-learning even when the state space was reduced to approximately 0.33%, indicating that the proposed method could reduce the cost and time required for learning the same amount.
The number of wins with the proposed method was 44,683 times that of traditional Q-learning. The proposed method was far more efficient as it used only 20,000 state spaces, which is approximately 0.33% of the 6 million state spaces used in traditional Q-learning.
Traditional Q-learning searched approximately 6 million state spaces in learning the same number of episodes. Thus, the number of state spaces searched by the proposed method was approximately 0.33% of that by traditional Q-learning, indicating that the time and state space can be significantly reduced by decreasing the search space.
Ultimately, the proposed method significantly reduced the learning time and state space in terms of the number of action spaces searched, while showing similar performance to Q-learning in the cumulative win count. Although the proposed method has the disadvantage of extracting spatial data from the action and state, this method is applicable in various fields because action is typically dependent on the location, and this method is expected to considerably improve performance.

후속연구

If the proposed method is applied, reinforcement learning can be applied to such a system. In the future, the proposed method will be combined with deep learning to study how the dance motions of virtual instructors can be improved.

참고문헌 (5)

V. Francois-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau, "An introduction to deep reinforcement learning," Foundations and Trends in Machine Learning, vol. 11, no. 3-4, pp. 219-354, 2018.

상세보기
O. Alemi, J. Francoise, and P. Pasquier, "GrooveNet: real-time music-driven dance movement generation using artificial neural networks," in Workshop on Machine Learning for Creativity in conjunction with the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, 2017.
A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, and M. Ghassemi, "Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach," in Proceedings of the Machine Learning for Health Care Conference (MLHC), Boston, MA, 2017, pp. 147-163.
R. Garg and D. P. Nayak, "Game of tic-tac-toe: Simulation using Min-Max algorithm," International Journal of Advanced Research in Computer Science, vol. 8, no. 7, pp. 1074-1077, 2017.
C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, "Is Q-learning provably efficient?," Advances in Neural Information Processing Systems, vol. 31, pp. 4863-4873, 2018.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증