[논문]개미 집단 시스템에서 TD-오류를 이용한 강화학습 기법

이승관; 정태충

doi:10.3745/kipstb.2004.11b.1.077

초록
AI-Helper

강화학습에서 temporal-credit 할당 문제 즉, 에이전트가 현재 상태에서 어떤 행동을 선택하여 상태전이를 하였을 때 에이전트가 선택한 행동에 대해 어떻게 보상(reward)할 것인가는 강화학습에서 중요한 과제라 할 수 있다. 본 논문에서는 조합최적화(hard combinational optimization) 문제를 해결하기 위한 새로운 메타 휴리스틱(meta heuristic) 방법으로, greedy search뿐만 아니라 긍정적 반응의 탐색을 사용한 모집단에 근거한 접근법으로 Traveling Salesman Problem(TSP)를 풀기 위해 제안된 Ant Colony System(ACS) Algorithms에 Q-학습을 적용한 기존의 Ant-Q 학습방범을 살펴보고 이 학습 기법에 다양화 전략을 통한 상태전이와 TD-오류를 적용한 학습방법인 Ant-TD 강화학습 방법을 제안한다. 제안한 강화학습은 기존의 ACS, Ant-Q학습보다 최적해에 더 빠르게 수렴할 수 있음을 실험을 통해 알 수 있었다.

Abstract ▼ AI-Helper

Reinforcement learning takes reward about selecting action when agent chooses some action and did state transition in Present state. this can be the important subject in reinforcement learning as temporal-credit assignment problems. In this paper, by new meta heuristic method to solve hard combinati...

Reinforcement learning takes reward about selecting action when agent chooses some action and did state transition in Present state. this can be the important subject in reinforcement learning as temporal-credit assignment problems. In this paper, by new meta heuristic method to solve hard combinational optimization problem, examine Ant-Q learning method that is proposed to solve Traveling Salesman Problem (TSP) to approach that is based for population that use positive feedback as well as greedy search. And, suggest Ant-TD reinforcement learning method that apply state transition through diversification strategy to this method and TD-error. We can show through experiments that the reinforcement learning method proposed in this Paper can find out an optimal solution faster than other reinforcement learning method like ACS and Ant-Q learning.

주제어

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

문제 정의

본 논문에서는 Ant-Q 강화 학습에서 Temporal-credit 할당 문제를 해결하기 위한 TD-오류 이용과 각 노드에 대해 에이전트들의 방문 빈도수 기반의 다양화 전략에 의한 상태 전이를 통해 강화학습하는 방법을 제안하였다.
본 논문에서는 Ant-Q 강화 학습에서 Temporal-credit 할당 문제를 해결하기 위해 제시된 TD-오류를 이용한 강화학습 방법을 제안한다.
본 논문에서는 TSP 문제를 풀기 위해 Colomi, Dorigo 그리고 Maniezzo[l, 2]에 의해 제안된 새로운 메타 휴리스틱 (Meta Heuristic) 방법인 강화학습 기반의 조합최적화 문제를 해결하는 Ant-Q[8, 9] 학습방법을 살펴보고 이 Ant-Q 학습에 TD-오류를 적용한 학습방법[3-5, 13]인 새로운 Ant-TD 강화학습 기법을 제안한다.

제안 방법

ACS, Ant-Q 학습과 본 논문에서 제안한 Ant-TD 강화학습과의 효율성 분석을 위해 최적해에 얼마나 빨리 수렴하는가를 임의의 도시 집합들에 대해 비교 분석 하였다.

이론/모형

Ant-Q에서 노드(戶)에 있는 에이전트k가 노드(G로의 이동은 다음의 pseudo-random proportional action choice rule(or state transition rule)에 의해 수행된다.
제안된 Ant-TD 학습은 기존의 Ant-Q 학습 성능을 개선하기 위해 새롭게 제안된 방법으로 Ant-Q 학습에 C.J.C.H. Watkins가 제안한 TD-오류를 적용한 방법이다[3].

성능/효과

따라서, 현재 최적 경로로 새로운 이웃 간선으로의 다양한 탐험의 역할을 고려해 각 간선에 에이전트들의 방문 횟수를 고려할 수 있는데, 이것은 탐험의 비율을 빨리하고 탐험의 정확성을 점차로 개선함으로써 강화학습에 있어서 효과적일 수 있다.
<표 1>은 (RXR 격자 문제)에 대해 10회 시행에 1000번 사이클을 반복했을 경우, 각 알고리즘에 의해 산출된 최적 경로 길이와 평균 경로 길이를 보여주는 것으로 문제 영역이 작은 경우, 즉 노드수가 작은 경우에는 결과의 차이가 없었지만 문제 영역이 커질수록 제안된 방법의 성능이 우수하다는 것을 보여주고 있다. 따라서 문제영역이 큰 문제에 대해 효과적으로 적용될 수 있다.
제안된 Ant-TD 강화 학습법은 기존의 Ant-Q 학습 성능을 개선하기 위해 새롭게 제안된 방법으로 에이전트들이 경로 사이클을 이루는 동안 방문한 간선에 대한 방문 빈도수를 상태전이 규칙에 적용해 에이전트들이 탐색영역을 더욱 다양하게 검색하게 하여 아직까지 탐색하지 않은 새로운 영역으로 다양하게 찾아가게 하고, TD-오류를 이용해 Temporal-Credit할당 문제를 해결하게 함으로써 최적해에 빠르게 수렴하게 하였다.
제안된 Ant-TD강화학습 방법은 다양화 전략을 통한 상태 전이와 TD-오류를 이용하여 목표상태를 탐색하는 학습 방법으로 기존의 ACS, Ant-Q학습보다 최적해에 더 빠르게 수렴하는 특징이 있다.

후속연구

향후 연구과제는 Ant-TD에서 현재 상태에서 선택한 노드에 대해 얼마나 적합한가를 의미하는 척도인 적합도(Eligibility factor)를 이용한 강화학습 방법에 대한 연구도 있어야 겠다.

참고문헌 (15)

A. Colorni, M. Dorigo and V. Maniezzo, 'An investigation of some properties of an ant algorithm,' Proceediings of the Parallel Parallel Problem Solving from Nature Conference(PPSn'92), R. Manner and B. Manderick (Eds.), Elsevier Publishing, pp.509-520, 1992
A. Colorni, M. Dorigo and V. Maniezzo, 'Distributed optimization by ant colonies', Proceedings of ECAL'91 - Eu - ropean Conference fo Artificial Life, Paris, France, F. Varela and P. Bourgine (Eds.), Elsevier Publishing, pp.134-144, 1991
C. J. C. H. Watkins, 'Learning from Delayed Rewards, King's College,' Ph.D. thesis, King's College, Cambrige, U.K, 1989
C. N. Fiecher, 'Efficient reinforcement learning', In Proceedings of the Seventh Annual ACM Conference On Computational Learning THeory, pp.88-97, 1994
E. Barnald, 'Temporal-difference methods and markov model,' IEEE Transactions on Systems, Man, and Cybernetics, Vol.23, pp.357-365, 1993

상세보기
L. M. Gambardella and M. Dorigo, 'Solving symmetric and asymmetric TSPs by ant colonies', Proceedings of IEEE International Conference of Evolutionary Computation, IEEE-EC'96, IEEE Press, pp.622-627, 1996
L. M. Gambardella and M. Dorigo, 'Ant Colony System : A Cooperative Learning apprach to the Traveling Salesman Problem', IEEE Transactions on Evolutionary Computation, Vol.1, No.1, 1997

상세보기
L. M. Gambradella and M. Dorigo, 'Ant-Q : a reinforcement learning approach to the traveling salesman problem', Proceedings of ML-95, Twelfth International Conference on Machine Learning, A. Prieditis and S. Russell (Eds.), Morgan Kaufmann, pp.252-260, 1995
M. Dorigo and L. M. Gambardella, 'A study of some properties of Ant-Q', Proceedings of PPSN IVFourth International Conference on Parallel Problem Solving From Nature, H. M.Voigt, W. Ebeling, I. Rechenberg and H. S. Schwefel (Eds.), Springer-Verlan, Berlin, pp.656-665, 1996
M. Drigo and V. Maniezzo and A. Colorni, 'The ant system : optimization by a colony of cooperation agents', IEEE Transactions of Systems, Man and Cybernetics- Part B, Vol.26, No.2, pp.29-41, 1996

상세보기
M. Dorigo and G. D. Caro, 'Ant Algorithms for Discrete Optimization', Artificial Life, Vol.5, No.3, pp.137-172, 1999

상세보기
M. Dorigo and L. M. Gambardella, 'Ant Colonies for the Traveling Salesman Problem', BioSystems, 43, pp.73-81, 1997

상세보기
R. C. Yee, P. E. Utgoff and A. G. Barto, 'Explaining temporal differences to create useful concepts for evaluating states', In Proceedings of the 8th National Conference on Artificial Intelligence, pp.882-888, 1990
T. Stutzle, and H. Hoos, 'The ant system and local search for the traveling salesman problem', Proceedings of ICEC 1997-1997 IEEE 4th International Conference fo Evolutionary
T. Sttzle and M. Dorigo, 'ACO Algorithms for the Traveling Saleslman Problem,' In K. Miettinen, M. Makela, P. Neittaanmaki, J. Periaux, editors, Evolutionary Algorithms in Engineering and Computer Science, Wiley, 1999

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

개미 집단 시스템에서 TD-오류를 이용한 강화학습 기법
A Reinforcement Loaming Method using TD-Error in Ant Colony System 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

AI 본문요약
AI-Helper

문제 정의

제안 방법

이론/모형

성능/효과

후속연구

참고문헌 (15)

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

개미 집단 시스템에서 TD-오류를 이용한 강화학습 기법 A Reinforcement Loaming Method using TD-Error in Ant Colony System 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

AI 본문요약 엑셀 다운로드 AI-Helper

문제 정의

제안 방법

이론/모형

성능/효과

후속연구

참고문헌 (15)

이 논문을 인용한 문헌

저자의 다른 논문 :

이승관 (18) 정태충 (44)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

개미 집단 시스템에서 TD-오류를 이용한 강화학습 기법
A Reinforcement Loaming Method using TD-Error in Ant Colony System 원문보기

초록
AI-Helper

AI 본문요약
AI-Helper