Thompson strategy based online reinforcement learning system for action selection
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06N-005/04
G06N-007/00
G06N-007/02
출원번호
UP-0169503
(2005-06-29)
등록번호
US-7707131
(2010-05-20)
발명자
/ 주소
Chickering, David M.
Paek, Timothy S.
Horvitz, Eric J.
출원인 / 주소
Microsoft Corporation
대리인 / 주소
Lee & Hayes, PLLC
인용정보
피인용 횟수 :
14인용 특허 :
67
초록▼
A system and method for online reinforcement learning is provided. In particular, a method for performing the explore-vs.-exploit tradeoff is provided. Although the method is heuristic, it can be applied in a principled manner while simultaneously learning the parameters and/or structure of the mode
A system and method for online reinforcement learning is provided. In particular, a method for performing the explore-vs.-exploit tradeoff is provided. Although the method is heuristic, it can be applied in a principled manner while simultaneously learning the parameters and/or structure of the model (e.g., Bayesian network model). The system includes a model which receives an input (e.g., from a user) and provides a probability distribution associated with uncertainty regarding parameters of the model to a decision engine. The decision engine can determine whether to exploit the information known to it or to explore to obtain additional information based, at least in part, upon the explore-vs.-exploit tradeoff (e.g., Thompson strategy). A reinforcement learning component can obtain additional information (e.g., feedback from a user) and update parameter(s) and/or the structure of the model. The system can be employed in scenarios in which an influence diagram is used to make repeated decisions and maximization of long-term expected utility is desired.
대표청구항▼
What is claimed is: 1. An online reinforcement learning system comprising components embodied on a computer readable storage medium, the components when executed by one or more processors, update a model based upon reinforcement learning, the components comprising: a model comprising an influence d
What is claimed is: 1. An online reinforcement learning system comprising components embodied on a computer readable storage medium, the components when executed by one or more processors, update a model based upon reinforcement learning, the components comprising: a model comprising an influence diagram with at least one chance node, the model receiving an input and providing a probability distribution associated with uncertainty regarding parameters of the model; a decision engine that selects an action based, at least in part, upon the probability distribution, the decision engine employing a Thompson strategy heuristic technique to maximize long term expected utility when selecting the action, wherein the decision engine decreases a variance of a distribution of the parameters as a last decision instance is approached; and a computer-implemented reinforcement learning component that modifies at least one of the parameters of the model based upon feedback associated with the selected action, the parameters defining distributions over discrete variables and continuous variables, uncertainty of the parameters expressed using Dirichlet priors for conditional distributions of discrete variables of the model, and, Normal-Wishart priors for distributions of continuous variables of the model, wherein the modified model is stored. 2. The system of claim 1, used when the parameters of the model are changing over time. 3. The system of claim 1, wherein the decision engine employs a maximum a posterior of the parameters when there is only one more decision instance remaining. 4. The system of claim 1, wherein the decision engine artificially increases the variance of a distribution of the parameters. 5. The system of claim 1, wherein the computer-implemented reinforcement learning component further modifies the structure of the model based, at least in part, upon the feedback associated with the selected action. 6. The system of claim 1, wherein the feedback comprises an input from a user of the system. 7. The system of claim 6, wherein the input from the user comprises a verbal utterance. 8. The system of claim 1, wherein the feedback comprises a lack of an input from a user of the system in a threshold period of time. 9. The system of claim 1, where one or more parameters of the model change over a period of time. 10. The system of claim 1, the parameters defining distributions over variables, where the variables comprise chance variables, decision variables and/or value variables. 11. The system of claim 1, employed repeatedly to facilitate decision making. 12. The system of claim 11, wherein the parameter(s) are updated prior to a next repetition. 13. The system of claim 1, the model comprising a Markov decision process represented as an Influence diagram. 14. The system of claim 1 employed as part of a dialog system. 15. An online reinforcement learning method comprising: determining a probability distribution associated with uncertainty regarding parameters of a model, the model comprising an influence diagram with at least one chance node; employing a computer-implemented Thompson strategy heuristic technique to select an action based, at least in part, upon the probability distribution, wherein a variance of a distribution of the parameters is artificially increased to be large enough that the model continues to adapt; updating at least one parameter of the model based, at least in part, upon feedback associated with the selected action, the parameters defining distributions over discrete variables and continuous variables, uncertainty of the parameters expressed using Dirichlet priors for conditional distributions of discrete variables of the model, and, Normal-Wishart priors for distributions of continuous variables of the model; and storing the updated model on a computer readable storage medium. 16. The method of claim 15, wherein the feedback comprises an input from a user or a lack of an input from the user in a threshold period of time. 17. A computer readable medium having stored thereon computer executable instructions for carrying out the method of claim 15.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (67)
Hakkani Tur,Dilek Z.; Rahim,Mazin G.; Riccardi,Giuseppe; Tur,Gokhan, Active learning process for spoken dialog systems.
Bellegarda Jerome R. (Goldens Bridge NY) Kanevsky Dimitri (Ossining NY), Computer program product for automatic recognition of a consistent message using multiple complimentary sources of infor.
Zinky, John A.; Schantz, Richard R.; Bakken, David E.; Loyall, Joseph P., Framework for providing quality of service requirements in a distributed object-oriented computer system.
Borgida Alexander Tiberiu ; Brachman Ronald Jay ; Kirk Thomas ; Selfridge Peter Gilman ; Terveen Loren Gilbert, Interactive data analysis employing a knowledge base.
Chen Steve S. (Eau Claire WI) Beard Douglas R. (Eleva WI) Spix George A. (Eau Claire WI) Priest Edward C. (Eau Claire WI) Wastlick John M. (Eau Claire WI) VanDyke James M. (Eau Claire WI), Method and apparatus for a unified parallel processing architecture.
Hamilton Graham ; Powell Michael L. ; Mitchell James G. ; Gibbons Jonathan J., Method and apparatus for subcontracts in distributed processing systems.
Combs, Charles; Gold, Jeffrey; Mair, Brian; Pedersen, David; Schear, David, Method and system for load-balanced data exchange in distributed network-based resource allocation.
Kampe,Mark A.; Herrmann,Frederic; Nguyen,Gia Khanh; Shokri,Eltefaat H., Method and system for managing high-availability-aware components in a networked computer system.
Brereton JoAnn Piersa ; Coden Anna Rosa ; Schwartz Michael Stephen, Method and system for translating an ad-hoc query language using common table expressions.
Khalidi Yousef A. (Sunnyvale CA) Hamilton Graham (Palo Alto CA) Kougiouris Panagiotis S. (Mountain View CA), Method for executing operation call from client application using shared memory region and establishing shared memory re.
Theimer Marvin M. (Mountain View CA) Spreitzer Michael J. (Tracy CA) Weiser Mark D. (Palo Alto CA) Goldstein Richard J. (San Francisco CA) Elrod Scott A. (Redwood City CA) Swinehart Daniel C. (Palo A, Method for granting a user request having locational and contextual attributes consistent with user policies for devices.
Theimer Marvin M. (Mountain View CA) Spreitzer Michael J. (Tracy CA) Weiser Mark D. (Palo Alto CA) Goldstein Richard J. (San Francisco CA) Elrod Scott A. (Redwood City CA) Swinehart Daniel C. (Palo A, Method for selectively performing event on computer controlled device whose location and allowable operation is consiste.
Theimer Marvin M. (Mountain View CA) Spreitzer Michael J. (Tracy CA) Weiser Mark D. (Palo Alto CA) Goldstein Richard J. (San Francisco CA) Elrod Scott A. (Redwood City CA) Swinehart Daniel C. (Palo A, Method for triggering selected machine event when the triggering properties of the system are met and the triggering con.
Leung, Ting Yu; Urata, Monica Sachiye; Vora, Swati, Method of simplifying and optimizing scalar subqueries and derived tables that return exactly or at most one tuple.
Miller, Paul Andrew; Benedyk, Robby Darren; Ravishankar, Venkataramaiah; Marsico, Peter Joseph, Methods and systems for providing database node access control functionality in a communications network routing node.
Baskey,Michael Edward; Brabson,Roy Frank; Huynh,Lap Thiet; Yocom,Peter Bergersen, Methods, systems and computer program products for server based type of service classification of a communication request.
Katzman James A. (San Jose CA) Bartlett Joel F. (Palo Alto CA) Bixler Richard M. (Sunnyvale CA) Davidow William H. (Atherton CA) Despotakis John A. (Pleasanton CA) Graziano Peter J. (Los Altos CA) Gr, Multiprocessor system.
Lippmann Wouter J. H. M. (Eindhoven NLX) Kessels Jozef L. W. (Eindhoven NLX) Eggenhuisen Huibert H. (Eindhoven NLX) Dijkstra Hendrik (Eindhoven NLX), Multiprocessor system comprising a plurality of data processors which are interconnected by a communication network.
Hamilton Graham (Palo Alto CA) Powell Michael L. (Palo Alto CA) Mitchell James G. (Los Altos CA) Gibbons Jonathan J. (Mountain View CA), Object oriented system for executing application call by using plurality of client-side subcontract mechanism associated.
Raitto John ; Ziauddin Mohamed ; Finnerty James, Rewriting a query in terms of a summary based on aggregate computability and canonical format, and when a dimension tabl.
Theimer Marvin M. (Mountain View CA) Spreitzer Michael J. (Tracy CA) Weiser Mark D. (Palo Alto CA) Goldstein Richard J. (San Francisco CA) Terry Douglas B. (San Carlos CA) Schilit William N. (Palo Al, Selective delivery of electronic messages in a multiple computer system based on context and environment of a user.
Theimer Marvin M. ; Spreitzer Michael J. ; Weiser Mark D. ; Goldstein Richard J. ; Swinehart Daniel C. ; Schilit William N. ; Want Roy, Specifying and establishing communication data paths between particular media devices in multiple media device computing.
Abbott, Kenneth H.; Freedman, Joshua M.; Newell, Dan; Robarts, James O., Supplying notifications related to supply and consumption of user context data.
Marx Matthew T. ; Carter Jerry K. ; Phillips Michael S. ; Holthouse Mark A. ; Seabury Stephen D. ; Elizondo-Cecenas Jose L. ; Phaneuf Brett D., System and method for developing interactive speech applications.
Faybishenko, Yaroslav; Kan, Gene H.; Camarda, Thomas J.; Botros, Sherif; Beatty, John; Cutting, Douglass R., System and method for resolving distributed network search queries to information providers.
Theimer Marvin M. (Mountain View CA) Spreitzer Michael J. (Tracy CA) Weiser Mark D. (Palo Alto CA) Goldstein Richard J. (San Francisco CA) Elrod Scott A. (Redwood City CA) Swinehart Daniel C. (Palo A, System for granting ownership of device by user based on requested level of ownership, present state of the device, and.
Conner Mike H. (Austin TX) Martin Andrew R. (Austin TX) Raper Larry K. (Austin TX), System for producing language neutral objects and generating an interface between the objects and multiple computer lang.
Arimilli, Ravi Kumar; Dodson, John Steven; Fields, Jr., James Stephen, Two-stage request protocol for accessing remote memory data in a NUMA data processing system.
Zintel, William M.; Gandhi, Amar S.; Gu, Ye; Pather, Shyamalan; Schlimmer, Jeffrey C.; Rude, Christopher M.; Weisman, Daniel R.; Ryan, Donald R.; Leach, Paul J.; Cai, Ting; Knight, Holly N.; Ford, Pe, XML-based template language for devices and services.
Tomkins, Andrew; Ravikumar, Shanmugasundaram; Agarwal, Shalini; Yang, MyLinh; Pang, Bo; Li, Mark Yinan, Providing additional information related to a vague term in a message.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.