The deep Q learning technique trains weights of an artificial neural network using a number of unique features, including separate target and prediction networks, random experience replay to avoid issues with temporally correlated training samples, and others. A hardware architecture is described th
The deep Q learning technique trains weights of an artificial neural network using a number of unique features, including separate target and prediction networks, random experience replay to avoid issues with temporally correlated training samples, and others. A hardware architecture is described that is tuned to perform deep Q learning. Inference cores use a prediction network to determine an action to apply to an environment. A replay memory stores the results of the action. Training cores use a loss function derived from outputs from both the target and prediction networks to update weights of the prediction neural networks. A high speed copy engine periodically copies weights from the prediction neural network to the target neural network.
대표청구항▼
1. A method for training a prediction artificial neural network, the method comprising: applying, by one or more inference cores, state information for time step t to a prediction artificial neural network having weights stored in a prediction network weight memory, to obtain output scores for a set
1. A method for training a prediction artificial neural network, the method comprising: applying, by one or more inference cores, state information for time step t to a prediction artificial neural network having weights stored in a prediction network weight memory, to obtain output scores for a set of actions;selecting an action from the set of actions based on the output scores, for application to an environment, to advance the environment to time step t+1;storing a tuple for a transition from state st to state st+1 into a replay memory, the tuple including the selected action, and a reward provided by the environment;adjusting, by the one or more training cores, weights of the prediction artificial neural network stored in the prediction network weight memory based on application of states st and st+1 from the tuple to the prediction artificial neural network and a target artificial neural network having weights stored in a target network weight memory, respectively. 2. The method of claim 1, wherein adjusting the weights of the prediction artificial neural network includes: sampling, by one or more training cores, one or more tuples from the replay memory, where each tuple includes a state sj, an action aj, a reward for the action rj, and a subsequent state sj+1. 3. The method of claim 2, wherein adjusting the weights of the prediction artificial neural network further includes: applying, by the one or more training cores, state sj+1 to a target artificial neural network having weights stored in a target network weight memory and obtaining a highest action score output from the target artificial neural network. 4. The method of claim 3, wherein adjusting the weights of the prediction artificial neural network further includes: applying, by the one or more training cores, state sj to the prediction artificial neural network to obtain an action score for action aj. 5. The method of claim 4, wherein adjusting the weights of the prediction artificial neural network further includes: determining, by the one or more training cores, a loss function based on the highest action score output by the target artificial neural network for state sj+1, the action score for action aj output by the prediction artificial neural network, and the reward score rj. 6. The method of claim 5, wherein adjusting the weights of the prediction artificial neural network further includes: performing, by the one or more training cores, a gradient descent operation on the loss function with respect to the weights of the prediction artificial neural network. 7. The method of claim 1, further comprising: periodically updating the weights of the target artificial neural network via a copy engine by copying the weights of the prediction artificial neural network into the target artificial neural network memory. 8. The method of claim 1, further comprising: repeating the applying, selecting, storing, and adjusting steps for each step of an episode of training. 9. The method of claim 8, further comprising: performing multiple episodes of training to train the prediction artificial neural network. 10. A machine learning device for training a prediction artificial neural network, the machine learning device comprising: a set of memories including a replay memory, a prediction network weight memory, and a target network weight memory;one or more inference cores configured to apply state information for time step t to a prediction artificial neural network having weights stored in the prediction network weight memory, to obtain output scores for a set of actions;an action selection processor, comprising one of the one or more inference cores or a processor other than the one or more inference cores, configured to select an action from the set of actions based on the output scores, for application to an environment, to advance the environment to time step t+1;a tuple storing processor, comprising one of the one or more inference cores or a processor other than the one or more inference cores, configured to store a tuple for a transition from state st to state st+1 into the replay memory, the tuple including the selected action, and a reward provided by the environment; andone or more training cores configured to adjust weights of the prediction artificial neural network stored in the prediction network weight memory based on application of states st and st+1 from the tuple to the prediction artificial neural network and a target artificial neural network having weights stored in the target network weight memory, respectively. 11. The machine learning device of claim 10, wherein adjusting the weights of the prediction artificial neural network includes: sampling, by one or more training cores, one or more tuples from the replay memory, where each tuple includes a state sj, an action aj, a reward for the action rj, and a subsequent state sj+1. 12. The machine learning device of claim 11, wherein adjusting the weights of the prediction artificial neural network further includes: applying, by the one or more training cores, state sj+1 to a target artificial neural network having weights stored in a target network weight memory and obtaining a highest action score output from the target artificial neural network. 13. The machine learning device of claim 12, wherein adjusting the weights of the prediction artificial neural network further includes: applying, by the one or more training cores, state sj to the prediction artificial neural network to obtain an action score for action aj. 14. The machine learning device of claim 13, wherein adjusting the weights of the prediction artificial neural network further includes: determining, by the one or more training cores, a loss function based on the highest action score output by the target artificial neural network for state sj+1, the action score for action aj output by the prediction artificial neural network, and the reward score rj. 15. The machine learning device of claim 14, wherein adjusting the weights of the prediction artificial neural network further includes: performing, by the one or more training cores, a gradient descent operation on the loss function with respect to the weights of the prediction artificial neural network. 16. The machine learning device of claim 10, further comprising: a copy engine configured to periodically update the weights of the target artificial neural network by copying the weights of the prediction artificial neural network into the target artificial neural network memory. 17. The machine learning device of claim 10, wherein the one or more inference cores, the action selection processor, the tuple storing processor, and the one or more training cores are further configured to: repeat the applying, selecting, storing, and adjusting for each step of an episode of training. 18. The machine learning device of claim 17, wherein the one or more inference cores, the action selection processor, the tuple storing processor, and the one or more training cores are further configured to: performing multiple episodes of training to train the prediction artificial neural network. 19. A computing device for training a prediction artificial neural network, the computing device comprising: a central processor configured to interface with an environment by applying actions to the environment and observing states and rewards output by the environment; anda machine learning device for training the prediction artificial neural network, the machine learning device comprising: a set of memories including a replay memory, a prediction network weight memory, and a target network weight memory;one or more inference cores configured to apply state information for time step t to a prediction artificial neural network having weights stored in the prediction network weight memory, to obtain output scores for a set of actions;an action selection processor, comprising one of the one or more inference cores, configured to select an action from the set of actions based on the output scores, for application to an environment, to advance the environment to time step t+1;a tuple storing processor, comprising one of the one or more inference cores, configured to store a tuple for a transition from state st to state st+1 into the replay memory, the tuple including the selected action, and a reward provided by the environment; andone or more training cores configured to adjust weights of the prediction artificial neural network stored in the prediction network weight memory based on application of states st and st+1 from the tuple to the prediction artificial neural network and a target artificial neural network having weights stored in the target network weight memory, respectively. 20. The computing device of claim 19, wherein adjusting the weights of the prediction artificial neural network includes: sampling, by one or more training cores, one or more tuples from the replay memory, where each tuple includes a state sj, an action aj, a reward for the action rj and a subsequent state sj+1.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.