Researchers at DeepMind, Google’s London-based AI company, believe they have developed technology that reinforces learning. The system learns by trial and error and is motivated to get things correct based on rewards.
Imagine using the technology to determine how to conduct search advertising and organic optimization. This is the potential and the curse of artificial intelligence.
The algorithm predicts the average reward it receives from multiple attempts at a task, and uses this prediction to decide how to act, according to the post on the company's website. Random challenges that may occur can alter the behavior of the technology by changing the exact amount of reward the system receives.
The idea is that the algorithm learns, considers rewards based on its learning, and almost seems to eventually develop its own personality based on outside influences. In a new paper, DeepMind researchers show it is possible to model not only the average but also the reward as it changes. Researchers call this the "value distribution" or the distribution value of the report.
advertisement
advertisement
Rewards make reinforcement learning systems increasingly accurate and faster to train than previous models. More importantly, per researchers, it opens the possibility of rethinking the entire reinforcement learning process.
Researchers give a short example of how the system learns based on rewards through a calculation that shows the average commute time of a train. Using Bellman's equation to predict the average commute time, researchers make assumptions as to the outcome of when the train will actually arrive as it makes its way through several stops.
The idea is to make predictions and consider the possible average, but in the end the technology reinterprets the outcome based on what it learns, even if the predictions and averages are incorrect. Rewards reinforce the random nature of the process to help the AI learn, similar to a human.