DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms

MACHADO, Mateus Gonçalves

DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms

Detalhes bibliográficos
Autor(a) principal:	MACHADO, Mateus Gonçalves
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Institucional da UFPE
Texto Completo:	https://repositorio.ufpe.br/handle/123456789/46630
Resumo:	Reinforcement Learning (RL) is an emergent subfield of Machine Learning in which an agent interacts with an environment and leverages their experiences to learn, by trial and error, which actions are the most appropriate for each state. At each step the agent receives a positive or negative reward signal, which is the main feedback used for learning. RL finds applications in many areas, such as robotics, stock exchange, and even in cooling systems, presenting superhuman performance in learning to play board games (Chess and Go) and video games (Atari Games, Dota2, and StarCraft2). However, RL methods still struggle in environments with sparse rewards. For example, an agent may receive very few goal score rewards in a soccer game. Thus, it is hard to associate rewards (goals) with actions. Researchers frequently introduce multiple intermediary rewards to help learning and circumvent this problem. However, adequately combining multiple rewards to compose the unique reward signal used by the RL methods frequently is not an easy task. This work aims to solve this specific problem by introducing DyLam. It extends existing policy gradient methods by decomposing the reward function used in the environment and dynamically weighting each component as a function of the agent’s performance on the associated task. We prove the convergence of the proposed method and show empirically that it overcomes competitor methods in the environments evaluated in terms of learning speed and, in some cases, the final performance.

Metadados do item

id	UFPE_a6a94a39c43948906d1943ce145c67f2
oai_identifier_str	oai:repositorio.ufpe.br:123456789/46630
network_acronym_str	UFPE
network_name_str	Repositório Institucional da UFPE
repository_id_str	2221
spelling	MACHADO, Mateus Gonçalveshttp://lattes.cnpq.br/6336642250934748http://lattes.cnpq.br/1931667959910637BASSANI, Hansenclever de França2022-09-22T12:00:22Z2022-09-22T12:00:22Z2022-06-07MACHADO, Mateus Gonçalves. DyLam: a dynamic reward weighting method for reinforcement learning policy gradient algorithms. 2022. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2022.https://repositorio.ufpe.br/handle/123456789/46630Reinforcement Learning (RL) is an emergent subfield of Machine Learning in which an agent interacts with an environment and leverages their experiences to learn, by trial and error, which actions are the most appropriate for each state. At each step the agent receives a positive or negative reward signal, which is the main feedback used for learning. RL finds applications in many areas, such as robotics, stock exchange, and even in cooling systems, presenting superhuman performance in learning to play board games (Chess and Go) and video games (Atari Games, Dota2, and StarCraft2). However, RL methods still struggle in environments with sparse rewards. For example, an agent may receive very few goal score rewards in a soccer game. Thus, it is hard to associate rewards (goals) with actions. Researchers frequently introduce multiple intermediary rewards to help learning and circumvent this problem. However, adequately combining multiple rewards to compose the unique reward signal used by the RL methods frequently is not an easy task. This work aims to solve this specific problem by introducing DyLam. It extends existing policy gradient methods by decomposing the reward function used in the environment and dynamically weighting each component as a function of the agent’s performance on the associated task. We prove the convergence of the proposed method and show empirically that it overcomes competitor methods in the environments evaluated in terms of learning speed and, in some cases, the final performance.FACEPEAprendizagem por Reforço (AR) é um subcampo emergente de Aprendizagem de Máquina no qual um agente interage com um ambiente e aproveita suas experiências para aprender, por tentativa e erro, quais ações são as mais adequadas para cada estado. A cada passo o agente recebe um sinal de recompensa positivo ou negativo, que é o principal feedback utilizado para o aprendizado. A AR encontra aplicações em diversas áreas, como robótica, bolsa de valores e até mesmo em sistemas de refrigeração, apresentando desempenho sobre-humano no aprendizado de jogos de tabuleiro (Xadrez e Go) e videogames (jogos de Atari, Dota2 e StarCraft2). No entanto, os métodos AR ainda lutam em ambientes com recompensas escassas. Por exemplo, um agente pode receber poucas recompensas por gols em um jogo de futebol. Assim, é difícil associar recompensas (gols) com ações. Os pesquisadores frequentemente introduzem várias recompensas intermediárias para ajudar no aprendizado e contornar esse problema. No entanto, combinar adequadamente várias recompensas para compor o sinal de recompensa único usado pelos métodos AR frequentemente não é uma tarefa fácil. Este trabalho visa resolver este problema específico através da introdução do DyLam. Ele estende os métodos de gradiente de política existentes, decompondo a função de recompensa usada no ambiente e ponderando dinamicamente cada componente em função do desempenho do agente na tarefa associada. Provamos a convergência do método proposto e mostramos empiricamente que ele supera métodos concorrentes nos ambientes avaliados em termos de velocidade de aprendizado e, em alguns casos, desempenho final.engUniversidade Federal de PernambucoPrograma de Pos Graduacao em Ciencia da ComputacaoUFPEBrasilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessEngenharia da computaçãoAprendizagemDyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithmsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesismestradoreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPETEXTDISSERTAÇÃO Mateus Gonçalves Machado.pdf.txtDISSERTAÇÃO Mateus Gonçalves Machado.pdf.txtExtracted texttext/plain90359https://repositorio.ufpe.br/bitstream/123456789/46630/4/DISSERTA%c3%87%c3%83O%20Mateus%20Gon%c3%a7alves%20Machado.pdf.txt63d2e937bfade6384ad349c80c0caa83MD54THUMBNAILDISSERTAÇÃO Mateus Gonçalves Machado.pdf.jpgDISSERTAÇÃO Mateus Gonçalves Machado.pdf.jpgGenerated Thumbnailimage/jpeg1217https://repositorio.ufpe.br/bitstream/123456789/46630/5/DISSERTA%c3%87%c3%83O%20Mateus%20Gon%c3%a7alves%20Machado.pdf.jpg3a8e4f620936c084c13160af94edc787MD55ORIGINALDISSERTAÇÃO Mateus Gonçalves Machado.pdfDISSERTAÇÃO Mateus Gonçalves Machado.pdfapplication/pdf7261906https://repositorio.ufpe.br/bitstream/123456789/46630/1/DISSERTA%c3%87%c3%83O%20Mateus%20Gon%c3%a7alves%20Machado.pdf2c59f6eca849f5c7301dd91b52fdd546MD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufpe.br/bitstream/123456789/46630/2/license_rdfe39d27027a6cc9cb039ad269a5db8e34MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82142https://repositorio.ufpe.br/bitstream/123456789/46630/3/license.txt6928b9260b07fb2755249a5ca9903395MD53123456789/466302022-09-23 03:24:50.112oai:repositorio.ufpe.br:123456789/46630VGVybW8gZGUgRGVww7NzaXRvIExlZ2FsIGUgQXV0b3JpemHDp8OjbyBwYXJhIFB1YmxpY2HDp8OjbyBkZSBEb2N1bWVudG9zIG5vIFJlcG9zaXTDs3JpbyBEaWdpdGFsIGRhIFVGUEUKIAoKRGVjbGFybyBlc3RhciBjaWVudGUgZGUgcXVlIGVzdGUgVGVybW8gZGUgRGVww7NzaXRvIExlZ2FsIGUgQXV0b3JpemHDp8OjbyB0ZW0gbyBvYmpldGl2byBkZSBkaXZ1bGdhw6fDo28gZG9zIGRvY3VtZW50b3MgZGVwb3NpdGFkb3Mgbm8gUmVwb3NpdMOzcmlvIERpZ2l0YWwgZGEgVUZQRSBlIGRlY2xhcm8gcXVlOgoKSSAtICBvIGNvbnRlw7pkbyBkaXNwb25pYmlsaXphZG8gw6kgZGUgcmVzcG9uc2FiaWxpZGFkZSBkZSBzdWEgYXV0b3JpYTsKCklJIC0gbyBjb250ZcO6ZG8gw6kgb3JpZ2luYWwsIGUgc2UgbyB0cmFiYWxobyBlL291IHBhbGF2cmFzIGRlIG91dHJhcyBwZXNzb2FzIGZvcmFtIHV0aWxpemFkb3MsIGVzdGFzIGZvcmFtIGRldmlkYW1lbnRlIHJlY29uaGVjaWRhczsKCklJSSAtIHF1YW5kbyB0cmF0YXItc2UgZGUgVHJhYmFsaG8gZGUgQ29uY2x1c8OjbyBkZSBDdXJzbywgRGlzc2VydGHDp8OjbyBvdSBUZXNlOiBvIGFycXVpdm8gZGVwb3NpdGFkbyBjb3JyZXNwb25kZSDDoCB2ZXJzw6NvIGZpbmFsIGRvIHRyYWJhbGhvOwoKSVYgLSBxdWFuZG8gdHJhdGFyLXNlIGRlIFRyYWJhbGhvIGRlIENvbmNsdXPDo28gZGUgQ3Vyc28sIERpc3NlcnRhw6fDo28gb3UgVGVzZTogZXN0b3UgY2llbnRlIGRlIHF1ZSBhIGFsdGVyYcOnw6NvIGRhIG1vZGFsaWRhZGUgZGUgYWNlc3NvIGFvIGRvY3VtZW50byBhcMOzcyBvIGRlcMOzc2l0byBlIGFudGVzIGRlIGZpbmRhciBvIHBlcsOtb2RvIGRlIGVtYmFyZ28sIHF1YW5kbyBmb3IgZXNjb2xoaWRvIGFjZXNzbyByZXN0cml0bywgc2Vyw6EgcGVybWl0aWRhIG1lZGlhbnRlIHNvbGljaXRhw6fDo28gZG8gKGEpIGF1dG9yIChhKSBhbyBTaXN0ZW1hIEludGVncmFkbyBkZSBCaWJsaW90ZWNhcyBkYSBVRlBFIChTSUIvVUZQRSkuCgogClBhcmEgdHJhYmFsaG9zIGVtIEFjZXNzbyBBYmVydG86CgpOYSBxdWFsaWRhZGUgZGUgdGl0dWxhciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGUgYXV0b3IgcXVlIHJlY2FlbSBzb2JyZSBlc3RlIGRvY3VtZW50bywgZnVuZGFtZW50YWRvIG5hIExlaSBkZSBEaXJlaXRvIEF1dG9yYWwgbm8gOS42MTAsIGRlIDE5IGRlIGZldmVyZWlybyBkZSAxOTk4LCBhcnQuIDI5LCBpbmNpc28gSUlJLCBhdXRvcml6byBhIFVuaXZlcnNpZGFkZSBGZWRlcmFsIGRlIFBlcm5hbWJ1Y28gYSBkaXNwb25pYmlsaXphciBncmF0dWl0YW1lbnRlLCBzZW0gcmVzc2FyY2ltZW50byBkb3MgZGlyZWl0b3MgYXV0b3JhaXMsIHBhcmEgZmlucyBkZSBsZWl0dXJhLCBpbXByZXNzw6NvIGUvb3UgZG93bmxvYWQgKGFxdWlzacOnw6NvKSBhdHJhdsOpcyBkbyBzaXRlIGRvIFJlcG9zaXTDs3JpbyBEaWdpdGFsIGRhIFVGUEUgbm8gZW5kZXJlw6dvIGh0dHA6Ly93d3cucmVwb3NpdG9yaW8udWZwZS5iciwgYSBwYXJ0aXIgZGEgZGF0YSBkZSBkZXDDs3NpdG8uCgogClBhcmEgdHJhYmFsaG9zIGVtIEFjZXNzbyBSZXN0cml0bzoKCk5hIHF1YWxpZGFkZSBkZSB0aXR1bGFyIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkZSBhdXRvciBxdWUgcmVjYWVtIHNvYnJlIGVzdGUgZG9jdW1lbnRvLCBmdW5kYW1lbnRhZG8gbmEgTGVpIGRlIERpcmVpdG8gQXV0b3JhbCBubyA5LjYxMCBkZSAxOSBkZSBmZXZlcmVpcm8gZGUgMTk5OCwgYXJ0LiAyOSwgaW5jaXNvIElJSSwgYXV0b3Jpem8gYSBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBQZXJuYW1idWNvIGEgZGlzcG9uaWJpbGl6YXIgZ3JhdHVpdGFtZW50ZSwgc2VtIHJlc3NhcmNpbWVudG8gZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCBwYXJhIGZpbnMgZGUgbGVpdHVyYSwgaW1wcmVzc8OjbyBlL291IGRvd25sb2FkIChhcXVpc2nDp8OjbykgYXRyYXbDqXMgZG8gc2l0ZSBkbyBSZXBvc2l0w7NyaW8gRGlnaXRhbCBkYSBVRlBFIG5vIGVuZGVyZcOnbyBodHRwOi8vd3d3LnJlcG9zaXRvcmlvLnVmcGUuYnIsIHF1YW5kbyBmaW5kYXIgbyBwZXLDrW9kbyBkZSBlbWJhcmdvIGNvbmRpemVudGUgYW8gdGlwbyBkZSBkb2N1bWVudG8sIGNvbmZvcm1lIGluZGljYWRvIG5vIGNhbXBvIERhdGEgZGUgRW1iYXJnby4KRepositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212022-09-23T06:24:50Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false
dc.title.pt_BR.fl_str_mv	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
title	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
spellingShingle	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms MACHADO, Mateus Gonçalves Engenharia da computação Aprendizagem
title_short	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
title_full	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
title_fullStr	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
title_full_unstemmed	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
title_sort	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
author	MACHADO, Mateus Gonçalves
author_facet	MACHADO, Mateus Gonçalves
author_role	author
dc.contributor.authorLattes.pt_BR.fl_str_mv	http://lattes.cnpq.br/6336642250934748
dc.contributor.advisorLattes.pt_BR.fl_str_mv	http://lattes.cnpq.br/1931667959910637
dc.contributor.author.fl_str_mv	MACHADO, Mateus Gonçalves
dc.contributor.advisor1.fl_str_mv	BASSANI, Hansenclever de França
contributor_str_mv	BASSANI, Hansenclever de França
dc.subject.por.fl_str_mv	Engenharia da computação Aprendizagem
topic	Engenharia da computação Aprendizagem
description	Reinforcement Learning (RL) is an emergent subfield of Machine Learning in which an agent interacts with an environment and leverages their experiences to learn, by trial and error, which actions are the most appropriate for each state. At each step the agent receives a positive or negative reward signal, which is the main feedback used for learning. RL finds applications in many areas, such as robotics, stock exchange, and even in cooling systems, presenting superhuman performance in learning to play board games (Chess and Go) and video games (Atari Games, Dota2, and StarCraft2). However, RL methods still struggle in environments with sparse rewards. For example, an agent may receive very few goal score rewards in a soccer game. Thus, it is hard to associate rewards (goals) with actions. Researchers frequently introduce multiple intermediary rewards to help learning and circumvent this problem. However, adequately combining multiple rewards to compose the unique reward signal used by the RL methods frequently is not an easy task. This work aims to solve this specific problem by introducing DyLam. It extends existing policy gradient methods by decomposing the reward function used in the environment and dynamically weighting each component as a function of the agent’s performance on the associated task. We prove the convergence of the proposed method and show empirically that it overcomes competitor methods in the environments evaluated in terms of learning speed and, in some cases, the final performance.
publishDate	2022
dc.date.accessioned.fl_str_mv	2022-09-22T12:00:22Z
dc.date.available.fl_str_mv	2022-09-22T12:00:22Z
dc.date.issued.fl_str_mv	2022-06-07
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	MACHADO, Mateus Gonçalves. DyLam: a dynamic reward weighting method for reinforcement learning policy gradient algorithms. 2022. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2022.
dc.identifier.uri.fl_str_mv	https://repositorio.ufpe.br/handle/123456789/46630
identifier_str_mv	MACHADO, Mateus Gonçalves. DyLam: a dynamic reward weighting method for reinforcement learning policy gradient algorithms. 2022. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2022.
url	https://repositorio.ufpe.br/handle/123456789/46630
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Pernambuco
dc.publisher.program.fl_str_mv	Programa de Pos Graduacao em Ciencia da Computacao
dc.publisher.initials.fl_str_mv	UFPE
dc.publisher.country.fl_str_mv	Brasil
publisher.none.fl_str_mv	Universidade Federal de Pernambuco
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFPE instname:Universidade Federal de Pernambuco (UFPE) instacron:UFPE
instname_str	Universidade Federal de Pernambuco (UFPE)
instacron_str	UFPE
institution	UFPE
reponame_str	Repositório Institucional da UFPE
collection	Repositório Institucional da UFPE
bitstream.url.fl_str_mv	https://repositorio.ufpe.br/bitstream/123456789/46630/4/DISSERTA%c3%87%c3%83O%20Mateus%20Gon%c3%a7alves%20Machado.pdf.txt https://repositorio.ufpe.br/bitstream/123456789/46630/5/DISSERTA%c3%87%c3%83O%20Mateus%20Gon%c3%a7alves%20Machado.pdf.jpg https://repositorio.ufpe.br/bitstream/123456789/46630/1/DISSERTA%c3%87%c3%83O%20Mateus%20Gon%c3%a7alves%20Machado.pdf https://repositorio.ufpe.br/bitstream/123456789/46630/2/license_rdf https://repositorio.ufpe.br/bitstream/123456789/46630/3/license.txt
bitstream.checksum.fl_str_mv	63d2e937bfade6384ad349c80c0caa83 3a8e4f620936c084c13160af94edc787 2c59f6eca849f5c7301dd91b52fdd546 e39d27027a6cc9cb039ad269a5db8e34 6928b9260b07fb2755249a5ca9903395
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)
repository.mail.fl_str_mv	attena@ufpe.br
_version_	1802310724315250688

DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms

Registros relacionados