A framework for closed domain question answering systems in the low data regime.

Carmo, Vinícius Cleves de Oliveira

A framework for closed domain question answering systems in the low data regime.

Detalhes bibliográficos
Autor(a) principal:	Carmo, Vinícius Cleves de Oliveira
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Biblioteca Digital de Teses e Dissertações da USP
Texto Completo:	https://www.teses.usp.br/teses/disponiveis/3/3141/tde-24052023-152815/
Resumo:	Question Answering (QA) systems that operate over textual databases aim at improving traditional information retrieval systems; while the latter recover a number of relevant documents from a document pool, the former can also find and present direct answers to users. Recent improvements on QA have been based on deep neural networks; such networks require large volumes of labeled data for training. Most existing datasets target general knowledge and, even though there are a few datasets for specific domains (such as biomedicine), for most domains there is no labeled, or easy to label, dataset available. This creates an obstacle for the development of domain-specific QA systems. We propose a framework for developing domain-specific QA systems by leveraging unsupervised learning so as to avoid the costs related to large scale dataset labeling. The contributions of this work are twofold. First, we apply domain-adaptive pretraining to improve out-of-domain performance of reading comprehension and question answering systems. This technique achieves state-of-the-art results on two Reading Comprehension datasets, and it exceeds the performance of state-of-the-art domain adaptation techniques in the literature by a significant margin: 2.3 exact match and 5.2 F1-score on BioASQ. Then, we propose a framework for domain-specific question answering in the low data regime. For document retrieval, we apply a combination of BM25 along with a custom text processing pipeline. We find that, in a low data setting, statistical document retrieval models outperform neural models as the data on the desired domain differ from the data used for training. For answer selection, we apply a neural reader trained with domain-adaptive pretraining to improve generalization on the desired domain. We also perform a case study by applying the proposed framework to the offshore engineering domain.

Metadados do item

id	USP_4a8c4ffcc30fe199387186497ba29ac5
oai_identifier_str	oai:teses.usp.br:tde-24052023-152815
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str	2721
spelling	A framework for closed domain question answering systems in the low data regime.Uma abordagem para o projeto de sistemas de resposta a perguntas com escassez de dados.Aprendizado computacionalInformation retrievalMachine learningNeural networksQuestion answering systemsRecuperação de informaçãoRedes neuraisSistemas de questões e respostasQuestion Answering (QA) systems that operate over textual databases aim at improving traditional information retrieval systems; while the latter recover a number of relevant documents from a document pool, the former can also find and present direct answers to users. Recent improvements on QA have been based on deep neural networks; such networks require large volumes of labeled data for training. Most existing datasets target general knowledge and, even though there are a few datasets for specific domains (such as biomedicine), for most domains there is no labeled, or easy to label, dataset available. This creates an obstacle for the development of domain-specific QA systems. We propose a framework for developing domain-specific QA systems by leveraging unsupervised learning so as to avoid the costs related to large scale dataset labeling. The contributions of this work are twofold. First, we apply domain-adaptive pretraining to improve out-of-domain performance of reading comprehension and question answering systems. This technique achieves state-of-the-art results on two Reading Comprehension datasets, and it exceeds the performance of state-of-the-art domain adaptation techniques in the literature by a significant margin: 2.3 exact match and 5.2 F1-score on BioASQ. Then, we propose a framework for domain-specific question answering in the low data regime. For document retrieval, we apply a combination of BM25 along with a custom text processing pipeline. We find that, in a low data setting, statistical document retrieval models outperform neural models as the data on the desired domain differ from the data used for training. For answer selection, we apply a neural reader trained with domain-adaptive pretraining to improve generalization on the desired domain. We also perform a case study by applying the proposed framework to the offshore engineering domain.Sistemas de resposta a perguntas (Question Answering QA) que operam sobre conjuntos de documentos visam melhorar os sistemas tradicionais de recuperação de informações; enquanto estes recuperam documentos relevantes de uma base de documentos, aqueles também localizam e apresentam respostas diretas aos usuários. Melhorias recentes em QA tem sido baseadas em redes neurais, porém tais redes exigem grandes volumes de dados rotulados para treinamento. A maioria dos conjuntos de dados existentes contem conhecimentos gerais e, embora existam alguns conjuntos de dados para domínios específicos (como biomedicina), na maioria dos domínios não há disponíveis conjuntos de dados rotulados ou fáceis de rotular. Isso cria um obstáculo para o desenvolvimento de sistemas de QA de domínio específico. Neste trabalho, propomos um esquema para desenvolvimento de sistemas de QA de domínio específico utilizando aprendizado não supervisionado, de modo a evitar os custos relacionados à rotulagem de grandes conjuntos de dados. Nossa contribuição tem duas formas. Primeiro, aplicamos a técnica de pré-treino adaptativo ao domínio para melhorar o desempenho fora do domínio em sistemas de compreensão de leitura e QA. Essa técnica atinge o estado da arte em dois conjuntos de dados de compreensão de leitura, e supera a performance de técnicas de adaptação de domínio no estado da arte na literatura por uma margem significativa: 2,3 em correspondência exata e 5.2 em F1-score no BioASQ. Em seguida, propomos um framework para QA em domínio específico em regime de escassez de dados. Para recuperação de documentos, aplicamos uma combinação do BM25 junto com um pipeline de processamento de texto personalizado. Descobrimos que, em um regime de escassez de dados, modelos estatísticos de recuperação de documentos superam os modelos neurais, conforme os dados no domínio desejado diferem dos dados utilizados durante o treinamento. Para a seleção de respostas, aplicamos um leitor neural treinado com a técnica de pré-treino adaptativo ao domínio para melhorar a generalização no domínio desejado. Também realizamos um estudo de caso aplicando o framework proposto ao domínio da engenharia oceânica.Biblioteca Digitais de Teses e Dissertações da USPCozman, Fabio GagliardiCarmo, Vinícius Cleves de Oliveira2022-12-06info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/3/3141/tde-24052023-152815/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2024-10-09T12:45:09Zoai:teses.usp.br:tde-24052023-152815Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212024-10-09T12:45:09Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	A framework for closed domain question answering systems in the low data regime. Uma abordagem para o projeto de sistemas de resposta a perguntas com escassez de dados.
title	A framework for closed domain question answering systems in the low data regime.
spellingShingle	A framework for closed domain question answering systems in the low data regime. Carmo, Vinícius Cleves de Oliveira Aprendizado computacional Information retrieval Machine learning Neural networks Question answering systems Recuperação de informação Redes neurais Sistemas de questões e respostas
title_short	A framework for closed domain question answering systems in the low data regime.
title_full	A framework for closed domain question answering systems in the low data regime.
title_fullStr	A framework for closed domain question answering systems in the low data regime.
title_full_unstemmed	A framework for closed domain question answering systems in the low data regime.
title_sort	A framework for closed domain question answering systems in the low data regime.
author	Carmo, Vinícius Cleves de Oliveira
author_facet	Carmo, Vinícius Cleves de Oliveira
author_role	author
dc.contributor.none.fl_str_mv	Cozman, Fabio Gagliardi
dc.contributor.author.fl_str_mv	Carmo, Vinícius Cleves de Oliveira
dc.subject.por.fl_str_mv	Aprendizado computacional Information retrieval Machine learning Neural networks Question answering systems Recuperação de informação Redes neurais Sistemas de questões e respostas
topic	Aprendizado computacional Information retrieval Machine learning Neural networks Question answering systems Recuperação de informação Redes neurais Sistemas de questões e respostas
description	Question Answering (QA) systems that operate over textual databases aim at improving traditional information retrieval systems; while the latter recover a number of relevant documents from a document pool, the former can also find and present direct answers to users. Recent improvements on QA have been based on deep neural networks; such networks require large volumes of labeled data for training. Most existing datasets target general knowledge and, even though there are a few datasets for specific domains (such as biomedicine), for most domains there is no labeled, or easy to label, dataset available. This creates an obstacle for the development of domain-specific QA systems. We propose a framework for developing domain-specific QA systems by leveraging unsupervised learning so as to avoid the costs related to large scale dataset labeling. The contributions of this work are twofold. First, we apply domain-adaptive pretraining to improve out-of-domain performance of reading comprehension and question answering systems. This technique achieves state-of-the-art results on two Reading Comprehension datasets, and it exceeds the performance of state-of-the-art domain adaptation techniques in the literature by a significant margin: 2.3 exact match and 5.2 F1-score on BioASQ. Then, we propose a framework for domain-specific question answering in the low data regime. For document retrieval, we apply a combination of BM25 along with a custom text processing pipeline. We find that, in a low data setting, statistical document retrieval models outperform neural models as the data on the desired domain differ from the data used for training. For answer selection, we apply a neural reader trained with domain-adaptive pretraining to improve generalization on the desired domain. We also perform a case study by applying the proposed framework to the offshore engineering domain.
publishDate	2022
dc.date.none.fl_str_mv	2022-12-06
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/3/3141/tde-24052023-152815/
url	https://www.teses.usp.br/teses/disponiveis/3/3141/tde-24052023-152815/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1815256533273935872

A framework for closed domain question answering systems in the low data regime.

Registros relacionados