Associating genotype sequence properties to haplotype inference errors

Detalhes bibliográficos
Autor(a) principal: ROSA, Rogério dos Santos
Data de Publicação: 2015
Tipo de documento: Tese
Idioma: por
Título da fonte: Repositório Institucional da UFPE
dARK ID: ark:/64986/001300000k22p
Texto Completo: https://repositorio.ufpe.br/handle/123456789/16011
Resumo: Haplotype information has a central role in the understanding and diagnosis of certain illnesses, and also for evolution studies. Since that type of information is hard to obtain directly, computational methods to infer haplotype from genotype data have received great attention from the computational biology community. Unfortunately, haplotype inference is a very hard computational biology problem and the existing methods can only partially identify correct solutions. I present neural network models that use different properties of the data to predict when a method is more prone to make errors. I construct models for three different Haplotype Inference approaches and I show that our models are accurate and statistically relevant. The results of our experiments offer valuable insights on the performance of those methods, opening opportunity for a combination of strategies or improvement of individual approaches. I formally demonstrate that Linkage Disequilibrium (LD) and heterozygosity are very strong indicators of Switch Error tendency for four methods studied, and I delineate scenarios based on LD measures, that reveal a higher or smaller propension of the HI methods to present inference errors, so the correlation between LD and the occurrence of errors varies among regions along the genotypes. I present evidence that considering windows of length 10, immediately to the left of a SNP (upstream region), and eliminating the non-informative SNPs through Fisher’s Test leads to a more suitable correlation between LD and Inference Errors. I apply Multiple Linear Regression to explore the relevance of several biologically meaningful properties of the genotype sequences for the accuracy of the haplotype inference results, developing models for two databases (considering only Humans) and using two error metrics. The accuracy of our results and the stability of our proposed models are supported by statistical evidence.
id UFPE_176dfdf3ab7f8a7e83d8203803ed8dc9
oai_identifier_str oai:repositorio.ufpe.br:123456789/16011
network_acronym_str UFPE
network_name_str Repositório Institucional da UFPE
repository_id_str 2221
spelling ROSA, Rogério dos Santoshttp://lattes.cnpq.br/8994178236264483GUIMARÃES, Katia Silva2016-03-16T15:28:48Z2016-03-16T15:28:48Z2015-03-12https://repositorio.ufpe.br/handle/123456789/16011ark:/64986/001300000k22pHaplotype information has a central role in the understanding and diagnosis of certain illnesses, and also for evolution studies. Since that type of information is hard to obtain directly, computational methods to infer haplotype from genotype data have received great attention from the computational biology community. Unfortunately, haplotype inference is a very hard computational biology problem and the existing methods can only partially identify correct solutions. I present neural network models that use different properties of the data to predict when a method is more prone to make errors. I construct models for three different Haplotype Inference approaches and I show that our models are accurate and statistically relevant. The results of our experiments offer valuable insights on the performance of those methods, opening opportunity for a combination of strategies or improvement of individual approaches. I formally demonstrate that Linkage Disequilibrium (LD) and heterozygosity are very strong indicators of Switch Error tendency for four methods studied, and I delineate scenarios based on LD measures, that reveal a higher or smaller propension of the HI methods to present inference errors, so the correlation between LD and the occurrence of errors varies among regions along the genotypes. I present evidence that considering windows of length 10, immediately to the left of a SNP (upstream region), and eliminating the non-informative SNPs through Fisher’s Test leads to a more suitable correlation between LD and Inference Errors. I apply Multiple Linear Regression to explore the relevance of several biologically meaningful properties of the genotype sequences for the accuracy of the haplotype inference results, developing models for two databases (considering only Humans) and using two error metrics. The accuracy of our results and the stability of our proposed models are supported by statistical evidence.Haplótipos têm um papel central na compreensão e diagnóstico de determinadas doenças e também para estudos de evolução. Este tipo de informação é difícil de obter diretamente, diante disto, métodos computacionais para inferir haplótipos a partir de dados genotípicos têm recebido grande atenção da comunidade de biologia computacional. Infelizmente, a Inferência de Halótipos é um problema difícil e os métodos existentes só podem predizer parcialmente soluções corretas. Foram desenvolvidos modelos de redes neurais que utilizam diferentes propriedades dos dados para prever quando um método é mais propenso a cometer erros. Foram calibrados modelos para três abordagens de Inferência de Haplótipos diferentes e os resultados validados estatisticamente. Os resultados dos experimentos oferecem informações valiosas sobre o desempenho e comportamento desses métodos, gerando condições para o desenvolvimento de estratégias de combinação de diferentes soluções ou melhoria das abordagens individuais. Foi demonstrado que Desequilíbrio de Ligação (LD) e heterozigosidade são fortes indicadores de tendência de erro, desta forma foram delineados cenários com base em medidas de LD, que revelam quando um método tem maior ou menor propensão de cometer erros. Foi identificado que utilizando janelas de 10 SNPs (polimorfismo de um único nucleotídeo), imediatamente a montante, e eliminando os SNPs não informativos pelo Teste de Fisher leva-se a uma correlação mais adequada entre LD e a ocorrência de erros. Por fim, foi aplicada análise de Regressão Linear para explorar a relevância de várias propriedades biologicamente significativas das sequências de genótipos para a precisão dos resultados de Inferência de Haplótipos, estimou-se modelos para duas bases de dados (considerando apenas humanos) utilizando duas métricas de erro. A precisão dos resultados e a estabilidade dos modelos propostos foram validadas por testes estatísticos.porUniversidade Federal de PernambucoPrograma de Pos Graduacao em Ciencia da ComputacaoUFPEBrasilAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessRegressão LinearAnálises EstatísticaSNPsHaplótiposDados GenotípicosInferência de HaplótiposLinear RegressionStatistical AnalysisSNPsHaplotypesGenotype DataHaplotype InferenceAssociating genotype sequence properties to haplotype inference errorsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisdoutoradoreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPETHUMBNAILRogerioSantosRosa_Tese.pdf.jpgRogerioSantosRosa_Tese.pdf.jpgGenerated Thumbnailimage/jpeg1244https://repositorio.ufpe.br/bitstream/123456789/16011/5/RogerioSantosRosa_Tese.pdf.jpge73c2ab554a22cd3bbb267dc3c93b2fdMD55ORIGINALRogerioSantosRosa_Tese.pdfRogerioSantosRosa_Tese.pdfapplication/pdf1740026https://repositorio.ufpe.br/bitstream/123456789/16011/1/RogerioSantosRosa_Tese.pdfaa346f64c34419c4b83269ccb99ade6aMD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-81232https://repositorio.ufpe.br/bitstream/123456789/16011/2/license_rdf66e71c371cc565284e70f40736c94386MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82311https://repositorio.ufpe.br/bitstream/123456789/16011/3/license.txt4b8a02c7f2818eaf00dcf2260dd5eb08MD53TEXTRogerioSantosRosa_Tese.pdf.txtRogerioSantosRosa_Tese.pdf.txtExtracted texttext/plain197856https://repositorio.ufpe.br/bitstream/123456789/16011/4/RogerioSantosRosa_Tese.pdf.txt181c2363b8027366944d06190851c41aMD54123456789/160112019-10-25 22:33:24.463oai:repositorio.ufpe.br:123456789/16011TGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKClRvZG8gZGVwb3NpdGFudGUgZGUgbWF0ZXJpYWwgbm8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgKFJJKSBkZXZlIGNvbmNlZGVyLCDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBQZXJuYW1idWNvIChVRlBFKSwgdW1hIExpY2Vuw6dhIGRlIERpc3RyaWJ1acOnw6NvIE7Do28gRXhjbHVzaXZhIHBhcmEgbWFudGVyIGUgdG9ybmFyIGFjZXNzw612ZWlzIG9zIHNldXMgZG9jdW1lbnRvcywgZW0gZm9ybWF0byBkaWdpdGFsLCBuZXN0ZSByZXBvc2l0w7NyaW8uCgpDb20gYSBjb25jZXNzw6NvIGRlc3RhIGxpY2Vuw6dhIG7Do28gZXhjbHVzaXZhLCBvIGRlcG9zaXRhbnRlIG1hbnTDqW0gdG9kb3Mgb3MgZGlyZWl0b3MgZGUgYXV0b3IuCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwoKTGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKCkFvIGNvbmNvcmRhciBjb20gZXN0YSBsaWNlbsOnYSBlIGFjZWl0w6EtbGEsIHZvY8OqIChhdXRvciBvdSBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMpOgoKYSkgRGVjbGFyYSBxdWUgY29uaGVjZSBhIHBvbMOtdGljYSBkZSBjb3B5cmlnaHQgZGEgZWRpdG9yYSBkbyBzZXUgZG9jdW1lbnRvOwpiKSBEZWNsYXJhIHF1ZSBjb25oZWNlIGUgYWNlaXRhIGFzIERpcmV0cml6ZXMgcGFyYSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGUEU7CmMpIENvbmNlZGUgw6AgVUZQRSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZGUgYXJxdWl2YXIsIHJlcHJvZHV6aXIsIGNvbnZlcnRlciAoY29tbyBkZWZpbmlkbyBhIHNlZ3VpciksIGNvbXVuaWNhciBlL291IGRpc3RyaWJ1aXIsIG5vIFJJLCBvIGRvY3VtZW50byBlbnRyZWd1ZSAoaW5jbHVpbmRvIG8gcmVzdW1vL2Fic3RyYWN0KSBlbSBmb3JtYXRvIGRpZ2l0YWwgb3UgcG9yIG91dHJvIG1laW87CmQpIERlY2xhcmEgcXVlIGF1dG9yaXphIGEgVUZQRSBhIGFycXVpdmFyIG1haXMgZGUgdW1hIGPDs3BpYSBkZXN0ZSBkb2N1bWVudG8gZSBjb252ZXJ0w6otbG8sIHNlbSBhbHRlcmFyIG8gc2V1IGNvbnRlw7pkbywgcGFyYSBxdWFscXVlciBmb3JtYXRvIGRlIGZpY2hlaXJvLCBtZWlvIG91IHN1cG9ydGUsIHBhcmEgZWZlaXRvcyBkZSBzZWd1cmFuw6dhLCBwcmVzZXJ2YcOnw6NvIChiYWNrdXApIGUgYWNlc3NvOwplKSBEZWNsYXJhIHF1ZSBvIGRvY3VtZW50byBzdWJtZXRpZG8gw6kgbyBzZXUgdHJhYmFsaG8gb3JpZ2luYWwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBhIHRlcmNlaXJvcyBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBhIGVudHJlZ2EgZG8gZG9jdW1lbnRvIG7Do28gaW5mcmluZ2Ugb3MgZGlyZWl0b3MgZGUgb3V0cmEgcGVzc29hIG91IGVudGlkYWRlOwpmKSBEZWNsYXJhIHF1ZSwgbm8gY2FzbyBkbyBkb2N1bWVudG8gc3VibWV0aWRvIGNvbnRlciBtYXRlcmlhbCBkbyBxdWFsIG7Do28gZGV0w6ltIG9zIGRpcmVpdG9zIGRlCmF1dG9yLCBvYnRldmUgYSBhdXRvcml6YcOnw6NvIGlycmVzdHJpdGEgZG8gcmVzcGVjdGl2byBkZXRlbnRvciBkZXNzZXMgZGlyZWl0b3MgcGFyYSBjZWRlciDDoApVRlBFIG9zIGRpcmVpdG9zIHJlcXVlcmlkb3MgcG9yIGVzdGEgTGljZW7Dp2EgZSBhdXRvcml6YXIgYSB1bml2ZXJzaWRhZGUgYSB1dGlsaXrDoS1sb3MgbGVnYWxtZW50ZS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBlc3NlIG1hdGVyaWFsIGN1am9zIGRpcmVpdG9zIHPDo28gZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3UgY29udGXDumRvIGRvIGRvY3VtZW50byBlbnRyZWd1ZTsKZykgU2UgbyBkb2N1bWVudG8gZW50cmVndWUgw6kgYmFzZWFkbyBlbSB0cmFiYWxobyBmaW5hbmNpYWRvIG91IGFwb2lhZG8gcG9yIG91dHJhIGluc3RpdHVpw6fDo28gcXVlIG7Do28gYSBVRlBFLMKgZGVjbGFyYSBxdWUgY3VtcHJpdSBxdWFpc3F1ZXIgb2JyaWdhw6fDtWVzIGV4aWdpZGFzIHBlbG8gcmVzcGVjdGl2byBjb250cmF0byBvdSBhY29yZG8uCgpBIFVGUEUgaWRlbnRpZmljYXLDoSBjbGFyYW1lbnRlIG8ocykgbm9tZShzKSBkbyhzKSBhdXRvciAoZXMpIGRvcyBkaXJlaXRvcyBkbyBkb2N1bWVudG8gZW50cmVndWUgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBwYXJhIGFsw6ltIGRvIHByZXZpc3RvIG5hIGFsw61uZWEgYykuCg==Repositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212019-10-26T01:33:24Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false
dc.title.pt_BR.fl_str_mv Associating genotype sequence properties to haplotype inference errors
title Associating genotype sequence properties to haplotype inference errors
spellingShingle Associating genotype sequence properties to haplotype inference errors
ROSA, Rogério dos Santos
Regressão Linear
Análises Estatística
SNPs
Haplótipos
Dados Genotípicos
Inferência de Haplótipos
Linear Regression
Statistical Analysis
SNPs
Haplotypes
Genotype Data
Haplotype Inference
title_short Associating genotype sequence properties to haplotype inference errors
title_full Associating genotype sequence properties to haplotype inference errors
title_fullStr Associating genotype sequence properties to haplotype inference errors
title_full_unstemmed Associating genotype sequence properties to haplotype inference errors
title_sort Associating genotype sequence properties to haplotype inference errors
author ROSA, Rogério dos Santos
author_facet ROSA, Rogério dos Santos
author_role author
dc.contributor.advisorLattes.pt_BR.fl_str_mv http://lattes.cnpq.br/8994178236264483
dc.contributor.author.fl_str_mv ROSA, Rogério dos Santos
dc.contributor.advisor1.fl_str_mv GUIMARÃES, Katia Silva
contributor_str_mv GUIMARÃES, Katia Silva
dc.subject.por.fl_str_mv Regressão Linear
Análises Estatística
SNPs
Haplótipos
Dados Genotípicos
Inferência de Haplótipos
Linear Regression
Statistical Analysis
SNPs
Haplotypes
Genotype Data
Haplotype Inference
topic Regressão Linear
Análises Estatística
SNPs
Haplótipos
Dados Genotípicos
Inferência de Haplótipos
Linear Regression
Statistical Analysis
SNPs
Haplotypes
Genotype Data
Haplotype Inference
description Haplotype information has a central role in the understanding and diagnosis of certain illnesses, and also for evolution studies. Since that type of information is hard to obtain directly, computational methods to infer haplotype from genotype data have received great attention from the computational biology community. Unfortunately, haplotype inference is a very hard computational biology problem and the existing methods can only partially identify correct solutions. I present neural network models that use different properties of the data to predict when a method is more prone to make errors. I construct models for three different Haplotype Inference approaches and I show that our models are accurate and statistically relevant. The results of our experiments offer valuable insights on the performance of those methods, opening opportunity for a combination of strategies or improvement of individual approaches. I formally demonstrate that Linkage Disequilibrium (LD) and heterozygosity are very strong indicators of Switch Error tendency for four methods studied, and I delineate scenarios based on LD measures, that reveal a higher or smaller propension of the HI methods to present inference errors, so the correlation between LD and the occurrence of errors varies among regions along the genotypes. I present evidence that considering windows of length 10, immediately to the left of a SNP (upstream region), and eliminating the non-informative SNPs through Fisher’s Test leads to a more suitable correlation between LD and Inference Errors. I apply Multiple Linear Regression to explore the relevance of several biologically meaningful properties of the genotype sequences for the accuracy of the haplotype inference results, developing models for two databases (considering only Humans) and using two error metrics. The accuracy of our results and the stability of our proposed models are supported by statistical evidence.
publishDate 2015
dc.date.issued.fl_str_mv 2015-03-12
dc.date.accessioned.fl_str_mv 2016-03-16T15:28:48Z
dc.date.available.fl_str_mv 2016-03-16T15:28:48Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://repositorio.ufpe.br/handle/123456789/16011
dc.identifier.dark.fl_str_mv ark:/64986/001300000k22p
url https://repositorio.ufpe.br/handle/123456789/16011
identifier_str_mv ark:/64986/001300000k22p
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv Attribution-NonCommercial-NoDerivs 3.0 Brazil
http://creativecommons.org/licenses/by-nc-nd/3.0/br/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution-NonCommercial-NoDerivs 3.0 Brazil
http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Pernambuco
dc.publisher.program.fl_str_mv Programa de Pos Graduacao em Ciencia da Computacao
dc.publisher.initials.fl_str_mv UFPE
dc.publisher.country.fl_str_mv Brasil
publisher.none.fl_str_mv Universidade Federal de Pernambuco
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFPE
instname:Universidade Federal de Pernambuco (UFPE)
instacron:UFPE
instname_str Universidade Federal de Pernambuco (UFPE)
instacron_str UFPE
institution UFPE
reponame_str Repositório Institucional da UFPE
collection Repositório Institucional da UFPE
bitstream.url.fl_str_mv https://repositorio.ufpe.br/bitstream/123456789/16011/5/RogerioSantosRosa_Tese.pdf.jpg
https://repositorio.ufpe.br/bitstream/123456789/16011/1/RogerioSantosRosa_Tese.pdf
https://repositorio.ufpe.br/bitstream/123456789/16011/2/license_rdf
https://repositorio.ufpe.br/bitstream/123456789/16011/3/license.txt
https://repositorio.ufpe.br/bitstream/123456789/16011/4/RogerioSantosRosa_Tese.pdf.txt
bitstream.checksum.fl_str_mv e73c2ab554a22cd3bbb267dc3c93b2fd
aa346f64c34419c4b83269ccb99ade6a
66e71c371cc565284e70f40736c94386
4b8a02c7f2818eaf00dcf2260dd5eb08
181c2363b8027366944d06190851c41a
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)
repository.mail.fl_str_mv attena@ufpe.br
_version_ 1815172843539791872