Associating genotype sequence properties to haplotype inference errors
Autor(a) principal: | |
---|---|
Data de Publicação: | 2015 |
Tipo de documento: | Tese |
Idioma: | por |
Título da fonte: | Repositório Institucional da UFPE |
dARK ID: | ark:/64986/001300000k22p |
Texto Completo: | https://repositorio.ufpe.br/handle/123456789/16011 |
Resumo: | Haplotype information has a central role in the understanding and diagnosis of certain illnesses, and also for evolution studies. Since that type of information is hard to obtain directly, computational methods to infer haplotype from genotype data have received great attention from the computational biology community. Unfortunately, haplotype inference is a very hard computational biology problem and the existing methods can only partially identify correct solutions. I present neural network models that use different properties of the data to predict when a method is more prone to make errors. I construct models for three different Haplotype Inference approaches and I show that our models are accurate and statistically relevant. The results of our experiments offer valuable insights on the performance of those methods, opening opportunity for a combination of strategies or improvement of individual approaches. I formally demonstrate that Linkage Disequilibrium (LD) and heterozygosity are very strong indicators of Switch Error tendency for four methods studied, and I delineate scenarios based on LD measures, that reveal a higher or smaller propension of the HI methods to present inference errors, so the correlation between LD and the occurrence of errors varies among regions along the genotypes. I present evidence that considering windows of length 10, immediately to the left of a SNP (upstream region), and eliminating the non-informative SNPs through Fisher’s Test leads to a more suitable correlation between LD and Inference Errors. I apply Multiple Linear Regression to explore the relevance of several biologically meaningful properties of the genotype sequences for the accuracy of the haplotype inference results, developing models for two databases (considering only Humans) and using two error metrics. The accuracy of our results and the stability of our proposed models are supported by statistical evidence. |
id |
UFPE_176dfdf3ab7f8a7e83d8203803ed8dc9 |
---|---|
oai_identifier_str |
oai:repositorio.ufpe.br:123456789/16011 |
network_acronym_str |
UFPE |
network_name_str |
Repositório Institucional da UFPE |
repository_id_str |
2221 |
spelling |
ROSA, Rogério dos Santoshttp://lattes.cnpq.br/8994178236264483GUIMARÃES, Katia Silva2016-03-16T15:28:48Z2016-03-16T15:28:48Z2015-03-12https://repositorio.ufpe.br/handle/123456789/16011ark:/64986/001300000k22pHaplotype information has a central role in the understanding and diagnosis of certain illnesses, and also for evolution studies. Since that type of information is hard to obtain directly, computational methods to infer haplotype from genotype data have received great attention from the computational biology community. Unfortunately, haplotype inference is a very hard computational biology problem and the existing methods can only partially identify correct solutions. I present neural network models that use different properties of the data to predict when a method is more prone to make errors. I construct models for three different Haplotype Inference approaches and I show that our models are accurate and statistically relevant. The results of our experiments offer valuable insights on the performance of those methods, opening opportunity for a combination of strategies or improvement of individual approaches. I formally demonstrate that Linkage Disequilibrium (LD) and heterozygosity are very strong indicators of Switch Error tendency for four methods studied, and I delineate scenarios based on LD measures, that reveal a higher or smaller propension of the HI methods to present inference errors, so the correlation between LD and the occurrence of errors varies among regions along the genotypes. I present evidence that considering windows of length 10, immediately to the left of a SNP (upstream region), and eliminating the non-informative SNPs through Fisher’s Test leads to a more suitable correlation between LD and Inference Errors. I apply Multiple Linear Regression to explore the relevance of several biologically meaningful properties of the genotype sequences for the accuracy of the haplotype inference results, developing models for two databases (considering only Humans) and using two error metrics. The accuracy of our results and the stability of our proposed models are supported by statistical evidence.Haplótipos têm um papel central na compreensão e diagnóstico de determinadas doenças e também para estudos de evolução. Este tipo de informação é difícil de obter diretamente, diante disto, métodos computacionais para inferir haplótipos a partir de dados genotípicos têm recebido grande atenção da comunidade de biologia computacional. Infelizmente, a Inferência de Halótipos é um problema difícil e os métodos existentes só podem predizer parcialmente soluções corretas. Foram desenvolvidos modelos de redes neurais que utilizam diferentes propriedades dos dados para prever quando um método é mais propenso a cometer erros. Foram calibrados modelos para três abordagens de Inferência de Haplótipos diferentes e os resultados validados estatisticamente. Os resultados dos experimentos oferecem informações valiosas sobre o desempenho e comportamento desses métodos, gerando condições para o desenvolvimento de estratégias de combinação de diferentes soluções ou melhoria das abordagens individuais. Foi demonstrado que Desequilíbrio de Ligação (LD) e heterozigosidade são fortes indicadores de tendência de erro, desta forma foram delineados cenários com base em medidas de LD, que revelam quando um método tem maior ou menor propensão de cometer erros. Foi identificado que utilizando janelas de 10 SNPs (polimorfismo de um único nucleotídeo), imediatamente a montante, e eliminando os SNPs não informativos pelo Teste de Fisher leva-se a uma correlação mais adequada entre LD e a ocorrência de erros. Por fim, foi aplicada análise de Regressão Linear para explorar a relevância de várias propriedades biologicamente significativas das sequências de genótipos para a precisão dos resultados de Inferência de Haplótipos, estimou-se modelos para duas bases de dados (considerando apenas humanos) utilizando duas métricas de erro. A precisão dos resultados e a estabilidade dos modelos propostos foram validadas por testes estatísticos.porUniversidade Federal de PernambucoPrograma de Pos Graduacao em Ciencia da ComputacaoUFPEBrasilAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessRegressão LinearAnálises EstatísticaSNPsHaplótiposDados GenotípicosInferência de HaplótiposLinear RegressionStatistical AnalysisSNPsHaplotypesGenotype DataHaplotype InferenceAssociating genotype sequence properties to haplotype inference errorsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisdoutoradoreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPETHUMBNAILRogerioSantosRosa_Tese.pdf.jpgRogerioSantosRosa_Tese.pdf.jpgGenerated Thumbnailimage/jpeg1244https://repositorio.ufpe.br/bitstream/123456789/16011/5/RogerioSantosRosa_Tese.pdf.jpge73c2ab554a22cd3bbb267dc3c93b2fdMD55ORIGINALRogerioSantosRosa_Tese.pdfRogerioSantosRosa_Tese.pdfapplication/pdf1740026https://repositorio.ufpe.br/bitstream/123456789/16011/1/RogerioSantosRosa_Tese.pdfaa346f64c34419c4b83269ccb99ade6aMD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-81232https://repositorio.ufpe.br/bitstream/123456789/16011/2/license_rdf66e71c371cc565284e70f40736c94386MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82311https://repositorio.ufpe.br/bitstream/123456789/16011/3/license.txt4b8a02c7f2818eaf00dcf2260dd5eb08MD53TEXTRogerioSantosRosa_Tese.pdf.txtRogerioSantosRosa_Tese.pdf.txtExtracted texttext/plain197856https://repositorio.ufpe.br/bitstream/123456789/16011/4/RogerioSantosRosa_Tese.pdf.txt181c2363b8027366944d06190851c41aMD54123456789/160112019-10-25 22:33:24.463oai:repositorio.ufpe.br:123456789/16011TGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKClRvZG8gZGVwb3NpdGFudGUgZGUgbWF0ZXJpYWwgbm8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgKFJJKSBkZXZlIGNvbmNlZGVyLCDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBQZXJuYW1idWNvIChVRlBFKSwgdW1hIExpY2Vuw6dhIGRlIERpc3RyaWJ1acOnw6NvIE7Do28gRXhjbHVzaXZhIHBhcmEgbWFudGVyIGUgdG9ybmFyIGFjZXNzw612ZWlzIG9zIHNldXMgZG9jdW1lbnRvcywgZW0gZm9ybWF0byBkaWdpdGFsLCBuZXN0ZSByZXBvc2l0w7NyaW8uCgpDb20gYSBjb25jZXNzw6NvIGRlc3RhIGxpY2Vuw6dhIG7Do28gZXhjbHVzaXZhLCBvIGRlcG9zaXRhbnRlIG1hbnTDqW0gdG9kb3Mgb3MgZGlyZWl0b3MgZGUgYXV0b3IuCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwoKTGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKCkFvIGNvbmNvcmRhciBjb20gZXN0YSBsaWNlbsOnYSBlIGFjZWl0w6EtbGEsIHZvY8OqIChhdXRvciBvdSBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMpOgoKYSkgRGVjbGFyYSBxdWUgY29uaGVjZSBhIHBvbMOtdGljYSBkZSBjb3B5cmlnaHQgZGEgZWRpdG9yYSBkbyBzZXUgZG9jdW1lbnRvOwpiKSBEZWNsYXJhIHF1ZSBjb25oZWNlIGUgYWNlaXRhIGFzIERpcmV0cml6ZXMgcGFyYSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGUEU7CmMpIENvbmNlZGUgw6AgVUZQRSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZGUgYXJxdWl2YXIsIHJlcHJvZHV6aXIsIGNvbnZlcnRlciAoY29tbyBkZWZpbmlkbyBhIHNlZ3VpciksIGNvbXVuaWNhciBlL291IGRpc3RyaWJ1aXIsIG5vIFJJLCBvIGRvY3VtZW50byBlbnRyZWd1ZSAoaW5jbHVpbmRvIG8gcmVzdW1vL2Fic3RyYWN0KSBlbSBmb3JtYXRvIGRpZ2l0YWwgb3UgcG9yIG91dHJvIG1laW87CmQpIERlY2xhcmEgcXVlIGF1dG9yaXphIGEgVUZQRSBhIGFycXVpdmFyIG1haXMgZGUgdW1hIGPDs3BpYSBkZXN0ZSBkb2N1bWVudG8gZSBjb252ZXJ0w6otbG8sIHNlbSBhbHRlcmFyIG8gc2V1IGNvbnRlw7pkbywgcGFyYSBxdWFscXVlciBmb3JtYXRvIGRlIGZpY2hlaXJvLCBtZWlvIG91IHN1cG9ydGUsIHBhcmEgZWZlaXRvcyBkZSBzZWd1cmFuw6dhLCBwcmVzZXJ2YcOnw6NvIChiYWNrdXApIGUgYWNlc3NvOwplKSBEZWNsYXJhIHF1ZSBvIGRvY3VtZW50byBzdWJtZXRpZG8gw6kgbyBzZXUgdHJhYmFsaG8gb3JpZ2luYWwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBhIHRlcmNlaXJvcyBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBhIGVudHJlZ2EgZG8gZG9jdW1lbnRvIG7Do28gaW5mcmluZ2Ugb3MgZGlyZWl0b3MgZGUgb3V0cmEgcGVzc29hIG91IGVudGlkYWRlOwpmKSBEZWNsYXJhIHF1ZSwgbm8gY2FzbyBkbyBkb2N1bWVudG8gc3VibWV0aWRvIGNvbnRlciBtYXRlcmlhbCBkbyBxdWFsIG7Do28gZGV0w6ltIG9zIGRpcmVpdG9zIGRlCmF1dG9yLCBvYnRldmUgYSBhdXRvcml6YcOnw6NvIGlycmVzdHJpdGEgZG8gcmVzcGVjdGl2byBkZXRlbnRvciBkZXNzZXMgZGlyZWl0b3MgcGFyYSBjZWRlciDDoApVRlBFIG9zIGRpcmVpdG9zIHJlcXVlcmlkb3MgcG9yIGVzdGEgTGljZW7Dp2EgZSBhdXRvcml6YXIgYSB1bml2ZXJzaWRhZGUgYSB1dGlsaXrDoS1sb3MgbGVnYWxtZW50ZS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBlc3NlIG1hdGVyaWFsIGN1am9zIGRpcmVpdG9zIHPDo28gZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3UgY29udGXDumRvIGRvIGRvY3VtZW50byBlbnRyZWd1ZTsKZykgU2UgbyBkb2N1bWVudG8gZW50cmVndWUgw6kgYmFzZWFkbyBlbSB0cmFiYWxobyBmaW5hbmNpYWRvIG91IGFwb2lhZG8gcG9yIG91dHJhIGluc3RpdHVpw6fDo28gcXVlIG7Do28gYSBVRlBFLMKgZGVjbGFyYSBxdWUgY3VtcHJpdSBxdWFpc3F1ZXIgb2JyaWdhw6fDtWVzIGV4aWdpZGFzIHBlbG8gcmVzcGVjdGl2byBjb250cmF0byBvdSBhY29yZG8uCgpBIFVGUEUgaWRlbnRpZmljYXLDoSBjbGFyYW1lbnRlIG8ocykgbm9tZShzKSBkbyhzKSBhdXRvciAoZXMpIGRvcyBkaXJlaXRvcyBkbyBkb2N1bWVudG8gZW50cmVndWUgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBwYXJhIGFsw6ltIGRvIHByZXZpc3RvIG5hIGFsw61uZWEgYykuCg==Repositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212019-10-26T01:33:24Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false |
dc.title.pt_BR.fl_str_mv |
Associating genotype sequence properties to haplotype inference errors |
title |
Associating genotype sequence properties to haplotype inference errors |
spellingShingle |
Associating genotype sequence properties to haplotype inference errors ROSA, Rogério dos Santos Regressão Linear Análises Estatística SNPs Haplótipos Dados Genotípicos Inferência de Haplótipos Linear Regression Statistical Analysis SNPs Haplotypes Genotype Data Haplotype Inference |
title_short |
Associating genotype sequence properties to haplotype inference errors |
title_full |
Associating genotype sequence properties to haplotype inference errors |
title_fullStr |
Associating genotype sequence properties to haplotype inference errors |
title_full_unstemmed |
Associating genotype sequence properties to haplotype inference errors |
title_sort |
Associating genotype sequence properties to haplotype inference errors |
author |
ROSA, Rogério dos Santos |
author_facet |
ROSA, Rogério dos Santos |
author_role |
author |
dc.contributor.advisorLattes.pt_BR.fl_str_mv |
http://lattes.cnpq.br/8994178236264483 |
dc.contributor.author.fl_str_mv |
ROSA, Rogério dos Santos |
dc.contributor.advisor1.fl_str_mv |
GUIMARÃES, Katia Silva |
contributor_str_mv |
GUIMARÃES, Katia Silva |
dc.subject.por.fl_str_mv |
Regressão Linear Análises Estatística SNPs Haplótipos Dados Genotípicos Inferência de Haplótipos Linear Regression Statistical Analysis SNPs Haplotypes Genotype Data Haplotype Inference |
topic |
Regressão Linear Análises Estatística SNPs Haplótipos Dados Genotípicos Inferência de Haplótipos Linear Regression Statistical Analysis SNPs Haplotypes Genotype Data Haplotype Inference |
description |
Haplotype information has a central role in the understanding and diagnosis of certain illnesses, and also for evolution studies. Since that type of information is hard to obtain directly, computational methods to infer haplotype from genotype data have received great attention from the computational biology community. Unfortunately, haplotype inference is a very hard computational biology problem and the existing methods can only partially identify correct solutions. I present neural network models that use different properties of the data to predict when a method is more prone to make errors. I construct models for three different Haplotype Inference approaches and I show that our models are accurate and statistically relevant. The results of our experiments offer valuable insights on the performance of those methods, opening opportunity for a combination of strategies or improvement of individual approaches. I formally demonstrate that Linkage Disequilibrium (LD) and heterozygosity are very strong indicators of Switch Error tendency for four methods studied, and I delineate scenarios based on LD measures, that reveal a higher or smaller propension of the HI methods to present inference errors, so the correlation between LD and the occurrence of errors varies among regions along the genotypes. I present evidence that considering windows of length 10, immediately to the left of a SNP (upstream region), and eliminating the non-informative SNPs through Fisher’s Test leads to a more suitable correlation between LD and Inference Errors. I apply Multiple Linear Regression to explore the relevance of several biologically meaningful properties of the genotype sequences for the accuracy of the haplotype inference results, developing models for two databases (considering only Humans) and using two error metrics. The accuracy of our results and the stability of our proposed models are supported by statistical evidence. |
publishDate |
2015 |
dc.date.issued.fl_str_mv |
2015-03-12 |
dc.date.accessioned.fl_str_mv |
2016-03-16T15:28:48Z |
dc.date.available.fl_str_mv |
2016-03-16T15:28:48Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufpe.br/handle/123456789/16011 |
dc.identifier.dark.fl_str_mv |
ark:/64986/001300000k22p |
url |
https://repositorio.ufpe.br/handle/123456789/16011 |
identifier_str_mv |
ark:/64986/001300000k22p |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Pernambuco |
dc.publisher.program.fl_str_mv |
Programa de Pos Graduacao em Ciencia da Computacao |
dc.publisher.initials.fl_str_mv |
UFPE |
dc.publisher.country.fl_str_mv |
Brasil |
publisher.none.fl_str_mv |
Universidade Federal de Pernambuco |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFPE instname:Universidade Federal de Pernambuco (UFPE) instacron:UFPE |
instname_str |
Universidade Federal de Pernambuco (UFPE) |
instacron_str |
UFPE |
institution |
UFPE |
reponame_str |
Repositório Institucional da UFPE |
collection |
Repositório Institucional da UFPE |
bitstream.url.fl_str_mv |
https://repositorio.ufpe.br/bitstream/123456789/16011/5/RogerioSantosRosa_Tese.pdf.jpg https://repositorio.ufpe.br/bitstream/123456789/16011/1/RogerioSantosRosa_Tese.pdf https://repositorio.ufpe.br/bitstream/123456789/16011/2/license_rdf https://repositorio.ufpe.br/bitstream/123456789/16011/3/license.txt https://repositorio.ufpe.br/bitstream/123456789/16011/4/RogerioSantosRosa_Tese.pdf.txt |
bitstream.checksum.fl_str_mv |
e73c2ab554a22cd3bbb267dc3c93b2fd aa346f64c34419c4b83269ccb99ade6a 66e71c371cc565284e70f40736c94386 4b8a02c7f2818eaf00dcf2260dd5eb08 181c2363b8027366944d06190851c41a |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE) |
repository.mail.fl_str_mv |
attena@ufpe.br |
_version_ |
1815172843539791872 |