Systemic-Functional modeling of text complexity in Brazilian Portuguese

Detalhes bibliográficos
Autor(a) principal: Rodrigo Araujo e Castro
Data de Publicação: 2021
Tipo de documento: Tese
Idioma: eng
Título da fonte: Repositório Institucional da UFMG
Texto Completo: http://hdl.handle.net/1843/39311
https://orcid.org/0000-0002-7433-7904
Resumo: Investigating text complexity is a significant step towards modeling text simplification tasks, as text simplification is the reduction of the complexity of a text. In the last two decades, studies in Natural Language Processing (NLP) have attempted to discover efficient simplification strategies. Although some attempts to address this issue with the construction of computer models based on language theories have provided potentially valuable insights, they remain insufficient to effectively deal with the task. Purporting to fill this gap and drawing on a comprehensive theory of language -- Systemic Functional Linguistics (SFL) (Halliday & Matthiessen, 2014) --, this thesis explores text complexity with a view to gathering findings that may inform text simplification tasks aimed to produce more accessible texts in Brazilian Portuguese. To that end, SIM-Pt (Simplified Brazilian Portuguese), a monolingual parallel corpus of aligned text segments in the physics, biology, and psychology domains, was compiled. Text segments were organized into two paired datasets: (1) two sets of naturally occurring segments, made up of, respectively, simpler and more complex segments extracted from science texts found on the Web; and (2) two sets of manually constructed segments based on the naturally occurring segments, ensuring distinct complexity levels. Each set contains approximately 200 text segments. Clauses in segments were manually analyzed in terms of Ideational, Interpersonal, and Textual meanings, and lexicogrammatical patterns were obtained on the basis of systemic and structural frequencies that could yield variables closely related to different levels of grammatical metaphor. By examining text complexity within the strata of Lexicogrammar, Semantics, and Context, we proposed a relationship between text complexity and experiential grammatical metaphor. The results show that, from the experiential viewpoint, a higher degree of experiential grammatical metaphor on average correlates with higher text complexity. The main pieces of evidence supporting this claim from the perspective of lexicogrammar were the higher frequency of relational and existential clauses in combination with middle voice and embedded clauses and the higher frequency of class shifts (especially nominalizations) and rank shifts (Ravelli, 1999). The findings of this thesis are expected to contribute to text simplification accounts for Brazilian Portuguese in both applied linguistics and NLP.
id UFMG_80c4b04d22edf009d366fb8940ddc1c6
oai_identifier_str oai:repositorio.ufmg.br:1843/39311
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Adriana Silvina Paganohttp://lattes.cnpq.br/9048531014341931David ButtIlka Afonso ReisAnnabelle LukinGiacomo Patrocinio FigueredoThiago Castro FerreiraIgor Antonio Lourenço da SilvaKicila Ferreguetti de Oliveirahttp://lattes.cnpq.br/1352640534438831Rodrigo Araujo e Castro2022-02-08T20:09:54Z2022-02-08T20:09:54Z2021-11-11http://hdl.handle.net/1843/39311https://orcid.org/0000-0002-7433-7904Investigating text complexity is a significant step towards modeling text simplification tasks, as text simplification is the reduction of the complexity of a text. In the last two decades, studies in Natural Language Processing (NLP) have attempted to discover efficient simplification strategies. Although some attempts to address this issue with the construction of computer models based on language theories have provided potentially valuable insights, they remain insufficient to effectively deal with the task. Purporting to fill this gap and drawing on a comprehensive theory of language -- Systemic Functional Linguistics (SFL) (Halliday & Matthiessen, 2014) --, this thesis explores text complexity with a view to gathering findings that may inform text simplification tasks aimed to produce more accessible texts in Brazilian Portuguese. To that end, SIM-Pt (Simplified Brazilian Portuguese), a monolingual parallel corpus of aligned text segments in the physics, biology, and psychology domains, was compiled. Text segments were organized into two paired datasets: (1) two sets of naturally occurring segments, made up of, respectively, simpler and more complex segments extracted from science texts found on the Web; and (2) two sets of manually constructed segments based on the naturally occurring segments, ensuring distinct complexity levels. Each set contains approximately 200 text segments. Clauses in segments were manually analyzed in terms of Ideational, Interpersonal, and Textual meanings, and lexicogrammatical patterns were obtained on the basis of systemic and structural frequencies that could yield variables closely related to different levels of grammatical metaphor. By examining text complexity within the strata of Lexicogrammar, Semantics, and Context, we proposed a relationship between text complexity and experiential grammatical metaphor. The results show that, from the experiential viewpoint, a higher degree of experiential grammatical metaphor on average correlates with higher text complexity. The main pieces of evidence supporting this claim from the perspective of lexicogrammar were the higher frequency of relational and existential clauses in combination with middle voice and embedded clauses and the higher frequency of class shifts (especially nominalizations) and rank shifts (Ravelli, 1999). The findings of this thesis are expected to contribute to text simplification accounts for Brazilian Portuguese in both applied linguistics and NLP.O estudo da complexidade textual é um passo fundamental para a modelagem de tarefas de simplificação textual, uma vez que simplificação se configura como uma redução na complexidade do texto. Nas últimas duas décadas, estudos em Processamento de Língua Natural (PLN) têm procurado identificar estratégias eficientes de simplificação. Embora algumas tentativas de abordar esta questão com a construção de modelos computacionais baseados em teorias da linguagem tenham fornecido insights potencialmente valiosos, estes ainda são insuficientes para lidar efetivamente com a tarefa. Com o objetivo de preencher esta lacuna e com base em uma teoria abrangente da linguagem -- a Linguística Funcional Sistêmica (LSF) (Halliday & Matthiessen, 2014) --, esta tese explora a complexidade da linguagem com o objetivo de obter evidências que possam informar as tarefas de simplificação textual visando a produção de textos mais acessíveis em português brasileiro. Para tanto, foi compilado SIM-Pt (Simplificado Português Brasileiro), um corpus paralelo monolingüe de segmentos textuais alinhados nos domínios da física, biologia e psicologia. Os segmentos foram organizados em dois conjuntos de dados associados: (1) dois conjuntos de segmentos extraídos de textos científicos encontrados na Web, compostos, respectivamente, de segmentos mais simples e mais complexos; e (2) dois conjuntos de segmentos criados manualmente com base nos segmentos extraídos de textos, mantendo-se níveis distintos de complexidade. Cada conjunto contém aproximadamente 200 segmentos de texto. As orações em cada segmento foram analisadas manualmente de acordo com seus significados Ideacionais, Interpessoais e Textuais, e padrões na lexicogramática foram obtidos com base em frequências sistêmicas e estruturais que pudessem fornecer variáveis estreitamente relacionadas a diferentes níveis de metaforicidade gramatical. Por meio do mapeamento da complexidade textual nos estratos da lexicogramática, semântica e contexto, foi proposta uma relação entre complexidade textual e metáfora gramatical experiencial. Os resultados mostram que, do ponto de vista experiencial, em média maior grau de metáfora gramatical experiencial está correlacionado com maior complexidade textual. As principais evidências que sustentam esta afirmação sob a perspectiva da lexicogramática foram a frequência mais elevada de orações relacionais e existenciais, juntamente com orações na voz média e orações incrustadas, e a frequência mais elevada de mudanças de classe de palavra (especialmente nominalizações) e mudanças na escala de ordens (Ravelli, 1999). Os resultados desta tese contribuem para os estudos da simplificação textual no português brasileiro, tanto no campo da linguística aplicada como no campo da PNL.CNPq - Conselho Nacional de Desenvolvimento Científico e TecnológicoengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Estudos LinguísticosUFMGBrasilFALE - FACULDADE DE LETRAShttp://creativecommons.org/licenses/by-nc-nd/3.0/pt/info:eu-repo/semantics/openAccessTradução e interpretaçãoLinguística aplicadaLinguística – Processamento de dadosFuncionalismo (Linguística)Linguística de corpusSystemic Functional LinguisticsApplied LinguisticsText simplificationText complexityGrammatical metaphorScience textsBrazilian PortugueseSystemic-Functional modeling of text complexity in Brazilian PortugueseModelagem Sistêmico-Funcional de complexidade textual do português brasileiroinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALCastro_2021_final.pdfCastro_2021_final.pdfapplication/pdf3553466https://repositorio.ufmg.br/bitstream/1843/39311/4/Castro_2021_final.pdf826f875ea754cb6b510ed805fb0cccf1MD54CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufmg.br/bitstream/1843/39311/2/license_rdfcfd6801dba008cb6adbd9838b81582abMD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82118https://repositorio.ufmg.br/bitstream/1843/39311/5/license.txtcda590c95a0b51b4d15f60c9642ca272MD551843/393112022-02-08 17:09:55.068oai:repositorio.ufmg.br:1843/39311TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KRepositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2022-02-08T20:09:55Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv Systemic-Functional modeling of text complexity in Brazilian Portuguese
dc.title.alternative.pt_BR.fl_str_mv Modelagem Sistêmico-Funcional de complexidade textual do português brasileiro
title Systemic-Functional modeling of text complexity in Brazilian Portuguese
spellingShingle Systemic-Functional modeling of text complexity in Brazilian Portuguese
Rodrigo Araujo e Castro
Systemic Functional Linguistics
Applied Linguistics
Text simplification
Text complexity
Grammatical metaphor
Science texts
Brazilian Portuguese
Tradução e interpretação
Linguística aplicada
Linguística – Processamento de dados
Funcionalismo (Linguística)
Linguística de corpus
title_short Systemic-Functional modeling of text complexity in Brazilian Portuguese
title_full Systemic-Functional modeling of text complexity in Brazilian Portuguese
title_fullStr Systemic-Functional modeling of text complexity in Brazilian Portuguese
title_full_unstemmed Systemic-Functional modeling of text complexity in Brazilian Portuguese
title_sort Systemic-Functional modeling of text complexity in Brazilian Portuguese
author Rodrigo Araujo e Castro
author_facet Rodrigo Araujo e Castro
author_role author
dc.contributor.advisor1.fl_str_mv Adriana Silvina Pagano
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/9048531014341931
dc.contributor.advisor2.fl_str_mv David Butt
dc.contributor.advisor-co1.fl_str_mv Ilka Afonso Reis
dc.contributor.advisor-co2.fl_str_mv Annabelle Lukin
dc.contributor.referee1.fl_str_mv Giacomo Patrocinio Figueredo
dc.contributor.referee2.fl_str_mv Thiago Castro Ferreira
dc.contributor.referee3.fl_str_mv Igor Antonio Lourenço da Silva
dc.contributor.referee4.fl_str_mv Kicila Ferreguetti de Oliveira
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/1352640534438831
dc.contributor.author.fl_str_mv Rodrigo Araujo e Castro
contributor_str_mv Adriana Silvina Pagano
David Butt
Ilka Afonso Reis
Annabelle Lukin
Giacomo Patrocinio Figueredo
Thiago Castro Ferreira
Igor Antonio Lourenço da Silva
Kicila Ferreguetti de Oliveira
dc.subject.por.fl_str_mv Systemic Functional Linguistics
Applied Linguistics
Text simplification
Text complexity
Grammatical metaphor
Science texts
Brazilian Portuguese
topic Systemic Functional Linguistics
Applied Linguistics
Text simplification
Text complexity
Grammatical metaphor
Science texts
Brazilian Portuguese
Tradução e interpretação
Linguística aplicada
Linguística – Processamento de dados
Funcionalismo (Linguística)
Linguística de corpus
dc.subject.other.pt_BR.fl_str_mv Tradução e interpretação
Linguística aplicada
Linguística – Processamento de dados
Funcionalismo (Linguística)
Linguística de corpus
description Investigating text complexity is a significant step towards modeling text simplification tasks, as text simplification is the reduction of the complexity of a text. In the last two decades, studies in Natural Language Processing (NLP) have attempted to discover efficient simplification strategies. Although some attempts to address this issue with the construction of computer models based on language theories have provided potentially valuable insights, they remain insufficient to effectively deal with the task. Purporting to fill this gap and drawing on a comprehensive theory of language -- Systemic Functional Linguistics (SFL) (Halliday & Matthiessen, 2014) --, this thesis explores text complexity with a view to gathering findings that may inform text simplification tasks aimed to produce more accessible texts in Brazilian Portuguese. To that end, SIM-Pt (Simplified Brazilian Portuguese), a monolingual parallel corpus of aligned text segments in the physics, biology, and psychology domains, was compiled. Text segments were organized into two paired datasets: (1) two sets of naturally occurring segments, made up of, respectively, simpler and more complex segments extracted from science texts found on the Web; and (2) two sets of manually constructed segments based on the naturally occurring segments, ensuring distinct complexity levels. Each set contains approximately 200 text segments. Clauses in segments were manually analyzed in terms of Ideational, Interpersonal, and Textual meanings, and lexicogrammatical patterns were obtained on the basis of systemic and structural frequencies that could yield variables closely related to different levels of grammatical metaphor. By examining text complexity within the strata of Lexicogrammar, Semantics, and Context, we proposed a relationship between text complexity and experiential grammatical metaphor. The results show that, from the experiential viewpoint, a higher degree of experiential grammatical metaphor on average correlates with higher text complexity. The main pieces of evidence supporting this claim from the perspective of lexicogrammar were the higher frequency of relational and existential clauses in combination with middle voice and embedded clauses and the higher frequency of class shifts (especially nominalizations) and rank shifts (Ravelli, 1999). The findings of this thesis are expected to contribute to text simplification accounts for Brazilian Portuguese in both applied linguistics and NLP.
publishDate 2021
dc.date.issued.fl_str_mv 2021-11-11
dc.date.accessioned.fl_str_mv 2022-02-08T20:09:54Z
dc.date.available.fl_str_mv 2022-02-08T20:09:54Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1843/39311
dc.identifier.orcid.pt_BR.fl_str_mv https://orcid.org/0000-0002-7433-7904
url http://hdl.handle.net/1843/39311
https://orcid.org/0000-0002-7433-7904
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv http://creativecommons.org/licenses/by-nc-nd/3.0/pt/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-nd/3.0/pt/
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Estudos Linguísticos
dc.publisher.initials.fl_str_mv UFMG
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv FALE - FACULDADE DE LETRAS
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
bitstream.url.fl_str_mv https://repositorio.ufmg.br/bitstream/1843/39311/4/Castro_2021_final.pdf
https://repositorio.ufmg.br/bitstream/1843/39311/2/license_rdf
https://repositorio.ufmg.br/bitstream/1843/39311/5/license.txt
bitstream.checksum.fl_str_mv 826f875ea754cb6b510ed805fb0cccf1
cfd6801dba008cb6adbd9838b81582ab
cda590c95a0b51b4d15f60c9642ca272
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_ 1803589301050540032