Systemic-Functional modeling of text complexity in Brazilian Portuguese
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFMG |
Texto Completo: | http://hdl.handle.net/1843/39311 https://orcid.org/0000-0002-7433-7904 |
Resumo: | Investigating text complexity is a significant step towards modeling text simplification tasks, as text simplification is the reduction of the complexity of a text. In the last two decades, studies in Natural Language Processing (NLP) have attempted to discover efficient simplification strategies. Although some attempts to address this issue with the construction of computer models based on language theories have provided potentially valuable insights, they remain insufficient to effectively deal with the task. Purporting to fill this gap and drawing on a comprehensive theory of language -- Systemic Functional Linguistics (SFL) (Halliday & Matthiessen, 2014) --, this thesis explores text complexity with a view to gathering findings that may inform text simplification tasks aimed to produce more accessible texts in Brazilian Portuguese. To that end, SIM-Pt (Simplified Brazilian Portuguese), a monolingual parallel corpus of aligned text segments in the physics, biology, and psychology domains, was compiled. Text segments were organized into two paired datasets: (1) two sets of naturally occurring segments, made up of, respectively, simpler and more complex segments extracted from science texts found on the Web; and (2) two sets of manually constructed segments based on the naturally occurring segments, ensuring distinct complexity levels. Each set contains approximately 200 text segments. Clauses in segments were manually analyzed in terms of Ideational, Interpersonal, and Textual meanings, and lexicogrammatical patterns were obtained on the basis of systemic and structural frequencies that could yield variables closely related to different levels of grammatical metaphor. By examining text complexity within the strata of Lexicogrammar, Semantics, and Context, we proposed a relationship between text complexity and experiential grammatical metaphor. The results show that, from the experiential viewpoint, a higher degree of experiential grammatical metaphor on average correlates with higher text complexity. The main pieces of evidence supporting this claim from the perspective of lexicogrammar were the higher frequency of relational and existential clauses in combination with middle voice and embedded clauses and the higher frequency of class shifts (especially nominalizations) and rank shifts (Ravelli, 1999). The findings of this thesis are expected to contribute to text simplification accounts for Brazilian Portuguese in both applied linguistics and NLP. |
id |
UFMG_80c4b04d22edf009d366fb8940ddc1c6 |
---|---|
oai_identifier_str |
oai:repositorio.ufmg.br:1843/39311 |
network_acronym_str |
UFMG |
network_name_str |
Repositório Institucional da UFMG |
repository_id_str |
|
spelling |
Adriana Silvina Paganohttp://lattes.cnpq.br/9048531014341931David ButtIlka Afonso ReisAnnabelle LukinGiacomo Patrocinio FigueredoThiago Castro FerreiraIgor Antonio Lourenço da SilvaKicila Ferreguetti de Oliveirahttp://lattes.cnpq.br/1352640534438831Rodrigo Araujo e Castro2022-02-08T20:09:54Z2022-02-08T20:09:54Z2021-11-11http://hdl.handle.net/1843/39311https://orcid.org/0000-0002-7433-7904Investigating text complexity is a significant step towards modeling text simplification tasks, as text simplification is the reduction of the complexity of a text. In the last two decades, studies in Natural Language Processing (NLP) have attempted to discover efficient simplification strategies. Although some attempts to address this issue with the construction of computer models based on language theories have provided potentially valuable insights, they remain insufficient to effectively deal with the task. Purporting to fill this gap and drawing on a comprehensive theory of language -- Systemic Functional Linguistics (SFL) (Halliday & Matthiessen, 2014) --, this thesis explores text complexity with a view to gathering findings that may inform text simplification tasks aimed to produce more accessible texts in Brazilian Portuguese. To that end, SIM-Pt (Simplified Brazilian Portuguese), a monolingual parallel corpus of aligned text segments in the physics, biology, and psychology domains, was compiled. Text segments were organized into two paired datasets: (1) two sets of naturally occurring segments, made up of, respectively, simpler and more complex segments extracted from science texts found on the Web; and (2) two sets of manually constructed segments based on the naturally occurring segments, ensuring distinct complexity levels. Each set contains approximately 200 text segments. Clauses in segments were manually analyzed in terms of Ideational, Interpersonal, and Textual meanings, and lexicogrammatical patterns were obtained on the basis of systemic and structural frequencies that could yield variables closely related to different levels of grammatical metaphor. By examining text complexity within the strata of Lexicogrammar, Semantics, and Context, we proposed a relationship between text complexity and experiential grammatical metaphor. The results show that, from the experiential viewpoint, a higher degree of experiential grammatical metaphor on average correlates with higher text complexity. The main pieces of evidence supporting this claim from the perspective of lexicogrammar were the higher frequency of relational and existential clauses in combination with middle voice and embedded clauses and the higher frequency of class shifts (especially nominalizations) and rank shifts (Ravelli, 1999). The findings of this thesis are expected to contribute to text simplification accounts for Brazilian Portuguese in both applied linguistics and NLP.O estudo da complexidade textual é um passo fundamental para a modelagem de tarefas de simplificação textual, uma vez que simplificação se configura como uma redução na complexidade do texto. Nas últimas duas décadas, estudos em Processamento de Língua Natural (PLN) têm procurado identificar estratégias eficientes de simplificação. Embora algumas tentativas de abordar esta questão com a construção de modelos computacionais baseados em teorias da linguagem tenham fornecido insights potencialmente valiosos, estes ainda são insuficientes para lidar efetivamente com a tarefa. Com o objetivo de preencher esta lacuna e com base em uma teoria abrangente da linguagem -- a Linguística Funcional Sistêmica (LSF) (Halliday & Matthiessen, 2014) --, esta tese explora a complexidade da linguagem com o objetivo de obter evidências que possam informar as tarefas de simplificação textual visando a produção de textos mais acessíveis em português brasileiro. Para tanto, foi compilado SIM-Pt (Simplificado Português Brasileiro), um corpus paralelo monolingüe de segmentos textuais alinhados nos domínios da física, biologia e psicologia. Os segmentos foram organizados em dois conjuntos de dados associados: (1) dois conjuntos de segmentos extraídos de textos científicos encontrados na Web, compostos, respectivamente, de segmentos mais simples e mais complexos; e (2) dois conjuntos de segmentos criados manualmente com base nos segmentos extraídos de textos, mantendo-se níveis distintos de complexidade. Cada conjunto contém aproximadamente 200 segmentos de texto. As orações em cada segmento foram analisadas manualmente de acordo com seus significados Ideacionais, Interpessoais e Textuais, e padrões na lexicogramática foram obtidos com base em frequências sistêmicas e estruturais que pudessem fornecer variáveis estreitamente relacionadas a diferentes níveis de metaforicidade gramatical. Por meio do mapeamento da complexidade textual nos estratos da lexicogramática, semântica e contexto, foi proposta uma relação entre complexidade textual e metáfora gramatical experiencial. Os resultados mostram que, do ponto de vista experiencial, em média maior grau de metáfora gramatical experiencial está correlacionado com maior complexidade textual. As principais evidências que sustentam esta afirmação sob a perspectiva da lexicogramática foram a frequência mais elevada de orações relacionais e existenciais, juntamente com orações na voz média e orações incrustadas, e a frequência mais elevada de mudanças de classe de palavra (especialmente nominalizações) e mudanças na escala de ordens (Ravelli, 1999). Os resultados desta tese contribuem para os estudos da simplificação textual no português brasileiro, tanto no campo da linguística aplicada como no campo da PNL.CNPq - Conselho Nacional de Desenvolvimento Científico e TecnológicoengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Estudos LinguísticosUFMGBrasilFALE - FACULDADE DE LETRAShttp://creativecommons.org/licenses/by-nc-nd/3.0/pt/info:eu-repo/semantics/openAccessTradução e interpretaçãoLinguística aplicadaLinguística – Processamento de dadosFuncionalismo (Linguística)Linguística de corpusSystemic Functional LinguisticsApplied LinguisticsText simplificationText complexityGrammatical metaphorScience textsBrazilian PortugueseSystemic-Functional modeling of text complexity in Brazilian PortugueseModelagem Sistêmico-Funcional de complexidade textual do português brasileiroinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALCastro_2021_final.pdfCastro_2021_final.pdfapplication/pdf3553466https://repositorio.ufmg.br/bitstream/1843/39311/4/Castro_2021_final.pdf826f875ea754cb6b510ed805fb0cccf1MD54CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufmg.br/bitstream/1843/39311/2/license_rdfcfd6801dba008cb6adbd9838b81582abMD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82118https://repositorio.ufmg.br/bitstream/1843/39311/5/license.txtcda590c95a0b51b4d15f60c9642ca272MD551843/393112022-02-08 17:09:55.068oai:repositorio.ufmg.br:1843/39311TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KRepositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2022-02-08T20:09:55Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
dc.title.pt_BR.fl_str_mv |
Systemic-Functional modeling of text complexity in Brazilian Portuguese |
dc.title.alternative.pt_BR.fl_str_mv |
Modelagem Sistêmico-Funcional de complexidade textual do português brasileiro |
title |
Systemic-Functional modeling of text complexity in Brazilian Portuguese |
spellingShingle |
Systemic-Functional modeling of text complexity in Brazilian Portuguese Rodrigo Araujo e Castro Systemic Functional Linguistics Applied Linguistics Text simplification Text complexity Grammatical metaphor Science texts Brazilian Portuguese Tradução e interpretação Linguística aplicada Linguística – Processamento de dados Funcionalismo (Linguística) Linguística de corpus |
title_short |
Systemic-Functional modeling of text complexity in Brazilian Portuguese |
title_full |
Systemic-Functional modeling of text complexity in Brazilian Portuguese |
title_fullStr |
Systemic-Functional modeling of text complexity in Brazilian Portuguese |
title_full_unstemmed |
Systemic-Functional modeling of text complexity in Brazilian Portuguese |
title_sort |
Systemic-Functional modeling of text complexity in Brazilian Portuguese |
author |
Rodrigo Araujo e Castro |
author_facet |
Rodrigo Araujo e Castro |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Adriana Silvina Pagano |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/9048531014341931 |
dc.contributor.advisor2.fl_str_mv |
David Butt |
dc.contributor.advisor-co1.fl_str_mv |
Ilka Afonso Reis |
dc.contributor.advisor-co2.fl_str_mv |
Annabelle Lukin |
dc.contributor.referee1.fl_str_mv |
Giacomo Patrocinio Figueredo |
dc.contributor.referee2.fl_str_mv |
Thiago Castro Ferreira |
dc.contributor.referee3.fl_str_mv |
Igor Antonio Lourenço da Silva |
dc.contributor.referee4.fl_str_mv |
Kicila Ferreguetti de Oliveira |
dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/1352640534438831 |
dc.contributor.author.fl_str_mv |
Rodrigo Araujo e Castro |
contributor_str_mv |
Adriana Silvina Pagano David Butt Ilka Afonso Reis Annabelle Lukin Giacomo Patrocinio Figueredo Thiago Castro Ferreira Igor Antonio Lourenço da Silva Kicila Ferreguetti de Oliveira |
dc.subject.por.fl_str_mv |
Systemic Functional Linguistics Applied Linguistics Text simplification Text complexity Grammatical metaphor Science texts Brazilian Portuguese |
topic |
Systemic Functional Linguistics Applied Linguistics Text simplification Text complexity Grammatical metaphor Science texts Brazilian Portuguese Tradução e interpretação Linguística aplicada Linguística – Processamento de dados Funcionalismo (Linguística) Linguística de corpus |
dc.subject.other.pt_BR.fl_str_mv |
Tradução e interpretação Linguística aplicada Linguística – Processamento de dados Funcionalismo (Linguística) Linguística de corpus |
description |
Investigating text complexity is a significant step towards modeling text simplification tasks, as text simplification is the reduction of the complexity of a text. In the last two decades, studies in Natural Language Processing (NLP) have attempted to discover efficient simplification strategies. Although some attempts to address this issue with the construction of computer models based on language theories have provided potentially valuable insights, they remain insufficient to effectively deal with the task. Purporting to fill this gap and drawing on a comprehensive theory of language -- Systemic Functional Linguistics (SFL) (Halliday & Matthiessen, 2014) --, this thesis explores text complexity with a view to gathering findings that may inform text simplification tasks aimed to produce more accessible texts in Brazilian Portuguese. To that end, SIM-Pt (Simplified Brazilian Portuguese), a monolingual parallel corpus of aligned text segments in the physics, biology, and psychology domains, was compiled. Text segments were organized into two paired datasets: (1) two sets of naturally occurring segments, made up of, respectively, simpler and more complex segments extracted from science texts found on the Web; and (2) two sets of manually constructed segments based on the naturally occurring segments, ensuring distinct complexity levels. Each set contains approximately 200 text segments. Clauses in segments were manually analyzed in terms of Ideational, Interpersonal, and Textual meanings, and lexicogrammatical patterns were obtained on the basis of systemic and structural frequencies that could yield variables closely related to different levels of grammatical metaphor. By examining text complexity within the strata of Lexicogrammar, Semantics, and Context, we proposed a relationship between text complexity and experiential grammatical metaphor. The results show that, from the experiential viewpoint, a higher degree of experiential grammatical metaphor on average correlates with higher text complexity. The main pieces of evidence supporting this claim from the perspective of lexicogrammar were the higher frequency of relational and existential clauses in combination with middle voice and embedded clauses and the higher frequency of class shifts (especially nominalizations) and rank shifts (Ravelli, 1999). The findings of this thesis are expected to contribute to text simplification accounts for Brazilian Portuguese in both applied linguistics and NLP. |
publishDate |
2021 |
dc.date.issued.fl_str_mv |
2021-11-11 |
dc.date.accessioned.fl_str_mv |
2022-02-08T20:09:54Z |
dc.date.available.fl_str_mv |
2022-02-08T20:09:54Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1843/39311 |
dc.identifier.orcid.pt_BR.fl_str_mv |
https://orcid.org/0000-0002-7433-7904 |
url |
http://hdl.handle.net/1843/39311 https://orcid.org/0000-0002-7433-7904 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
http://creativecommons.org/licenses/by-nc-nd/3.0/pt/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-nd/3.0/pt/ |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Estudos Linguísticos |
dc.publisher.initials.fl_str_mv |
UFMG |
dc.publisher.country.fl_str_mv |
Brasil |
dc.publisher.department.fl_str_mv |
FALE - FACULDADE DE LETRAS |
publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
instname_str |
Universidade Federal de Minas Gerais (UFMG) |
instacron_str |
UFMG |
institution |
UFMG |
reponame_str |
Repositório Institucional da UFMG |
collection |
Repositório Institucional da UFMG |
bitstream.url.fl_str_mv |
https://repositorio.ufmg.br/bitstream/1843/39311/4/Castro_2021_final.pdf https://repositorio.ufmg.br/bitstream/1843/39311/2/license_rdf https://repositorio.ufmg.br/bitstream/1843/39311/5/license.txt |
bitstream.checksum.fl_str_mv |
826f875ea754cb6b510ed805fb0cccf1 cfd6801dba008cb6adbd9838b81582ab cda590c95a0b51b4d15f60c9642ca272 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
repository.mail.fl_str_mv |
|
_version_ |
1803589301050540032 |