SEGMENTING CORPORA OF TEX

Detalhes bibliográficos
Autor(a) principal: Sardinha, Tony Berber
Data de Publicação: 2018
Tipo de documento: Artigo
Idioma: eng
Título da fonte: DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
Texto Completo: https://revistas.pucsp.br/index.php/delta/article/view/38793
Resumo: The aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections.
id PUC_SP-4_b716686d59cee04b3d9539af283b6406
oai_identifier_str oai:ojs.pkp.sfu.ca:article/38793
network_acronym_str PUC_SP-4
network_name_str DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
repository_id_str
spelling SEGMENTING CORPORA OF TEXCorpus linguisticsDiscourse analysisSegmentationLexical cohesionRepetitionThe aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections.Pontifícia Universidade Católica de São paulo2018-08-08info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://revistas.pucsp.br/index.php/delta/article/view/38793DELTA: Documentação e Estudos em Linguística Teórica e Aplicada; v. 18 n. 2 (2002)1678-460X0102-4450reponame:DELTA: Documentação de Estudos em Lingüística Teórica e Aplicadainstname:Pontifícia Universidade Católica de São Paulo (PUC-SP)instacron:PUC_SPenghttps://revistas.pucsp.br/index.php/delta/article/view/38793/26327Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicadainfo:eu-repo/semantics/openAccessSardinha, Tony Berber2018-08-08T14:36:32Zoai:ojs.pkp.sfu.ca:article/38793Revistahttps://revistas.pucsp.br/deltaPRIhttps://revistas.pucsp.br/index.php/delta/oai||delta@pucsp.br1678-460X1678-460Xopendoar:2018-08-08T14:36:32DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada - Pontifícia Universidade Católica de São Paulo (PUC-SP)false
dc.title.none.fl_str_mv SEGMENTING CORPORA OF TEX
title SEGMENTING CORPORA OF TEX
spellingShingle SEGMENTING CORPORA OF TEX
Sardinha, Tony Berber
Corpus linguistics
Discourse analysis
Segmentation
Lexical cohesion
Repetition
title_short SEGMENTING CORPORA OF TEX
title_full SEGMENTING CORPORA OF TEX
title_fullStr SEGMENTING CORPORA OF TEX
title_full_unstemmed SEGMENTING CORPORA OF TEX
title_sort SEGMENTING CORPORA OF TEX
author Sardinha, Tony Berber
author_facet Sardinha, Tony Berber
author_role author
dc.contributor.author.fl_str_mv Sardinha, Tony Berber
dc.subject.por.fl_str_mv Corpus linguistics
Discourse analysis
Segmentation
Lexical cohesion
Repetition
topic Corpus linguistics
Discourse analysis
Segmentation
Lexical cohesion
Repetition
description The aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections.
publishDate 2018
dc.date.none.fl_str_mv 2018-08-08
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://revistas.pucsp.br/index.php/delta/article/view/38793
url https://revistas.pucsp.br/index.php/delta/article/view/38793
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv https://revistas.pucsp.br/index.php/delta/article/view/38793/26327
dc.rights.driver.fl_str_mv Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicada
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicada
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Pontifícia Universidade Católica de São paulo
publisher.none.fl_str_mv Pontifícia Universidade Católica de São paulo
dc.source.none.fl_str_mv DELTA: Documentação e Estudos em Linguística Teórica e Aplicada; v. 18 n. 2 (2002)
1678-460X
0102-4450
reponame:DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
instname:Pontifícia Universidade Católica de São Paulo (PUC-SP)
instacron:PUC_SP
instname_str Pontifícia Universidade Católica de São Paulo (PUC-SP)
instacron_str PUC_SP
institution PUC_SP
reponame_str DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
collection DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
repository.name.fl_str_mv DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada - Pontifícia Universidade Católica de São Paulo (PUC-SP)
repository.mail.fl_str_mv ||delta@pucsp.br
_version_ 1799129302828056576