SEGMENTING CORPORA OF TEX
Autor(a) principal: | |
---|---|
Data de Publicação: | 2018 |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada |
Texto Completo: | https://revistas.pucsp.br/index.php/delta/article/view/38793 |
Resumo: | The aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections. |
id |
PUC_SP-4_b716686d59cee04b3d9539af283b6406 |
---|---|
oai_identifier_str |
oai:ojs.pkp.sfu.ca:article/38793 |
network_acronym_str |
PUC_SP-4 |
network_name_str |
DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada |
repository_id_str |
|
spelling |
SEGMENTING CORPORA OF TEXCorpus linguisticsDiscourse analysisSegmentationLexical cohesionRepetitionThe aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections.Pontifícia Universidade Católica de São paulo2018-08-08info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://revistas.pucsp.br/index.php/delta/article/view/38793DELTA: Documentação e Estudos em Linguística Teórica e Aplicada; v. 18 n. 2 (2002)1678-460X0102-4450reponame:DELTA: Documentação de Estudos em Lingüística Teórica e Aplicadainstname:Pontifícia Universidade Católica de São Paulo (PUC-SP)instacron:PUC_SPenghttps://revistas.pucsp.br/index.php/delta/article/view/38793/26327Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicadainfo:eu-repo/semantics/openAccessSardinha, Tony Berber2018-08-08T14:36:32Zoai:ojs.pkp.sfu.ca:article/38793Revistahttps://revistas.pucsp.br/deltaPRIhttps://revistas.pucsp.br/index.php/delta/oai||delta@pucsp.br1678-460X1678-460Xopendoar:2018-08-08T14:36:32DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada - Pontifícia Universidade Católica de São Paulo (PUC-SP)false |
dc.title.none.fl_str_mv |
SEGMENTING CORPORA OF TEX |
title |
SEGMENTING CORPORA OF TEX |
spellingShingle |
SEGMENTING CORPORA OF TEX Sardinha, Tony Berber Corpus linguistics Discourse analysis Segmentation Lexical cohesion Repetition |
title_short |
SEGMENTING CORPORA OF TEX |
title_full |
SEGMENTING CORPORA OF TEX |
title_fullStr |
SEGMENTING CORPORA OF TEX |
title_full_unstemmed |
SEGMENTING CORPORA OF TEX |
title_sort |
SEGMENTING CORPORA OF TEX |
author |
Sardinha, Tony Berber |
author_facet |
Sardinha, Tony Berber |
author_role |
author |
dc.contributor.author.fl_str_mv |
Sardinha, Tony Berber |
dc.subject.por.fl_str_mv |
Corpus linguistics Discourse analysis Segmentation Lexical cohesion Repetition |
topic |
Corpus linguistics Discourse analysis Segmentation Lexical cohesion Repetition |
description |
The aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections. |
publishDate |
2018 |
dc.date.none.fl_str_mv |
2018-08-08 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://revistas.pucsp.br/index.php/delta/article/view/38793 |
url |
https://revistas.pucsp.br/index.php/delta/article/view/38793 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
https://revistas.pucsp.br/index.php/delta/article/view/38793/26327 |
dc.rights.driver.fl_str_mv |
Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicada info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicada |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Pontifícia Universidade Católica de São paulo |
publisher.none.fl_str_mv |
Pontifícia Universidade Católica de São paulo |
dc.source.none.fl_str_mv |
DELTA: Documentação e Estudos em Linguística Teórica e Aplicada; v. 18 n. 2 (2002) 1678-460X 0102-4450 reponame:DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada instname:Pontifícia Universidade Católica de São Paulo (PUC-SP) instacron:PUC_SP |
instname_str |
Pontifícia Universidade Católica de São Paulo (PUC-SP) |
instacron_str |
PUC_SP |
institution |
PUC_SP |
reponame_str |
DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada |
collection |
DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada |
repository.name.fl_str_mv |
DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada - Pontifícia Universidade Católica de São Paulo (PUC-SP) |
repository.mail.fl_str_mv |
||delta@pucsp.br |
_version_ |
1799129302828056576 |