SEGMENTING CORPORA OF TEX

Sardinha, Tony Berber

SEGMENTING CORPORA OF TEX

Detalhes bibliográficos
Autor(a) principal:	Sardinha, Tony Berber
Data de Publicação:	2018
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
Texto Completo:	https://revistas.pucsp.br/index.php/delta/article/view/38793
Resumo:	The aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections.

Metadados do item

id	PUC_SP-4_b716686d59cee04b3d9539af283b6406
oai_identifier_str	oai:ojs.pkp.sfu.ca:article/38793
network_acronym_str	PUC_SP-4
network_name_str	DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
repository_id_str
spelling	SEGMENTING CORPORA OF TEXCorpus linguisticsDiscourse analysisSegmentationLexical cohesionRepetitionThe aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections.Pontifícia Universidade Católica de São paulo2018-08-08info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://revistas.pucsp.br/index.php/delta/article/view/38793DELTA: Documentação e Estudos em Linguística Teórica e Aplicada; v. 18 n. 2 (2002)1678-460X0102-4450reponame:DELTA: Documentação de Estudos em Lingüística Teórica e Aplicadainstname:Pontifícia Universidade Católica de São Paulo (PUC-SP)instacron:PUC_SPenghttps://revistas.pucsp.br/index.php/delta/article/view/38793/26327Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicadainfo:eu-repo/semantics/openAccessSardinha, Tony Berber2018-08-08T14:36:32Zoai:ojs.pkp.sfu.ca:article/38793Revistahttps://revistas.pucsp.br/deltaPRIhttps://revistas.pucsp.br/index.php/delta/oai\|\|delta@pucsp.br1678-460X1678-460Xopendoar:2018-08-08T14:36:32DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada - Pontifícia Universidade Católica de São Paulo (PUC-SP)false
dc.title.none.fl_str_mv	SEGMENTING CORPORA OF TEX
title	SEGMENTING CORPORA OF TEX
spellingShingle	SEGMENTING CORPORA OF TEX Sardinha, Tony Berber Corpus linguistics Discourse analysis Segmentation Lexical cohesion Repetition
title_short	SEGMENTING CORPORA OF TEX
title_full	SEGMENTING CORPORA OF TEX
title_fullStr	SEGMENTING CORPORA OF TEX
title_full_unstemmed	SEGMENTING CORPORA OF TEX
title_sort	SEGMENTING CORPORA OF TEX
author	Sardinha, Tony Berber
author_facet	Sardinha, Tony Berber
author_role	author
dc.contributor.author.fl_str_mv	Sardinha, Tony Berber
dc.subject.por.fl_str_mv	Corpus linguistics Discourse analysis Segmentation Lexical cohesion Repetition
topic	Corpus linguistics Discourse analysis Segmentation Lexical cohesion Repetition
description	The aim of the research presented here is to report on a corpus-based method for discourse analysis that is based on the notion of segmentation, or the division of texts into cohesive portions. For the purposes of this investigation, a segment is defined as a contiguous portion of written text consisting of at least two sentences. The segmentation procedure developed for the study is called LSM (link set median), which is based on the identification of lexical repetition in text. The data analysed in this investigation were three corpora of 100 texts each. Each corpus was composed of texts of one particular genre: research articles, annual business reports, and encyclopaedia entries. The total number of words in the three corpora was 1,262,710 words. The segments inserted in the texts by the LSM procedure were compared to the internal section divisions in the texts. Afterwards, the results obtained through the LSM procedure were then compared to segmentation carried out at random. The results indicated that the LSM procedure worked better than random, suggesting that lexical repetition accounts in part for the way texts are segmented into sections.
publishDate	2018
dc.date.none.fl_str_mv	2018-08-08
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://revistas.pucsp.br/index.php/delta/article/view/38793
url	https://revistas.pucsp.br/index.php/delta/article/view/38793
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	https://revistas.pucsp.br/index.php/delta/article/view/38793/26327
dc.rights.driver.fl_str_mv	Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicada info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Copyright (c) 2018 DELTA: Documentação e Estudos em Linguística Teórica e Aplicada
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Pontifícia Universidade Católica de São paulo
publisher.none.fl_str_mv	Pontifícia Universidade Católica de São paulo
dc.source.none.fl_str_mv	DELTA: Documentação e Estudos em Linguística Teórica e Aplicada; v. 18 n. 2 (2002) 1678-460X 0102-4450 reponame:DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada instname:Pontifícia Universidade Católica de São Paulo (PUC-SP) instacron:PUC_SP
instname_str	Pontifícia Universidade Católica de São Paulo (PUC-SP)
instacron_str	PUC_SP
institution	PUC_SP
reponame_str	DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
collection	DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada
repository.name.fl_str_mv	DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada - Pontifícia Universidade Católica de São Paulo (PUC-SP)
repository.mail.fl_str_mv	\|\|delta@pucsp.br
_version_	1799129302828056576

SEGMENTING CORPORA OF TEX

Registros relacionados