Identifying idiolect in forensic authorship attribution : an n-gram textbite approach
Autor(a) principal: | |
---|---|
Data de Publicação: | 2017 |
Outros Autores: | |
Tipo de documento: | Artigo |
Idioma: | por |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://ojs.letras.up.pt/index.php/LLLD/article/view/2443 |
Resumo: | Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist “approaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolect” (Coulthard, 2004: 31). However, given the difficulty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ‘n-gram textbites’, small textual segments that characterise that author’s writing, providing DNA-like chunks of identifying material? Drawing on a corpus of 63,000 emails and 2.5 million words written by 176 employees of the former American energy corporation Enron, a case study approach is adopted, Vrst showing through stylistic analysis that one Enron employee repeatedly produces the same stylistic patterns of politely encoded directives in a way that may be considered habitual. Then a statistical experiment with the same case study author Vnds that word n-grams can assign anonymised email samples to him with success rates as high as 100%. This paper argues that, if suXciently distinctive, these textbites are able to identify authors by reducing a mass of data to key segments that move us closer to the elusive concept of idiolect. |
id |
RCAP_162410a2c5175f47b3353bbf63b5605e |
---|---|
oai_identifier_str |
oai:ojs.pkp.sfu.ca:article/2443 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approachArtigos/ArticlesForensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist “approaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolect” (Coulthard, 2004: 31). However, given the difficulty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ‘n-gram textbites’, small textual segments that characterise that author’s writing, providing DNA-like chunks of identifying material? Drawing on a corpus of 63,000 emails and 2.5 million words written by 176 employees of the former American energy corporation Enron, a case study approach is adopted, Vrst showing through stylistic analysis that one Enron employee repeatedly produces the same stylistic patterns of politely encoded directives in a way that may be considered habitual. Then a statistical experiment with the same case study author Vnds that word n-grams can assign anonymised email samples to him with success rates as high as 100%. This paper argues that, if suXciently distinctive, these textbites are able to identify authors by reducing a mass of data to key segments that move us closer to the elusive concept of idiolect.Faculdade de Letras da Universidade do Porto2017-05-30T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttps://ojs.letras.up.pt/index.php/LLLD/article/view/2443por2183-3745Johnson, AlisonWright, Davidinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2022-09-21T15:48:18Zoai:ojs.pkp.sfu.ca:article/2443Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T15:56:36.775642Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approach |
title |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approach |
spellingShingle |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approach Johnson, Alison Artigos/Articles |
title_short |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approach |
title_full |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approach |
title_fullStr |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approach |
title_full_unstemmed |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approach |
title_sort |
Identifying idiolect in forensic authorship attribution : an n-gram textbite approach |
author |
Johnson, Alison |
author_facet |
Johnson, Alison Wright, David |
author_role |
author |
author2 |
Wright, David |
author2_role |
author |
dc.contributor.author.fl_str_mv |
Johnson, Alison Wright, David |
dc.subject.por.fl_str_mv |
Artigos/Articles |
topic |
Artigos/Articles |
description |
Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist “approaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolect” (Coulthard, 2004: 31). However, given the difficulty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ‘n-gram textbites’, small textual segments that characterise that author’s writing, providing DNA-like chunks of identifying material? Drawing on a corpus of 63,000 emails and 2.5 million words written by 176 employees of the former American energy corporation Enron, a case study approach is adopted, Vrst showing through stylistic analysis that one Enron employee repeatedly produces the same stylistic patterns of politely encoded directives in a way that may be considered habitual. Then a statistical experiment with the same case study author Vnds that word n-grams can assign anonymised email samples to him with success rates as high as 100%. This paper argues that, if suXciently distinctive, these textbites are able to identify authors by reducing a mass of data to key segments that move us closer to the elusive concept of idiolect. |
publishDate |
2017 |
dc.date.none.fl_str_mv |
2017-05-30T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://ojs.letras.up.pt/index.php/LLLD/article/view/2443 |
url |
https://ojs.letras.up.pt/index.php/LLLD/article/view/2443 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.none.fl_str_mv |
2183-3745 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Faculdade de Letras da Universidade do Porto |
publisher.none.fl_str_mv |
Faculdade de Letras da Universidade do Porto |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799130434972418048 |