Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | , , |
Tipo de documento: | Artigo |
Idioma: | eng por |
Título da fonte: | Bakhtiniana |
Texto Completo: | https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831 |
Resumo: | The BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully text-searchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus. |
id |
PUC_SP-6_c6f3603a541dc8f7d9b0d460e3d1c23a |
---|---|
oai_identifier_str |
oai:ojs.pkp.sfu.ca:article/55831 |
network_acronym_str |
PUC_SP-6 |
network_name_str |
Bakhtiniana |
repository_id_str |
|
spelling |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research A construção de corpus de larga escala da fala bilíngue de crianças e da fala bilíngue dirigida à criança, anotado e alinhado aos arquivos de áudio: desafios, soluções e implicações para a pesquisaSpoken Russian corpusDisfluency annotationMorphological taggingSyntactic parsingBilingual and heritage speakersCorpus de fala em russoAnotação de disfluênciasMarcação morfológicaAnálise sintáticaFalantes bilínguesFalantes de herançaThe BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully text-searchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus.O projeto BiRCh (The Corpus of Bilingual Russian Child Speech, Corpus de fala de crianças bilíngues em russo) envolve a construção de um corpus longitudinal composto de gravações de fala em russo produzida por crianças e suas famílias na Rússia, Ucrânia, Alemanha, EUA e Canadá. Estamos construindo um corpus de larga escala com base no conjunto dessas gravações, o ‘Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)’, com os dois componentes básicos: (1) as transcrições de um milhão de palavras alinhadas com os arquivos de áudio, em que pode ser realizada a busca textual, e (2) as transcrições de 500 mil palavras anotadas morfologicamente e analisadas sintaticamente, também alinhadas com os arquivos de áudio. Estamos utilizando o corpus para investigar os diversos fenômenos no input linguístico e na trajetória do desenvolvimento de falantes de herança, tais como o uso de caso, gênero, construções passivas e impessoais, marcadores de polidez, disfluências e marcadores discursivos. Este artigo enfoca os desafios e soluções no processo da construção do BiRCh e as implicações para a pesquisa com base nos dados detalhadamente anotados fornecidos pelo corpus.Pontifícia Universidade Católica de São Paulo2022-10-28info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfapplication/pdfhttps://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831Bakhtiniana. Revista de Estudos do Discurso ; Vol. 17 No. 4 (2022); Port. 223-261 / Eng. 229-263Bakhtiniana. Revista de Estudos do Discurso; Vol. 17 Núm. 4 (2022); Port. 223-261 / Eng. 229-263Bakhtiniana. Revista de Estudos do Discurso; Vol. 17 No. 4 (2022); Port. 223-261 / Eng. 229-263Bakhtiniana. Revista de Estudos do Discurso; v. 17 n. 4 (2022); Port. 223-261 / Eng. 229-263Бахтиниана: Журнал дискурсивных исследований; Том 17 № 4 (2022); Port. 223-261 / Eng. 229-2632176-4573reponame:Bakhtinianainstname:Pontifícia Universidade Católica de São Paulo (PUC-SP)instacron:PUC_SPengporhttps://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831/40759https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831/40760Copyright (c) 2022 Bakhtiniana. Revista de Estudos do Discursohttps://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessLưu, AlexKoval, Pavel Malamud, SophiaDubinina, Irina2022-10-28T12:34:51Zoai:ojs.pkp.sfu.ca:article/55831Revistahttps://revistas.pucsp.br/index.php/bakhtiniana/indexPRIhttps://old.scielo.br/oai/scielo-oai.php||bakhtinianarevista@gmail.com2176-45732176-4573opendoar:2022-10-28T12:34:51Bakhtiniana - Pontifícia Universidade Católica de São Paulo (PUC-SP)false |
dc.title.none.fl_str_mv |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research A construção de corpus de larga escala da fala bilíngue de crianças e da fala bilíngue dirigida à criança, anotado e alinhado aos arquivos de áudio: desafios, soluções e implicações para a pesquisa |
title |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research |
spellingShingle |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research Lưu, Alex Spoken Russian corpus Disfluency annotation Morphological tagging Syntactic parsing Bilingual and heritage speakers Corpus de fala em russo Anotação de disfluências Marcação morfológica Análise sintática Falantes bilíngues Falantes de herança |
title_short |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research |
title_full |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research |
title_fullStr |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research |
title_full_unstemmed |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research |
title_sort |
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research |
author |
Lưu, Alex |
author_facet |
Lưu, Alex Koval, Pavel Malamud, Sophia Dubinina, Irina |
author_role |
author |
author2 |
Koval, Pavel Malamud, Sophia Dubinina, Irina |
author2_role |
author author author |
dc.contributor.author.fl_str_mv |
Lưu, Alex Koval, Pavel Malamud, Sophia Dubinina, Irina |
dc.subject.por.fl_str_mv |
Spoken Russian corpus Disfluency annotation Morphological tagging Syntactic parsing Bilingual and heritage speakers Corpus de fala em russo Anotação de disfluências Marcação morfológica Análise sintática Falantes bilíngues Falantes de herança |
topic |
Spoken Russian corpus Disfluency annotation Morphological tagging Syntactic parsing Bilingual and heritage speakers Corpus de fala em russo Anotação de disfluências Marcação morfológica Análise sintática Falantes bilíngues Falantes de herança |
description |
The BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully text-searchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-10-28 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831 |
url |
https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831 |
dc.language.iso.fl_str_mv |
eng por |
language |
eng por |
dc.relation.none.fl_str_mv |
https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831/40759 https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831/40760 |
dc.rights.driver.fl_str_mv |
Copyright (c) 2022 Bakhtiniana. Revista de Estudos do Discurso https://creativecommons.org/licenses/by/4.0 info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Copyright (c) 2022 Bakhtiniana. Revista de Estudos do Discurso https://creativecommons.org/licenses/by/4.0 |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf application/pdf |
dc.publisher.none.fl_str_mv |
Pontifícia Universidade Católica de São Paulo |
publisher.none.fl_str_mv |
Pontifícia Universidade Católica de São Paulo |
dc.source.none.fl_str_mv |
Bakhtiniana. Revista de Estudos do Discurso ; Vol. 17 No. 4 (2022); Port. 223-261 / Eng. 229-263 Bakhtiniana. Revista de Estudos do Discurso; Vol. 17 Núm. 4 (2022); Port. 223-261 / Eng. 229-263 Bakhtiniana. Revista de Estudos do Discurso; Vol. 17 No. 4 (2022); Port. 223-261 / Eng. 229-263 Bakhtiniana. Revista de Estudos do Discurso; v. 17 n. 4 (2022); Port. 223-261 / Eng. 229-263 Бахтиниана: Журнал дискурсивных исследований; Том 17 № 4 (2022); Port. 223-261 / Eng. 229-263 2176-4573 reponame:Bakhtiniana instname:Pontifícia Universidade Católica de São Paulo (PUC-SP) instacron:PUC_SP |
instname_str |
Pontifícia Universidade Católica de São Paulo (PUC-SP) |
instacron_str |
PUC_SP |
institution |
PUC_SP |
reponame_str |
Bakhtiniana |
collection |
Bakhtiniana |
repository.name.fl_str_mv |
Bakhtiniana - Pontifícia Universidade Católica de São Paulo (PUC-SP) |
repository.mail.fl_str_mv |
||bakhtinianarevista@gmail.com |
_version_ |
1799138684090449920 |