Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research

Detalhes bibliográficos
Autor(a) principal: Lưu, Alex
Data de Publicação: 2022
Outros Autores: Koval, Pavel, Malamud, Sophia, Dubinina, Irina
Tipo de documento: Artigo
Idioma: eng
por
Título da fonte: Bakhtiniana
Texto Completo: https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831
Resumo: The BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully text-searchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus.
id PUC_SP-6_c6f3603a541dc8f7d9b0d460e3d1c23a
oai_identifier_str oai:ojs.pkp.sfu.ca:article/55831
network_acronym_str PUC_SP-6
network_name_str Bakhtiniana
repository_id_str
spelling Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research A construção de corpus de larga escala da fala bilíngue de crianças e da fala bilíngue dirigida à criança, anotado e alinhado aos arquivos de áudio: desafios, soluções e implicações para a pesquisaSpoken Russian corpusDisfluency annotationMorphological taggingSyntactic parsingBilingual and heritage speakersCorpus de fala em russoAnotação de disfluênciasMarcação morfológicaAnálise sintáticaFalantes bilínguesFalantes de herançaThe BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully text-searchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus.O projeto BiRCh (The Corpus of Bilingual Russian Child Speech, Corpus de fala de crianças bilíngues em russo) envolve a construção de um corpus longitudinal composto de gravações de fala em russo produzida por crianças e suas famílias na Rússia, Ucrânia, Alemanha, EUA e Canadá. Estamos construindo um corpus de larga escala com base no conjunto dessas gravações, o ‘Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)’, com os dois componentes básicos: (1) as transcrições de um milhão de palavras alinhadas com os arquivos de áudio, em que pode ser realizada a busca textual, e (2) as transcrições de 500 mil palavras anotadas morfologicamente e analisadas sintaticamente, também alinhadas com os arquivos de áudio. Estamos utilizando o corpus para investigar os diversos fenômenos no input linguístico e na trajetória do desenvolvimento de falantes de herança, tais como o uso de caso, gênero, construções passivas e impessoais, marcadores de polidez, disfluências e marcadores discursivos. Este artigo enfoca os desafios e soluções no processo da construção do BiRCh e as implicações para a pesquisa com base nos dados detalhadamente anotados fornecidos pelo corpus.Pontifícia Universidade Católica de São Paulo2022-10-28info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfapplication/pdfhttps://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831Bakhtiniana. Revista de Estudos do Discurso ; Vol. 17 No. 4 (2022); Port. 223-261 / Eng. 229-263Bakhtiniana. Revista de Estudos do Discurso; Vol. 17 Núm. 4 (2022); Port. 223-261 / Eng. 229-263Bakhtiniana. Revista de Estudos do Discurso; Vol. 17 No. 4 (2022); Port. 223-261 / Eng. 229-263Bakhtiniana. Revista de Estudos do Discurso; v. 17 n. 4 (2022); Port. 223-261 / Eng. 229-263Бахтиниана: Журнал дискурсивных исследований; Том 17 № 4 (2022); Port. 223-261 / Eng. 229-2632176-4573reponame:Bakhtinianainstname:Pontifícia Universidade Católica de São Paulo (PUC-SP)instacron:PUC_SPengporhttps://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831/40759https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831/40760Copyright (c) 2022 Bakhtiniana. Revista de Estudos do Discursohttps://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessLưu, AlexKoval, Pavel Malamud, SophiaDubinina, Irina2022-10-28T12:34:51Zoai:ojs.pkp.sfu.ca:article/55831Revistahttps://revistas.pucsp.br/index.php/bakhtiniana/indexPRIhttps://old.scielo.br/oai/scielo-oai.php||bakhtinianarevista@gmail.com2176-45732176-4573opendoar:2022-10-28T12:34:51Bakhtiniana - Pontifícia Universidade Católica de São Paulo (PUC-SP)false
dc.title.none.fl_str_mv Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
A construção de corpus de larga escala da fala bilíngue de crianças e da fala bilíngue dirigida à criança, anotado e alinhado aos arquivos de áudio: desafios, soluções e implicações para a pesquisa
title Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
spellingShingle Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
Lưu, Alex
Spoken Russian corpus
Disfluency annotation
Morphological tagging
Syntactic parsing
Bilingual and heritage speakers
Corpus de fala em russo
Anotação de disfluências
Marcação morfológica
Análise sintática
Falantes bilíngues
Falantes de herança
title_short Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
title_full Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
title_fullStr Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
title_full_unstemmed Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
title_sort Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
author Lưu, Alex
author_facet Lưu, Alex
Koval, Pavel
Malamud, Sophia
Dubinina, Irina
author_role author
author2 Koval, Pavel
Malamud, Sophia
Dubinina, Irina
author2_role author
author
author
dc.contributor.author.fl_str_mv Lưu, Alex
Koval, Pavel
Malamud, Sophia
Dubinina, Irina
dc.subject.por.fl_str_mv Spoken Russian corpus
Disfluency annotation
Morphological tagging
Syntactic parsing
Bilingual and heritage speakers
Corpus de fala em russo
Anotação de disfluências
Marcação morfológica
Análise sintática
Falantes bilíngues
Falantes de herança
topic Spoken Russian corpus
Disfluency annotation
Morphological tagging
Syntactic parsing
Bilingual and heritage speakers
Corpus de fala em russo
Anotação de disfluências
Marcação morfológica
Análise sintática
Falantes bilíngues
Falantes de herança
description The BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully text-searchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus.
publishDate 2022
dc.date.none.fl_str_mv 2022-10-28
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831
url https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831
dc.language.iso.fl_str_mv eng
por
language eng
por
dc.relation.none.fl_str_mv https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831/40759
https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831/40760
dc.rights.driver.fl_str_mv Copyright (c) 2022 Bakhtiniana. Revista de Estudos do Discurso
https://creativecommons.org/licenses/by/4.0
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Copyright (c) 2022 Bakhtiniana. Revista de Estudos do Discurso
https://creativecommons.org/licenses/by/4.0
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Pontifícia Universidade Católica de São Paulo
publisher.none.fl_str_mv Pontifícia Universidade Católica de São Paulo
dc.source.none.fl_str_mv Bakhtiniana. Revista de Estudos do Discurso ; Vol. 17 No. 4 (2022); Port. 223-261 / Eng. 229-263
Bakhtiniana. Revista de Estudos do Discurso; Vol. 17 Núm. 4 (2022); Port. 223-261 / Eng. 229-263
Bakhtiniana. Revista de Estudos do Discurso; Vol. 17 No. 4 (2022); Port. 223-261 / Eng. 229-263
Bakhtiniana. Revista de Estudos do Discurso; v. 17 n. 4 (2022); Port. 223-261 / Eng. 229-263
Бахтиниана: Журнал дискурсивных исследований; Том 17 № 4 (2022); Port. 223-261 / Eng. 229-263
2176-4573
reponame:Bakhtiniana
instname:Pontifícia Universidade Católica de São Paulo (PUC-SP)
instacron:PUC_SP
instname_str Pontifícia Universidade Católica de São Paulo (PUC-SP)
instacron_str PUC_SP
institution PUC_SP
reponame_str Bakhtiniana
collection Bakhtiniana
repository.name.fl_str_mv Bakhtiniana - Pontifícia Universidade Católica de São Paulo (PUC-SP)
repository.mail.fl_str_mv ||bakhtinianarevista@gmail.com
_version_ 1799138684090449920