A corpus of European Portuguese child and child-directed speech
Autor(a) principal: | |
---|---|
Data de Publicação: | 2014 |
Outros Autores: | , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10451/30661 |
Resumo: | We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model. |
id |
RCAP_0f617b01735517cee03de711eb53d915 |
---|---|
oai_identifier_str |
oai:repositorio.ul.pt:10451/30661 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
A corpus of European Portuguese child and child-directed speechAcquisitionChild corpusPart-of-speech-taggingWe present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.European Language Resources AssociationRepositório da Universidade de LisboaSantos, Ana LúciaGénéreux, MichelCardoso, AidaAgostinho, CelinaAbalada, Silvana2018-01-17T12:39:41Z20142014-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10451/30661engSantos, Ana Lúcia, Michel Généreux, Aida Cardoso, Celina Agostinho & Silvana Abalada (2014): "A corpus of European Portuguese child and child-directed speech". In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik: European Language Resources Association (ELRA), pp. 1488-1491.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T16:23:31Zoai:repositorio.ul.pt:10451/30661Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:46:17.365658Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
A corpus of European Portuguese child and child-directed speech |
title |
A corpus of European Portuguese child and child-directed speech |
spellingShingle |
A corpus of European Portuguese child and child-directed speech Santos, Ana Lúcia Acquisition Child corpus Part-of-speech-tagging |
title_short |
A corpus of European Portuguese child and child-directed speech |
title_full |
A corpus of European Portuguese child and child-directed speech |
title_fullStr |
A corpus of European Portuguese child and child-directed speech |
title_full_unstemmed |
A corpus of European Portuguese child and child-directed speech |
title_sort |
A corpus of European Portuguese child and child-directed speech |
author |
Santos, Ana Lúcia |
author_facet |
Santos, Ana Lúcia Généreux, Michel Cardoso, Aida Agostinho, Celina Abalada, Silvana |
author_role |
author |
author2 |
Généreux, Michel Cardoso, Aida Agostinho, Celina Abalada, Silvana |
author2_role |
author author author author |
dc.contributor.none.fl_str_mv |
Repositório da Universidade de Lisboa |
dc.contributor.author.fl_str_mv |
Santos, Ana Lúcia Généreux, Michel Cardoso, Aida Agostinho, Celina Abalada, Silvana |
dc.subject.por.fl_str_mv |
Acquisition Child corpus Part-of-speech-tagging |
topic |
Acquisition Child corpus Part-of-speech-tagging |
description |
We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model. |
publishDate |
2014 |
dc.date.none.fl_str_mv |
2014 2014-01-01T00:00:00Z 2018-01-17T12:39:41Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10451/30661 |
url |
http://hdl.handle.net/10451/30661 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
Santos, Ana Lúcia, Michel Généreux, Aida Cardoso, Celina Agostinho & Silvana Abalada (2014): "A corpus of European Portuguese child and child-directed speech". In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik: European Language Resources Association (ELRA), pp. 1488-1491. |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
European Language Resources Association |
publisher.none.fl_str_mv |
European Language Resources Association |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799134386889687040 |