A corpus of European Portuguese child and child-directed speech

Santos, Ana Lúcia; Généreux, Michel; Cardoso, Aida; Agostinho, Celina; Abalada, Silvana

A corpus of European Portuguese child and child-directed speech

Detalhes bibliográficos
Autor(a) principal:	Santos, Ana Lúcia
Data de Publicação:	2014
Outros Autores:	Généreux, Michel, Cardoso, Aida, Agostinho, Celina, Abalada, Silvana
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10451/30661
Resumo:	We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.

Metadados do item

id	RCAP_0f617b01735517cee03de711eb53d915
oai_identifier_str	oai:repositorio.ul.pt:10451/30661
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	A corpus of European Portuguese child and child-directed speechAcquisitionChild corpusPart-of-speech-taggingWe present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.European Language Resources AssociationRepositório da Universidade de LisboaSantos, Ana LúciaGénéreux, MichelCardoso, AidaAgostinho, CelinaAbalada, Silvana2018-01-17T12:39:41Z20142014-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10451/30661engSantos, Ana Lúcia, Michel Généreux, Aida Cardoso, Celina Agostinho & Silvana Abalada (2014): "A corpus of European Portuguese child and child-directed speech". In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik: European Language Resources Association (ELRA), pp. 1488-1491.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T16:23:31Zoai:repositorio.ul.pt:10451/30661Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:46:17.365658Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	A corpus of European Portuguese child and child-directed speech
title	A corpus of European Portuguese child and child-directed speech
spellingShingle	A corpus of European Portuguese child and child-directed speech Santos, Ana Lúcia Acquisition Child corpus Part-of-speech-tagging
title_short	A corpus of European Portuguese child and child-directed speech
title_full	A corpus of European Portuguese child and child-directed speech
title_fullStr	A corpus of European Portuguese child and child-directed speech
title_full_unstemmed	A corpus of European Portuguese child and child-directed speech
title_sort	A corpus of European Portuguese child and child-directed speech
author	Santos, Ana Lúcia
author_facet	Santos, Ana Lúcia Généreux, Michel Cardoso, Aida Agostinho, Celina Abalada, Silvana
author_role	author
author2	Généreux, Michel Cardoso, Aida Agostinho, Celina Abalada, Silvana
author2_role	author author author author
dc.contributor.none.fl_str_mv	Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv	Santos, Ana Lúcia Généreux, Michel Cardoso, Aida Agostinho, Celina Abalada, Silvana
dc.subject.por.fl_str_mv	Acquisition Child corpus Part-of-speech-tagging
topic	Acquisition Child corpus Part-of-speech-tagging
description	We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.
publishDate	2014
dc.date.none.fl_str_mv	2014 2014-01-01T00:00:00Z 2018-01-17T12:39:41Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10451/30661
url	http://hdl.handle.net/10451/30661
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	Santos, Ana Lúcia, Michel Généreux, Aida Cardoso, Celina Agostinho & Silvana Abalada (2014): "A corpus of European Portuguese child and child-directed speech". In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik: European Language Resources Association (ELRA), pp. 1488-1491.
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	European Language Resources Association
publisher.none.fl_str_mv	European Language Resources Association
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134386889687040

A corpus of European Portuguese child and child-directed speech

Registros relacionados