Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
Autor(a) principal: | |
---|---|
Data de Publicação: | 2014 |
Outros Autores: | , , , , , , , |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10071/25540 |
Resumo: | This paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese. |
id |
RCAP_ddc01b86cb23d664b709a7ff543641f1 |
---|---|
oai_identifier_str |
oai:repositorio.iscte-iul.pt:10071/25540 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Casa de la Lhéngua: A set of language resources and natural language processing tools for MirandeseLanguage resourcesMinority languageMirandeseSpeech synthesisLexical databaseThis paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese.European Language Resources Association (ELRA)2022-05-25T11:45:11Z2014-01-01T00:00:00Z20142023-06-26T13:04:03Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10071/25540eng978-295174088-4Ferreira, J. P.Chesi, C.Baldewijns, D.Dias, M. S.Braga, D.Pinto, F. M.Cho, H.Correia, M.Ferreira, A.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-07-07T03:51:44Zoai:repositorio.iscte-iul.pt:10071/25540Portal AgregadorONGhttps://www.rcaap.pt/oai/openairemluisa.alvim@gmail.comopendoar:71602024-07-07T03:51:44Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese |
title |
Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese |
spellingShingle |
Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese Ferreira, J. P. Language resources Minority language Mirandese Speech synthesis Lexical database |
title_short |
Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese |
title_full |
Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese |
title_fullStr |
Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese |
title_full_unstemmed |
Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese |
title_sort |
Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese |
author |
Ferreira, J. P. |
author_facet |
Ferreira, J. P. Chesi, C. Baldewijns, D. Dias, M. S. Braga, D. Pinto, F. M. Cho, H. Correia, M. Ferreira, A. |
author_role |
author |
author2 |
Chesi, C. Baldewijns, D. Dias, M. S. Braga, D. Pinto, F. M. Cho, H. Correia, M. Ferreira, A. |
author2_role |
author author author author author author author author |
dc.contributor.author.fl_str_mv |
Ferreira, J. P. Chesi, C. Baldewijns, D. Dias, M. S. Braga, D. Pinto, F. M. Cho, H. Correia, M. Ferreira, A. |
dc.subject.por.fl_str_mv |
Language resources Minority language Mirandese Speech synthesis Lexical database |
topic |
Language resources Minority language Mirandese Speech synthesis Lexical database |
description |
This paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese. |
publishDate |
2014 |
dc.date.none.fl_str_mv |
2014-01-01T00:00:00Z 2014 2022-05-25T11:45:11Z 2023-06-26T13:04:03Z |
dc.type.driver.fl_str_mv |
conference object |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10071/25540 |
url |
http://hdl.handle.net/10071/25540 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
978-295174088-4 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
European Language Resources Association (ELRA) |
publisher.none.fl_str_mv |
European Language Resources Association (ELRA) |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
mluisa.alvim@gmail.com |
_version_ |
1817546551302553600 |