Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese

Detalhes bibliográficos
Autor(a) principal: Ferreira, J. P.
Data de Publicação: 2014
Outros Autores: Chesi, C., Baldewijns, D., Dias, M. S., Braga, D., Pinto, F. M., Cho, H., Correia, M., Ferreira, A.
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10071/25540
Resumo: This paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese.
id RCAP_ddc01b86cb23d664b709a7ff543641f1
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/25540
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Casa de la Lhéngua: A set of language resources and natural language processing tools for MirandeseLanguage resourcesMinority languageMirandeseSpeech synthesisLexical databaseThis paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese.European Language Resources Association (ELRA)2022-05-25T11:45:11Z2014-01-01T00:00:00Z20142023-06-26T13:04:03Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10071/25540eng978-295174088-4Ferreira, J. P.Chesi, C.Baldewijns, D.Dias, M. S.Braga, D.Pinto, F. M.Cho, H.Correia, M.Ferreira, A.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-07-07T03:51:44Zoai:repositorio.iscte-iul.pt:10071/25540Portal AgregadorONGhttps://www.rcaap.pt/oai/openairemluisa.alvim@gmail.comopendoar:71602024-07-07T03:51:44Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
title Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
spellingShingle Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
Ferreira, J. P.
Language resources
Minority language
Mirandese
Speech synthesis
Lexical database
title_short Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
title_full Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
title_fullStr Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
title_full_unstemmed Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
title_sort Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese
author Ferreira, J. P.
author_facet Ferreira, J. P.
Chesi, C.
Baldewijns, D.
Dias, M. S.
Braga, D.
Pinto, F. M.
Cho, H.
Correia, M.
Ferreira, A.
author_role author
author2 Chesi, C.
Baldewijns, D.
Dias, M. S.
Braga, D.
Pinto, F. M.
Cho, H.
Correia, M.
Ferreira, A.
author2_role author
author
author
author
author
author
author
author
dc.contributor.author.fl_str_mv Ferreira, J. P.
Chesi, C.
Baldewijns, D.
Dias, M. S.
Braga, D.
Pinto, F. M.
Cho, H.
Correia, M.
Ferreira, A.
dc.subject.por.fl_str_mv Language resources
Minority language
Mirandese
Speech synthesis
Lexical database
topic Language resources
Minority language
Mirandese
Speech synthesis
Lexical database
description This paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese.
publishDate 2014
dc.date.none.fl_str_mv 2014-01-01T00:00:00Z
2014
2022-05-25T11:45:11Z
2023-06-26T13:04:03Z
dc.type.driver.fl_str_mv conference object
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/25540
url http://hdl.handle.net/10071/25540
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 978-295174088-4
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv European Language Resources Association (ELRA)
publisher.none.fl_str_mv European Language Resources Association (ELRA)
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv mluisa.alvim@gmail.com
_version_ 1817546551302553600