Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information

Detalhes bibliográficos
Autor(a) principal: Balbachan, Fernando
Data de Publicação: 2010
Outros Autores: Dell'Era, Diego
Tipo de documento: Artigo
Idioma: por
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: https://linguamatica.com/index.php/linguamatica/article/view/60
Resumo: Argument from the Poverty of Stimulus (APS) is the great epistemological debate arena between simbolic and statistical paradigms in computational linguistics. Since 2000, several works inside statistical paradigm have been published, attacking APS as they present some unsupervised general-purpose algorithm for language acquisition. Among the most important contributions, Clark’s Ph.D. thesis (2001) appeals to diverse statistical techniques in order to come up with an unsupervised general-purpose algorithm for inducing language and, more precisely, a complete Context-Free Grammar (CFG) for English.   Clark (2001) works with several induction techniques for each linguistic phenomenon modelized: morphology from Hidden Markovian Models (HMM), POS-tagging from clustering, etc. Particularly, in this current paper we are interested in the induction of syntax constituency, given a POS-tagged corpus, as a previous step towards the whole process of inducing a complete CFG. In his own thesis, the author admits that more crosslinguistic evidence is needed, so as to support the psycholinguistic plausibility of an approach such as his. Currently, there is no work that have proposed to prove Clark’s approach in very inflected languages with free-order constituents like Spanish. Thus, our work is intended to contribute with that crosslinguistic evidence, analyzing the feasibilty of the application of Clark’s algorithm for inducing  constituency on Spanish.Clark (2001) entails the application of K-means clustering to group sequences of morpho-syntactic labels, according to their distributional information. Then, there is a stage of filtering out the clusters, through a mutual-information-based criterion between the symbols that co-occur immediately before and after the sequences. This criterion prevents from the typical bias in sparsed corpora, and in turn, succeeds in distinguishing the co-ocurrence of adyacent symbols above the threshold of default entropy for short-distance (Li 1990).  Our implementation has been tested on a prototypical corpus, obtaining interesting results. We have verified recall=74%, precision=58% and F-measure=65% for this prototypical stage. These results encourage us to continue with our long-term research, the goal of developing an algorithm for complete acquisition of Spanish.  
id RCAP_409bd78047856822437787198e7726fa
oai_identifier_str oai:linguamatica.com:article/60
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual informationInducción de constituyentes sintácticos en español con técnicas de clustering y filtrado por información mutuaInducción de constituyentes sintácticos en español con técnicas de clustering y filtrado por información mutuacomputational linguisticsstatistical parsingsyntax constituencydistributional informationlingüística computacionalparsing estadísticoconstituyentes sintácticosinformación distribucionallingüística computacionalparsing estatísticocomponentes sintáticoinformação mútuaArgument from the Poverty of Stimulus (APS) is the great epistemological debate arena between simbolic and statistical paradigms in computational linguistics. Since 2000, several works inside statistical paradigm have been published, attacking APS as they present some unsupervised general-purpose algorithm for language acquisition. Among the most important contributions, Clark’s Ph.D. thesis (2001) appeals to diverse statistical techniques in order to come up with an unsupervised general-purpose algorithm for inducing language and, more precisely, a complete Context-Free Grammar (CFG) for English.   Clark (2001) works with several induction techniques for each linguistic phenomenon modelized: morphology from Hidden Markovian Models (HMM), POS-tagging from clustering, etc. Particularly, in this current paper we are interested in the induction of syntax constituency, given a POS-tagged corpus, as a previous step towards the whole process of inducing a complete CFG. In his own thesis, the author admits that more crosslinguistic evidence is needed, so as to support the psycholinguistic plausibility of an approach such as his. Currently, there is no work that have proposed to prove Clark’s approach in very inflected languages with free-order constituents like Spanish. Thus, our work is intended to contribute with that crosslinguistic evidence, analyzing the feasibilty of the application of Clark’s algorithm for inducing  constituency on Spanish.Clark (2001) entails the application of K-means clustering to group sequences of morpho-syntactic labels, according to their distributional information. Then, there is a stage of filtering out the clusters, through a mutual-information-based criterion between the symbols that co-occur immediately before and after the sequences. This criterion prevents from the typical bias in sparsed corpora, and in turn, succeeds in distinguishing the co-ocurrence of adyacent symbols above the threshold of default entropy for short-distance (Li 1990).  Our implementation has been tested on a prototypical corpus, obtaining interesting results. We have verified recall=74%, precision=58% and F-measure=65% for this prototypical stage. These results encourage us to continue with our long-term research, the goal of developing an algorithm for complete acquisition of Spanish.  El Argumento de la Pobreza de los Estímulos (Argument from the Poverty of Stimulus, APS) se presenta como el gran campo de debate epistemológico entre el paradigma simbólico y el paradigma estadístico en lingüística computacional (Pullum y Scholz 2002). Desde 2000 en adelante aparecieron algunos trabajos dentro del paradigma estadístico que se propusieron atacar el Argumento de la Pobreza de los Estímulos a partir de la postulación de algún algoritmo general no supervisado de adquisición integral del lenguaje. Entre los aportes más importantes, la tesis de doctorado de Clark (2001) recurre a diversas técnicas estadísticas para dar con un algoritmo general no supervisado de inducción del lenguaje, y en particular, de una gramática independiente de contexto para el inglés.Clark (2001) trabaja con distintas técnicas de inducción para cada fenómeno lingüístico modelizado: morfología mediante modelos markovianos, categorización (POS-tagging) mediante clustering, etc. Puntualmente, en este trabajo estamos interesados en la inducción de constituyentes sintácticos, dado un corpus etiquetado por clase de palabras (POS-tagged), como paso previo al procedimiento de inducción de una gramática independiente de contexto. En su propia tesis, el autor reconoce que es necesaria una mayor evidencia translingüística que apoye la plausibilidad psicolingüística de un enfoque como el suyo. Actualmente, no existen trabajos que se hayan propuesto probar el enfoque de Clark (2001) para la inducción de sintaxis en lenguas flexivas y con orden libre de constituyentes, como el español. Así pues, nuestro trabajo se propone contribuir con dicha evidencia translingüística, estudiando la factibilidad de aplicación del algoritmo de inducción de constituyentes de Clark (2001) para el español. El algoritmo de Clark (2001) que nos ocupa consiste en aplicar técnicas de clustering K-means para agrupar secuencias de etiquetas de clase de palabra, según su información distribucional. Luego, se procede a filtrar los resultados para encontrar clusters que efectivamente se correspondan con grupos de constituyentes, recurriendo a un criterio de información mutua entre los símbolos inmediatamente anteriores y posteriores a dichas secuencias. Este criterio de filtrado evita el sesgo de un corpus  escaso, al tiempo que logra distinguir la dependencia buscada entre los límites de las secuencias candidatas a constituyentes por sobre el umbral de la entropía natural de símbolos que co-ocurren a una cierta distancia en el lenguaje (Li 1990). Nuestra implementación del algoritmo ha sido evaluada en un corpus de dimensiones prototípicas, con resultados prometedores. Se obtuvo una cobertura de 74%, una precisión de 58% y una medida F de 65%, en la etapa prototípica. Estos resultados alientan la continuidad del trabajo de investigación a largo plazo, con la meta de lograr un robusto algoritmo de adquisición integral del lenguaje para el español.El Argumento de la Pobreza de los Estímulos (Argument from the Poverty of Stimulus, APS) se presenta como el gran campo de debate epistemológico entre el paradigma simbólico y el paradigma estadístico en lingüística computacional (Pullum y Scholz 2002). Desde 2000 en adelante aparecieron algunos trabajos dentro del paradigma estadístico que se propusieron atacar el Argumento de la Pobreza de los Estímulos a partir de la postulación de algún algoritmo general no supervisado de adquisición integral del lenguaje. Entre los aportes más importantes, la tesis de doctorado de Clark (2001) recurre a diversas técnicas estadísticas para dar con un algoritmo general no supervisado de inducción del lenguaje, y en particular, de una gramática independiente de contexto para el inglés.Clark (2001) trabaja con distintas técnicas de inducción para cada fenómeno lingüístico modelizado: morfología mediante modelos markovianos, categorización (POS-tagging) mediante clustering, etc. Puntualmente, en este trabajo estamos interesados en la inducción de constituyentes sintácticos, dado un corpus etiquetado por clase de palabras (POS-tagged), como paso previo al procedimiento de inducción de una gramática independiente de contexto. En su propia tesis, el autor reconoce que es necesaria una mayor evidencia translingüística que apoye la plausibilidad psicolingüística de un enfoque como el suyo. Actualmente, no existen trabajos que se hayan propuesto probar el enfoque de Clark (2001) para la inducción de sintaxis en lenguas flexivas y con orden libre de constituyentes, como el español. Así pues, nuestro trabajo se propone contribuir con dicha evidencia translingüística, estudiando la factibilidad de aplicación del algoritmo de inducción de constituyentes de Clark (2001) para el español.  El algoritmo de Clark (2001) que nos ocupa consiste en aplicar técnicas de clustering K-means para agrupar secuencias de etiquetas de clase de palabra, según su información distribucional. Luego, se procede a filtrar los resultados para encontrar clusters que efectivamente se correspondan con grupos de constituyentes, recurriendo a un criterio de información mutua entre los símbolos inmediatamente anteriores y posteriores a dichas secuencias. Este criterio de filtrado evita el sesgo de un corpus escaso, al tiempo que logra distinguir la dependencia buscada entre los límites de las secuencias candidatas a constituyentes por sobre el umbral de la entropía natural de símbolos que co-ocurren a una cierta distancia en el lenguaje (Li 1990). Nuestra implementación del algoritmo ha sido evaluada en un corpus de dimensiones prototípicas, con resultados prometedores. Se obtuvo una cobertura de 74%, una precisión de 58% y una medida F de 65%, en la etapa prototípica. Estos resultados alientan la continuidad del trabajo de investigación a largo plazo, con la meta de lograr un robusto algoritmo de adquisición integral del lenguaje para el español.Universidade do Minho e Universidade de Vigo2010-06-09info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://linguamatica.com/index.php/linguamatica/article/view/60Linguamática; Vol. 2 No. 2; 39-57Linguamática; Vol. 2 Núm. 2; 39-57Linguamática; v. 2 n. 2; 39-571647-0818reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAPporhttps://linguamatica.com/index.php/linguamatica/article/view/60https://linguamatica.com/index.php/linguamatica/article/view/60/84Balbachan, FernandoDell'Era, Diegoinfo:eu-repo/semantics/openAccess2023-09-08T13:46:17Zoai:linguamatica.com:article/60Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:28:33.595854Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
Inducción de constituyentes sintácticos en español con técnicas de clustering y filtrado por información mutua
Inducción de constituyentes sintácticos en español con técnicas de clustering y filtrado por información mutua
title Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
spellingShingle Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
Balbachan, Fernando
computational linguistics
statistical parsing
syntax constituency
distributional information
lingüística computacional
parsing estadístico
constituyentes sintácticos
información distribucional
lingüística computacional
parsing estatístico
componentes sintático
informação mútua
title_short Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
title_full Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
title_fullStr Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
title_full_unstemmed Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
title_sort Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
author Balbachan, Fernando
author_facet Balbachan, Fernando
Dell'Era, Diego
author_role author
author2 Dell'Era, Diego
author2_role author
dc.contributor.author.fl_str_mv Balbachan, Fernando
Dell'Era, Diego
dc.subject.por.fl_str_mv computational linguistics
statistical parsing
syntax constituency
distributional information
lingüística computacional
parsing estadístico
constituyentes sintácticos
información distribucional
lingüística computacional
parsing estatístico
componentes sintático
informação mútua
topic computational linguistics
statistical parsing
syntax constituency
distributional information
lingüística computacional
parsing estadístico
constituyentes sintácticos
información distribucional
lingüística computacional
parsing estatístico
componentes sintático
informação mútua
description Argument from the Poverty of Stimulus (APS) is the great epistemological debate arena between simbolic and statistical paradigms in computational linguistics. Since 2000, several works inside statistical paradigm have been published, attacking APS as they present some unsupervised general-purpose algorithm for language acquisition. Among the most important contributions, Clark’s Ph.D. thesis (2001) appeals to diverse statistical techniques in order to come up with an unsupervised general-purpose algorithm for inducing language and, more precisely, a complete Context-Free Grammar (CFG) for English.   Clark (2001) works with several induction techniques for each linguistic phenomenon modelized: morphology from Hidden Markovian Models (HMM), POS-tagging from clustering, etc. Particularly, in this current paper we are interested in the induction of syntax constituency, given a POS-tagged corpus, as a previous step towards the whole process of inducing a complete CFG. In his own thesis, the author admits that more crosslinguistic evidence is needed, so as to support the psycholinguistic plausibility of an approach such as his. Currently, there is no work that have proposed to prove Clark’s approach in very inflected languages with free-order constituents like Spanish. Thus, our work is intended to contribute with that crosslinguistic evidence, analyzing the feasibilty of the application of Clark’s algorithm for inducing  constituency on Spanish.Clark (2001) entails the application of K-means clustering to group sequences of morpho-syntactic labels, according to their distributional information. Then, there is a stage of filtering out the clusters, through a mutual-information-based criterion between the symbols that co-occur immediately before and after the sequences. This criterion prevents from the typical bias in sparsed corpora, and in turn, succeeds in distinguishing the co-ocurrence of adyacent symbols above the threshold of default entropy for short-distance (Li 1990).  Our implementation has been tested on a prototypical corpus, obtaining interesting results. We have verified recall=74%, precision=58% and F-measure=65% for this prototypical stage. These results encourage us to continue with our long-term research, the goal of developing an algorithm for complete acquisition of Spanish.  
publishDate 2010
dc.date.none.fl_str_mv 2010-06-09
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://linguamatica.com/index.php/linguamatica/article/view/60
url https://linguamatica.com/index.php/linguamatica/article/view/60
dc.language.iso.fl_str_mv por
language por
dc.relation.none.fl_str_mv https://linguamatica.com/index.php/linguamatica/article/view/60
https://linguamatica.com/index.php/linguamatica/article/view/60/84
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade do Minho e Universidade de Vigo
publisher.none.fl_str_mv Universidade do Minho e Universidade de Vigo
dc.source.none.fl_str_mv Linguamática; Vol. 2 No. 2; 39-57
Linguamática; Vol. 2 Núm. 2; 39-57
Linguamática; v. 2 n. 2; 39-57
1647-0818
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799133553130209280