Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information

Balbachan, Fernando; Dell'Era, Diego

Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information

Detalhes bibliográficos
Autor(a) principal:	Balbachan, Fernando
Data de Publicação:	2010
Outros Autores:	Dell'Era, Diego
Tipo de documento:	Artigo
Idioma:	por
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	https://linguamatica.com/index.php/linguamatica/article/view/60
Resumo:	Argument from the Poverty of Stimulus (APS) is the great epistemological debate arena between simbolic and statistical paradigms in computational linguistics. Since 2000, several works inside statistical paradigm have been published, attacking APS as they present some unsupervised general-purpose algorithm for language acquisition. Among the most important contributions, Clark’s Ph.D. thesis (2001) appeals to diverse statistical techniques in order to come up with an unsupervised general-purpose algorithm for inducing language and, more precisely, a complete Context-Free Grammar (CFG) for English.   Clark (2001) works with several induction techniques for each linguistic phenomenon modelized: morphology from Hidden Markovian Models (HMM), POS-tagging from clustering, etc. Particularly, in this current paper we are interested in the induction of syntax constituency, given a POS-tagged corpus, as a previous step towards the whole process of inducing a complete CFG. In his own thesis, the author admits that more crosslinguistic evidence is needed, so as to support the psycholinguistic plausibility of an approach such as his. Currently, there is no work that have proposed to prove Clark’s approach in very inflected languages with free-order constituents like Spanish. Thus, our work is intended to contribute with that crosslinguistic evidence, analyzing the feasibilty of the application of Clark’s algorithm for inducing  constituency on Spanish.Clark (2001) entails the application of K-means clustering to group sequences of morpho-syntactic labels, according to their distributional information. Then, there is a stage of filtering out the clusters, through a mutual-information-based criterion between the symbols that co-occur immediately before and after the sequences. This criterion prevents from the typical bias in sparsed corpora, and in turn, succeeds in distinguishing the co-ocurrence of adyacent symbols above the threshold of default entropy for short-distance (Li 1990).  Our implementation has been tested on a prototypical corpus, obtaining interesting results. We have verified recall=74%, precision=58% and F-measure=65% for this prototypical stage. These results encourage us to continue with our long-term research, the goal of developing an algorithm for complete acquisition of Spanish.

Metadados do item

id	RCAP_409bd78047856822437787198e7726fa
oai_identifier_str	oai:linguamatica.com:article/60
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual informationInducción de constituyentes sintácticos en español con técnicas de clustering y filtrado por información mutuaInducción de constituyentes sintácticos en español con técnicas de clustering y filtrado por información mutuacomputational linguisticsstatistical parsingsyntax constituencydistributional informationlingüística computacionalparsing estadísticoconstituyentes sintácticosinformación distribucionallingüística computacionalparsing estatísticocomponentes sintáticoinformação mútuaArgument from the Poverty of Stimulus (APS) is the great epistemological debate arena between simbolic and statistical paradigms in computational linguistics. Since 2000, several works inside statistical paradigm have been published, attacking APS as they present some unsupervised general-purpose algorithm for language acquisition. Among the most important contributions, Clark’s Ph.D. thesis (2001) appeals to diverse statistical techniques in order to come up with an unsupervised general-purpose algorithm for inducing language and, more precisely, a complete Context-Free Grammar (CFG) for English.   Clark (2001) works with several induction techniques for each linguistic phenomenon modelized: morphology from Hidden Markovian Models (HMM), POS-tagging from clustering, etc. Particularly, in this current paper we are interested in the induction of syntax constituency, given a POS-tagged corpus, as a previous step towards the whole process of inducing a complete CFG. In his own thesis, the author admits that more crosslinguistic evidence is needed, so as to support the psycholinguistic plausibility of an approach such as his. Currently, there is no work that have proposed to prove Clark’s approach in very inflected languages with free-order constituents like Spanish. Thus, our work is intended to contribute with that crosslinguistic evidence, analyzing the feasibilty of the application of Clark’s algorithm for inducing  constituency on Spanish.Clark (2001) entails the application of K-means clustering to group sequences of morpho-syntactic labels, according to their distributional information. Then, there is a stage of filtering out the clusters, through a mutual-information-based criterion between the symbols that co-occur immediately before and after the sequences. This criterion prevents from the typical bias in sparsed corpora, and in turn, succeeds in distinguishing the co-ocurrence of adyacent symbols above the threshold of default entropy for short-distance (Li 1990).  Our implementation has been tested on a prototypical corpus, obtaining interesting results. We have verified recall=74%, precision=58% and F-measure=65% for this prototypical stage. These results encourage us to continue with our long-term research, the goal of developing an algorithm for complete acquisition of Spanish.  El Argumento de la Pobreza de los Estímulos (Argument from the Poverty of Stimulus, APS) se presenta como el gran campo de debate epistemológico entre el paradigma simbólico y el paradigma estadístico en lingüística computacional (Pullum y Scholz 2002). Desde 2000 en adelante aparecieron algunos trabajos dentro del paradigma estadístico que se propusieron atacar el Argumento de la Pobreza de los Estímulos a partir de la postulación de algún algoritmo general no supervisado de adquisición integral del lenguaje. Entre los aportes más importantes, la tesis de doctorado de Clark (2001) recurre a diversas técnicas estadísticas para dar con un algoritmo general no supervisado de inducción del lenguaje, y en particular, de una gramática independiente de contexto para el inglés.Clark (2001) trabaja con distintas técnicas de inducción para cada fenómeno lingüístico modelizado: morfología mediante modelos markovianos, categorización (POS-tagging) mediante clustering, etc. Puntualmente, en este trabajo estamos interesados en la inducción de constituyentes sintácticos, dado un corpus etiquetado por clase de palabras (POS-tagged), como paso previo al procedimiento de inducción de una gramática independiente de contexto. En su propia tesis, el autor reconoce que es necesaria una mayor evidencia translingüística que apoye la plausibilidad psicolingüística de un enfoque como el suyo. Actualmente, no existen trabajos que se hayan propuesto probar el enfoque de Clark (2001) para la inducción de sintaxis en lenguas flexivas y con orden libre de constituyentes, como el español. Así pues, nuestro trabajo se propone contribuir con dicha evidencia translingüística, estudiando la factibilidad de aplicación del algoritmo de inducción de constituyentes de Clark (2001) para el español. El algoritmo de Clark (2001) que nos ocupa consiste en aplicar técnicas de clustering K-means para agrupar secuencias de etiquetas de clase de palabra, según su información distribucional. Luego, se procede a filtrar los resultados para encontrar clusters que efectivamente se correspondan con grupos de constituyentes, recurriendo a un criterio de información mutua entre los símbolos inmediatamente anteriores y posteriores a dichas secuencias. Este criterio de filtrado evita el sesgo de un corpus  escaso, al tiempo que logra distinguir la dependencia buscada entre los límites de las secuencias candidatas a constituyentes por sobre el umbral de la entropía natural de símbolos que co-ocurren a una cierta distancia en el lenguaje (Li 1990). Nuestra implementación del algoritmo ha sido evaluada en un corpus de dimensiones prototípicas, con resultados prometedores. Se obtuvo una cobertura de 74%, una precisión de 58% y una medida F de 65%, en la etapa prototípica. Estos resultados alientan la continuidad del trabajo de investigación a largo plazo, con la meta de lograr un robusto algoritmo de adquisición integral del lenguaje para el español.El Argumento de la Pobreza de los Estímulos (Argument from the Poverty of Stimulus, APS) se presenta como el gran campo de debate epistemológico entre el paradigma simbólico y el paradigma estadístico en lingüística computacional (Pullum y Scholz 2002). Desde 2000 en adelante aparecieron algunos trabajos dentro del paradigma estadístico que se propusieron atacar el Argumento de la Pobreza de los Estímulos a partir de la postulación de algún algoritmo general no supervisado de adquisición integral del lenguaje. Entre los aportes más importantes, la tesis de doctorado de Clark (2001) recurre a diversas técnicas estadísticas para dar con un algoritmo general no supervisado de inducción del lenguaje, y en particular, de una gramática independiente de contexto para el inglés.Clark (2001) trabaja con distintas técnicas de inducción para cada fenómeno lingüístico modelizado: morfología mediante modelos markovianos, categorización (POS-tagging) mediante clustering, etc. Puntualmente, en este trabajo estamos interesados en la inducción de constituyentes sintácticos, dado un corpus etiquetado por clase de palabras (POS-tagged), como paso previo al procedimiento de inducción de una gramática independiente de contexto. En su propia tesis, el autor reconoce que es necesaria una mayor evidencia translingüística que apoye la plausibilidad psicolingüística de un enfoque como el suyo. Actualmente, no existen trabajos que se hayan propuesto probar el enfoque de Clark (2001) para la inducción de sintaxis en lenguas flexivas y con orden libre de constituyentes, como el español. Así pues, nuestro trabajo se propone contribuir con dicha evidencia translingüística, estudiando la factibilidad de aplicación del algoritmo de inducción de constituyentes de Clark (2001) para el español.  El algoritmo de Clark (2001) que nos ocupa consiste en aplicar técnicas de clustering K-means para agrupar secuencias de etiquetas de clase de palabra, según su información distribucional. Luego, se procede a filtrar los resultados para encontrar clusters que efectivamente se correspondan con grupos de constituyentes, recurriendo a un criterio de información mutua entre los símbolos inmediatamente anteriores y posteriores a dichas secuencias. Este criterio de filtrado evita el sesgo de un corpus escaso, al tiempo que logra distinguir la dependencia buscada entre los límites de las secuencias candidatas a constituyentes por sobre el umbral de la entropía natural de símbolos que co-ocurren a una cierta distancia en el lenguaje (Li 1990). Nuestra implementación del algoritmo ha sido evaluada en un corpus de dimensiones prototípicas, con resultados prometedores. Se obtuvo una cobertura de 74%, una precisión de 58% y una medida F de 65%, en la etapa prototípica. Estos resultados alientan la continuidad del trabajo de investigación a largo plazo, con la meta de lograr un robusto algoritmo de adquisición integral del lenguaje para el español.Universidade do Minho e Universidade de Vigo2010-06-09info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://linguamatica.com/index.php/linguamatica/article/view/60Linguamática; Vol. 2 No. 2; 39-57Linguamática; Vol. 2 Núm. 2; 39-57Linguamática; v. 2 n. 2; 39-571647-0818reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAPporhttps://linguamatica.com/index.php/linguamatica/article/view/60https://linguamatica.com/index.php/linguamatica/article/view/60/84Balbachan, FernandoDell'Era, Diegoinfo:eu-repo/semantics/openAccess2023-09-08T13:46:17Zoai:linguamatica.com:article/60Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:28:33.595854Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information Inducción de constituyentes sintácticos en español con técnicas de clustering y filtrado por información mutua Inducción de constituyentes sintácticos en español con técnicas de clustering y filtrado por información mutua
title	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
spellingShingle	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information Balbachan, Fernando computational linguistics statistical parsing syntax constituency distributional information lingüística computacional parsing estadístico constituyentes sintácticos información distribucional lingüística computacional parsing estatístico componentes sintático informação mútua
title_short	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
title_full	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
title_fullStr	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
title_full_unstemmed	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
title_sort	Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
author	Balbachan, Fernando
author_facet	Balbachan, Fernando Dell'Era, Diego
author_role	author
author2	Dell'Era, Diego
author2_role	author
dc.contributor.author.fl_str_mv	Balbachan, Fernando Dell'Era, Diego
dc.subject.por.fl_str_mv	computational linguistics statistical parsing syntax constituency distributional information lingüística computacional parsing estadístico constituyentes sintácticos información distribucional lingüística computacional parsing estatístico componentes sintático informação mútua
topic	computational linguistics statistical parsing syntax constituency distributional information lingüística computacional parsing estadístico constituyentes sintácticos información distribucional lingüística computacional parsing estatístico componentes sintático informação mútua
description	Argument from the Poverty of Stimulus (APS) is the great epistemological debate arena between simbolic and statistical paradigms in computational linguistics. Since 2000, several works inside statistical paradigm have been published, attacking APS as they present some unsupervised general-purpose algorithm for language acquisition. Among the most important contributions, Clark’s Ph.D. thesis (2001) appeals to diverse statistical techniques in order to come up with an unsupervised general-purpose algorithm for inducing language and, more precisely, a complete Context-Free Grammar (CFG) for English.   Clark (2001) works with several induction techniques for each linguistic phenomenon modelized: morphology from Hidden Markovian Models (HMM), POS-tagging from clustering, etc. Particularly, in this current paper we are interested in the induction of syntax constituency, given a POS-tagged corpus, as a previous step towards the whole process of inducing a complete CFG. In his own thesis, the author admits that more crosslinguistic evidence is needed, so as to support the psycholinguistic plausibility of an approach such as his. Currently, there is no work that have proposed to prove Clark’s approach in very inflected languages with free-order constituents like Spanish. Thus, our work is intended to contribute with that crosslinguistic evidence, analyzing the feasibilty of the application of Clark’s algorithm for inducing  constituency on Spanish.Clark (2001) entails the application of K-means clustering to group sequences of morpho-syntactic labels, according to their distributional information. Then, there is a stage of filtering out the clusters, through a mutual-information-based criterion between the symbols that co-occur immediately before and after the sequences. This criterion prevents from the typical bias in sparsed corpora, and in turn, succeeds in distinguishing the co-ocurrence of adyacent symbols above the threshold of default entropy for short-distance (Li 1990).  Our implementation has been tested on a prototypical corpus, obtaining interesting results. We have verified recall=74%, precision=58% and F-measure=65% for this prototypical stage. These results encourage us to continue with our long-term research, the goal of developing an algorithm for complete acquisition of Spanish.
publishDate	2010
dc.date.none.fl_str_mv	2010-06-09
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://linguamatica.com/index.php/linguamatica/article/view/60
url	https://linguamatica.com/index.php/linguamatica/article/view/60
dc.language.iso.fl_str_mv	por
language	por
dc.relation.none.fl_str_mv	https://linguamatica.com/index.php/linguamatica/article/view/60 https://linguamatica.com/index.php/linguamatica/article/view/60/84
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade do Minho e Universidade de Vigo
publisher.none.fl_str_mv	Universidade do Minho e Universidade de Vigo
dc.source.none.fl_str_mv	Linguamática; Vol. 2 No. 2; 39-57 Linguamática; Vol. 2 Núm. 2; 39-57 Linguamática; v. 2 n. 2; 39-57 1647-0818 reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799133553130209280

Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information

Registros relacionados