Semantic Similarity Match for Data Quality

Detalhes bibliográficos
Autor(a) principal: Martins, Fernando
Data de Publicação: 2007
Outros Autores: Falcão, André, Couto, Francisco M.
Tipo de documento: Relatório
Idioma: por
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10451/14158
Resumo: Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality process
id RCAP_33a12691348cb22b9dda534961712cbc
oai_identifier_str oai:repositorio.ul.pt:10451/14158
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Semantic Similarity Match for Data Qualitysemantic similaritydata cleaningdata qualitywordnetsimilarity matchData quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality processDepartment of Informatics, University of LisbonRepositório da Universidade de LisboaMartins, FernandoFalcão, AndréCouto, Francisco M.2009-02-10T13:12:03Z2007-102007-10-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/reportapplication/pdfhttp://hdl.handle.net/10451/14158porinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T15:59:48Zoai:repositorio.ul.pt:10451/14158Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:36:00.010636Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Semantic Similarity Match for Data Quality
title Semantic Similarity Match for Data Quality
spellingShingle Semantic Similarity Match for Data Quality
Martins, Fernando
semantic similarity
data cleaning
data quality
wordnet
similarity match
title_short Semantic Similarity Match for Data Quality
title_full Semantic Similarity Match for Data Quality
title_fullStr Semantic Similarity Match for Data Quality
title_full_unstemmed Semantic Similarity Match for Data Quality
title_sort Semantic Similarity Match for Data Quality
author Martins, Fernando
author_facet Martins, Fernando
Falcão, André
Couto, Francisco M.
author_role author
author2 Falcão, André
Couto, Francisco M.
author2_role author
author
dc.contributor.none.fl_str_mv Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Martins, Fernando
Falcão, André
Couto, Francisco M.
dc.subject.por.fl_str_mv semantic similarity
data cleaning
data quality
wordnet
similarity match
topic semantic similarity
data cleaning
data quality
wordnet
similarity match
description Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality process
publishDate 2007
dc.date.none.fl_str_mv 2007-10
2007-10-01T00:00:00Z
2009-02-10T13:12:03Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/report
format report
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10451/14158
url http://hdl.handle.net/10451/14158
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Department of Informatics, University of Lisbon
publisher.none.fl_str_mv Department of Informatics, University of Lisbon
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134258574393344